BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//talks.cam.ac.uk//v3//EN
BEGIN:VTIMEZONE
TZID:Europe/London
BEGIN:DAYLIGHT
TZOFFSETFROM:+0000
TZOFFSETTO:+0100
TZNAME:BST
DTSTART:19700329T010000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0100
TZOFFSETTO:+0000
TZNAME:GMT
DTSTART:19701025T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
CATEGORIES:Isaac Newton Institute Seminar Series
SUMMARY:A fast algorithm with minimax optimal guarantees f
or topic models with an unknown number of topics -
Flori Bunea (Cornell University)
DTSTART;TZID=Europe/London:20180625T160000
DTEND;TZID=Europe/London:20180625T164500
UID:TALK107371AThttp://talks.cam.ac.uk
URL:http://talks.cam.ac.uk/talk/index/107371
DESCRIPTION:Topic models have become popular for the analysis
of data that consists in a collection of n indepen
dent multinomial observations. The model links all
cell probabilities\, collected in a p x n matrix
\, via the assumption that this matrix can be fact
orized as the product of two non-negative matrice
s A and W. Topic models have been originally devel
oped in text mining\, when one browses through n d
ocuments\, based on a dictionary of p words\, and
covering K topics. In this terminology\, the matri
x A is called the word-topic matrix\, and is the m
ain target of estimation. It can be viewed as a ma
trix of conditional probabilities\, and it is uniq
uely defined\, under appropriate separability assu
mptions\, discussed in this talk. Notably\, the un
ique A is required to satisfy what is commonly kno
wn as the anchor word assumption\, under which A h
as an unknown number of rows respectively proport
ional to K-dimensional canonical basis vectors.
The indices of such rows are referred to as anchor
words. Recent computationally feasible algorithms
\, with theoretical guarantees\, utilize construc
tively this assumption by linking the estimation o
f the set of anchor words with that of estimating
the K vertices of a simplex. This crucial step in
the estimation of A requires K to be known\, and
cannot be easily extended to the more realistic s
et-up when K is unknown. In this talk\, we
take a different view on anchor word estimation\,
and on the estimation of the word-topic matrix A.
We propose a new method of estimation in topic mo
dels\, that is not a variation on the existing sim
plex finding algorithms\, and that estimates K fro
m the observed data. We derive new finite sample m
inimax lower bounds for the estimation of A\, as w
ell as new upper bounds for our proposed estimator
. We describe the scenarios where our estimator is
minimax adaptive. Our finite sample analysis is v
alid for any n\, p\, K and document length N\, a
nd both p and K are allowed to increase with n\, a
situation not handled well by previous analyses.
We illustrate\, via simulations\, that the new a
lgorithm is faster and more accurate than the curr
ent ones\, although we start out with a computati
onal and theoretical disadvantage of not knowing
the correct number of topics K\, while we provid
e the competing methods with the correct value in
our simulations.
LOCATION:Seminar Room 1\, Newton Institute
CONTACT:INI IT
END:VEVENT
END:VCALENDAR