Log in

Cambridge users (raven) details

Other users details

No account? details

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

Modeling text with Dirichlet compound multinomial distributions

Add to your list(s) Download to your calendar using vCal

Charles Elkan, UCSD
Wednesday 24 January 2007, 14:00-15:00
TCM Seminar Room, Cavendish Laboratory, Department of Physics.

If you have a question about this talk, please contact Christian Steinruecken.

The Dirichlet compound multinomial (DCM) distribution, also called the multivariate Polya distribution, is a generative model for documents that takes into account burstiness: the fact that if a word occurs once in a document, it is likely to occur repeatedly. I will present a new family of distributions that are close approximations to DCMs and that are an exponential family, unlike DCMs. These so-called EDCM distributions give insight into DCM properties, and lead to an algorithm for EDCM maximum-likelihood training that is 100x faster than the corresponding DCM method. Next, I will discuss expectation-maximization with EDCM components and deterministic annealing as a new clustering algorithm for documents. This algorithm is competitive with the best known methods, and superior from the point of view of finding models with low perplexity. Finally, I will explain the Fisher kernel induced by DCMs and its connection with the well-known TF-IDF heuristic for information retrieval.

This talk is part of the Inference Group series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Log in

Information on

Modeling text with Dirichlet compound multinomial distributions

This talk is included in these lists:

Other lists

Other talks