University of Cambridge > > Inference Group > Modeling text with Dirichlet compound multinomial distributions

Modeling text with Dirichlet compound multinomial distributions

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Christian Steinruecken.

The Dirichlet compound multinomial (DCM) distribution, also called the multivariate Polya distribution, is a generative model for documents that takes into account burstiness: the fact that if a word occurs once in a document, it is likely to occur repeatedly. I will present a new family of distributions that are close approximations to DCMs and that are an exponential family, unlike DCMs. These so-called EDCM distributions give insight into DCM properties, and lead to an algorithm for EDCM maximum-likelihood training that is 100x faster than the corresponding DCM method. Next, I will discuss expectation-maximization with EDCM components and deterministic annealing as a new clustering algorithm for documents. This algorithm is competitive with the best known methods, and superior from the point of view of finding models with low perplexity. Finally, I will explain the Fisher kernel induced by DCMs and its connection with the well-known TF-IDF heuristic for information retrieval.

This talk is part of the Inference Group series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2024, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity