Modeling text with Dirichlet compound multinomial distributions
Add to your list(s)
Download to your calendar using vCal
If you have a question about this talk, please contact Christian Steinruecken.
The Dirichlet compound multinomial (DCM) distribution, also called the multivariate Polya distribution, is a generative model for documents that takes into account burstiness: the fact that if a word occurs once in a document, it is likely to occur repeatedly. I will present a new family of distributions that are close approximations to DCMs and that are an exponential family, unlike DCMs. These so-called EDCM distributions give insight into DCM properties, and lead to an algorithm for EDCM maximum-likelihood training that is 100x faster than the corresponding DCM method. Next, I will discuss expectation-maximization with EDCM components and deterministic annealing as a new clustering algorithm for documents. This algorithm is
competitive with the best known methods, and superior from the point of view of finding models with low perplexity. Finally, I will explain the Fisher kernel induced by DCMs and its connection with the well-known TF-IDF heuristic for information retrieval.
This talk is part of the Inference Group series.
This talk is included in these lists:
Note that ex-directory lists are not shown.
|