University of Cambridge > > Machine Learning @ CUED > Information theoretic model selection in clustering

Information theoretic model selection in clustering

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Peter Orbanz.

Partitioning of data sets into groups defines an important preprocessing step for compression, prototype extraction or outlier removal. Various criteria of connectedness or proximity have been proposed to group data according to structural similarity but in general it is unclear which method or model to use. In the spirit of information theory we propose a decision process to determine the amount of extractable information from data conditioned on a hypothesis class of partitions. A sender-receiver-scenario defines an approximation capacity for a clustering problem which quantizes the hypothesis class and, thereby, introduces sets of statistically indistinguishible partitionings. The quality of a clustering model is determined by its ability to extract more “signal” bits from a data source than a competing data interpretation.

Empirical evidence for this model selection concept is provided by cluster validation in computer security, i.e., multilabel clustering of Boolean data for role based access control, but also in analysis of microarray data.

This talk is part of the Machine Learning @ CUED series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2024, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity