Information theoretic model selection in clustering
Add to your list(s)
Download to your calendar using vCal
If you have a question about this talk, please contact Peter Orbanz.
Partitioning of data sets into groups defines an important
preprocessing step for compression, prototype extraction or outlier
removal. Various criteria of connectedness or proximity have been
proposed to group data according to structural similarity but in
general it is unclear which method or model to use. In the spirit of
information theory we propose a decision process to determine the
amount of extractable information from data conditioned on a
hypothesis class of partitions. A sender-receiver-scenario defines an
approximation capacity for a clustering problem which quantizes the
hypothesis class and, thereby, introduces sets of statistically
indistinguishible partitionings. The quality of a clustering model is
determined by its ability to extract more “signal” bits from a data
source than a competing data interpretation.
Empirical evidence for this model selection concept is provided by
cluster validation in computer security, i.e., multilabel clustering
of Boolean data for role based access control, but also in analysis of
microarray data.
This talk is part of the Machine Learning @ CUED series.
This talk is included in these lists:
Note that ex-directory lists are not shown.
|