Janne Sinkkonen, Janne Nikkilä, Leo Lahti, and Samuel Kaski.
Associative Clustering
ECML 2004, Pisa, Italy, 2004. Accepted for publication.
(pdf)
Clustering by maximizing the dependency between two paired,
continuous-valued multivariate data sets is studied. The new
method, associative clustering (AC), maximizes a Bayes factor
between similarly parameterized models for dependent and independent
cluster sets. The setup is analogous (but not identical) to that of
the Information Bottleneck (IB), for which our clustering criterion
offers a well-founded and asymptotically well-behaving criterion for
small data sets: With suitable prior assumptions the Bayes factor
becomes equivalent to the hypergeometric probability of a
contingency table, while for large data it becomes the standard
mutual information. An optimization algorithm is introduced, with
empirical comparisons to a combination of IB and K-means, and to
plain K-means. Two case studies cluster genes 1) to find
dependencies between gene expression and transcription factor
binding, and 2) to find dependencies between expression in different
organisms.