Janne Sinkkonen and Samuel Kaski. Semisupervised clustering based on conditional distributions in an auxiliary space. Technical Report A60, Helsinki University of Technology, Publications in Computer and Information Science, Espoo, Finland, 2000. (postscript, gzipped postscript)

We study the problem of learning groups or categories that are local in the continuous {\em primary} space, but homogeneous by the distributions of an associated auxiliary random variable over a discrete {\em auxiliary} space. Assuming variation in the auxiliary space is meaningful, categories will emphasize similarly meaningful aspects of the primary space. From a data set consisting of pairs of primary and auxiliary items, the categories are learned by minimizing a Kullback--Leibler divergence-based distortion between (implicitly estimated) distributions of the auxiliary data, conditioned on the primary data. Still, the categories are solely defined in terms of the primary space. An on-line algorithm resembling the traditional Hebb-type competitive learning is introduced for learning the categories. Minimizing the distortion criterion turns out to be equivalent to maximizing the mutual information between the categories and the auxiliary data. In addition, connections to density estimation and to the related Information Bottleneck principle are outlined. In a case study text documents are clustered by the similarity of keywords associated to the documents.

Back to my online publications


Sami Kaski <sami.kaski@hut.fi>
Last modified: Wed Mar 9 08:40:56 EET 2005