Samuel Kaski, Janne Sinkkonen, and Arto Klami. Discriminative
clustering. Neurocomputing, 69:18-41, 2005.
(preprint pdf)
A distributional clustering model for continuous data is reviewed
and new methods for optimizing and regularizing it are introduced
and compared. Based on samples of discrete-valued auxiliary data
associated to samples of the continuous primary data, the continuous
data space is partitioned into Voronoi regions that are maximally
homogeneous in terms of the discrete data. Then only variation in
the primary data associated to variation in the discrete data
affects the clustering; the discrete data ``supervises'' the
clustering. Because the whole continuous space is partitioned, new
samples can be easily clustered by the continuous part of the data
alone. In experiments, the approach is shown to produce more
homogeneous clusters than alternative methods. Two regularization
methods are demonstrated to further improve the results: an
entropy-type penalty for unequal cluster sizes, and the inclusion of
a model for the marginal density of the primary data.
The latter is also interpretable as special kind of joint
distribution modeling with tunable emphasis for discrimination and
the marginal density.
This material is presented to ensure timely dissemination of scholarly
and technical work. Copyright and all rights therein are retained by
authors or by other copyright holders. All persons copying this
information are expected to adhere to the terms and constraints
invoked by each author's copyright. In most cases, these works may not
be reposted without the explicit permission of the copyright holder.