Samuel Kaski and Janne Sinkkonen.
Principle of learning metrics for data analysis.
The Journal of VLSI Signal Processing-Systems for Signal, Image,
and Video Technology, special issue on Data Mining and Biomedical
Applications of Neural Networks, to appear.
(postscript,
gzipped postscript)
Visualization and clustering of multivariate data are usually based on
mutual distances of samples, measured by heuristic means such as the
Euclidean distance of vectors of extracted features. Our recently
developed methods remove this arbitrariness by learning to measure
important differences. The effect is equivalent to changing the metric
of the data space. It is assumed that variation of the data is
important only to the extent it causes variation in auxiliary
data which is available paired to the primary data. The learning of
the metric is supervised by the auxiliary data, whereas the data
analysis in the new metric is unsupervised. We review two approaches:
a clustering algorithm and another that is based on an explicitly
generated metric. Applications have so far been in exploratory
analysis of texts, gene function, and bankruptcy. Connections for the
two approaches are derived, which leads to new promising approaches to
the clustering problem.
Keywords: Discriminative clustering, exploratory data
analysis, Fisher information matrix, information metric, learning
metrics