Samuel Kaski. Learning metrics for exploratory data analysis. In David Miller, Tulay Adali, Jan Larsen, Marc Van Hulle, and Scott Douglas, editors, Neural Networks for Signal Processing XI, Proceedings of the 2001 IEEE Signal Processing Society Workshop, pages 53-62. IEEE, New York, NY, 2001. (postscript, gzipped postscript)

Visualization and cluster analysis of multivariate data is usually based on distances between samples in a data space. The distance measure is often heuristically chosen, for instance by choosing suitable features and then using a global Euclidean metric. We have developed methods that remove the arbitrariness by measuring distances only along important (local) directions. The metric is learned from auxiliary data that is paired with the primary data during the learning process. It is assumed that changes in the primary data are important or relevant if they cause changes in the auxiliary data; for example, in analysis of gene expression the auxiliary data can indicate the functional classes of the genes. The new distance measures can be used for instance in clustering and Self-Organizing Map-based data visualization. The methods have so far been applied in analysis of bankruptcy, text documents, and gene expression.


Sami Kaski <sami.kaski'at'hut.fi>
Last modified: Wed Mar 9 08:38:26 EET 2005