drCCA: dimensionality reduction with Canonical Correlation Analysis

Brief summary: drCCA is a simple, efficient, and completely linear data fusion tool based on canonical correlation analysis.

Integration of multiple information sources is an increasingly important task in bioinformatics applications. Understanding, predicting or already efficiently exploring the cellular mechanisms requires information from several sources, such as gene expression, protein concentrations or transcription factor binding. Combining data sources, that can in general have very different forms of representations, is a non-trivial problem, but already partial solutions are useful. Combining several sources is advantageous also when using just one type of data, because it reduces the noise which is often a significant issue in biological experiments that have high dimensionality but relatively few samples.

We offer a simple tool for combining several data sources with co-occurring samples into a one vectorial data set of low dimensionality. The method is motivated through bioinformatics applications, but is generally usable for data fusion tasks in other fields as well. The method aims to retain the variation that is shared between the original data sources, while reducing the dimensionality by ignoring variation that is specific to any of the data sources alone. It is assumed that such variation is either noise or at least less interesting as it is related to a phenomenom not visible in the other sources, despite those containing measurements of the exact same objects.

The drCCA method is based on utilizing the generalized canonical correlation analysis to perform a linear projection on the collection data sets. As the method is completely linear it is fast to compute for large data sets, making genome-wide fusion possible. The package includes regularization and tools for selecting the final dimensionality of the combined data set automatically.

Publication:

More information on the algorithm can be found in the following publication:

Abhishek Tripathi, Arto Klami and Samuel Kaski. Simple integrative preprocessing preserves what is shared in data sources. BMC Bioinformatics, 2008,9:111. (Open Access: html, pdf)

If you use the package, please cite the above paper.

Documentation:

You can read the html documentation included in the package.

Package:

The drCCA software package runs under R, a free language and environment for statistical computing. The latest version (currently 2.6.0) is recommended, but the package should work also with older version.

You can download and read the license of the drCCA package here.

Support

If you have any comments or bug reports on the package, contact Abhishek Tripathi.