RESEARCH ON BIOINFORMATICS

by the Statistical Machine Learning And Bioinformatics group.

Combining the different kinds of current high-throughput data produces new systems-level hypotheses about gene function and regulation, and ultimately functioning of biological organisms. We develop probabilistic modeling, statistical data analysis and machine learning methods to advance this field. We currently have four main focus areas:

Systems-level translational medicine

Translational medicine refers to translating molecular-level models and inferences to the patient level. We develop methods for the next step of translational medicine where the goal is to translate systems-level molecular understanding from model organisms to humans. This includes fusing of metabolomics and transcriptomics data, discovery of disease effects, and their mapping between tissues and organisms with probabilistic methods. The initial disease focus is on Type I diabetes. A related project:

TRANSCENDO
In Silico models of disease pathogenesis and therapy. A project in the MASI program of Tekes, in collaboration with VTT Biotechnology (M. Oresic), VTT Information Technology (I. Karanta), and University of Turku (E. Savontaus).

Genomics of human endogenous retroviruses

About eight per cent of human DNA consists of remains of specific kinds of transposons called human endogenous retroviruses (HERV). Human retroviruses, such as HIV, in general are viruses capable of copying their genetic code to the DNA of humans, and they become endogenous once they have been copied to the germ-line. Human endogenous retroviruses are remains from ancient infections, and it has been suggested that they may have functions in regulating the activity of human genes, and may produce proteins under some conditions. We developed and applied methods for exporing the class structure of HERVs, as well as their association to expression with statistical models, like mixtures of Hidden Markov Models. A related project:

MICMAN
Analysis of transposable elements in the human genome, in collaboration with Prof. Jonas Blomberg of Uppsala University. Ended on 2006. The evaluation report (www.aka.fi/julkaisut > Julkaisusarjan julkaisut > 1/07 Microbes and Man Research Programme 2003-2005) of the programme praised the project for "good quality and creative basic research."

Data fusion for systems biology

A major component of systems biology is integration of information from multiple sources. For example, in cancer it is known that some of the gene expression changes are due to copy number changes in the genome. Both gene expression and copy number changes can be measured, but to find the interesting dependencies between the two, sophisticated integration is required. Another example is gene expression in man and mouse, where we wish to find genes and gene groups with either different or similar activity in the two organisms, in order to study which properties of mice generalize to man. This subfield could be called comparative functional genomics.

We have introduced methods for focusing on relevant variation in several data sets, the relevance being determined by auxiliary data sets (for example Gene Ontology classes) or symmetrically by several sets in data fusion. Probabilistic data fusion, mutual dependency modeling, and learning metrics methods (See Statistical machine learning and data mining) provide state of the art tools for this. Related projects:

MULTIBIO
Project funded by Tekes and the industry, in collaboration with Prof. Sakari Knuutila's group at the Laboratory of Cytomolecular Genetics, Prof. Jaakko Kangasjärvi's Plant Stress group at the Finnish Centre of Excellence in Plant Signal Research, and Dr. Alvis Brazma's Microarray group at the European Bioinformatics Institute.
SYSBIO
MUDFUN project in the SYSBIO program of the Academy of Finland, in collaboration with Neuroscience Center, University of Helsinki (Prof. Eero Castrén), From Data to Knowledge Research Unit, Helsinki University of Technology (Prof. Jaakko Hollmén), and Laboratory of Cytomolecular Genetics, University of Helsinki (Prof. Sakari Knuutila).
NeoBio
SYMBOLIC project in the NeoBio program of Tekes, in collaboration with Control Engineering Laboratory, HUT (Prof. Heikki Hyötyniemi), MediCel Inc. (Dr. Christophe Roos), and the Finnish IT Center for Science (Dr. Minna Laine). Ended 2006.
LIFE2000
Analysis of functional genomics data, in collaboration with Prof. Eero Castrén of the A. I. Virtanen Institute of University of Kuopio. Ended 2003.

Information visualization for high-throughput data

The large and high-dimensional high-throughput data sets are prime application areas for information visualization and other informatics methods. We have so far visualized gene interaction graphs, where the task is to make the huge graphs understandable through visualization. The second application area has been to construct visual interfaces to gene expression databanks. A large community-resource or private gene expression databank consists of numerous data sets submitted by several parties. A key challenge is how to best use the databanks to support further research. Information visualization methods produce an interface to the databank which highlights visually similarities and differences of the data sets.

Other

Preliminary joint projects with several other groups.

PUBLICATIONS

2010

Abhishek Tripathi, Arto Klami, Matej Orešič, Samuel Kaski. Matching samples of multiple views. Accepted for Data Mining and Knowledge Discovery.
Leo Lahti, Juha E.A. Knuuttila, and Samuel Kaski. Global modeling of transcriptional responses in interaction networks. Bioinformatics (2010). (html; pdf; implementation in R and Matlab)
Ilkka Huopaniemi, Tommi Suvitaival, Matej Orešič, and Samuel Kaski. Graphical multi-way models . Accepted to ECML PKDD 2010.
Ilkka Huopaniemi, Tommi Suvitaival, Janne Nikkilä, Matej Orešič, and Samuel Kaski. Multivariate multi-way analysis of multi-source data. Bioinformatics 2010 26: i391-i398 & ISMB 2010 (pdf)
José Caldas and Samuel Kaski. Hierarchical generative biclustering for microRNA expression analysis. In Proceedings of the 14th International Conference on Research in Computational Molecular Biology (RECOMB), 2010. (html)
Juuso Parkkinen and Samuel Kaski. Searching for functional gene modules with interaction component models. BMC Systems Biology 2010, 4:4. (html)

2009

Ilkka Huopaniemi, Tommi Suvitaival, Janne Nikkilä, Matej Orešič, and Samuel Kaski. Multi-Way, Multi-View Learning. Talk in the NIPS 2009 workshop on Learning from Multiple Sources with Application to Robotics, December 12, Whistler, Canada. (pdf)
Simon Rogers, Arto Klami, Janne Sinkkonen, Mark Girolami, Samuel Kaski. Infinite Factorization of Multiple Non-parametric Views. Machine Learning 2009, DOI 10.1007/s10994-009-5155-1. (online)
Ilkka Huopaniemi, Tommi Suvitaival, Janne Nikkilä, Matej Orešič, and Samuel Kaski. Two-way analysis of high-dimensional collinear data. In Data Mining and Knowledge Discovery 19(2):261-276, 2009 (pdf)
Leo Lahti, Samuel Myllykangas, Sakari Knuutila, and Samuel Kaski. Dependency detection with similarity constraints. In Proc. MLSP'09 IEEE International Workshop on Machine Learning for Signal Processing, to appear. (preprint pdf)
Leo Lahti, Laura L. Elo, Tero Aittokallio, and Samuel Kaski. Probabilistic analysis of probe reliability in differential gene expression studies with short oligonucleotide arrays. To appear in IEEE/ACM Transactions on Computational Biology and Bioinformatics, accepted for publication.(abstract; pdf; preprint.pdf)
Mohamed Guled, Leo Lahti, Pamela M Lindholm, Kaisa Salmenkivi, Izhar Bagwan, Andrew G Nicholson and Sakari Knuutila. CDKN2A, NF2 and JUN Are Dysregulated Among Other Genes by miRNAs in Malignant Mesothelioma - a miRNA Microarray Analysis To appear in Genes, Chromosomes and Cancer, accepted for publication.
Hasan Ogul. Variable context Markov chains for HIV protease cleavage site prediction. Biosystems (in press)
José Caldas, Nils Gehlenborg, Ali Faisal, Alvis Brazma, and Samuel Kaski. Probabilistic retrieval and visualization of biologically relevant microarray experiments. Bioinformatics, 25(12): i145-i153, 2009. (html). See also: Software, Poster(best poster award at the 5th ISCB Student Council Symposium).
Farid Benachenhou, Patric Jern, Merja Oja, Gäran Sperber, Vidar Blikstad, Panu Somervuo, Samuel Kaski, and Jonas Blomberg. Evolutionary conservation of orthoretroviral long terminal repeats (LTRs) and ab initio detection of single LTRs in genomic data. PLoS ONE, 4(4):e5179, 2009. (html/pdf)
Suvi Savola, Arto Klami, Abhishek Tripathi, Tarja Niini, Massimo Serra, Piero Picci, Samuel Kaski, Diana Zambelli, Katia Scotlandi and Sakari Knuutila. Combined use of expression and CGH arrays pinpoints novel candidate genes in Ewing sarcoma family of tumors. BMC Cancer 9:17, 2009. (html)
Abhishek Tripathi, Arto Klami and Samuel Kaski. Using Dependencies to Pair Samples for Multi-View Learning. In the IEEE 2009 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2009), in press. (abstract, preprint pdf)(Runner-up Best Student Paper Award)

2008

José Caldas and Samuel Kaski. Bayesian biclustering with the plaid model. In proceedings of the IEEE International Workshop on Machine Learning for Signal Processing XVIII (MLSP), Cancún, Mexico, pages 291-296, 2008. (html)
Janne Nikkilä, Marko Sysi-Aho, Andrey Ermolov, Tuulikki Seppänen-Laakso, Olli Simell, Samuel Kaski and Matej Orešič. Gender-dependent progression of systemic metabolic states in early childhood. Molecular Systems Biology, 2008,4:197.(html)
Abhishek Tripathi, Arto Klami and Samuel Kaski. Simple integrative preprocessing preserves what is shared in data sources. BMC Bioinformatics, 2008,9:111.(html)

2007

Juuso Parkkinen and Samuel Kaski. Searching for functional gene modules with interaction component models. Presentation in the NIPS 2007 workshop on Machine Learning in Computational Biology, December 7, Whistler, Canada. (abstract, pdf extended abstract)
Merja Oja. Methods for Exploring Genomic Data Sets: Application to Human Endogenous Retroviruses. D.Sc. thesis. Dissertations in Computer and Information Science, Report D23. Espoo, Finland, 2007.
Abhishek Tripathi, Arto Klami, and Samuel Kaski. Linear Data Fusion with drCCA. Poster in the conference Machine Learning in Systems Biology 07 (MLSB), Evry, France, September 24-25, 2007. (abstract, poster)
Samuel Kaski, Juho Rousu, and Esko Ukkonen. Probabilistic modeling and machine learning in structural and systems biology; editorial of a special issue. BMC Bioinformatics, 8(Suppl 2):S1, 2007. (html)
Merja Oja, Jaakko Peltonen, Jonas Blomberg and Samuel Kaski. Methods for estimating human endogenous retrovirus activities from EST databases. BMC Bioinformatics, 8(Suppl 2):S11, 2007. (html)
Antti Ajanki, Janne Nikkilä, and Samuel Kaski. Discovering condition-dependent Bayesian networks for gene regulation. In proceedings of Fifth IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS), Tuusula, Finland, 10-12 June, 2007. (abstract, pdf)
Penny Nymark, Pamela M Lindholm, Mikko V Korpela, Leo Lahti, Salla Ruosaari, Samuel Kaski, Jaakko Hollmen, Sisko Anttila, Vuokko L Kinnula, and Sakari Knuutila. Specific Gene Expression Profiles in Asbestos-exposed Epithelial and Mesothelial Lung Cell Lines. BMC Genomics 8:62, 2007 (abstract).
Jarkko Venna, and Samuel Kaski. Comparison of visualization methods for an atlas of gene expression data sets. Information Visualization, 6:139-154, 2007. (abstract, preprint pdf)

2006

Merja Oja, Jaakko Peltonen, and Samuel Kaski. A hidden Markov model mixture for estimating human endogenous retrovirus activities from expressed sequence databases. Poster in the European Conference on Computation Biology 2006 (ECCB). Eilat, Israel, January 21-24, 2007. (abstract)
Juho Rousu, Samuel Kaski, and Esko Ukkonen, editors. Probabilistic Modeling and Machine Learning in Structural and Systems Biology. Workshop Proceedings; Tuusula, Finland, June 17-18. University of Helsinki, Helsinki, Finland, 2006.
Udo Seiffert, Barbara Hammer, Samuel Kaski, and Thomas Villmann. Neural networks and machine learning in bioinformatics - theory and applications. In Proceedings of ESANN'06, 14th European Symposium on Artificial Neural Networks, pages 521-532. d-side, Evere, Belgium, 2006. (Introductory article for a workshop in ESANN'06.)
Jarkko Venna and Samuel Kaski. Visualizing Gene Interaction Graphs with Local Multidimensional Scaling. In Proceedings of ESANN'06, 14th European Symposium on Artificial Neural Networks, pages 557-562. d-side, Evere, Belgium, 2006.(abstract, preprint pdf)
Merja Oja, Jaakko Peltonen, and Samuel Kaski. Estimation of human endogenous retrovirus activities from expressed sequence databases. In Juho Rousu, Samuel Kaski, and Esko Ukkonen, editors, Probabilistic Modeling and Machine Learning in Structural and Systems Biology (PMSB 2006), workshop proceedings, pages 50-54, Helsinki University Printing House, 2006. (abstract, gzipped postscript, pdf)
Janne Nikkilä, Antti Honkela, and Samuel Kaski. Exploring the independence of gene regulatory modules In Juho Rousu, Samuel Kaski, and Esko Ukkonen, editors, Probabilistic Modeling and Machine Learning in Structural and Systems Biology (PMSB 2006), workshop proceedings, pages131-136, Helsinki University Printing House, 2006. (pdf)

2005

Samuel Kaski and Janne Nikkilä. Of mice and men and yeast. CSC News, 4/2005.(pdf, A popular article of our dependency exploration studies in bioinformatics; note: the guy in the figure is neither of us but an editorial joke...)
Janne Nikkilä, Christophe Roos, Eerika Savia, and Samuel Kaski. Explorative modeling of yeast stress response and its regulation with gCCA and associative clustering. International Journal of Neural Systems, vol.15, nr.4, pages 237-246, 2005.(IJNS journal, pdf)
Laura L. Elo, Leo Lahti, Heli Skottman, Minna Kyläniemi, Riitta Lahesmaa and Tero Aittokallio. Integrating probe-level expression changes across generations of Affymetrix arrays. Nucleic Acids Research 2005 33(22):e193 (abstract, full text, PDF)
Janne Nikkilä. Exploratory Cluster Analysis of Genomic High-Throughput Data Sets and Their Dependencies. D.Sc. thesis. Dissertations in Computer and Information Science, Report D11. Espoo, Finland, 2005.
Merja Oja, Göran O. Sperper, Jonas Blomberg and Samuel Kaski. Self-organizing map-based discovery and visualization of human endogenous retroviral sequence groups. International Journal of Neural Systems Vol. 15, No. 3 (2005) 163-179.(abstract,pdf)
Samuel Kaski, Janne Nikkilä, Janne Sinkkonen, Leo Lahti, Juha Knuuttila, and Christophe Roos. Associative clustering for exploring dependencies between functional genomics data sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics. Special Issue on Machine Learning for Bioinformatics -- Part 2, vol. 2, nr. 3, pages 203-216, 2005 (abstract, preprint pdf, ps, gzipped ps; the most thorough description of associative clustering, including three bioinformatics case studies)
Janne Sinkkonen, Samuel Kaski Janne Nikkilä;, and Leo Lahti. Associative Clustering (AC): Technical Details. Technical Report A84, Helsinki University of Technology, Publications in Computer and Information Science, Espoo, Finland, April 2005. (ps, pdf; accompanying report including only additional technical details and derivations of the method.)
Arto Klami, Janne Nikkilä, Christophe Roos, and Samuel Kaski. Extracting yeast stress genes by dependencies between stress treatments. Poster in the European Conference on Computational Biology 05 (ECCB), Madrid, Spain, September 28-October 1, 2005. (abstract, pdf (A4 size))
Samuel Kaski, Eerika Savia and Kai Puolamäki. Predicting binding of transcriptional regulators with a two-way latent grouping model. Poster in the 13th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB) Detroit, Michigan, June 25-29 2005. (abstract, long abstract (pdf), poster (pdf))
Jarkko Venna, and Samuel Kaski. Visualized atlas of a gene expression databank In Proceedings of Symposium of Knowledge Representation in Bioinformatics. Espoo, Finland, 15.-17. June 2005. To appear. (abstract, pdf)
Janne Nikkilä, Christophe Roos, and Samuel Kaski. Integration of transcription factor binding and gene expression by associative clustering. In Bounsaythhip, Hollmén, Kaski, Oresic (eds.): KRBIO05, Proceedings of Symposium of Knowledge Representation in Bioinformatics, pages 22--29, Espoo, Finland, 15.-17. June 2005, Otamedia Ltd. (pdf)
Arto Klami and Samuel Kaski. Non-parametric dependent components. In Proceedings of ICASSP'05, IEEE International Conference on Acoustics, Speech, and Signal Processing, pages V-209 - V-212, IEEE, 2005. (abstract, pdf; a generalization of canonical correlation analysis for non-Gaussian data, used for characterizing yeast stress.)
Samuel Kaski, Janne Nikkilä, Eerika Savia, and Christophe Roos. Discriminative clustering of yeast stress response. In Udo Seiffert, Lakhmi Jain, and Patric Schweizer, editors, Bioinformatics using Computational Intelligence Paradigms, pages 75-92. Springer, Berlin, 2005. (preprint abstract, preprint pdf; originally submitted in 2003)

2004

Merja Oja, Göran Sperber, Jonas Blomberg, and Samuel Kaski. Grouping and Visualizing Human Endogenous Retroviruses by Bootstrapping Median Self-organizing Maps. In Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology. San Diego, California, USA, 7-8 October, pages 95-101. 2004. (preprint abstract, preprint pdf)
Samuel Kaski, Janne Nikkilä, and Christophe Roos What do data sets have in common? The yeast stress case. Poster in ISMB/ECCB 2004. (abstract, postscript (A4 size), gzipped postscript, pdf)
Janne Sinkkonen, Janne Nikkilä, Leo Lahti, and Samuel Kaski. Associative Clustering. ECML 2004, Pisa, Italy. In: Boulicaut, Esposito, Giannotti,Pedreschi (eds.): Machine Learning: ECML2004 ( Proceedings of 15th European Conference on Machine Learning), Lecture Notes in Computer Science 3201, pages 396-406, 2004. (abstract, pdf)
Janne Nikkilä, Christophe Roos, and Samuel Kaski. Exploring dependencies between yeast stress genes and their regulators. In: Zheng Rong Yang, Richard Everson, and Hujun Yin, editors, Proceedings of International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2004), pp.92-98, Springer, 2004. (abstract, postscript , gzipped postscript)
Leo Lahti, Juha Knuuttila, Janne Nikkilä; and Samuel Kaski Exploring dependencies in the expression of orthologous man-mouse gene pairs In Bioinformatics 2004, Linköping, Sweden, June 3-6, 2004. A poster. (abstract, postscript (A4 size), gzipped postscript (A4 size))
Merja Oja, Jonas Blomberg, and Samuel Kaski. Class discovery and visualization for human endogenous retroviruses by bootstrapping Median Self-organizing Maps In Bioinformatics 2004, Linköping, Sweden, June 3-6, 2004. A poster. (abstract, postscript (A4 size), gzipped postscript)

2003

Samuel Kaski, Janne Nikkilä, Merja Oja, Jarkko Venna, Petri Törönen, and Eero Castren. Trustworthiness and metrics in visualizing similarity of gene expression. BMC Bioinformatics, 4:48, 2003.
Janne Nikkilä, Christophe Roos, Janne Sinkkonen, and Samuel Kaski. Associative clustering to find dependencies between expression profiles and transcription factor binding. In Proceedings of ECCB2003, European Conference on Computational Biology, Paris, France, September 27-30, 2003, pages 433-434. A poster. (abstract)
Janne Sinkkonen, Janne Nikkilä;, Leo Lahti and Samuel Kaski. Associative Clustering by Maximizing a Bayes Factor. Technical Report A68, Helsinki University of Technology, Publications in Computer and Information Science, Espoo, Finland, June 2003. (postscript, gzipped postscript) (Extension of DC to clustering of both margins of continuous co-occurrence data.)
Merja Oja, Panu Somervuo, Samuel Kaski, and Teuvo Kohonen Clustering of human endogenous retrovirus sequences with median self-organizing map In Workshop on Self-Organizing Maps, 11-14 September 2003. Hibikino, Japan. Proceedings on CD. (abstract,postscript, gzipped postscript)
Samuel Kaski and Jaakko Peltonen. Informative discriminant analysis. In Tom Fawcett and Nina Mishra, editors, Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), pp. 329-336, AAAI Press, Menlo Park, CA, 2003. (abstract, postscript, gzipped postscript, pdf) (A generalization of linear discriminant analysis for data visualization.)

2002

Merja Oja, Janne Nikkilä, Petri Törönen, Eero Castrén, and Samuel Kaski. Learning metrics for visualizing gene functional similarities. In Pekka Ala-Siuru and Samuel Kaski, editors, STeP 2002 - Intelligence, The Art of Natural and Artificial. The 10th Finnish Artificial Intelligence Conference, Oulu, Finland 15-17 Dec. 2002, pages 31-40, 2002. (abstract, postscript, gzipped postscript)
Janne Nikkilä, Petri Törönen, Samuel Kaski, Jarkko Venna, Eero Castrén, and Garry Wong Analysis and visualization of gene expression data using Self-Organizing Maps Neural Networks, Special issue on New Developments on Self-Organizing Maps, 2002. vol. 15, issue 8-9, pages 953-966. (postscript (preprint), gzipped postscript (preprint) )
Merja Oja, Petri Törönen, Janne Nikkilä, Eero Castrén, and Samuel Kaski. Learning metrics for SOM-based clustering and visualization of yeast gene expression data. In Bioinformatics 2002, Bergen, Norway, April 4-7, 2002. A poster. (abstract, postscript (A4 size), gzipped postscript (A4 size))
Merja Oja, Janne Nikkilä, Petri Törönen, Garry Wong, Eero Castrén, and Samuel Kaski. Exploratory clustering of gene expression profiles of mutated yeast strains. In Wei Zhang and Ilya Shmulevich, editors, Computational And Statistical Approaches To Genomics, pages 65-78. Kluwer Acadmic Publishers, 2002. (abstract, postscript, gzipped postscript)
Janne Sinkkonen and Samuel Kaski. Clustering based on conditional distributions in an auxiliary space. Neural Computation, 14:217-239, 2002. (abstract, postscript, gzipped postscript)

2001

Samuel Kaski, Janne Nikkilä, Petri Törönen, Eero Castren, and Garry Wong. Analysis and visualization of gene expression data using self-organizing maps. In Proceedings of NSIP-01, IEEE-EURASIP Workshop on Nonlinear Signal and Image Processing, 2001. Proceedings on CD-ROM. (abstract, postscript, gzipped postscript)
Samuel Kaski. SOM-based exploratory analysis of gene expression data. In N. Allinson, H. Yin, L. Allinson, and J. Slack, editors, Advances in Self-Organizing Maps, pages 124-131. Springer, London, 2001. (abstract, postscript, gzipped postscript) (Invited talk, overlapping with the previous paper.)
Samuel Kaski, Janne Sinkkonen, and Janne Nikkilä. Clustering gene expression data by mutual information with gene function. In Georg Dorffner, Horst Bischof, and Kurt Hornik, editors, Artificial Neural Networks - ICANN 2001, pages 81-86. Springer, Berlin, 2001. (abstract, postscript, gzipped postscript) (Short version of the application in the 2002 Neural Computation paper)
Janne Nikkilä, Petri Törönen, Janne Sinkkonen, and Samuel Kaski. Analysis of gene expression data using semi-supervised clustering. In Proceedings of Bioinformatics 2001, Skövde, Sweden, March 29-April 1, pages 2-3. 2001. (abstract).

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Back to the main page of the research group.