Introduction

TreeSegment is a universal segmentation system that partitions a sequence of categorical data into homogeneous, non-overlapping regions according to a minimum-complexity description of the sequence segments by variable length Markov chains (VLMCs), also known as tree models and variable order Markov (VOM) models. TreeSegment achieves two data mining task at the same time: it discovers segment boundaries and obtains minimal context trees corresponding to those segments. Such context trees contain the probability distribution vectors that capture essential features of the corresponding segments. In order to find an optimal segmentation the algorithm uses the minimum description length (MDL) principle and two alternative scoring methods: the Bayesian Information Criterion (BIC) and the Krichevsky-Trofimov Probability (KT) that score segments in terms of minimal context trees corresponding to those segments. TreeSegment is one of the most advanced segmentation algorithms, which outperforms other segmentation algorithms using fixed length Markov chains (MCs) in terms of accuracy and is also capable of recognizing partition points in cases where the other algorithms fail. In extensive experiments that we conducted on DNA sequences we demonstrate that the method selects segments that are consistent with the annotated regions of the genes. Our results suggest that our method may be useful in aiding the process of gene discovery and annotation.

Example DNA segmentation

Figure 1: Segmentations obtained for Caenorhabditis elegans gene F33E11.3;top: annotated sequence;middle: segmentation obtained by BIC method;bottom: segmentation obtained by KT method. The numbers above each segment denote depths of the corresponding trees


Why do we need tree models in segmentation?

The short answer is as follows: comparing to MC segmentation methods (0-MC, 1-MC, etc.) by using tree models we can fit high order models to segments to maximize the likelihood, without being penalized for an exponential increase in the number of parameters (as in the case of ordinary MCs). Consider the following example sequence S of length 2n over the alphabet &Alpha={A, C, G, T} that is a concatenation of two segments S1 and S2 of length n each generated as follows (Figure 4):
  1. S2 was generated from a 2-VLMC model M 2(T2, &theta(T2)), where tree T2 has only one node at depth 2 and the remaining three nodes at depth 1
  2. S1 was generated from a 1-MC M1(T1, &theta(T1)), where T1=T2|1 and &theta(T1)= &theta(T2|1)), where T2|1 is a truncation of tree T2 upto depth 1. Thus, M1 and M2 are identical in terms of 1-MC and 0-MC.
Figure 4: Segmenting an example sequence generated by a 1-MC and a 2-VLMC
Since M1 and M2 are identical in terms of 1-MC and 0-MC and M2 has only one context at depth 2 then MC segmentation methods will select M1 as the underlying model for the whole sequence S to avoid being penalized for a full 2-MC for S2, i.e., for useless parameters corresponding to the non-existent contexts in the generating tree T2. Thus, a VLMC segmentation method will select T2 for S2 without paying for the the non-existent contexts and therefore it will properly select 2 segments with the partition point in the middle of S.

Manual

TreeSegment has been implemented in C++, includes only standard libraries, and is compatible with all UNIX/LINUX systems.

Installation

  1. Uncompress the archive by typing: tar xvfz tseg.tgz
  2. Go to the tseg directory and type:
    1. make clean
    2. make

Usage

NAME
        tseg - TreeSegment: optimal segmentation using tree models

USAGE
        tseg [options] [input-file]

DESCRIPTION
        tseg segments a file using tree models

OPTIONS
-M m: fitted model, where:
        model=vlmc: means fitting a VLMC according to each segment
        model=mc: means fitting a fixed-depth tree to each segment

-S s: scoring method, where:
        s=bic:  means using the Context algorithm to obtain a VLMC and BIC criterion to score the tree
        s=kt:   means using the original version of the CTM algorithm to obtain a VLMC that optimizes KT
                probability of the tree by considering a local decision: the parent node versus its children.
        s=kt1:  means using the modified version of the CTM algorithm to obtain a VLMC that optimizes KT
                probability of the tree by considering a local decision: the parent node versus all subsets
                of children. Note that this method was mainly developed for DNA where the alphabet size
                |A|=4. For |A| > 4 this method is inefficient (exponential in |A|) and the gain in segmentation
                accuracy may not be significant to justify using this modified method comparing to the original CTM
                algorithm (s=kt)

-D d: the depth of the tree

-K k: maximum number of segments. The algorithm selects the optimal number of segment from 1 to K

-Q q:   step size for computing the score of a segment only for segment boundaries that are a multiple of Q.
        The default value of q is 1 and by setting |q| > 1 one can achieve a O(1/q^{2}) speedup of the
        segmentation algorithm

-P p:   pseudocount for computing the BIC score (s=bic). It sets a pseudocount value for computing the likelihood of
        a segment given a tree. The default value is 0 and p can range from 0 to 1.

-A a: alphabet size

-V: (verbose) print trees corresponding to all segments

-fastaDNA: the input file is a DNA sequence in FASTA format

INPUT FILE FORMAT
        An ASCII file, where alphabet symbols as (integer alphabet indexes from 0 to |A|-1)
        are separated with the new line (carriage return) symbol.

RESULTS
        Array containing pairs {partition point, <depth of the corresponding tree>] of the form
        I[k']:[{p1, <d1>}, {p2, <d2>},  {pk', <dk'>}], where k' is the optimal number of partition points

EXAMPLE
        tseg -M vlmc -S kt -D 3 -K 20 -A 4 -P 0.5 -fastaDNA Y50D4C.3

Example output

> tseg -M vlmc -S bic -D 3 -K 20 -A 4 -V 1mc2vlmc

MININMAL SEGMENTATION

I[2]:[{0, <1>}, {1000, <2>}]  /*vector of partition points*/
SCORE: 3.15632e+03 /*total score*/

{0, 999} /*the first segment boundaries}
->[1 2 1]  /*initial memory and the corresponding context tree*/

/*Every node is represented as follows:
(label)-{Nn(c)| [Nn(c, 0), Nn(c, 1), ..., Nn(c, A-1)]},
where c is the context corresponding to the node and Nn(c) and Nn(c,a) is the number of occurences of
strings c and ca respectively in a sequence of length n. Thus, P(a|c)=Nn(c, a)/N(c).
*/

<0>   (-1)-{997 | [558, 222, 92, 125]} /*root has alwas label -1*/
<1>     +-->(1)-{223 | [112, 34, 28, 49]}-T /*terminal node corresponding to context 1*/
<1>     +-->(0, 3)-{682 | [424, 159, 50, 49]}-V /*virtual node corresponding to context {0, 3}*/
<1>     `-->(2)-{92 | [22, 29, 14, 27]}-T

{1000, 1999}/*the second segment boundaries*/
->[3 0 1] /*initial memory and the corresponding context tree*/
<0>   (-1)-{997 | [561, 222, 102, 112]}-F /*full node*/
<1>     +-->(3)-{112 | [77, 22, 11, 2]}-T
<1>     +-->(0)-{561 | [346, 137, 39, 39]}
<2>     |    +-->(3, 2)-{98 | [47, 33, 16, 2]}-V
<2>     |    +-->(0)-{346 | [227, 103, 13, 3]}-T
<2>     |    `-->(1)-{117 | [72, 1, 10, 34]}-T
<1>     +-->(1)-{222 | [117, 32, 30, 43]}
<2>     |    +-->(0, 3, 2)-{190 | [89, 32, 26, 43]}-V /*virtual node corresponding to context 1{0, 3}*/
<2>     |    `-->(1)-{32 | [28, 0, 4, 0]}-T
<1>     `-->(2)-{102 | [21, 31, 22, 28]}-T

Download

TreeSegment system version 24-03-2007

Publications

Robert Gwadera, Aristides Gionis and Heikki Mannila, Optimal Segmentation using Tree Models , Knowledge and Information Systems (KAIS), Volume 15, Number 3, Pages: 259-283, June 2008.