TreeSegment is a universal segmentation system that partitions a sequence of categorical data into homogeneous, non-overlapping regions according to a minimum-complexity description of the sequence segments by variable length Markov chains (VLMCs), also known as tree models and variable order Markov (VOM) models. TreeSegment achieves two data mining task at the same time: it discovers segment boundaries and obtains minimal context trees corresponding to those segments. Such context trees contain the probability distribution vectors that capture essential features of the corresponding segments. In order to find an optimal segmentation the algorithm uses the minimum description length (MDL) principle and two alternative scoring methods: the Bayesian Information Criterion (BIC) and the Krichevsky-Trofimov Probability (KT) that score segments in terms of minimal context trees corresponding to those segments. TreeSegment is one of the most advanced segmentation algorithms, which outperforms other segmentation algorithms using fixed length Markov chains (MCs) in terms of accuracy and is also capable of recognizing partition points in cases where the other algorithms fail. In extensive experiments that we conducted on DNA sequences we demonstrate that the method selects segments that are consistent with the annotated regions of the genes. Our results suggest that our method may be useful in aiding the process of gene discovery and annotation.
|
NAME
tseg - TreeSegment: optimal segmentation using tree models
USAGE
tseg [options] [input-file]
DESCRIPTION
tseg segments a file using tree models
OPTIONS
-M m: fitted model, where:
model=vlmc: means fitting a VLMC according to each segment
model=mc: means fitting a fixed-depth tree to each segment
-S s: scoring method, where:
s=bic: means using the Context algorithm to obtain a VLMC and BIC criterion to score the tree
s=kt: means using the original version of the CTM algorithm to obtain a VLMC that optimizes KT
probability of the tree by considering a local decision: the parent node versus its children.
s=kt1: means using the modified version of the CTM algorithm to obtain a VLMC that optimizes KT
probability of the tree by considering a local decision: the parent node versus all subsets
of children. Note that this method was mainly developed for DNA where the alphabet size
|A|=4. For |A| > 4 this method is inefficient (exponential in |A|) and the gain in segmentation
accuracy may not be significant to justify using this modified method comparing to the original CTM
algorithm (s=kt)
-D d: the depth of the tree
-K k: maximum number of segments. The algorithm selects the optimal number of segment from 1 to K
-Q q: step size for computing the score of a segment only for segment boundaries that are a multiple of Q.
The default value of q is 1 and by setting |q| > 1 one can achieve a O(1/q^{2}) speedup of the
segmentation algorithm
-P p: pseudocount for computing the BIC score (s=bic). It sets a pseudocount value for computing the likelihood of
a segment given a tree. The default value is 0 and p can range from 0 to 1.
-A a: alphabet size
-V: (verbose) print trees corresponding to all segments
-fastaDNA: the input file is a DNA sequence in FASTA format
INPUT FILE FORMAT
An ASCII file, where alphabet symbols as (integer alphabet indexes from 0 to |A|-1)
are separated with the new line (carriage return) symbol.
RESULTS
Array containing pairs {partition point, <depth of the corresponding tree>] of the form
I[k']:[{p1, <d1>}, {p2, <d2>}, {pk', <dk'>}], where k' is the optimal number of partition points
EXAMPLE
tseg -M vlmc -S kt -D 3 -K 20 -A 4 -P 0.5 -fastaDNA Y50D4C.3
> tseg -M vlmc -S bic -D 3 -K 20 -A 4 -V 1mc2vlmc
MININMAL SEGMENTATION
I[2]:[{0, <1>}, {1000, <2>}] /*vector of partition points*/
SCORE: 3.15632e+03 /*total score*/
{0, 999} /*the first segment boundaries}
->[1 2 1] /*initial memory and the corresponding context tree*/
/*Every node is represented as follows:
(label)-{Nn(c)| [Nn(c, 0), Nn(c, 1), ..., Nn(c, A-1)]},
where c is the context corresponding to the node and Nn(c) and Nn(c,a) is the number of occurences of
strings c and ca respectively in a sequence of length n. Thus, P(a|c)=Nn(c, a)/N(c).
*/
<0> (-1)-{997 | [558, 222, 92, 125]} /*root has alwas label -1*/
<1> +-->(1)-{223 | [112, 34, 28, 49]}-T /*terminal node corresponding to context 1*/
<1> +-->(0, 3)-{682 | [424, 159, 50, 49]}-V /*virtual node corresponding to context {0, 3}*/
<1> `-->(2)-{92 | [22, 29, 14, 27]}-T
{1000, 1999}/*the second segment boundaries*/
->[3 0 1] /*initial memory and the corresponding context tree*/
<0> (-1)-{997 | [561, 222, 102, 112]}-F /*full node*/
<1> +-->(3)-{112 | [77, 22, 11, 2]}-T
<1> +-->(0)-{561 | [346, 137, 39, 39]}
<2> | +-->(3, 2)-{98 | [47, 33, 16, 2]}-V
<2> | +-->(0)-{346 | [227, 103, 13, 3]}-T
<2> | `-->(1)-{117 | [72, 1, 10, 34]}-T
<1> +-->(1)-{222 | [117, 32, 30, 43]}
<2> | +-->(0, 3, 2)-{190 | [89, 32, 26, 43]}-V /*virtual node corresponding to context 1{0, 3}*/
<2> | `-->(1)-{32 | [28, 0, 4, 0]}-T
<1> `-->(2)-{102 | [21, 31, 22, 28]}-T
TreeSegment system version 24-03-2007
Robert Gwadera, Aristides Gionis and Heikki Mannila, Optimal Segmentation using Tree Models , Knowledge and Information Systems (KAIS), Volume 15, Number 3, Pages: 259-283, June 2008.