MultiStreamSegment is a universal multi-stream segmentation system for segmenting synchronized multi-stream sequences also known as multi-attribute sequences. MultiStreamSegment describes the multi-sequences in probabilistic terms using Dynamic Bayesian Networks (DBN). It achives two data mining tasks at once: (I) it discovers the segment boundaries; and (II) it discovers probabilistic dependencies in multi-sequences by fitting an optimal DBN for each segment. It uses the Bayesian Information Criterion (BIC) and a variant of the Minimum Description Length (MDL) Principle to select an optimal DBN structure and the number of segments of the multi-sequence.
|
NAME
mseg - MultiStreamSegment: optimal segmentation using bayesian networks
USAGE
mseg [options] [input-file]
DESCRIPTION
mseg segments a file using a bayesian network
OPTIONS
-S s: scoring method, where:
s=bic:
s=kt:
-K k: maximum number of segments. The algorithm selects the optimal number of segment from 1 to K
-Q q: step size for computing the score of a segment only for segment boundaries that are a multiple of Q.
The default value of Q is 1 and by setting |Q| > 1 one can achieve a O(1/Q^{2}) speedup of the
segmentation algorithm
-P p: pseudocount for computing the BIC score (s=bic). It sets a pseudocount value for computing the likelihood of
a segment given a tree. The default value is 0 and p can range from 0 to 1.
-A a: alphabet size
-I i: number of streams.
-H h: considers the first h symbols of the input file.
-V: (verbose) print trees corresponding to all segments.
INPUT FILE FORMAT
An ASCII file, where alphabet symbols as (integer alphabet indexes from 0 to |A|-1)
are separated with the new line (carriage return) symbol.
RESULTS
Array containing pairs {partition point, ] of the form
I[k']:[{p1, }, {p2, }, {pk', }], where k' is the optimal number of partition points
EXAMPLE
mseg -S bic -K 20 -A 2 -I 3 -Q 100 y.gen
> mseg -S bic -K 20 -A 2 -I 3 -Q 100 y.gen MININMAL SEGMENTATION
MultiStreamSegment system version 8-10-2008
Robert Gwadera, Janne Toivola and Jaakko Hollmen, Segmenting Multi-Atrribute Sequences using Dynamic Bayesian Networks , The 1st Workshop on Data Mining of Uncertain Data in conjunction with The 2007 IEEE International Conference on Data Mining (ICDM-07), 465-470, Omaha Nebraska, October 2007.