Introduction

MultiStreamSegment is a universal multi-stream segmentation system for segmenting synchronized multi-stream sequences also known as multi-attribute sequences. MultiStreamSegment describes the multi-sequences in probabilistic terms using Dynamic Bayesian Networks (DBN). It achives two data mining tasks at once: (I) it discovers the segment boundaries; and (II) it discovers probabilistic dependencies in multi-sequences by fitting an optimal DBN for each segment. It uses the Bayesian Information Criterion (BIC) and a variant of the Minimum Description Length (MDL) Principle to select an optimal DBN structure and the number of segments of the multi-sequence.

Example segmentation

Figure 1: Segmentations of a 10-attribute sequence


Manual

MultiStreamSegment has been implemented in C++, includes only standard libraries, and is compatible with all UNIX/LINUX systems.

Installation

  1. Uncompress the archive by typing: tar xvfz tseg.tgz
  2. Go to the tseg directory and type:
    1. make clean
    2. make

Usage


NAME
        mseg - MultiStreamSegment: optimal segmentation using bayesian networks

USAGE
        mseg [options] [input-file]

DESCRIPTION
        mseg segments a file using a bayesian network

OPTIONS

-S s: scoring method, where:
        s=bic:
        s=kt:

-K k: maximum number of segments. The algorithm selects the optimal number of segment from 1 to K

-Q q:   step size for computing the score of a segment only for segment boundaries that are a multiple of Q.
        The default value of Q is 1 and by setting |Q| > 1 one can achieve a O(1/Q^{2}) speedup of the
        segmentation algorithm

-P p:   pseudocount for computing the BIC score (s=bic). It sets a pseudocount value for computing the likelihood of
        a segment given a tree. The default value is 0 and p can range from 0 to 1.

-A a: alphabet size

-I i: number of streams.

-H h: considers the first h symbols of the input file.

-V: (verbose) print trees corresponding to all segments.



INPUT FILE FORMAT
        An ASCII file, where alphabet symbols as (integer alphabet indexes from 0 to |A|-1)
        are separated with the new line (carriage return) symbol.

RESULTS
        Array containing pairs {partition point, ] of the form
        I[k']:[{p1, }, {p2, },  {pk', }], where k' is the optimal number of partition points

EXAMPLE
        mseg -S bic -K 20 -A 2 -I 3 -Q 100  y.gen

Example output

>  mseg -S bic -K 20 -A 2 -I 3 -Q 100  y.gen

MININMAL SEGMENTATION


Download

MultiStreamSegment system version 8-10-2008

Publications

Robert Gwadera, Janne Toivola and Jaakko Hollmen, Segmenting Multi-Atrribute Sequences using Dynamic Bayesian Networks , The 1st Workshop on Data Mining of Uncertain Data in conjunction with The 2007 IEEE International Conference on Data Mining (ICDM-07), 465-470, Omaha Nebraska, October 2007.