T-61.5020 Statistical Natural Language Processing
Exercises 1 -- Basics of probability calculus
Version 1.0
Let us observe a stemming program for Finnish. By using the context, it can conduct whether the stem for word-form ``siitä'' is ``se'' (it) or ``siittää'' (conceive). For a word-form of ``se'' the program can determine the right stem for probability of 0.95. The same holds for the word-forms of ``siittää''. Because the stem ``se'' is much more common, only one of a thousand ``siitä'' should be conducted to the stem ``siittää''.
The program tells that for an occurred word-form ``siitä'', the corresponding stem is ``siittää''. What is the probability that the program is correct?
So called Zipf's law is often referred when simple statistics are calculated from a language. Let the words be tabulated so that the most common word comes first (), and the rest follow by the order of frequency ( ). Next to the word is written how many times it occurred it the text (). Zipf alleges that
Does this apply to a randomly generated language, that has 30 letters including the word boundary?
Ideal MDL is of little practical use, since it has been shown that it is impossible to design an algorithm that finds the shortest possible Turing machine. Therefore there exists several variations, of which one quite common is so called two-part coding scheme.
In two-part coding scheme we first select a model class that can describe some data given the model parameters . The goal is to describe and send a set of data, , that is assumed to be generated by a model of the decided class, with the minimum number of bits possible. As the receiver does not know the parameters that we choose, also they must be sent. Denote as the description length needed to encode the parameters, and as the description length needed to encode the data when the parameters are known. Thus we need to minimize the total code length .
In statistical modeling, the coding of the model class corresponds to probability distribution , and the coding of parameters to distribution . From information theory we know that if probability of a message is , the minimum code length for the message is bits. Show that the optimal selection of the parameters in two-part coding scheme equals to Maximum A Posteriori estimation.