T-61.5020 Statistical Natural Language Processing
Answers 1 -- Basics of probability calculus
Version 1.0
First of the given probabilities, , tells that is we see a word of three letters, the probability that it is an abbreviation is 0.8, and 0.2 for something else. Next one, , tells that the probability for a random word being exactly three letters long is .
The probability for a random word being three letter abbreviation
is get by product of the given probabilities. First we look how
probable it is for a word to be three letters long, then how
probable it is to be abbreviation when being three letters:
To answer the given question, we need the Bayes' theorem:
To generate a one letter word, the given random language should generate two symbols, i.e. word boundary after something else. The probability for this is
Respectively, the probability of a word of two letters is
As the probability of the word is directly proportional to its expected incidence in the test data, we can make a table similar to the table 1.3 in the book by directly calculating probabilities. As words of same length have equal probability, and they cannot be sorted by frequency, we count the value for only one word per length. The results are presented in table 1 and drawn to Figure 1.
15 | 1111 | 16111 |
450 | 37.04 | 16648 |
13064 | 1.235 | 16129 |
378900 | 0.0412 | 15593 |
1098800 | 0.00137 | 15073 |
318660000 | 0.0000457 | 14570 |
Expectation value:
Variance:
Now we can use formula
Expectation value of the sum of independent random variables
Variance of a random variable multiplied by a constant
Variance of the sum of independent random variables
Now we have all the needed formulas. We want to calculate expectation value for the sum , where is the random variable corresponding to the first throw and to the second.
The expectation value and variance do not tell everything about the distribution. In figure 2 there are results for varying number of dice tosses simulated using Matlab. The shape of the distribution moves nearer to the normal (gaussian) distribution as the number of dices grow. This is why natural phenomena are often modelled using normal distribution: If many small random events affect to the result, it will be normal distributed. This is also is good excuse for transforming calculations to easier forms.
More formal proof for that the distribution will approach normal is found from http:// mathworld.wolfram.com/CentralLimitTheorem.html
The goal is to minimize the total code length