T-61.5020 Statistical Natural Language Processing
Answers 8 -- N-gram language models 
Version 1.0
| 
 | 
We can see that the estimates given by a human are somewhat adrift and statistics badly so.
Statistics were estimated from a corpus of 30 million words. In that corpus, none of the trigrams occured even once. If the words were stemmed, 11 sentences were found that had the trigram ``tuntua jo hyvä''. The estimate needs badly some smoothing, and it will not be very good even after that.
Also the estimates given by our human can be doubted, as some quite possible senteces have a probability of zero. One example might be ``Kyllä alkaa tuntumaan jo kumisaapas jalassa'' (free translation: ``Now the rubber boots are beginning to feel in my feet''), something that one might say after a long walk.
When the full sentence was given to the test person, we got quite good estimates (third column). To get even close with a computer, the language model would need understand grammatical syntax of Finnish as well as sematic meaning of words (e.g. that February is in the end of winter season).
 tells the number of occurrences in the training set.
 tells the number of occurrences in the training set.
In the unigram estimates we forgot everything about the previous words, brigram estimates depend on just one previous word, and trigram estimates use a history of two words.
For unigrams
|  | 
 is number of samples in the training set. These estimates 
    are independent of the word histories.
 is number of samples in the training set. These estimates 
    are independent of the word histories.
|  |  |  | |
|  |  |  | |
|  |  |  | 
In the bigram estimates we use the previous word, i.e.
    
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | 
 is the size of the vocabulary.
 is the size of the vocabulary.
So we get the following estimates for unigrams:
    
|  |  |  | |
|  |  |  | |
|  |  |  | 
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | 
 times
    before the training corpus:
 times
    before the training corpus:
     
 . The estimates are:
. The estimates are:
    |  |  |  | |
|  |  |  | |
|  |  |  | 
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | 
 can be found by taking part of the 
    training data out of the actual training and using it to optimize
    the parameter.
 can be found by taking part of the 
    training data out of the actual training and using it to optimize
    the parameter.
 ). The discounting can naturally 
  be done only if there were some occurrences. When the discounting is 
  done to the bigram estimates, some probability mass is left out, and 
  that can be used as an interpolation weight (or alternatively back-off 
  weight) with the unigram probabilities. Thus, combining absolute 
  discounting and interpolation, we get the following bigram estimate:
). The discounting can naturally 
  be done only if there were some occurrences. When the discounting is 
  done to the bigram estimates, some probability mass is left out, and 
  that can be used as an interpolation weight (or alternatively back-off 
  weight) with the unigram probabilities. Thus, combining absolute 
  discounting and interpolation, we get the following bigram estimate:
  
 is a coefficient that normalizes the distribution 
  to sum up to one, and depends on the bigram history. Its value can be 
  estimated from how many different following words the history had:
 is a coefficient that normalizes the distribution 
  to sum up to one, and depends on the bigram history. Its value can be 
  estimated from how many different following words the history had:
  
|  | 
Let's start by calculating the interpolation coefficients for the 
  histories ``olla'' and ``vaikuttaa''. The first one had five different 
  right contexts (following words) and it has occurred five times. 
  The second one has occurred one once and thus with one right context.
  
|  |  |  | |
|  |  |  | 
Now we can estimate the interpolated bigram probabilities:
    
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | |
|  |  |  | 
Notice that the method does not give any probability mass to words that did not occur in the training data, such as ``gorilla''. This can be fixed by using some additional smoothing method for the unigram. Another alternative would be to continue the interpolation with an uniform distribution (sometimes called ``zero-gram'').
Interpolation or back-off to lower order n-grams is not always 
  very reliable. As an example, think about the bigram 
  ``San Francisco''. As the name is quite commonly used, the estimate 
  
 would be very reliable.
  So if the previous word was ``San'', no problems there. But what if 
  it wasn't? In this case, the probability of ``Francisco'' should be 
  clearly lower than it is if we use the unigram distribution without 
  any idea of the previous word. Based on this idea, Kneser-Ney 
  smoothing estimates the lower order distributions by calculating 
  how many times the observed word to occurs in a new context.
  The current state-of-the-art smoothing technique is modified Kneser-Ney
  interpolation, that applies this kind of type counts and absolute 
  discounting with three separately optimized discounts.
 would be very reliable.
  So if the previous word was ``San'', no problems there. But what if 
  it wasn't? In this case, the probability of ``Francisco'' should be 
  clearly lower than it is if we use the unigram distribution without 
  any idea of the previous word. Based on this idea, Kneser-Ney 
  smoothing estimates the lower order distributions by calculating 
  how many times the observed word to occurs in a new context.
  The current state-of-the-art smoothing technique is modified Kneser-Ney
  interpolation, that applies this kind of type counts and absolute 
  discounting with three separately optimized discounts.
|  |  |  | |
|  |  | ||
|  |  | 
Now we can count the sum of the logarithmic probabilities:
|  | |||
|  |  | ||
|  | |||
|  |  | 
Inserting the result to the expression of perplexity:
|  | 
 .
.