T-61.5020 Statistical Natural Language Processing
Answers 7 -- Word sense disabiguation
Version 1.0
Let's start from Bayes' theorem.
Let's choose the nearest 10 words as the context:
Finally, let's write the expression open:
None of the used approximations is totally correct, but the roughest error is probably the one of independency of the context words. However, this is the way of getting an easily feasible method.
We need two estimates: probability that the word in the context occurs with the sense , and prior probability . As we have equal number of occurrences for the senses sataa=rain and sataa=number, we can but set the prior to .
Maximum likelihood (ML) estimation is applied in the course book.
In our problem we were asked to use priors, so let's define a small
prior that all words are of equal probability to the probability
, and add it to the estimators with coefficient
.
This can be thought as if every known word had already occurred
0.5 times in both context types.
A large emphasises the meaning of the prior, and thus a small
evidence from the training set does not change it much.
The same calculations for the number sense:
Let's find the dictionary definitions for the words in the tested sentence. Those are compared to the dictionary definitions of two senses of the studied word. The meaning that has more mutual words with the words in the dictionary definitions of the other words (including the word itself) in the sentence is decided to be the correct one.
In this case, from the definition of ``ampuminen'', shooting, we find the words ``harjoitella'' and ``varusmies'' that are also in the test sentence. The word ``sarjatuli'' is found from the definition of ``kivääri'', so three points for shooting.
From the definition of ``ammuminen'', that is moo'ing, we find the word ``niityllä'', which is also in the test sentence. One point for moo'ing.
It seems that it is shooting for this one ().
Let's see how many hits the Google will give:
prices | go up | 111000 |
price | goes up | 88100 |
199100 | ||
prices | slant | 58 |
prices | lean | 2520 |
prices | lurch | 21 |
price | slants | 1 |
price | leans | 63 |
price | lurches | 114 |
2777 |
This example goes clearly for the sense ``go up''.
What about the next example? If we do the translation and search
using the given word order, we will get no hits (excluding the
hits for this exercise problem). So we try to find
documents where the words may occur in any order:
want | shin | hoof | liver | or | snout | 260 |
like | shin | hoof | liver | or | snout | 304 |
covet | shin | hoof | liver | or | snout | 219 |
desire | shin | hoof | liver | or | snout | 243 |
1026 | ||||||
want | kick | poke | cost | or | suffer | 43500 |
We see that the verb meanings of the words win here, altough the nouns would probable be more correct. All searches are not even needed, because the first one already produces more hits than all of the other senses together. In addition, most of the hits returned by the first four searches were from dictionaries.
As the senses shin, hoof, liver and snout are much rarer than the verbs, they are found much less. In this situation we should probably normalize the search in some way. This example was harder than the first one also because this time the sentence was not a common and fixed phrase.
The problem is to estimate the probability of the sense when we know the context .
The convergence of the algorithm, as E- and M-steps are iterated, is illustrated in Figure 1. In this case the priors we kept at for first 15 iterations, which improved the stability. We see that the algorithm can separate the numbers and the trees. For sentences 8 and 9 the model overlearns and sets them only to one sense. If the amount of training data would be larger, also these estimates might be more feasible.
The same algorithm can be used to, e.g., separate a set of documents to their topics. In that case, the contexts would be the full documents.
[6.]
Here we present one possible example solution step by step. The most important points where we have made an arbitary decision that can increase inaccuracy and could be as well done otherwise, are marked with italics.
Using the method described above we got the results in Table 1. Here we used a map of size . If no correct answers are available, it is easier to evaluate the result when we have small number of groups. For example, for words ``sade'' and ``komissio'', the results for a map were 59% and 98%. In Figure 2 we have the grouping of words ``sade'' and ``komissio'' for the map.
training | test | ||||
correct % | correct % | correct % | correct % | ||
Lappi | Pariisi | 63 | 55 | 61 | 53 |
sade | komissio | 66 | 93 | 66 | 92 |
Venäjä | tammikuu | 80 | 60 | 78 | 60 |
Halonen | TPS | 62 | 74 | 63 | 70 |
leijona | ydinvoima | 70 | 55 | 75 | 48 |