T-61.5020 Statistical Natural Language Processing
Exercises 9 -- Statistical machine translation
Version 1.1
The corpora include XML-style tags and other information not needed here. They can be removed using a Python script available in the course's web page3. The corpus package has separate files for the two languages, and the same line numbers of the same files are the corresponding sentences.
Next choose a relatively common word of source language,
e.g. Finnish. Find all the sentences which include that word from the
Finnish corpus. Then go through the target language (e.g. English) and
collect all the words
that are in the corresponding sentences
(lines) where the Finnish word was found, together with their
co-occurrence counts (
). Then try to find the most probable
translation(s) from this set of words.