T-61.5020 Statistical Natural Language Processing
Exercises 9 -- Statistical machine translation
Version 1.1
The corpora include XML-style tags and other information not needed here. They can be removed using a Python script available in the course's web page3. The corpus package has separate files for the two languages, and the same line numbers of the same files are the corresponding sentences.
Next choose a relatively common word of source language, e.g. Finnish. Find all the sentences which include that word from the Finnish corpus. Then go through the target language (e.g. English) and collect all the words that are in the corresponding sentences (lines) where the Finnish word was found, together with their co-occurrence counts (). Then try to find the most probable translation(s) from this set of words.