T-61.5020 project work

Instructions for project work, spring 2008

T-61.5020 Statistical natural language processing

Project work

General info

The purpose of the project work is to apply one or two statistical methods for some natural language problem presented during the course. One should write a report of the work and the results. The work can be thought of as a small research project.

It is preferable to carry out the project work before attending an exam.

The project work reports that are returned by 31.5.2008 will be graded during spring 2008. Reports returned later are graded at the earliest convenience of the course personnel.

Finished reports may be returned by e-mail as a pdf file.

Project report

Write a report describing your work.

The report should begin with a title page containing the course code and name, student name(s) and ID(s), and the topic.

In the report, describe briefly the research problem, the methods utilized, the experiments carried out, the results and conclusions as well as references.

If you use some other than the given data sets, describe also the data set and append samples of it to the report.

Attach program code as an appendix. If, in addition, you use some ready-made programs or tools, mention these in the report.

The length of the report should be 5-10 pages, not counting the program code that should be included as an appendix.

Working in pairs

It is possible to do the project jointly with a pair. In this case, an extended version of the project should be carried out, for example by applying the methods to larger or additional data sets, or by utilizing several methods, or by extending the work in some other way.

Only one report is written, in which the distribution of work between the pair is also described. The report may be somewhat longer (10-15 pages) to reflect the extended content of the project.

Working in pairs is especially recommended if one desires to go in more deeply in some topic, and the work load would otherwise become too heavy.

Grading

The project works are graded as 5, 3, 1 or failed. Of these 5 means excellent and 1 passing.

The grade 1 can lower the course grade and correspondingly 5 can raise it when the points obtained in the exam are close to a shift in grade (1-2 points from it). An exception is the grade 5 which cannot be raised.

Topics

Statistical machine translation

Make a raw translation of the lecture materials of the course from Finnish into English. Details to be agreed upon.

Word sense disambiguation

Apply two different methods to word sense disambiguation. One of the methods should be unsupervised and the other supervised. Apply the methods either to Finnish or English data sets. Analyze the benefits and the problems of the methods.

Alternatively, you can choose only one method and apply it to both languages, and consider/analyze the suitability of the method for each language.

English data set: Senseval

Pick from the Senseval-data at least two words to be disambiguated. Report results on the Senseval-test data on the same words. You can also compare your results to those obtained in the Senseval-competition using different methods.

Note: the dictionary data is included only in case someone wants to apply a dictionary-based method instead. It is not necessarily needed.

Finnish data set: STT

This data set does not include correct sense taggings for any ambiguous words. However, one utilize it as data when solving the pseudo word disambiguation problem. For example, create a pseudo-ambiguous word by replacing all occurrences of words 'banaani' and 'ovi' with the ambiguous word 'banaaniovi'. The original words are thus the correct senses to be recognized.

Produce at least two pseudo words (i.e. pairs or combinations of several words) and apply the methods to those. Report results on a separate test set divided from the STT data.

Information retrieval

Data

CACM data set

Methors

Apply two different methods, at least one of which is one of the following:

Salton's vector space model (VSM),
Latent semantic indexing (LSI),
Self-organizing map (SOM)

Compare the suitability of the methods to the task and discuss the advantages and disadvantages of each method.

Individual topic

The project work can be carried out on an individual topic, as well. First you should obtain approval for your topic from the lecturer, as follows:

Send about a half A4 desription of the topic you suggest, containing the research problem, the data set you propose to examine and that is at your disposal, and the methods you thought of applying. If necessary, discuss and refine the topic with the lecturer.

Tyylianalyysi

Tehtävänä on kehittää järjestelmä, jolle annetaan opetusaineistona joko a) kahden (tai useamman) eri kirjoittajan tekstejä tai b) kahden (tai useamman) eri tyylilajin tekstejä. Opetusvaiheen jälkeen järjestelmää testataan opetusaineistoon kuulumattomilla teksteillä tarkoituksena tarkistaa, kuinka hyvin järjestelmä luokittelee testiaineiston. Tyylianalyysin perustana olevia statistiikkoja voivat esimerkiksi olla:

sanojen pituudet
virkkeiden pituudet
yksittäisten sanojen jakaumaa koskevat tiedot
merkkijono- tai sana-n-grammit
välimerkkijakaumat
annetulla sanalistalla olevien sanojen jakaumat (esimerkiksi persoonapronominit)

Kysymys-vastaus -järjestelmä

Tehtävänä on kehittää järjestelmä, joka antaa vastauksia informaatiotekniikan alueelta englanniksi esitettyihin kysymyksiin. Erityisaiheina ovat Kohosen itseorganisoiva kartta (Self-Organizing Map, SOM) ja riippumattomien komponenttien analyysi (Independent Component Analysis, ICA).

Vastauksensa kysymykseen järjestelmä hakee joukosta dokumentteja se, joka käytetyn menetelmän mielessä täsmää tarkimmin annettuun hakulauseeseen. Tällainen hakulause voisi esimerkiksi olla "How to choose the neighborhood function when using the self-organizing map?". Tehtävänäsi on valita sopiva menetelmä tähän tehtävään ja kerätä joukko kysymyslauseita testejä varten.

Aineistona on joukko osoitteesta http://www.cis.hut.fi/ kerättyjä dokumentteja jossakin määrin esikäsiteltyinä (ks. ohje).

Mukana on seuraavanlaisia dokumentteja:

SOM Toolbox -nimisen matlab-paketin dokumentaatiota
http://www.cis.hut.fi/projects/somtoolbox/package/docs2/
Otteita Aapo Hyvärisen ICA-tutoriaalista IJCNN99-konferenssiin
http://www.cis.hut.fi/aapo/papers/IJCNN99_tutorialweb/
Otteita informaatiotekniikan laboratorion ja Neuroverkkojen tutkimusyksikön kaksivuotisraportista vuosille 2000-2001
http://www.cis.hut.fi/research/reports/biennial00-01/
Otteita Timo Honkelan väitöskirjan johdanto-osuudesta
http://www.cis.hut.fi/tho/thesis/
WSOM'97-konferenssin abstraktit
http://www.cis.hut.fi/wsom97/

Kunkin osuuden kirjoittajien oikeudet säilyvät muuttumattomina.

Aineiston saa käyttöön seuraavilla unix-komennoilla:

gunzip hut_nnrc_collection03.tar.gz
tar xvf hut_nnrc_collection03.tar

Huomaa, että kokoelma purkautuu samaan hakemistoon kuin missä tar-komento annetaan. Aineistossa on 370 dokumenttia, jotka on koodattu xml-tyyliin seuraavasti:

     <document>
     <source> ... mistä dokumentti on saatu ... </source>
     <author> ... kirjoittaja(t) ... </author>
     <text>
     ... leipäteksti ...
     </text>
     </document>

Joissakin tiedostoissa on author-kenttä on ennen source-kenttää. WSOM-kokoelmassa author-tagit ovat eri rivillä kuin itse kirjoittajien tiedot.