T-61.5020 project work
Instructions for project work, spring 2008
T-61.5020 Statistical natural language processing
Project work
Contents:
General info
The purpose of the project work is to apply one or two statistical
methods for some natural language problem presented during the course.
One should write a report of the work and the results. The work can be
thought of as a small research project.
It is preferable to carry out the project work before attending an exam.
The project work reports that are returned by
31.5.2008 will be graded during spring 2008. Reports
returned later are graded at the earliest convenience of the course
personnel.
Finished reports may be returned by e-mail as a pdf file.
Project report
Write a report describing your work.
The report should begin with a title page containing the course code
and name, student name(s) and ID(s), and the topic.
In the report, describe briefly the research problem, the methods
utilized, the experiments carried out, the results and conclusions as
well as references.
If you use some other than the given data sets, describe also the data
set and append samples of it to the report.
Attach program code as an appendix. If, in addition, you use some
ready-made programs or tools, mention these in the report.
The length of the report should be 5-10 pages, not counting the program code
that should be included as an appendix.
Working in pairs
It is possible to do the project jointly with a pair. In this case,
an extended version of the project should be carried out, for example
by applying the methods to larger or additional data sets, or by
utilizing several methods, or by extending the work in some other way.
Only one report is written, in which the distribution of work between
the pair is also described. The report may be somewhat longer (10-15
pages) to reflect the extended content of the project.
Working in pairs is especially recommended if one desires to go in
more deeply in some topic, and the work load would otherwise become
too heavy.
Grading
The project works are graded as 5, 3, 1 or failed. Of these 5
means excellent and 1 passing.
The grade 1 can lower the course grade and correspondingly 5 can raise
it when the points obtained in the exam are close to a shift in grade
(1-2 points from it). An exception is the grade 5 which cannot be raised.
Topics
Statistical machine translation
Make a raw translation of the lecture materials of the course from
Finnish into English. Details to be agreed upon.
<
Word sense disambiguation
Apply two different methods to word sense disambiguation. One of the
methods should be unsupervised and the other supervised. Apply the
methods either to Finnish or English data sets. Analyze the benefits
and the problems of the methods.
Alternatively, you can choose only one method and apply it to both
languages, and consider/analyze the suitability of the method for
each language.
English data set: Senseval
Pick from the Senseval-data at least two words to be disambiguated.
Report results on the Senseval-test data on the same words. You can
also compare your results to those obtained in the
Senseval-competition using different methods.
Note: the dictionary data is included only in case someone wants to
apply a dictionary-based method instead. It is not necessarily
needed.
Finnish data set: STT
This data set does not include correct sense taggings for any
ambiguous words. However, one utilize it as data when solving the
pseudo word disambiguation problem. For example, create a
pseudo-ambiguous word by replacing all occurrences of words 'banaani'
and 'ovi' with the ambiguous word 'banaaniovi'. The original words are
thus the correct senses to be recognized.
Produce at least two pseudo words (i.e. pairs or combinations of
several words) and apply the methods to those. Report results on a
separate test set divided from the STT data.
Information retrieval
Data
CACM data set
Methors
Apply two different methods, at least one of which is one of the
following:
- Salton's vector space model (VSM),
- Latent semantic indexing (LSI),
- Self-organizing map (SOM)
Compare the suitability of the methods to the task and discuss the
advantages and disadvantages of each method.
Individual topic
The project work can be carried out on an individual topic, as
well. First you should obtain approval for your topic from the
lecturer, as follows:
Send about a half A4 desription of the topic you suggest, containing the
research problem, the data set you propose to examine and that is at
your disposal, and the methods you thought of applying. If necessary,
discuss and refine the topic with the lecturer.
Tyylianalyysi
Tehtävänä on kehittää järjestelmä, jolle annetaan opetusaineistona
joko a) kahden (tai useamman)
eri kirjoittajan tekstejä tai b) kahden (tai useamman) eri tyylilajin
tekstejä. Opetusvaiheen jälkeen järjestelmää testataan opetusaineistoon
kuulumattomilla teksteillä tarkoituksena tarkistaa, kuinka hyvin
järjestelmä luokittelee testiaineiston.
Tyylianalyysin perustana olevia statistiikkoja voivat
esimerkiksi olla:
- sanojen pituudet
- virkkeiden pituudet
- yksittäisten sanojen jakaumaa koskevat tiedot
- merkkijono- tai sana-n-grammit
- välimerkkijakaumat
- annetulla sanalistalla olevien sanojen jakaumat (esimerkiksi persoonapronominit)
Kysymys-vastaus -järjestelmä
Tehtävänä on kehittää järjestelmä, joka antaa vastauksia
informaatiotekniikan alueelta englanniksi esitettyihin kysymyksiin.
Erityisaiheina ovat Kohosen itseorganisoiva kartta (Self-Organizing Map, SOM)
ja riippumattomien komponenttien analyysi (Independent Component
Analysis, ICA).
Vastauksensa kysymykseen järjestelmä hakee joukosta
dokumentteja se, joka käytetyn menetelmän mielessä täsmää tarkimmin
annettuun hakulauseeseen. Tällainen hakulause voisi esimerkiksi
olla "How to choose the neighborhood function when using
the self-organizing map?". Tehtävänäsi on valita sopiva menetelmä
tähän tehtävään ja kerätä joukko kysymyslauseita testejä varten.
Aineistona on
joukko osoitteesta http://www.cis.hut.fi/
kerättyjä dokumentteja jossakin määrin
esikäsiteltyinä (ks.
ohje).
Mukana on seuraavanlaisia dokumentteja:
- SOM Toolbox -nimisen matlab-paketin dokumentaatiota
http://www.cis.hut.fi/projects/somtoolbox/package/docs2/
- Otteita Aapo Hyvärisen ICA-tutoriaalista
IJCNN99-konferenssiin
http://www.cis.hut.fi/aapo/papers/IJCNN99_tutorialweb/
- Otteita informaatiotekniikan laboratorion ja
Neuroverkkojen tutkimusyksikön kaksivuotisraportista
vuosille 2000-2001
http://www.cis.hut.fi/research/reports/biennial00-01/
- Otteita Timo Honkelan väitöskirjan johdanto-osuudesta
http://www.cis.hut.fi/tho/thesis/
- WSOM'97-konferenssin abstraktit
http://www.cis.hut.fi/wsom97/
Kunkin osuuden kirjoittajien oikeudet säilyvät
muuttumattomina.
Aineiston saa käyttöön seuraavilla unix-komennoilla:
gunzip hut_nnrc_collection03.tar.gz
tar xvf hut_nnrc_collection03.tar
Huomaa, että kokoelma purkautuu samaan hakemistoon
kuin missä tar-komento annetaan.
Aineistossa on 370 dokumenttia, jotka on koodattu
xml-tyyliin seuraavasti:
<document>
<source> ... mistä dokumentti on saatu ... </source>
<author> ... kirjoittaja(t) ... </author>
<text>
... leipäteksti ...
</text>
</document>
Joissakin tiedostoissa on author-kenttä on ennen
source-kenttää. WSOM-kokoelmassa author-tagit ovat
eri rivillä kuin itse kirjoittajien tiedot.