- 1.
- Let's modify the given table to a format that suits better the first
calculations.
Table:
Modified tables.
(tp = True Positives,
fp = False Positives,
fn = False Negatives,
fp = True Negatives)
engine 1 |
relevant |
non-relevant |
returned |
4 tp |
6 fp |
not returned |
2 fn |
9988 tn |
engine 2 |
relevant |
non-relevant |
returned |
6 tp |
4 fp |
not returned |
0 fn |
9990 tn |
|
In the following table there are the definitions of the five first
measures and the results for applying them.
Table:
Results. Note that only the precision and recall values are
in a region that is easy to understand.
measure |
|
|
|
Ratio of |
precision |
|
|
|
relevants in returned |
recall |
|
|
|
relevants found |
fallout |
|
|
|
returned non-relevants |
accuracy |
|
|
|
correctly classified |
error |
|
|
|
incorrectly classified |
|
F-measure is defined using both the precision and recall:
where stands for precision and recall. controls
the weighting between them. If we choose
,
For the first engine and for the second .
When calculating uninterpolated average precision, we go through the
list of returned documents, and whenever a relevant document is
seen, we calculate the precision over the documents processed so
far. Relevants that were not returned are taken into account with a
zero precision. Then we take an average over the precisions.
- 2.
- Word frequences in the documents were given: and .
The total number of the documents is . Inverser Document Frequency
is defined as
, so for the word it is
and for the word
. Thus the first word got almost twice
the weight of the second word.
The idea in Residual Inverse Document Frequency (RIDF) is that we
can model the occurrences of a word using a Poisson distribution.
This works well for words that are evenly distributed in a corpus.
Contentually important words usually occur in groups inside the
documents that discuss the corresponding matter, and therefore
Poisson distribution gives an incorrect estimation for their
frequencies. In RIDF we measure the difference between IDF and
Poisson distributions. The more difference we have, the more does
the word tell about the document. (Note: There are many errors in
this section of the course book's first edition.)
Actual calculations are the following: On average, word
occurs
times in a document. The probability
for that in a certain document word occur times is
obtained from the Poisson distribution:
RIDF is defined as
I.e., we take from the Poisson distribution the probability that the word
occurs at least once in the document (
)). IDF,
on the other hand, was based on the observed value of that probability
(
).
Simplifying the expression of RIDF:
Assigning the values:
We see that RIDF weighted the word 2.5 times more than IDF.
Thus both methods estimate that is a more relevant search term
than .
- 3.
- The asked document-word matrix is presented in table 3.
Table:
Document-word matrix
|
|
|
|
|
|
|
|
Schumacher |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
rata |
1 |
1 |
1 |
0 |
0 |
1 |
0 |
formula |
1 |
0 |
1 |
1 |
0 |
0 |
0 |
kolari |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
galaksi |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
tähti |
0 |
0 |
1 |
0 |
0 |
1 |
1 |
planeetta |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
meteoriitti |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
|
In Singular Value Decomposition (SVD) we decompose the matrix as:
Here is an orthogonal
matrix, is a diagonal
matrix and an orthogonal
matrix. The matrices are presented
in tables 4, 5, and 6.
Table:
|
|
|
|
|
|
|
|
|
Schumacher |
-0.200 |
-0.336 |
0.290 |
0.115 |
0.823 |
0.007 |
0.121 |
-0.243 |
rata |
-0.590 |
0.007 |
0.184 |
0.686 |
-0.232 |
-0.183 |
0.025 |
0.243 |
formula |
-0.435 |
-0.464 |
-0.040 |
-0.225 |
-0.333 |
0.609 |
0.045 |
-0.243 |
kolari |
-0.317 |
-0.361 |
-0.108 |
-0.494 |
0.071 |
-0.438 |
-0.285 |
0.485 |
galaksi |
-0.200 |
0.400 |
0.602 |
-0.242 |
-0.053 |
0.028 |
-0.563 |
-0.243 |
tähti |
-0.464 |
0.376 |
-0.408 |
-0.213 |
0.034 |
-0.345 |
0.275 |
-0.485 |
planeetta |
-0.257 |
0.476 |
-0.234 |
-0.070 |
0.363 |
0.530 |
-0.007 |
0.485 |
meteoriitti |
-0.026 |
0.116 |
0.534 |
-0.336 |
-0.132 |
-0.048 |
0.713 |
0.243 |
|
Table:
2.949 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2.107 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1.459 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1.311 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1.183 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0.638 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0.460 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
Table:
|
|
|
|
|
|
|
|
|
-0.348 |
-0.217 |
0.099 |
0.352 |
-0.478 |
0.669 |
0.152 |
|
-0.268 |
-0.156 |
0.325 |
0.611 |
0.499 |
-0.275 |
0.316 |
|
-0.613 |
-0.210 |
-0.255 |
-0.187 |
-0.390 |
-0.559 |
0.130 |
|
-0.323 |
-0.551 |
0.098 |
-0.460 |
0.474 |
0.279 |
-0.261 |
|
-0.077 |
0.245 |
0.779 |
-0.440 |
-0.157 |
-0.030 |
0.328 |
|
-0.512 |
0.598 |
0.099 |
0.124 |
0.094 |
0.048 |
-0.587 |
|
-0.244 |
0.404 |
-0.440 |
-0.216 |
0.335 |
0.290 |
0.583 |
|
Table:
Scaled
|
|
|
|
|
|
|
|
|
-0.913 |
-0.924 |
-0.971 |
-0.634 |
-0.400 |
-0.768 |
-0.646 |
|
-0.407 |
-0.384 |
-0.238 |
-0.773 |
0.917 |
0.640 |
0.764 |
|
Table:
Correlations of documents
|
|
|
|
|
|
|
|
|
1.000 |
|
|
|
|
|
|
|
1.000 |
1.000 |
|
|
|
|
|
|
0.984 |
0.988 |
1.000 |
|
|
|
|
|
0.894 |
0.882 |
0.800 |
1.000 |
|
|
|
|
-0.008 |
0.018 |
0.171 |
-0.455 |
1.000 |
|
|
|
0.441 |
0.464 |
0.594 |
-0.008 |
0.894 |
1.000 |
|
|
0.279 |
0.304 |
0.446 |
-0.180 |
0.958 |
0.985 |
1.000 |
|
We reduce the inner dimension to two by taking only the two largest
eigenvalues from and leaving the rest of the dimensions out from
the matrices and . Now the similarity of the documents can be
compared using the matrix . If 's columns are scaled to
unity, it is easy to calculate correlations between rows. This kind of
a scaled matrix is in table 7. (Similarity of words
could be compared from .) From the correlation matrix (table
8) we see that the Formula 1 and astronomy related
articles correlate much more inwardly than crosswise. Documents
and that were totally uncorrelated before, are now clearly
correlated. We have projected the data to two-dimensional space, and
similar articles have ended up near each other in that reduced
dimension.