33
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Threshold
A
c
c
u
ra
c
y
Original Paradigm
New Paradigm
Figure 18. Original paradigm versus new, using SO-LSA with TASA and GI.
6. DISCUSSION OF RESULTS
LSA has not yet been scaled up to corpora of the sizes that are available for PMI-IR, so
we were unable to evaluate SO-LSA on the larger corpora
that were used to evaluate
SO-PMI. However, the experiments suggest that SO-LSA is able to use data more
efficiently than SO-PMI, and SO-LSA might surpass the accuracy attained by SO-PMI
with AV-ENG, given a corpus of comparable size.
PMI measures the degree of association between two
words by the frequency with
which they co-occur. That is, if PMI(
word
1
,
word
2
) is positive, then
word
1
and
word
2
tend
to occur near each other. Resnik [1995] argues that such
word-word co-occurrence
approaches are able to capture “relatedness” of words, but
do not specifically address
similarity of meaning. LSA, on the other hand, measures the degree of association
between two words by comparing the contexts in which the two words occur. That is, if
LSA(
word
1
,
word
2
) is positive, then (in general) there are many words,
word
i
, such that
word
1
tends to occur near
word
i
and
word
2
tends to occur near
word
i
. It appears that such
word-context co-occurrence approaches correlate better with human judgments of
semantic similarity than word-word co-occurrence approaches [Landauer 2002]. This
could help explain LSA’s apparent efficiency of data usage.
Laplace smoothing was used in SO-PMI primarily to prevent division by zero, rather
than to provide resistance to noise, which is why the relatively small value of 0.01 was
34
chosen. The experiments show that the performance of SO-PMI is not particularly
sensitive to the value of the smoothing factor with larger corpora.
The size of the neighbourhood for SO-PMI seems
to be an important parameter,
especially when the corpus is small. For the TASA corpus, a neighbourhood size of 1000
words (which is the same as a whole document, since the largest document is 650 words
long) yields the best results. On the other hand, for the larger corpora, a neighbourhood
size of ten words (NEAR) results in higher accuracy than using the whole document
(AND). For best results, it seems that the neighbourhood size should be tuned for the
given corpus and the given test words (rarer test words
will tend to need larger
neighbourhoods).
Given the TASA corpus and the GI lexicon, SO-LSA appears to work best with a 250
dimensional space. This is approximately the same number as other researchers have
found useful in other applications of LSA [Deerwester
et al. 1990; Landauer and Dumais
1997]. However, the accuracy with 200 or 300 dimensions is almost the same as the
accuracy with 250 dimensions; SO-LSA is not especially sensitive to the value of this
parameter.
The experiments with alternative paradigm words
show that both SO-PMI and
SO-LSA are sensitive to the choice of paradigm words. It appears that the difference
between the original paradigm words and the new paradigm words is that the former are
less context-sensitive. Since SO-A estimates semantic orientation by association with the
paradigm words, it is not surprising that it is important to use paradigm words that are
robust, in the sense that their semantic orientation is relatively insensitive to context.
7. LIMITATIONS AND FUTURE WORK
A limitation of SO-A is the size of the corpora required for good performance. A large
corpus of text requires significant disk space and processing time.
In our experiments
with SO-PMI, we paused for five seconds between each query, as a courtesy to AltaVista.
Processing the 3,596 words taken from the General Inquirer lexicon required 50,344
queries, which took about 70 hours. This can be reduced to 10 hours, using equation (22)
instead of equation (17), but there may be a loss of accuracy, as we saw in Section 5.5.
However, improvements in hardware will reduce the impact of this limitation. In the
future, corpora of a hundred billion words will be common and the average desktop
computer will be able to process them easily. Today, we can indirectly work with corpora
of this size through web search engines, as we have done in this paper. With a little bit of
creativity, a web search engine can tell us a lot about language use.
35
The ideas in SO-A can likely be extended to many other semantic aspects of words.
The General Inquirer lexicon has 182 categories of word tags [Stone
et al. 1966] and this
paper has only used two of them, so there is no shortage of future work. For example,
another interesting pair of categories
in General Inquirer is strong and
weak. Although
Dostları ilə paylaş: