Microsoft Word turney-littman-acm doc



Yüklə 200 Kb.
Pdf görüntüsü
səhifə15/18
tarix22.05.2023
ölçüsü200 Kb.
#119806
1   ...   10   11   12   13   14   15   16   17   18
Threshold
A
c
c
u
ra
c
y
Original Paradigm
New Paradigm
Figure 15. Original paradigm versus new, using SO-PMI with AV-ENG and GI. 


32
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Threshold
A
c
c
u
ra
c
y
Original Paradigm
New Paradigm
Figure 16. Original paradigm versus new, using SO-PMI with AV-CA and GI. 
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Threshold
A
c
c
u
ra
c
y
Original Paradigm
New Paradigm
Figure 17. Original paradigm versus new, using SO-PMI with TASA and GI. 


33
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Threshold
A
c
c
u
ra
c
y
Original Paradigm
New Paradigm
Figure 18. Original paradigm versus new, using SO-LSA with TASA and GI. 
6. DISCUSSION OF RESULTS 
LSA has not yet been scaled up to corpora of the sizes that are available for PMI-IR, so 
we were unable to evaluate SO-LSA on the larger corpora that were used to evaluate 
SO-PMI. However, the experiments suggest that SO-LSA is able to use data more 
efficiently than SO-PMI, and SO-LSA might surpass the accuracy attained by SO-PMI 
with AV-ENG, given a corpus of comparable size. 
PMI measures the degree of association between two words by the frequency with 
which they co-occur. That is, if PMI(word
1
word
2
) is positive, then word
1
and word
2
tend 
to occur near each other. Resnik [1995] argues that such word-word co-occurrence 
approaches are able to capture “relatedness” of words, but do not specifically address 
similarity of meaning. LSA, on the other hand, measures the degree of association 
between two words by comparing the contexts in which the two words occur. That is, if 
LSA(word
1
word
2
) is positive, then (in general) there are many words, word
i
, such that 
word
1
tends to occur near word
i
and word
2
tends to occur near word
i
. It appears that such 
word-context co-occurrence approaches correlate better with human judgments of 
semantic similarity than word-word co-occurrence approaches [Landauer 2002]. This 
could help explain LSA’s apparent efficiency of data usage. 
Laplace smoothing was used in SO-PMI primarily to prevent division by zero, rather 
than to provide resistance to noise, which is why the relatively small value of 0.01 was 


34
chosen. The experiments show that the performance of SO-PMI is not particularly 
sensitive to the value of the smoothing factor with larger corpora. 
The size of the neighbourhood for SO-PMI seems to be an important parameter
especially when the corpus is small. For the TASA corpus, a neighbourhood size of 1000 
words (which is the same as a whole document, since the largest document is 650 words 
long) yields the best results. On the other hand, for the larger corpora, a neighbourhood 
size of ten words (NEAR) results in higher accuracy than using the whole document 
(AND). For best results, it seems that the neighbourhood size should be tuned for the 
given corpus and the given test words (rarer test words will tend to need larger 
neighbourhoods). 
Given the TASA corpus and the GI lexicon, SO-LSA appears to work best with a 250 
dimensional space. This is approximately the same number as other researchers have 
found useful in other applications of LSA [Deerwester et al. 1990; Landauer and Dumais 
1997]. However, the accuracy with 200 or 300 dimensions is almost the same as the 
accuracy with 250 dimensions; SO-LSA is not especially sensitive to the value of this 
parameter. 
The experiments with alternative paradigm words show that both SO-PMI and 
SO-LSA are sensitive to the choice of paradigm words. It appears that the difference 
between the original paradigm words and the new paradigm words is that the former are 
less context-sensitive. Since SO-A estimates semantic orientation by association with the 
paradigm words, it is not surprising that it is important to use paradigm words that are 
robust, in the sense that their semantic orientation is relatively insensitive to context.
7. LIMITATIONS AND FUTURE WORK 
A limitation of SO-A is the size of the corpora required for good performance. A large 
corpus of text requires significant disk space and processing time. In our experiments 
with SO-PMI, we paused for five seconds between each query, as a courtesy to AltaVista. 
Processing the 3,596 words taken from the General Inquirer lexicon required 50,344 
queries, which took about 70 hours. This can be reduced to 10 hours, using equation (22) 
instead of equation (17), but there may be a loss of accuracy, as we saw in Section 5.5. 
However, improvements in hardware will reduce the impact of this limitation. In the 
future, corpora of a hundred billion words will be common and the average desktop 
computer will be able to process them easily. Today, we can indirectly work with corpora 
of this size through web search engines, as we have done in this paper. With a little bit of 
creativity, a web search engine can tell us a lot about language use. 


35
The ideas in SO-A can likely be extended to many other semantic aspects of words. 
The General Inquirer lexicon has 182 categories of word tags [Stone et al. 1966] and this 
paper has only used two of them, so there is no shortage of future work. For example, 
another interesting pair of categories in General Inquirer is strong and weak. Although 

Yüklə 200 Kb.

Dostları ilə paylaş:
1   ...   10   11   12   13   14   15   16   17   18




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin