21
0
10
20
30
40
50
60
70
80
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
Laplace Smoothing Factor
A
c
c
u
ra
c
y
100% Threshold
75% Threshold
50% Threshold
25% Threshold
Figure 5. Effect of Laplace smoothing factor with TASA and the GI lexicon.
These three figures show that the optimal smoothing factor increases as the size of the
corpus
increases, as expected. The figures also show that the impact of the smoothing
factor decreases as the corpus size increases. There is less need for smoothing when a
large quantity of data is available. The baseline smoothing factor of 0.01 was chosen to
avoid division by zero, not to provide resistance to noise. The benefit from optimizing the
smoothing factor for noise resistance is small for large corpora.
5.4. Varying the Neighbourhood Size
The AltaVista NEAR operator restricts search to a fixed neighbourhood of ten words, but
we can vary the neighbourhood size with the TASA corpus, since we have a local copy of
the corpus. Figure 6 shows accuracy as a function of the neighbourhood size, as we vary
the size from 2 to 1000 words, using TASA and the GI lexicon.
The advantage of a small neighbourhood is that words that occur closer to each other
are more likely to be semantically related. The disadvantage is that, for any pair of words,
there will usually be more occurrences of the pair within
a large neighbourhood than
within a small neighbourhood, so a larger neighbourhood
will tend to have higher
statistical reliability. An optimal neighbourhood size will balance these conflicting
effects. A larger corpus should yield better statistical reliability than a smaller corpus, so
the optimal neighbourhood size will be smaller with a larger corpus. The optimal
neighbourhood size will also be determined by the frequency of the words in the test set.
Rare words will favour a larger neighbourhood size than frequent words.
22
Figure 6 shows that, for the TASA corpus and the GI lexicon, it seems best to have a
neighbourhood size of at least 100 words. The TASA corpus is relatively small, so it is
not surprising that a large neighbourhood size is best. The baseline neighbourhood size of
10 words is clearly suboptimal for TASA.
0
10
20
30
40
50
60
70
80
1
10
100
1000
Dostları ilə paylaş: