19
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Threshold
A
c
c
u
ra
c
y
AV-ENG
AV-CA
TASA
Figure 2. Accuracy of SO-PMI with the GI lexicon and the three corpora.
5.3. Varying the Laplace Smoothing Factor
As we mentioned in Section 3.1, we used a Laplace smoothing factor of 0.01 in the
baseline version of SO-PMI.
In this section, we explore the impact of varying the
smoothing factor.
Figure 3 graphs the accuracy of SO-PMI as a function of the smoothing factor, which
varies from 0.0001 to 10,000 (note the logarithmic scale), using the AV-ENG corpus and
the GI lexicon. There are four curves, for four different thresholds on the percentage of
the full test set that is classified. The smoothing factor has relatively little impact until it
rises above 10, at which point the accuracy begins to fall off. The optimal value is about
1, although the difference between 1 and 0.1 or 0.01 is slight.
Figure 4 shows the same experimental setup, except using the AV-CA corpus. We see
the
same general pattern, but the accuracy begins to decline a little earlier, when the
smoothing factor rises above 0.1. The highest accuracy is attained when the smoothing
factor is about 0.1. The AV-CA corpus (approximately 2 × 10
9
words) is more sensitive
to the smoothing factor than the AV-ENG corpus (approximately 1 × 10
11
words). A
smoothing factor of about 0.1 seems to help SO-PMI handle the increased noise, due to
the smaller corpus (compare Figure 3 and Figure 4).