Threshold
A
c
c
u
ra
c
y
AV-ENG
AV-CA
TASA
Figure 1. Accuracy of SO-PMI with the HM lexicon and the three corpora.
Table 5 shows the accuracy of SO-PMI with the GI lexicon, which includes adverbs,
nouns, and verbs, in addition to adjectives. Figure 2 gives more detail. Compared with
Table 4 and Figure 1, there is a slight drop in accuracy, but the general trends are the
same.
Table 5. The accuracy of SO-PMI with the GI lexicon and the three corpora.
Percent of full
test set
Size of test set
Accuracy with
AV-ENG
Accuracy with
AV-CA
Accuracy with
TASA
100%
3596
82.84%
76.06%
61.26%
75%
2697
90.66%
81.76%
63.92%
50%
1798
95.49%
87.26%
47.33%
25%
899
97.11%
89.88%
68.74%
Approx. num. of words in corpus
1 × 10
11
2 × 10
9
1 × 10
7
19
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Threshold
A
c
c
u
ra
c
y
AV-ENG
AV-CA
TASA
Figure 2. Accuracy of SO-PMI with the GI lexicon and the three corpora.
5.3. Varying the Laplace Smoothing Factor
As we mentioned in Section 3.1, we used a Laplace smoothing factor of 0.01 in the
baseline version of SO-PMI. In this section, we explore the impact of varying the
smoothing factor.
Figure 3 graphs the accuracy of SO-PMI as a function of the smoothing factor, which
varies from 0.0001 to 10,000 (note the logarithmic scale), using the AV-ENG corpus and
the GI lexicon. There are four curves, for four different thresholds on the percentage of
the full test set that is classified. The smoothing factor has relatively little impact until it
rises above 10, at which point the accuracy begins to fall off. The optimal value is about
1, although the difference between 1 and 0.1 or 0.01 is slight.
Figure 4 shows the same experimental setup, except using the AV-CA corpus. We see
the same general pattern, but the accuracy begins to decline a little earlier, when the
smoothing factor rises above 0.1. The highest accuracy is attained when the smoothing
factor is about 0.1. The AV-CA corpus (approximately 2 × 10
9
words) is more sensitive
to the smoothing factor than the AV-ENG corpus (approximately 1 × 10
11
words). A
smoothing factor of about 0.1 seems to help SO-PMI handle the increased noise, due to
the smaller corpus (compare Figure 3 and Figure 4).
20
0
20
40
60
80
100
120
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
Dostları ilə paylaş: |