16
generated by reducing multiple-entry words to single entries. Some words with multiple
senses were tagged as both “Positiv” and “Negativ”. For example, “mind” in the sense of
“intellect” is positive, but “mind” in the sense of “beware” is negative. These ambiguous
words were not included in our set of 3,596 words. We also excluded the fourteen
paradigm words (good/bad, nice/nasty, etc.).
Of the words in the HM lexicon, 47.7% also appear in the GI lexicon (324 positive,
313 negative). The agreement between the two lexicons on the orientation of these shared
words is 98.3% (6 terms are positive in HM but negative in GI; 5 terms are negative in
HM but positive in GI).
The AltaVista search engine is available at http://www.altavista.com/. Based on
estimates in the popular press and our own tests with various queries, we estimate that the
AltaVista index contained approximately 350 million English web pages at the time our
experiments were carried out. This corresponds to roughly one hundred billion words.
We call this the AV-ENG corpus. The set of web pages indexed by AltaVista is
constantly changing, but there is enough stability that our experiments were reliably
repeatable over the course of several months.
In order to examine the effect of corpus size on learning, we used AV-CA, a subset of
the AV-ENG corpus. The AV-CA corpus was produced by adding “AND host:.ca” to
every query to AltaVista, which restricts the search results to the web pages with “ca” in
the host domain name. This consists mainly of hosts that end in “ca” (the
Canadian
domain), but it also includes a few hosts with “ca” in other parts of the domain name
(such as “http://www.ca.com/”). The AV-CA corpus contains approximately 7 million
web pages (roughly two billion words), about 2% of the size of the AV-ENG corpus.
Our experiments with SO-LSA are based on the online demonstration of LSA,
available at http://lsa.colorado.edu/. This demonstration allows a choice of several
different corpora. We chose the largest corpus,
the TASA-ALL corpus, which we call
simply TASA. In the online LSA demonstration, TASA is called the “General Reading
up to 1st year college (300 factors)” topic space. The corpus contains a wide variety of
short documents,
taken from novels, newspaper articles, and other sources. It was
collected by Touchstone Applied Science Associates, to develop The Educator’s Word
Frequency Guide. The TASA corpus contains approximately 10 million words, about
0.5% of the size of the AV-CA corpus.
The TASA corpus is not indexed by AltaVista. For SO-PMI, the following
experimental results were generated by emulating AltaVista on a local copy of the TASA
17
corpus. We used a simple Perl script to calculate the hits()
function for TASA, as a
surrogate for sending queries to AltaVista.
5.2. SO-PMI Baseline
Table 4 shows the accuracy of SO-PMI in its baseline configuration, as described in
Section 3.1. These results are for all three corpora, tested with the HM lexicon. In this
table, the strength (absolute value) of the semantic orientation was used as a measure of
confidence that the word will be correctly classified. Test words were sorted in
descending order of the absolute value of their semantic orientation
and the top ranked
words (the highest confidence words) were then classified. For example, the second row
in Table 4 shows the accuracy when the top 75% (with highest confidence) were
classified and the last 25% (with lowest confidence) were ignored.
Table 4. The accuracy of SO-PMI with the HM lexicon and the three corpora.
Percent of full
test set
Size of test set
Accuracy with
AV-ENG
Accuracy with
AV-CA
Accuracy with
TASA
100%
1336
87.13%
80.31%
61.83%
75%
1002
94.41%
85.93%
64.17%
50%
668
97.60%
91.32%
46.56%
25%
334
98.20%
92.81%
70.96%
Approx. num. of words in corpus
1 × 10
11
2 × 10
9
1 × 10
7
The performance of SO-PMI in Table 4 can be compared to the performance of the
HM algorithm in Table 2 (Section 4.1), since both use the HM lexicon, but there are
some differences in the evaluation, since the HM algorithm is supervised but SO-PMI is
unsupervised. Because the HM algorithm is supervised, part of the HM lexicon must be
set aside for training, so the algorithm cannot be evaluated on the whole lexicon. Aside
from this caveat, it appears that the performance of
the HM algorithm is roughly
comparable to the performance of SO-PMI with the AV-CA corpus, which is about one
hundred times larger than the corpus used by Hatzivassiloglou and McKeown [1997]
(2 × 10
9
words versus 2 × 10
7
words). This suggests that the HM algorithm makes more
efficient use of corpora than SO-PMI, but the advantage of SO-PMI is that it can easily
be scaled up to very large corpora, where it can achieve significantly higher accuracy.
The results of these experiments are shown in more detail in Figure 1. The percentage
of the full test set (labeled
threshold
in the figure) varies from 5% to 100% in increments
of 5%. Three curves are plotted, one for each of the three corpora. The figure shows that
18
a smaller corpus not only results in lower accuracy, but also results in less stability. With
the larger corpora, the curves are relatively smooth; with the smallest corpus, the curve
looks quite noisy.
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Dostları ilə paylaş: