HM lexicon
is the list of 1,336 labeled adjectives that was
created by Hatzivassiloglou and McKeown [1997]. The
GI lexicon
is a list of 3,596
labeled words extracted from the General Inquirer lexicon [Stone
et al.
1966]. The
AV-ENG corpus
is the set of English web pages indexed by the AltaVista search engine.
The
AV-CA corpus
is the set of English web pages in the Canadian domain that are
indexed by AltaVista. The
TASA corpus
is a set of short English documents gathered
from a variety of sources by Touchstone Applied Science Associates.
The HM lexicon consists of 1,336 adjectives, 657 positive and 679 negative
[Hatzivassiloglou and McKeown 1997]. We described this lexicon earlier, in Sections 1
and 4.1. We use the HM lexicon to allow comparison between the approach of
Hatzivassiloglou and McKeown [1997] and the SO-A algorithms described here.
Since the HM lexicon is limited to adjectives, most of the following experiments use
a second lexicon, the GI lexicon, which consists of 3,596 adjectives, adverbs, nouns, and
verbs, 1,614 positive and 1,982 negative [Stone
et al.
1966]. The General Inquirer lexicon
is available at http://www.wjh.harvard.edu/~inquirer/. The lexicon was developed by
Philip Stone and his colleagues, beginning in the 1960’s, and continues to grow. It has
been designed as a tool for
content analysis
, a technique used by social scientists,
political scientists, and psychologists for objectively identifying specified characteristics
of messages [Stone
et al.
1966].
The full General Inquirer lexicon has 182 categories of word tags and 11,788 words.
The words tagged “Positiv” (1,915 words) and “Negativ” (2,291 words) have
(respectively) positive and negative semantic orientations. Table 3 lists some examples.
Table 3. Examples of “Positiv” and “Negativ” words.
Positiv
Negativ
abide
absolve
abandon
abhor
ability
absorbent
abandonment
abject
able
absorption
abate
abnormal
abound
abundance
abdicate
abolish
Words with multiple senses may have multiple entries in the lexicon. The list of 3,596
words (1,614 positive and 1,982 negative) used in the subsequent experiments was
16
generated by reducing multiple-entry words to single entries. Some words with multiple
senses were tagged as both “Positiv” and “Negativ”. For example, “mind” in the sense of
“intellect” is positive, but “mind” in the sense of “beware” is negative. These ambiguous
words were not included in our set of 3,596 words. We also excluded the fourteen
paradigm words (good/bad, nice/nasty, etc.).
Of the words in the HM lexicon, 47.7% also appear in the GI lexicon (324 positive,
313 negative). The agreement between the two lexicons on the orientation of these shared
words is 98.3% (6 terms are positive in HM but negative in GI; 5 terms are negative in
HM but positive in GI).
The AltaVista search engine is available at http://www.altavista.com/. Based on
estimates in the popular press and our own tests with various queries, we estimate that the
AltaVista index contained approximately 350 million English web pages at the time our
experiments were carried out. This corresponds to roughly one hundred billion words.
We call this the AV-ENG corpus. The set of web pages indexed by AltaVista is
constantly changing, but there is enough stability that our experiments were reliably
repeatable over the course of several months.
In order to examine the effect of corpus size on learning, we used AV-CA, a subset of
the AV-ENG corpus. The AV-CA corpus was produced by adding “AND host:.ca” to
every query to AltaVista, which restricts the search results to the web pages with “ca” in
the host domain name. This consists mainly of hosts that end in “ca” (the Canadian
domain), but it also includes a few hosts with “ca” in other parts of the domain name
(such as “http://www.ca.com/”). The AV-CA corpus contains approximately 7 million
web pages (roughly two billion words), about 2% of the size of the AV-ENG corpus.
Our experiments with SO-LSA are based on the online demonstration of LSA,
available at http://lsa.colorado.edu/. This demonstration allows a choice of several
different corpora. We chose the largest corpus, the TASA-ALL corpus, which we call
simply TASA. In the online LSA demonstration, TASA is called the “General Reading
up to 1st year college (300 factors)” topic space. The corpus contains a wide variety of
short documents, taken from novels, newspaper articles, and other sources. It was
collected by Touchstone Applied Science Associates, to develop The Educator’s Word
Frequency Guide. The TASA corpus contains approximately 10 million words, about
0.5% of the size of the AV-CA corpus.
The TASA corpus is not indexed by AltaVista. For SO-PMI, the following
experimental results were generated by emulating AltaVista on a local copy of the TASA
17
corpus. We used a simple Perl script to calculate the hits() function for TASA, as a
surrogate for sending queries to AltaVista.
5.2. SO-PMI Baseline
Table 4 shows the accuracy of SO-PMI in its baseline configuration, as described in
Section 3.1. These results are for all three corpora, tested with the HM lexicon. In this
table, the strength (absolute value) of the semantic orientation was used as a measure of
confidence that the word will be correctly classified. Test words were sorted in
descending order of the absolute value of their semantic orientation and the top ranked
words (the highest confidence words) were then classified. For example, the second row
in Table 4 shows the accuracy when the top 75% (with highest confidence) were
classified and the last 25% (with lowest confidence) were ignored.
Table 4. The accuracy of SO-PMI with the HM lexicon and the three corpora.
Percent of full
test set
Size of test set
Accuracy with
AV-ENG
Accuracy with
AV-CA
Accuracy with
TASA
100%
1336
87.13%
80.31%
61.83%
75%
1002
94.41%
85.93%
64.17%
50%
668
97.60%
91.32%
46.56%
25%
334
98.20%
92.81%
70.96%
Approx. num. of words in corpus
1 × 10
11
2 × 10
9
1 × 10
7
The performance of SO-PMI in Table 4 can be compared to the performance of the
HM algorithm in Table 2 (Section 4.1), since both use the HM lexicon, but there are
some differences in the evaluation, since the HM algorithm is supervised but SO-PMI is
unsupervised. Because the HM algorithm is supervised, part of the HM lexicon must be
set aside for training, so the algorithm cannot be evaluated on the whole lexicon. Aside
from this caveat, it appears that the performance of the HM algorithm is roughly
comparable to the performance of SO-PMI with the AV-CA corpus, which is about one
hundred times larger than the corpus used by Hatzivassiloglou and McKeown [1997]
(2 × 10
9
words versus 2 × 10
7
words). This suggests that the HM algorithm makes more
efficient use of corpora than SO-PMI, but the advantage of SO-PMI is that it can easily
be scaled up to very large corpora, where it can achieve significantly higher accuracy.
The results of these experiments are shown in more detail in Figure 1. The percentage
of the full test set (labeled
threshold
in the figure) varies from 5% to 100% in increments
of 5%. Three curves are plotted, one for each of the three corpora. The figure shows that
18
a smaller corpus not only results in lower accuracy, but also results in less stability. With
the larger corpora, the curves are relatively smooth; with the smallest corpus, the curve
looks quite noisy.
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Dostları ilə paylaş: |