Pwords = a set of words with positive semantic orientation
Nwords = a set of words with negative semantic orientation
A(word
1
, word
2
) = a measure of association between word
1
and word
2
SO-A(word) =
∈
∈
−
Pwords
pword
Nwords
nword
nword
word
pword
word
)
,
(
A
)
,
(
A
.
Pwords = {good, nice, excellent, positive, fortunate, correct, and superior}
Nwords = {bad, nasty, poor, negative, unfortunate, wrong, and inferior}.
7
strategy. This paper examines SO-PMI (Semantic Orientation from Pointwise Mutual
Information) and SO-LSA (Semantic Orientation from Latent Semantic Analysis).
3.1. Semantic Orientation from PMI
PMI-IR [Turney 2001] uses Pointwise Mutual Information (PMI) to calculate the strength
of the semantic association between words [Church and Hanks 1989]. Word co-
occurrence statistics are obtained using Information Retrieval (IR). PMI-IR has been
empirically evaluated using 80 synonym test questions from the Test of English as a
Foreign Language (TOEFL), obtaining a score of 74% [Turney 2001], comparable to that
produced by direct thesaurus search [Littman 2001].
The Pointwise Mutual Information (PMI) between two words, word
1
and word
2
, is
defined as follows [Church and Hanks 1989]:
(7)
Here, p( word
1
& word
2
) is the probability that word
1
and word
2
co-occur. If the words are
statistically independent, the probability that they co-occur is given by the product
p( word
1
) p( word
2
). The ratio between p( word
1
& word
2
) and p( word
1
) p( word
2
) is a
measure of the degree of statistical dependence between the words. The log of the ratio
corresponds to a form of correlation, which is positive when the words tend to co-occur
and negative when the presence of one word makes it likely that the other word is absent.
PMI-IR estimates PMI by issuing queries to a search engine (hence the IR in PMI-IR)
and noting the number of hits (matching documents). The following experiments use the
AltaVista Advanced Search engine
4
, which indexes approximately 350 million web pages
(counting only those pages that are in English). Given a (conservative) estimate of 300
words per web page, this represents a corpus of at least one hundred billion words.
AltaVista was chosen over other search engines because it has a NEAR operator. The
AltaVista NEAR operator constrains the search to documents that contain the words
within ten words of one another, in either order. Previous work has shown that NEAR
performs better than AND when measuring the strength of semantic association between
words [Turney 2001]. We experimentally compare NEAR and AND in Section 5.4.
SO-PMI is an instance of SO-A. From equation (4), we have:
(8)
4
See http://www.altavista.com/sites/search/adv.
PMI( word
1
, word
2
) =
)
(
p
)
(
p
)
&
(
p
log
2
1
2
1
2
word
word
word
word
.
SO-PMI( word)=
∈
∈
−
Pwords
pword
Nwords
nword
nword
word
pword
word
)
,
(
PMI
)
,
(
PMI
.
8
Let hits( query) be the number of hits returned by the search engine, given the query,
query. We calculate PMI( word
1
, word
2
) from equation (7) as follows:
(9)
Here, N is the total number of documents indexed by the search engine. Combining
equations (8) and (9), we have:
(10)
Note that N, the total number of documents, drops out of the final equation. Equation (10)
is a log-odds ratio [Agresti 1996].
Calculating the semantic orientation of a word via equation (10) requires twenty-eight
queries to AltaVista (assuming there are fourteen paradigm words). Since the two
products in (10) that do not contain word are constant for all words, they only need to be
calculated once. Ignoring these two constant products, the experiments required only
fourteen queries per word.
To avoid division by zero, 0.01 was added to the number of hits. This is a form of
Laplace smoothing. We examine the effect of varying this parameter in Section 5.3.
Pointwise Mutual Information is only one of many possible measures of word
association. Several others are surveyed in Manning and Schütze [1999]. Dunning [1993]
suggests the use of likelihood ratios as an improvement over PMI. To calculate likelihood
ratios for the association of two words, X and Y, we need to know four numbers:
(11)
(12)
(13)
(14)
If the neighbourhood size is ten words, then we can use hits( X NEAR Y) to estimate
k( X Y) and hits( X) – hits( X NEAR Y) to estimate k( X ~ Y), but note that these are only
rough estimates, since hits( X NEAR Y) is the number of documents that contain X near Y,
not the number of neighbourhoods that contain X and Y. Some preliminary experiments
suggest that this distinction is important, since alternatives to PMI (such as likelihood
PMI( word
1
, word
2
) =
)
(
hits
)
(
hits
)
NEAR
(
hits
log
2
1
1
1
2
1
1
2
word
word
word
word
N
N
N
.
SO-PMI( word)
=
⋅
⋅
∏
∏
∏
∏
∈
∈
∈
∈
Pwords
pword
Nwords
nword
Pwords
pword
Nwords
nword
nword
word
pword
nword
pword
word
)
NEAR
hits(
)
hits(
)
hits(
)
NEAR
hits(
log
2
.
k( X Y) = the frequency that X occurs within a given neighbourhood of Y
k(~ X Y) = the frequency that Y occurs in a neighbourhood without X
k( X ~ Y) = the frequency that X occurs in a neighbourhood without Y
k(~ X ~ Y) = the frequency that neither X nor Y occur in a neighbourhood.
9
ratios [Dunning 1993] and the Z-score [Smadja 1993]) appear to perform worse than PMI
when used with search engine hit counts.
However, if we do not restrict our attention to measures of word association that are
compatible with search engine hit counts, there are many possibilities. In the next
subsection, we look at one of them, Latent Semantic Analysis.
3.2. Semantic Orientation from LSA
SO-LSA applies Latent Semantic Analysis (LSA) to calculate the strength of the
semantic association between words [Landauer and Dumais 1997]. LSA uses the Singular
Value Decomposition (SVD) to analyze the statistical relationships among words in a
corpus.
The first step is to use the text to construct a matrix X, in which the row vectors
represent words and the column vectors represent chunks of text (e.g., sentences,
paragraphs, documents). Each cell represents the weight of the corresponding word in the
corresponding chunk of text. The weight is typically the tf-idf score (Term Frequency
times Inverse Document Frequency) for the word in the chunk. (tf-idf is a standard tool in
information retrieval [van Rijsbergen 1979].)
5
The next step is to apply singular value decomposition [Golub and Van Loan 1996] to
Dostları ilə paylaş: |