Microsoft Word turney-littman-acm doc



Yüklə 200 Kb.
Pdf görüntüsü
səhifə5/18
tarix22.05.2023
ölçüsü200 Kb.
#119806
1   2   3   4   5   6   7   8   9   ...   18
U
k
k
V
k
T
, which is equivalent to using the 


10
corresponding rows of 
U
k
[Deerwester 
et al.
1990; Bartell 
et al.
1992; Schütze 1993; 
Landauer and Dumais 1997]. 
The semantic orientation of a word, 
word
, is calculated by SO-LSA from equation 
(4), as follows: 
(15) 
For the paradigm words, we have the following (from equations (5), (6), and (15)): 
(16) 
As with SO-PMI, a word, 
word
, is classified as having a positive semantic orientation 
when SO-LSA(
word
) is positive and a negative orientation when SO-LSA(
word
) is 
negative. The magnitude of SO-LSA(
word
) represents the strength of the semantic 
orientation. 
4. RELATED WORK 
Related work falls into three groups: work on classifying words by positive or negative 
semantic orientation (Section 4.1), classifying reviews (e.g., movie reviews) as positive 
or negative (Section 4.2), and recognizing subjectivity in text (Section 4.3). 
4.1. Classifying Words 
Hatzivassiloglou and McKeown [1997] treat the problem of determining semantic 
orientation as a problem of classifying words, as we also do in this paper. They note that 
there are linguistic constraints on the semantic orientations of adjectives in conjunctions. 
As an example, they present the following three sentences: 
1. The tax proposal was simple and well received by the public. 
2. The tax proposal was simplistic, but well received by the public. 
3. (*) The tax proposal was simplistic and well received by the public. 
The third sentence is incorrect, because we use “and” with adjectives that have the same 
semantic orientation (“simple” and “well-received” are both positive), but we use “but” 
with adjectives that have different semantic orientations (“simplistic” is negative).
Hatzivassiloglou and McKeown [1997] use a four-step supervised learning algorithm 
to infer the semantic orientation of adjectives from constraints on conjunctions: 
1. All conjunctions of adjectives are extracted from the given corpus. 
5
The tf-idf score gives more weight to terms that are statistically “surprising”. This heuristic works well for 
information retrieval, but its impact on determining semantic orientation is unknown. 
SO-LSA(
word
)=



Pwords
pword
Nwords
nword
nword
word
pword
word
)
,
(
LSA
)
,
(
LSA

SO-LSA(
word
) = [LSA(
word
, good) + ... + LSA(
word
, superior)] 
– [LSA(
word
, bad) + ... + LSA(
word
, inferior)]. 


11
2. A supervised learning algorithm combines multiple sources of evidence to label pairs 
of adjectives as having the same semantic orientation or different semantic 
orientations. The result is a graph where the nodes are adjectives and links indicate 
sameness or difference of semantic orientation.
3. A clustering algorithm processes the graph structure to produce two subsets of 
adjectives, such that links across the two subsets are mainly different-orientation 
links, and links inside a subset are mainly same-orientation links. 
4. Since it is known that positive adjectives tend to be used more frequently than 
negative adjectives, the cluster with the higher average frequency is classified as 
having positive semantic orientation. 
For brevity, we will call this the HM algorithm. 
Like SO-PMI and SO-LSA, HM can produce a real-valued number that indicates both 
the direction (positive or negative) and the strength of the semantic orientation. The 
clustering algorithm (Step 3 above) can produce a “goodness-of-fit” measure that 
indicates how well an adjective fits in its assigned cluster.
Hatzivassiloglou and McKeown [1997] used a corpus of 21 million words and 
evaluated HM with 1,336 manually labeled adjectives (657 positive and 679 negative). 
Their results are given in Table 2. HM classifies adjectives with accuracies ranging from 
78% to 92%, depending on Alpha, as described next.
Table 2. The accuracy of HM with a 21 million-word corpus.
6
Alpha 
Accuracy 
Size of test set 
Percent of “full” test set 

78.08% 
730 
100.0% 

82.56% 
516 
70.7% 

87.26% 
369 
50.5% 

92.37% 
236 
32.3% 
Alpha is a parameter that is used to partition the 1,336 labeled adjectives into training 
and testing sets. As Alpha increases, the training set grows and the testing set becomes 
smaller. The precise definition of Alpha is complicated, but the basic idea is to put the 
hard cases (the adjectives for which there are few conjunctions in the given corpus) in the 
training set and the easy cases (the adjectives for which there are many conjunctions) in 
the testing set. As Alpha increases, the testing set becomes increasingly easy (that is, the 
adjectives that remain in the testing set are increasingly well covered by the given 
6
This table is derived from Table 3 in Hatzivassiloglou and McKeown [1997]. 


12
corpus). In essence, the idea is to improve accuracy by abstaining from classifying the 
difficult (rare, sparsely represented) adjectives. As expected, the accuracy rises as Alpha 
rises. This suggests that the accuracy will improve with larger corpora. 
This algorithm is able to achieve good accuracy levels, but it has some limitations. In 
contrast with SO-A, HM is restricted to adjectives and it requires labeled adjectives as 
training data (in step 2).
Although each step in HM, taken by itself, is relatively simple, the combination of the 
four steps makes theoretical analysis challenging. In particular, the interaction between 
the supervised labeling (step 2) and the clustering (step 3) is difficult to analyze. For 
example, the degree of regularization (i.e., smoothing, pruning) in the labeling step may 
have an impact on the quality of the clusters. By contrast, SO-PMI is captured in a single 
formula (equation (10)), which takes the form of the familiar log-odds ratio [Agresti 
1996]. 
HM has only been evaluated with adjectives, but it seems likely that it would work 
with adverbs. For example, we would tend to say “He ran quickly (+) 

Yüklə 200 Kb.

Dostları ilə paylaş:
1   2   3   4   5   6   7   8   9   ...   18




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin