24
5.5. Product versus Disjunction
Recall equation (10), for calculating SO-PMI(
word
):
(17)
As we discussed in Section 3.1, this equation requires fourteen queries to AltaVista for
each word (ignoring the constant terms).
In this section, we investigate whether the
number of queries can be reduced by
combining the paradigm words, using the OR
operator.
For convenience, we introduce the following definitions:
(18)
(19)
Given the fourteen paradigm words, for example, we have the following (from equations
(5), (6), (18), and (19)):
(20)
(21)
We attempt to approximate (17) as follows:
8
(22)
Calculating the semantic orientation of a word using equation (22)
requires only two
queries per word, instead of fourteen (ignoring
the constant terms, hits(
Pquery
) and
hits(
Nquery
)).
Figure 9 plots the performance of product (equation (17)) versus disjunction (equation
(22)) for SO-PMI with the AV-ENG corpus and the GI lexicon. Figure 10 shows the
performance with the AV-CA corpus and Figure 11 with the TASA corpus. For the
largest corpus, there is a clear advantage to using our original equation (17), but the two
equations have similar performance with the smaller corpora. Since the execution time of
SO-PMI is almost completely dependent on the number of queries sent to AltaVista,
equation (22) executes seven times faster than equation (17). Therefore the disjunction
8
We use OR here, because using AND or NEAR would almost always result in zero hits. We add 0.01 to the
hits, to avoid division by zero.
SO-PMI(
word
)
=
⋅
⋅
∏
∏
∏
∏
∈
∈
∈
∈
Pwords
pword
Nwords
nword
Pwords
pword
Nwords
nword
nword
word
pword
nword
pword
word
)
NEAR
hits(
)
hits(
)
hits(
)
NEAR
hits(
log
2
.
pword
Pquery
Pwords
pword
∈
=
OR
nword
Nquery
Nwords
nword
∈
=
OR
.
Pquery = (good OR nice OR ... OR superior)
Nquery = (bad OR nasty OR ... OR inferior).
SO-PMI(
word) =
⋅
⋅
)
(
hits
)
NEAR
(
hits
)
hits(
)
NEAR
(
hits
log
2
Pquery
Nquery
word
Nquery
Pquery
word
.
25
equation should be preferred for smaller corpora and the
product equation should be
preferred for larger corpora.
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Dostları ilə paylaş: