Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	38/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 34 35 36 37 38 39 40 41 ... 57

g
for a given instance i ∈ I
a
. We use the co-occurrences between instances
in I
a
and I
g
to compute scores s(i, j) that express the applicability of tag j to in-
stance i. For each instance i, we identify an initial mapping m
0
(i) by selecting the
tag with the highest score.
Subsequently, we investigate whether we can use the hypothesis that related
instances often share a category, as we have created methods to identify relatedness
between instances. We hence reuse the values t(i, i
0
) to find a final mapping.
Using either
PM
,
PCM
or
DM
, we can also acquire co-occurrence counts for
pairs (i, j) ∈ I
a
× I
g
. The function S(i, j) expressing the extent of applicability of j
to i is defined similarly as function T , namely
S(i, j) =
co(i, j)
∑
i
0
∈I
a
co(i
0
, j)
,
(6.3)
and we normalize this function as follows
s(i, j) =
S(i, j)
∑
j
0
∈I
g
S(i, j
0
)
.
(6.4)

6.2 Processing Extracted Subjective Information
115
Now, s(i, j) can be read as the probability that tag j is applicable to i. If we are
interested in the tag m
0
(i) most applicable to i, we thus select the j such that s(i, j)
is maximized,
m
0
(i) = argmax
j∈I
g
s(i, j).
(6.5)
The instance categorization problem focuses on the identification of one single
tag or category for a given instance. We investigate whether we can improve the
initial mapping m
0
by using the assumption that related instances in I
a
often share
a category. We are hence interested if the use of the computed relatedness between
instances in I
a
helps to improve the precision of the mapping m
0
.
We combine the scores t with the initial mapping m
0
as follows. For each i ∈ I
a
,
we inspect m
0
to determine the category that is assigned most often to i and its k − 1
most related instances. We thus expect that the most appropriate category j for i is
most often mapped by m
0
among i and its nearest neighbors.
For each instance i ∈ I
a
, we construct an ordered list B
k
(i) with i and its k − 1
nearest neighbors
B
k
(i) = (i
1
, i
2
, ..., i
k
)
with i as its first element, i.e. i = i
1
, and
t(i, i
l
) ≥ t(i, i
l+1
), for 1 ≤ l < k.
For a final mapping m of instances i ∈ I
a
to a category in I
g
, we inspect the most
occurring category mapped by m
0
to i and its k − 1 nearest neighbors.
m(i, k) = argmax
j∈I
g
(
∑
i
0
∈B
k
(i)
τ(i
0
, j))
with
τ(i
0
, j) =
½
1
if m
0
(i
0
) = j
0
otherwise.
If two categories have an equal score, we select the first occurring one. That is,
the category that is mapped by m
0
to i or to the instance most related to i.
Hence, we address the instance categorization problem by selecting the single
tag (category, genre, etc.) m(i, k).
6.2.3 Tagging Instances
With respect to the instance tagging problem, we assume that multiple tags may
be applicable to an instance. Hence, we are interested in an ordered list of tags

116
for a given instance in I
a
. Similar to the approach for the instance categorization
problem, we will start with the scores s(i, j) to compute an initial ordered list of
tags for instance i. Likewise, we investigate whether the use of instance relatedness
can lead to improvements over the initial tagging.
When addressing the instance categorization problem, we assumed the relation
between instances and tags to be functional. That is, each instance in I
a
was as-
sumed to be related to at most one tag (e.g. a
genre or art style). When dealing with
the instance tagging problem however, we assume that multiple tags are applicable
to a given instance. Thus the question is which of the tags are most applicable and
to what extent.
The use of the score s(i, j) is a first approximation to identify the tags most
related to the given instance i. Similar to the computation of the final mapping m,
we use the similarity between the instances in I
a
to obtain a final score.
The degree of relatedness of an instance i
0
to i is given by t(i, i
0
). For tag j, the
degree of applicability of j to i is given by s(i, j).
We use the computed scores of relatedness t(i, i
0
) to improve the initial tagging
s(i, j). If two instances are closely related, we expect similar tags for the two.
Hence, if i
0
is closely related to i, we want s(i
0
, j) to contribute significantly to the
final score p(i, j). Using the normalized scoring functions, we can compute the
applicability p
0
(i, j) of tag j to instance i as follows
p
0
(i, j) =
∑
i
0
,i
0
6=i
t(i, i
0
) · s(i
0
, j).
(6.6)
If erroneously a high score is found for s(i, j), this error is decreased when
close related instances i
0
have low scores for s(i
0
, j).
However, p
0
(i, j) does not suffice as no self-relatedness score t(i, i) is defined.
We do consider s(i, j) relevant when computing the scores for the tags with respect
to instance i. Hence, we introduce a weight w for s(i, j) as a substitute for t(i, i) in
the score p(i, j),

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 34 35 36 37 38 39 40 41 ... 57