Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	37/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 33 34 35 36 37 38 39 40 ... 57

a
and I
g
and collect the top k documents for each of the queries. For instances in I
a
, we
retrieve each document using the
URL
s found by the search engine. We count
the occurrences of the categories in I
g
(thus the names of the categories) in the
retrieved documents for the initial mapping m
0
. From the documents retrieved with
a category g ∈ I
g
, we similarly extract the occurrences of instances in I
a
.
The documents obtained using
DM
are the most relevant for each element
b ∈ I
a
. For the instances in I
a
queried, we expect to find biographies, fan pages,
pages of museums, entries in database sites and so on. The labels in I
g
(e.g. the
genres, styles or other descriptors) mentioned in these pages will most probably
reflect the genre of the artist queried. Thus co(i, j) is here defined as follows.
Definition [
DM
co-occurrences]. co
DM
(i, j) gives number of occurrences of j in
documents found with i, plus the number of occurrences of i in documents found
with j.
2
Like
PM
, this method also requires only O(n+m) queries. However, additional
data communication is required since for each query up to k documents have to be
downloaded instead of using only the data provided by the search engine.
6.2 Processing Extracted Subjective Information
In the previous section, we discussed three methods to identify relation instances
on the web. Here we show how we use the numbers of co-occurrences of these
related instances to address the three problems presented in this chapter.

114
6.2.1 Identifying Relatedness between Instances
Having gathered a list of co-occurrences of instances in I
a
using either
PM
,
PCM
or
DM
, we are interested to what extent these instances are expressed to be related.
We assume that two instances are related when they are relatively often mentioned
in the same context. For each instance i we could consider the instance i
0
∈ I
a
with
the highest co(i, i) to be the most related to i. However, we observe that, in that
case, frequently occurring instances have a relatively large probability to be related
to any other instance. This observation leads to an approach inspired by the theory
of pointwise mutual information [Manning & Sch¨utze, 1999; Downey et al., 2005].
We use T (i, i
0
) to express the relatedness of instances i
0
to i as follows,
T (i, i
0
) =
co(i, i
0
)
∑
i
00
,i
00
6=i
0
co(i
00
, i
0
)
.
(6.1)
The function T can be normalized to t, i.e. with values 0 ≤ t(i, i
0
) ≤ 1
t(i, i
0
) =
T (i, i
0
)
∑
i
00
∈I
a
T (i, i
00
)
.
(6.2)
We address the Instance Relatedness Problem using t(i, i
0
) by identifying an or-
dered list of all instances related to i.
6.2.2 Categorizing Instances
The Instance Categorization Problem handles the identification of a most applica-
ble j ∈ I

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 33 34 35 36 37 38 39 40 ... 57