Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	36/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 32 33 34 35 36 37 38 39 ... 57

a
× I
g
and (i, i
0
) in I
a
× I
a
, we apply this method and bookkeep
the total number of co-occurrences encountered. From the snippets returned by
the search engine, we thus identify the elements of either I
a
or I
g
to measure the
number of co-occurrences of the pairs. Hence, using
PM
for instances i and j
co(i, j) is defined as follows,
Definition [
PM
co-occurrences]. co
PM
(i, j) gives the sum of the number of
occurrences of i when querying patterns with j, plus the number of occurrences of
j when querying patterns with i.
2
Using
PM
we only need O(m + n) queries to collect co-occurrences of pairs in
I
a
× I
g
and I
a
× I
a
for I
a
of size n and I
g
of size m.
Page-Count-based Method (PCM). As a first alternative to
PM
, we extract
the estimated number of hits co(i, j) [Cilibrasi & Vitanyi, 2007; Knees et al.,
2004; Mika, 2007; Gligorov et al., 2007]. This method to find co-occurren-
ces between instances is based on analyzing the total numbers of occurrences
of pairs of instances on the web. We identify the co-occurrences co(i, j) as follows,
Definition [
PCM
co-occurrences]. co
PCM
(i, j) gives the number of hits for the
search engine query "i", " j".
2
We assume that the order of the terms i and j in the query does not effect the
number of hits, thus we assume co(i, j) = co( j, i).
This Page-Count-based Method (
PCM
) is simple and intuitive. If we are for
example interested in categorizing music artists into genres, we analyze the num-
ber of hits to queries for combinations of the names of the artist and each genre.
Assuming Johnny Cash to be a country artist, we expect that more documents con-
tain both the terms
Country and Johnny Cash than Reggae and Johnny Cash. An
important drawback of
PCM
is the high Google complexity. For large sets this can

6.2 Processing Extracted Subjective Information
113
be problematic [Cafarella, Downey, Soderland, & Etzioni, 2005]. Moreover, the
number of hits can fluctuate over time [V´eronis, 2006], which hampers the reuse
of old hit counts.
Using
PCM
we thus need to perform m · n queries to collect the co-occurren-
ces between tags and instances and
1
2
(n
2
− n) queries to gather all pairs of co-oc-
currences between the instances in I
a
. Hence, the Google Complexity of
PCM
is
O(mn + n
2
). When we assume that the size of I
g
does not exceed n, the Google
Complexity of
PCM
is O(n
2
).
Document-based Method (DM). In the Document-based Method (
DM
) ap-
proach we collect the first k
URL
s of the documents returned by the search engine
for a given query, constructed using a known instance. These k
URL
s are the most
relevant for the query submitted based on the ranking used by the search engine
[Brin & Page, 1998]. The corresponding documents are subsequently scanned for
occurrences of instances of the related class [De Boer et al., 2007].
In the first phase of the algorithm, we query all instances in both I

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 32 33 34 35 36 37 38 39 ... 57