Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	14/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 10 11 12 13 14 15 16 17 ... 57

s
, we want to have many search results containing instances
from c
o
.
3. If relation r is not functional, the pattern should be broad, i.e. among the
search results when querying a combination of the pattern and an instance in
I
s
there must be as many distinct r-related instances from c
o
as possible.
Note that these criteria are language independent. To measure the three criteria,
we use the validation set to combine the patterns found with instances i in the
validation set V . Hereto we define the following variables.
1. The frequency of a pattern S, ffreq(S), the number of occurrences of S
found in the identification phase.
2. The precision fprec(S) of a pattern S, is given by
fprec(S) =
∑
i∈V
P(S, i)
|V |
,
for instances i ∈ V , where P(S, i) is defined as
P(S, i) =
F
I
(S, i)
F
O
(S, i)
where

3.1 Identifying Effective Patterns
37
F
I
(S, i) is the number of snippets
containing instances of c
o
when querying S with i
and
F
O
(s, x)is the total number of snippets found
when querying S with i.
3. The broadness of a pattern S, fspr(S) where
fspr(S) =
∑
i∈V
B(S, x),
with
B(s, x) =
the number of distinct instances of class c
o
found
when querying S with i.
The larger we choose the validation set V , the more reliable the measures for
precision and broadness.
We finally calculate the score of the patterns. For the non-functional relations,
we do so by multiplying the individual scores:
score(S) = ffreq(S)· fprec(S)· fspr(S)
(3.1)
For the functional relations, the aim is to obtain results with low scores for
fspr(S). For these cases we propose the following function to combine the three
parameters.
score(S) = ffreq(S)· fprec(S) ·
1
fspr(S)
(3.2)
For efficiency reasons, we only compute the scores of the patterns with the
highest frequencies. We apply this heuristic, as we are in practice likely to obtain
thousands of patterns in the identification phase. For the case of the computation of
the scores for patterns expressing functional relations, ffreq(S) is the upperbound
for score(S). Hence, if it holds that ffreq(S
0
) is smaller than score(S) for patterns
S and S
0
, then score(S
0
) is also smaller than score(S). Therefore not all patterns
need to be evaluated to find the most effective ones. For the case of the non-
functional relations, the maximum value for fspr(S) can be estimated to find an
upperbound for score(S).
We express the Google complexity of the effective pattern identification algo-
rithm in terms of the sizes of the identification and validation sets T and V . The

38
...
[placeholder] [pattern] [known instance]
|
{z
}
query
...
|
{z
}
search result
Figure 3.1. Query, pattern, search result and placeholder.
identification phase requires 2|T | queries, as each pair in T is queried twice. Each
pattern evaluated in the second phase requires |V | queries to determine the score.
Hence, assuming that the m most frequently identified patterns are evaluated, the
total Google complexity of the algorithm is 2|T | + m|V |.
3.2 Identifying Instances
As discussed in the first chapters of this thesis, the ontology population task differs
from the general information extraction task in a number of aspects. These differ-
ences also have their consequences in choosing a strategy in identifying instances
from texts.
As we use the web as a corpus, we can assume that the instances are redun-
dantly available. For our task it is not necessary to recognize each of the encoun-
tered occurrences [McDowell & Cafarella, 2006]. In recognizing instances, the
focus should be on the precision. If we extract erroneous instances and use them
in newly constructed queries, this will potentially lead to the extraction of more
erroneous instances. Hence, we opt for a strategy with high precision, while the
bootstrapping mechanisms should lead to a high recall for the ontology considered.
As we opt for a pattern-based approach, we know the context of the potential
instance (i.e. the queried expression) and its placeholder (either preceding or fol-
lowing the search query). We define the maximal distance to the query in terms of
the number of words. Figure 3.1 and Tables 3.4 and 3.5 illustrate this task. Within
the search results in Table 3.4, the task is to identify professions like
scientist and
chemist at a placeholder following the queried expressing. In Table 3.5 a challenge
is to recognize
Rachel Carson as a Person, contrary to Project Manager.
Now the problem is to identify instances at the placeholder.
The Instance Identification Problem. Given is an initial ontology with relation r
on the classes c

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 10 11 12 13 14 15 16 17 ... 57