Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	10/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 6 7 8 9 10 11 12 13 ... 57

P(r) of patterns expressing r and a non-empty set of instances for either the ob-
ject or the subject class. Using a known instance and a pattern, we can formulate
queries that potentially lead to relevant texts.
Using an ontology O that meets the requirements, we populate O using the
following approach. We iteratively select a relation r in R (e.g.
born in) and a
pattern S corresponding to this relation (e.g. ‘
was born in’). We then select a class,
i.e. either the subject or the object class for r, and take a known instance from
this class (e.g.
Alan Turing from the subject class Person). The selected instance
and pattern are then combined into a search engine query (
Alan Turing was born
in). Subsequently, we extract instances of the unqueried class from the search
results. This procedure is continued until no unqueried instance-pattern pairs exist.
New patterns can be learned by analyzing texts containing newly extracted relation
instances. Using newly learned patterns, the ontology population procedure can be

2.2 Extraction Information from the Web using Patterns
29
do ¬ stop criterion →
do ∃
r,c
j
∃
i∈I
j
, S∈P(r)
“i – S combination unqueried” →
combine pattern S and instance i into query ;
collect snippets or documents from the search results ;
extract instances of the related class c
k
from search results ;
store the extracted instances i
0
in class c
k
;
store the extracted relation instances (i, i
0
) ∈ I
j
× I
k
in relation r;
od
find new patterns for the relations in R ;
od
Table 2.2. Sketch of the ontology population algorithm.
repeated. In Table 2.2 we give an overview of the approach in pseudo-code.
When initially no patterns are provided, the algorithm can be used to identify
patterns. However, in that case, non-empty sets of relation instances are required.
As a stop criterion we simply use a fixed number of iterations. The extraction
of the instances in the texts as well as the identification of patterns can be studied
in isolation and are the topics of the next chapter.
Google complexity. As extracted instances are used as queries, one can easily
observe that the Google complexity of the approach cannot be expressed in terms
of the size of the input, the initial ontology O. However, the Google complexity
can be expressed in terms of the size of the populated ontology O
0
.
After the termination of the algorithm, each instance in the output ontology
has been queried with every matching pattern. Suppose we have pat(r
a
) patterns
for relation r
a
, then the total number of queries N
q
in the population phase can be
expressed as follows.
N
q
=
∑
r
a
∈R
pat(r
a
) · (|I
s
| + |I
o
|)
(2.3)
Hence, assuming that a constant number of queries is used, the Google com-
plexity is linear in the sum of the sizes of the sets of instances in the populated
ontology.
Bootstrapping. It is notable that the algorithm features multiple bootstrapping
mechanisms. For an ontology with incomplete classes, the following bootstrapping

30
steps apply.

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 6 7 8 9 10 11 12 13 ... 57