Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	4/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 2 3 4 5 6 7 8 9 ... 57

2
We consider relation instance j to be an inhabitant of a relation named n, if the
statement “i
s
n i
o
” (e.g.
Eindhoven is located in the Netherlands) is true.
In Figure 2.1 an example ontology is visualized. Relations are considered be-
tween instances in the central class
person and instances in all the other classes.
2.1.3 The Ontology Population Problem
As stated in the introductory chapter, we restrict ourselves to using natural language
texts on the web. Before we focus on the actual process of extracting information
from such texts, the task is how to find potentially relevant texts. For informa-
tion extraction tasks with a large collection of documents, the use of a document
retrieval system is necessary to identify relevant texts.
As the web is a collection of billions of text documents, there is need to select

2.1 Introduction
19
Nationality
Profession
Gender
Person
has
has
has
related_with
Fame
has
Period
lived
Figure 2.1. An example ontology on historical persons.
potentially relevant documents or document fragments. As we consider document
retrieval a separate concern, we chose to use an off-the-shelf search engine.
Using a search engine, we hence need to formulate queries that result in relevant
documents. Having retrieved a relevant document, we can focus on the extraction
of information, i.e. populating the initial ontology. We consider the following two
subproblems in ontology population from texts on the web using a search engine.
The Class Instantiation Problem.
Given an initial ontology O with class
c
j
, identify instances of c
j
using texts found with a web search engine.
2
The Relation Instantiation Problem. Given an initial ontology O, with re-
lation r = (n, c
s
, c
o
, ϕ, J) find relation instances (i, i
0
) ∈ I
s
× I
o
.
2
These two subproblems in information extraction are combined in the ontol-
ogy population problem.
The Ontology Population Problem (OP). Given an initial ontology O, in-
stantiate the classes and relations by extracting information from texts on the web
found with a search engine.
2
Given an initial ontology O, we use O
0
to refer to the populated ontology.
Popular search engines currently only give access to a limited list of possibly
interesting web pages. A user can get an idea of relevance of the pages presented
by analyzing the title and a snippet presented. When a user has sent an accurate
query to the search engine, the actual information required by the user can already
be contained in the snippet.
If these snippets and titles are well usable for web information extraction
purposes, the documents themselves do not have to be downloaded and processed.
We hereto formulate the following alternative problem description.
The Snippet Ontology Population Problem (SOP). Given an initial ontol-

20
ogy O, instantiate the classes and relations by extracting information from search
engine snippets.
2
2.1.4 Evaluating a Populated Ontology
Having populated an ontology, we want to obtain insight in the quality of the in-
formation extracted in terms of soundness and completeness. That is, the extracted
information on the one hand needs to be correct and on the other hand as complete
as possible.
Hereto, we use the standard measures precision and recall. To measure preci-
sion and recall, we assume a ground truth ontology O
ref
to be given.
For the set O

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 57