Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	7/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 2 3 4 5 6 7 8 9 10 ... 57

2
In this thesis, we will analyze the Google Complexity in terms of the required
number of queries for the populated ontology O
0
.
To restrict the Google complexity, we need accurate queries for which we
can expect the search engine to return highly relevant information. The actual
requirements depend on the application of the data. If the collection of information
is a single time effort, a run time of a couple of days would be acceptable.
However, for real-time or semi real-time applications, a more efficient approach is
required.
Design Constraint. The Google complexity of the approach chosen to pop-
ulate an ontology should be such that the algorithm terminates within days.
2
In this chapter, we present a method with a Google complexity that is linear
in the size of the output. In Chapter 5, we focus on two applications of web
information extraction, with a constant Google complexity.
Limitations on Text Processing
Having retrieved a potentially relevant document from the web, the task is to iden-
tify relevant instances and their relations. Traditionally, approaches in informa-
tion extraction (and natural language processing in general) can be split into data-
oriented and knowledge-oriented ones.
In a data-oriented information extraction approach, instances and relations are
typically recognized using an annotated training set. In a representative text corpus,
relevant information such as part-of-speech tags, dependency parse trees and noun
phrases are signaled. These annotated texts are used to train a machine learning

24
algorithm to classify a text without annotations, the test set. The assumption is used
that instances of the same class appear in a similar context, are morphologically
similar, or have the same role in the sentence.
In a knowledge-oriented approach on the other hand, we create a model to rec-
ognize instances and relations in texts. We hence use our own knowledge of lan-
guage to create recognition rules. For example, we could state that two capitalized
words preceded by
mr. indicate the name of a male person.
Using either a data- or knowledge-oriented approach to populate an ontology,
the approach is to be domain dependent. The annotations or rules that are used to
recognize some class c
j
(e.g.
Movie, Musical Artist) cannot be used to recognize
instances of some other class (e.g.
Person). An additional problem for a data-
oriented approach is the lack of available annotations.
Supervised data-oriented approaches in natural language processing make use
of a representative training corpus. The text in this training corpus is annotated for
the specific
NLP
task, for example part-of-speech tagging [Brill, 1992] or the iden-
tification of dependencies within a sentence [Lin, 1998; Marneffe, MacCartney, &
Manning, 2006]. Such low level features are commonly used in information ex-
traction methods [Collins, 2002; Etzioni et al., 2005]. The common annotations for
information extraction in the available corpora focus on standard, restricted named
entity recognition tasks, such as the recognition of person names, companies and –
in the biomedical domain – protein names. The more regular a corpus is, the better
a system performs on a given
NLP
task [McCallum, 2005].
The web texts found with a search engine and especially the snippets are irreg-
ular as they are multilingual and contain typo’s and the broken sentences. Due to
the irregularity of the texts and the lack of representative training data, it is there-
fore not likely that low level features like parts-of-speech can be identified reliably.
An additional problem is that annotated training data is not available for all the
class instantiation tasks we are interested in.
Given these considerations, we choose not to make use of manually annotated
training data and off-the-shelf systems that are trained on such data. Hence, to
opt for a generic approach in ontology population, we formulate the following
constraint.
Design Constraint.
To facilitate a generic approach, we do not make use
of manually annotated training data.
2
In Chapter 4 we return to this topic, where we evaluate the use of an off-the-
shelf supervised named entity recognizer to identify person names in snippets.
In the next chapter, taking this design constraint into account, we discuss op-
tions in rule-based and unsupervised machine learning approaches in ontology pop-

2.2 Extraction Information from the Web using Patterns
25
ulation.
2.2.2 Sketch of the Approach
In this section, we present a global approach to populate an initial ontology O.
As discussed earlier in this chapter, we are confronted with the design constraint
that the availability of the search engines is limited. This enforces us to formulate
precise queries, in order to obtain highly relevant search results.
Now, if an ontology with complete classes is given, the task is to only populate
the relations, i.e. to find relation instances. In other words, we have to find and
recognize natural language formulations of the
subject – relation – object triplets.
If we are interested in the class instantiation problem, the tasks are quite similar.
For a class named n, the task is to find terms t where the triplet t
is a n is expressed.
Hence, the class instantiation problem can easily be rewritten into a relation instan-
tiation problem for incomplete classes.Suppose we are handed the following class
instantiation problem: O = ({c

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 10 ... 57