26
•
Large number of queries. This approach leads to
|I
s
| · |I
o
| queries and has
therefore in general no Google complexity linear in the total number of in-
stances.
•
Not generally applicable. As such an approach assumes the classes to be
complete, it cannot be used to solve the general ontology population problem
for incomplete classes.
•
No solution for relation identification. The co-occurrence of two instances
in a document does not necessarily reflect the intended relation. Hence, ei-
ther the query needs to be more specific [Cimiano & Staab, 2004] or the
documents need to be processed [Knees et al., 2004].
As an alternative, we formulate queries containing
one known instance. Such
an approach would lead to a Google complexity linear in the number of instances in
O
0
, if we formulate a constant number of queries per instance. Having formulated
a query containing an instance, the texts retrieved by the search engine are to be
processed to recognize an instance of the other class and evidence for the relation
between the two.
A very simple language model. The web as a corpus – and especially the collec-
tion of snippets returned by a search engine – is multi-lingual and contains typo’s,
broken sentences, slang, jokes, and other irregularities. As no representative an-
notations or reliable tools are available for such data, we choose to opt for a very
simple language model to identify instances and their relations.
We focus on sentences where the instances of the subject and object class are
related by a small text fragment. We ignore the rest of the context. Given a relation
Dostları ilə paylaş: