3.2 Identifying Instances
41
suffix tree can be built to efficiently represent the instances and match them with
the texts found [Gusfield, 1997].
To match the instances with the search results, we can simply opt for an ex-
act string matching approach. As an alternative, we can allow small variations
based on the edit distance (to compensate for encountered typo’s) or ignore case-
differences.
Compensating for Ambiguous Terms. Homonyms are a common phenomenon
in natural language, and are one of the factors that complicate natural language
processing. Homonyms can have meanings that are quite distant (for example the
term
Boston may refer to the city and the pop band) or more closely related (e.g.
Theo van Gogh is the name of two different persons, Groningen is both the name
of a Dutch province and its capital city,
Boston is both a band and the title of their
debut album).
Hence, when encountering an occurrence of one of the instances in a text, we
are not guaranteed that the term indeed refers to the intended instance. Ideally, for
each occurrence of an instance in a text we want to observe whether the occurrence
indeed reflects the intended instance. However, the automatic parsing of texts is
troublesome. Moreover if an instance is identified as a subject or object within a
sentence, then we still do not know whether the term indeed reflects the instance.
For terms with only one meaning, we can be more confident that occurrences
of these terms indeed refer to the intended instance. For term with numerous def-
initions however, this relation can be much less certain. Hereto, we propose a
mechanism to estimate the likeness that a term indeed refers to the instance.
We use the
define functionality in Google to obtain the number of senses of a
term. For example, by querying define: Tool, we obtained a list of 31 defini-
tions for the term
Tool, collected from various online dictionaries and encyclope-
dias. This indicates that Tool is an ambiguous term. On the contrary, terms such as
Daft Punk, Fatboy Slim and Johannes Brahms lead to precisely one definition. We
define
n(
i) to be the number of definitions for
i that are returned by Google. If no
definitions are returned, we consider
n(
i) to be 1.
In Tables 3.6 and 3.7 examples are given for definitions for an ambiguous and
an unambiguous instance of the class
Artist.
Based on the number of definitions
n(
i) returned by Google, we formulate an
estimate
p(
i) for the likeliness that
i
Dostları ilə paylaş: