Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	15/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 11 12 13 14 15 16 17 18 ... 57

s
and c
o
. Using a query combining an instance in I
s
and a pattern
expressing r, we obtain query results Q. Given Q, identify instances of class c
o
at
the placeholder.
2
A class in the given initial ontology may be either complete or incomplete. In the

3.2 Identifying Instances
39
Marie Curie was a world-renowned scientist who made many important
discoveries,
Marie Curie was a dedicated humanitarian, eager to
Marie Curie was a two-time Nobel Prize winner and one of the first
women ever to
Marie Curie was a lone genius who found new
Marie Curie was a famous scientist as you think. She was born in
Poland 1867.
Marie Curie was a brilliant scientist who received two Nobel Prizes.
Marie Curie was a world-renowned scientist who made many important
Marie Curie was a physicist and chemist of Polish upbringing and, sub-
sequently,
Marie Curie was a Polish-born physicist and chemist and one of the
most famous scientists of her time.
Marie Curie was a Polish physicist and chemist who lived between
1867-1934.
Marie Curie was a Polish chemist and pioneer in the early field of radi-
ology and
Marie Curie was a Polish-born physicist and chemist and one of the
most famous
Marie Curie was a real hero as she
Table 3.4. Example search results for the pattern
[Person] was a [Profession]
,
instantiated with
Marie Curie
.
first case, it may seem trivial to recognize instances in texts as we can match the
text at the placeholder with the set of instances. However, as the terms representing
the instances can be ambiguous (e.g.
Live, Madonna), we present an approach to
compensate for the ambiguity. For the task with incomplete classes, we focus on
both knowledge-oriented and data-oriented approaches. The design constraint that
no representative manually annotated texts are available hampers the latter.
3.2.1 Instance Identification for Complete Classes
Given is an ontology O with a complete class c
a
, for example
Painter. Note that
for complete classes all relevant instances are included in the ontology. Hence, if a
class is labeled to be complete, all
relevant instances for the given setting are given.

40
What were some of Sir Isaac Newton’s other jobs before he was a sci-
entist
scientist, while Benjamin Franklin was a scientist
if Einstein was a scientist
Project Manager was a scientist
Poste on 2008-03-13 09:59 by Mike. Fucked Up. I met this new woman.
Apparently she was a scientist
Charles W. Buggs (1906-1991) Charles W. Buggs was a scientist
describe an occupation: ”My father was a scientist
therefore the proposition that Mahatma Gandhi was a scientist
Kurt Godel was a scientist
gathered to celebrate the centennial of Rachel Carson. She was a scien-
tist
Rachel Carson was a scientist
Niels Bohr was a scientist
Werner Heisenberg was a scientist
Orange Research and Education, and was a scientist
his flaws and excesses (well depicted in the movie), Kinsey was a
scientist
Table 3.5. Example search results for the pattern
[Person] was a [Profession]
,
instantiated with
scientist
.
We defined the placeholder for instances of the unqueried class in terms of the
maximum amount of words between the queried expression and the instance to be
identified, given that these words are within the same sentence. We allow distances
larger than 0 to compensate for variations in adjective, adverbs and the like. For
example, for the pattern [
hyponym] is a [hypernym] the words directly preceding
the phrase
is a are typically used to specify the hyponym (Chapter 5.1).
For simplicity, we use the full stop marker to detect sentence boundaries. Re-
cently, Kiss and Strunk [2006] have proposed a more elaborate method to distin-
guish sentence boundaries from abbreviations in a multilingual corpus using an
unsupervised approach.
Having defined the placeholder for instances of class c
a
, we scan the search
results for occurrences of the instances in the class. For larger sets of instances, a

3.2 Identifying Instances
41
suffix tree can be built to efficiently represent the instances and match them with
the texts found [Gusfield, 1997].
To match the instances with the search results, we can simply opt for an ex-
act string matching approach. As an alternative, we can allow small variations
based on the edit distance (to compensate for encountered typo’s) or ignore case-
differences.
Compensating for Ambiguous Terms. Homonyms are a common phenomenon
in natural language, and are one of the factors that complicate natural language
processing. Homonyms can have meanings that are quite distant (for example the
term
Boston may refer to the city and the pop band) or more closely related (e.g.
Theo van Gogh is the name of two different persons, Groningen is both the name
of a Dutch province and its capital city,
Boston is both a band and the title of their
debut album).
Hence, when encountering an occurrence of one of the instances in a text, we
are not guaranteed that the term indeed refers to the intended instance. Ideally, for
each occurrence of an instance in a text we want to observe whether the occurrence
indeed reflects the intended instance. However, the automatic parsing of texts is
troublesome. Moreover if an instance is identified as a subject or object within a
sentence, then we still do not know whether the term indeed reflects the instance.
For terms with only one meaning, we can be more confident that occurrences
of these terms indeed refer to the intended instance. For term with numerous def-
initions however, this relation can be much less certain. Hereto, we propose a
mechanism to estimate the likeness that a term indeed refers to the instance.
We use the
define functionality in Google to obtain the number of senses of a
term. For example, by querying define: Tool, we obtained a list of 31 defini-
tions for the term
Tool, collected from various online dictionaries and encyclope-
dias. This indicates that Tool is an ambiguous term. On the contrary, terms such as
Daft Punk, Fatboy Slim and Johannes Brahms lead to precisely one definition. We
define n(i) to be the number of definitions for i that are returned by Google. If no
definitions are returned, we consider n(i) to be 1.
In Tables 3.6 and 3.7 examples are given for definitions for an ambiguous and
an unambiguous instance of the class
Artist.
Based on the number of definitions n(i) returned by Google, we formulate an
estimate p(i) for the likeliness that i

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 11 12 13 14 15 16 17 18 ... 57