Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	24/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 20 21 22 23 24 25 26 27 ... 57

o
be this set of countries, and let I
s
be the set { (
coun-
tries, country }. The set of relation instances J consists of all instance combinations
(
countries, a) and ( country, a) , for a ∈ I
o
. We apply the text pattern learning al-
gorithm as discussed in Section 3.1 on this set of relation instances.
Using the proposed pattern learning algorithm, we identified almost 40,000
patterns. We computed fspr and fprec for the 1,000 most frequently found patterns.
In Table 4.6, we give the 25 most effective patterns found by the algorithm.
The common hyponym patterns ‘like’ and ‘such as’ show to be the most effec-
tive. This observation is useful, when we want to minimize the amount of queries
for hyponym patterns. Other commonly hyponym patterns with high scores are
including, and other and namely. All infix patterns identified by Hearst ([1992],
Table 2.1 on page 28) are identified here as well.
Apart from hyponym patterns that can be generally used, we also find patterns
that are specific for the given setting. Patterns like
code for, flag of are very usable
to identify the studied relation. Such phrases are directly recognizable as usable
patterns, but may not be straightforward to identify manually. Other patterns con-
taining an adjective (e.g.
is a sovereign) are perhaps over-specific, but well-usable.
The combinations of
is a, is an or is the with an adjective occur in total 2,400 times
in the list.
We conclude that the commonly used hyponym patterns are indeed also identi-
fied as effective patterns. Moreover, some patterns that are very typical for this set-
ting (i.e. all countries and their hypernyms) are identified as well. Although these
patterns are intuitive formulations of the studied relation, they are less straightfor-
ward to find manually.

60
PATTERN
FREQ
PREC
SPR
(countries) like
645
0.66
134
(countries) such as
537
0.54
126
is a small (country)
142
0.69
110
(country) code for
342
0.36
84
(country) map of
345
0.34
78
(countries) including
430
0.21
93
is the only (country)
138
0.55
102
is a (country)
339
0.22
99
(country) flag of
251
0.63
46
and other (countries)
279
0.34
72
and neighboring (countries)
164
0.43
92
(country) name republic of
83
0.93
76
(country) book of
59
0.77
118
is a poor (country)
63
0.73
106
is the first (country)
53
0.70
112
(countries) except
146
0.37
76
(country) code for calling
157
0.95
26
is an independent (country)
62
0.55
114
and surrounding (countries)
84
0.40
107
is one of the poorest (countries)
61
0.75
78
and several other (countries)
65
0.59
90
among other (countries)
84
0.38
97
is a sovereign (country)
48
0.69
89
or any other (countries)
87
0.58
58
(countries) namely
58
0.44
109
Table 4.6. Learned hyponym patterns and their scores.
4.3.2 Recognizing Instances using Memory-Based Learning
Contrary to the case-study in Section 4.2, we now focus on a data-driven approach
to identify instances. We apply the method discussed in Section 3.1.2 to extract
instances by classifying feature vectors derived from the search results.
As the focus is on the instance identification task, we use the best scoring hy-
ponym patterns found with the complete list of countries. We selected the 11 most
effective patterns to express the relation (cf. Table 4.6). Contrary to the pattern
learning set-up, the class c
o
is incomplete. We now only assume given the coun-
try names starting with A – D, i.e. from
Afghanistan to the Dominican Republic.
These names of countries are used to annotate search results and train the classifier
as described in Chapter 2. No relation instances are provided. The complete class
with instances
country and countries remains unchanged.

4.3 Identifying Countries
61
I am deeply grateful to the Secretary General for the opportunity
of working ...
and destitute state - such as Afghanistan –
provides fertile ground for ...
In an uninhabited region such as Antarctica, ...
with vaunted ski industries such as Austria and Switzerland
insist
they aren’t ...
Table 4.7. Example search results used to train the classifier.
We run the ontology population algorithm using the described initial ontol-
ogy. Using the known names of countries, we construct a training set with all
known instance-pattern combinations, i.e. all the snippets found with
like (from
like Afghanistan up to like the Dominican Republic) and the ten other patterns.
In Table 4.7 some example sentences are shown that are used in a training set
of the classifier. Note that
Switzerland will not be annotated as a country name, as
the term is no instance in the initial ontology.
The annotated search results are used to identify instances of ccountry when
querying instances of the other class. We recognize instances in the search results
for the 22 (11 patterns, 2 instances) queries.
We compare two approaches in the data-oriented identification of instances:
one where we use the
focus word in the vectors to be classified, and one where the
actual focus word is left out. Approaches that use the focus word in the feature
vector may be biased to this feature. In the worst case, only the instances in the
training set will be recognized.
To describe the context of the focus word, we construct a vector with the 5
tokens preceding and following. Moreover, for each of these context features, we
add one of the more general feature discussed in Chapter 3. The general feature
describing the focus word is maintained for both approaches.
The vectors in the training set are labeled with one of the three following
classes:
start, intern, not. From the classified feature vectors in the test set, all
focus words labeled
start and possible subsequent focus words labeled intern are
extracted.
We evaluate the instances identified using the complete list of countries from
the
CIA factbook. A term is thus considered a country when its exact string repre-
sentation is found in this list, and incorrect in all other cases.
We use the number of times the instances are identified as a confidence measure
p for the correctness and sort all instances found by decreasing occurrence. In

62
0
0.2
0.4
0.6
0.8
1
0
20
40
60
80
100
120
140
160
180
precision
recall
Identifying country names using memory-based learning
using focus word, including seed list
without focus word, including seed list
Figure 4.1. Recall (# countries identified, including the countries in the seed list)
and precision for the instance identification.
Figure 4.1 the results of the two alternative classification approaches are compared:
the approach with the focus word in the feature vector and the one without. We plot
the precision of the n most frequently found terms against the absolute recall.
The figure shows that the 50 most frequently found instances are all correct if
we use the approach with the focus word in the vector. The precision drops steeply
for recall above 120 instances. For the other approach, errors occur among the best
rated instances, but the precision is quite constant for recall levels between 20 and
100.
In Table 4.8 we give the most frequent instances that are evaluated to be in-
correct. The table shows that many of these instances are geographic locations
(regions, nations, continents) or variations on country names (e.g.
The UK, Myan-
mar). Taiwan is not recognized as a sovereign country by the USA and not included
in the CIA factbook.
Other errors include common words such as
The and What and wind direc-
tions. While for example the multi-word terms
United States, Saudi Arabia and
South Korea are recognized, we also identify parts of these names (e.g. Arabia,
United) as a separate instance.
As the approach using the focus word in the vector may be biased towards the
instances that occur in the training set, we also compare the results by leaving out
the country names starting with A – D in the evaluation set.
In Figure 4.2 the results are compared for the instances found that were not

4.3 Identifying Countries
63
The
Europe
America
United
States
This
USA
What
Country
Korea
Sri
European
Countries
Thai
How
American
Which
South
There
Flutter
However
England
Arabia
BALI
North
African
London
Saudi
International
Asia
California
The UK
Western
People
Saturday
Our
Zealand
That
Shop
They
National
Taiwan
Sea
Gold
Rib
Napa
Cape
Myanmar
Table 4.8. The best-ranked incorrect terms found using the classification without
the focus word.
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
20
40
60
80
100
120
precision
recall
Identifying country names using memory-based learning
using focus word, excluding seed list
without focus word, excluding seed list
Figure 4.2. Recall (# countries identified, excluding the countries in the seed list)
and precision for the instance identification.
included in the mentioned seed list. This figure shows that none of the two meth-
ods is clearly outperformed by the other. This is an expected result, as for this
evaluation set the focus word itself does not contain relevant information.
As the results of this experiment cannot be compared with previous work, the
quality of the results is hard to judge. We conclude that at precision levels of

64
around 0.5 up to 80 new instances can be identified. Among the false positive are
many synonyms of country names and other geographic entities. The total set of
countries has a size of 270 instances, while 73 countries were used in the annotation
phase. Given the limited number of queries and the simplicity of the method,
we are encouraged by these results. Analysis of the erroneous results shows that
many false positives are geographic locations or parts of instances identified more
frequently (e.g. apart from
Sri Lanka also Sri and Lanka are extracted).
By adding check functions, the precision can be improved. Alternatively, us-
ing Cimiano and Staab’s method [2004], we can distinguish between the various
geographic locations found. For example, if more evidence is found that
London
is a city or
Asia is a continent, the hypothesis that London and Asia are countries
can be rejected.
4.4 The Presidents of the United States of America
In this case-study, we focus on an ontology describing the presidents of the US. We
choose this setting as the required information can be expected to be redundantly
available and the evaluation of the extracted information is relatively easy as an
undisputable complete list of presidents is available as ground truth.
We want to identify a complete list of all past presidents (from
George Wash-
ington to George W. Bush). We define the initial ontology as follows. Given are
the classes named
US President and Rank and the relations succeeded (with sub-
ject and object class
US President) and order (on US President and Rank). The
instances of the complete class
rank (first president, second president .. 50th presi-
dent) are given beforehand. Note that the current president, George W. Bush, is the
43rd in line and the last instances are added for evaluation purposes.
Using this set-up, we focus on two tasks. First we populate the relations with
a given complete class
US President. The second task is to populate the ontol-
ogy where
US President is incomplete. We compare a rule-based approach with
instance identification using memory-based learning.
4.4.1 Identifying Relations
We apply the ontology population on the US president ontology with complete
classes starting with pattern
[US President] was the [Rank] for order-relation and
[US President] succeeded [US President] for the succeeded relation.
Again, we use the snippets and page titles found with the search engine. The
most frequent relations found are used to identify new patterns.
In Section 3.1 we argued that for non-functional relations broad patterns are to
be selected. For functional relations on the other hand, patterns are to be selected
that connect the subject instance to few object instances. As the two relations are

4.4 The Presidents of the United States of America
65
NON
-
FUNCTIONAL
FUNCTIONAL
iteration
succeeded
order
succeeded
order
1
0.88
0.89
0.88
0.89
2
0.67
0.94
0.76
0.94
3
0.71
0.94
0.81
0.94
3
0.57
0.94
0.79
0.94
4
0.52
0.94
0.81
0.94
Table 4.9. F-Measure per iteration for the given relations.
in general functional
6
, we expect the performance of the ontology population algo-
rithm with the functional-setting for the two relations to have the best performance.
We evaluate the populated ontologies after each iteration. For the
order re-
lation, we focus on the president that is most often found for a given rank. For
example, we take instance
26th and extract the most frequent relation instance con-
taining
26th. The succeeded relation is evaluated in a similar manner. As Grover
Cleveland succeeded both Chester Arthur and Benjamin Harrison, we focus on the
two most frequently occurring relation instances with
Grover Cleveland as subject
instance.
In Table 4.9 the F-measures (combining precision and recall, page 20) of the
extracted relation instances is given for the first four iterations.
As the ontology populated after the first iteration is based on the two manually
selected patterns, the results do not differ for this iteration. For the subsequent iter-
ations, we observe differences between the functional and non-functional approach
in the
succeeded relation. The F-measure for both approach is less than the one for
the first iteration. The non-functional results deteriorate as more and more patterns
are added (e.g.
‘(president) and (president)’) that do not express the intended re-
lation. No differences are observed for the results for the
order relation. This can
be explained by the fact that few other relations are imaginable that combine the
instances of the two classes. Moreover, as the vast majority of the instances were
added in the first iterations, the ranking is stabilized after iteration 2.
We conclude that the ontology population method using the search engine snip-
pets gives good results. The results for the
succeeded relation show the effect of
the distinction between functional and non-functional relations.
4.4.2 Identifying Instances
Having focussed on a task with complete classes above, the class
president is now
empty. Hence, the instances of
US President are initially to be found using the
6
Grover Cleveland was the only president in two non-subsequent terms.

66
order-relation as no queries can be formulated for the succeeded-relation.
In two alternative runs of the algorithm, the instances of
US President are iden-
tified as follows.
•
Rule-based approach. We have formulated a regular expression describing
person names. We accept the longest sequence with either two or three cap-
italized words, or a sequence with two capitalized words and a capital and
period in between (e.g.
John F. Kennedy).
•
Using Memory-based learning. Like the previous experiment in Section 4.3,
we classify vectors describing the search results and extract instances from
these vectors. We use the ten most recent presidents in the training set for the
first iteration. In the feature vectors generated from the corresponding search
results, we do not include the focus words in the classification. Hence, the
ten presidents in the training set are not instantly added when populating the
ontology but have to be extracted from the search results.
Compared to the experiment in Section 4.3, the training set for the memory-
based learning approach is small. We are interested whether we can find a long list
of instances using the bootstrapping mechanisms.
The evaluation of the extracted instances is not straightforward, as many al-
ternatives may refer to the same president (e.g.
President Clinton, Bill Clinton,
William Jefferson Clinton). We therefore decided to automatically evaluate the
instances found using
Google’s define functionality.
We consider an extracted instance to be a president of the United States if:
- indeed definitions are found for the given term, and
- the word
president is found in at least one of the definitions, and
- the terms
United States or US are found in at least one of the definitions.
When we inspect the results for the populated ontology, we encountered many
terms that refer to vice-presidents, presidents of other countries and other states-
men. Although these instances are no correct instances of
US President, one could
argue that they are instances in some superclass
politician. We therefore propose a
second evaluation where we focus on politicians in general. Using
Google define,
we consider an instance to be a politician if:
- definitions are found for the term, and
- at least one of the words
president, minister, leader, statesmen or politician
is found in the definitions.
Naturally, US presidents are included in this definition.
Figure 4.3 gives the precision and absolute recall for the first iteration. It shows
that using the rule-based method in total 120 terms were found that refer to US

4.4 The Presidents of the United States of America
67
R
ULE
-
BASED
USING
MBL
succeeded
order
succeeded
order
iteration 1.
0.95
0.94
0.25
0.27
iteration 2.
0.98
1.00
0.25
0.27
Table 4.10. F-Measure for the first iteration for the given relations.
presidents, while history only has known 42 distinct presidents. Using the rule-
based approach all presidents were identified, the last one – James Buchanan – at
a precision level of 0.47. Hillary Clinton is said to be the 44th president of the
United States
7
.
Using the rule-based approach, in total 120 correct distinct variations of the
names of the 42 presidents were found. Apart from these 120, we also found 61
names of presidents of other countries and 42 other politicians.
The performance for the approach using memory-based learning is disappoint-
ing. As only few names were included in the training set, classification of the con-
text is biased towards these names. The only presidents identified are also present
in the initial training set. As we do not abstract from the context, but only from
the focus word, words in the context like
Bill and Clinton signal the presence of an
intern or start vector. For example, a focus word is only classified as intern when
it is preceded by one of the first names of the presidents in the training set (e.g.
Bill or Ronald). As no new president names are learned, the performance does not
improve in the next iterations, when newly identified patterns are applied.
With respect to the relation instances found, the F-measures for the first two it-
erations are given in Table 4.10. As only few names were identified using memory-
based learning, also the F-measures for the found relations are not convincing. The
relations using the rule-based approach are precisely identified. With respect to the
second iteration using the rule-based approach, only the succeeded relations be-
tween
James Polk and John Tyler and between James Monroe and James Madison
were not extracted, while the precision for the
succeeded-related was 1.
We conclude that for the given task, the simple rule-based method gives con-
vincing results. On the other hand, the approach using automatically annotated
training data is disappointing. Especially, as the search results for the
rank have
shown to contain all relevant data to populate the ontology.
7
The experiment was conducted in February 2008.

68
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
50
100
150
200
250
precision
# correct
Identifying instances for US president ontology
politicians, using MBL
US presidents, using MBL
politicians, rule-based
US presidents, rule-based
Figure 4.3. Recall and precision for the instance identification.
4.5 Extracting Historical Persons from the Web
In this last section of this chapter we focus on the population of an ontology on
historical persons. Such information is currently not present as such on the Web.
By combining and structuring the knowledge on diverse web pages, we are able to
find answers to questions as
Who are important novelists from Ireland?, Which no-
table people were born in 1648?, Who are popular female composers?. To present
the information in an attractive manner, we compute a
fame rank for each person
based on the presence on the web.
In the given initial ontology (cf. Figure 4.4) all classes but
Person are complete,
while
Person is empty. The class Period contains all combinations of years that are
likely to denote the periods of life of a historical person. For example
’1345 - 1397’
is an instance of
Period. The class Nationality contains all names of countries in the
world. We identify derivatives of country names as well and use them as synonyms
(e.g.
American for United States, English for United Kingdom and Flemish for
Belgium). A hierarchy among the instances of Nationality is defined using the
names of the continent, such that we can for example select a list of historical
persons from Europe. Likewise, the instances of
Profession reflect 88 professions.
For the instances
male and female we have added a list of derivatives to be used
as synonyms, namely the terms
he, his, son of, brother of, father of, man and men
for
male and the analogous words for female. We use the class Fame to rank the
retrieved instances of
Person according to their presence on the web. Hence the
task is to identify a collection of biographies of historical persons and to identify
a social network between the persons found. As persons may have more than

4.5 Extracting Historical Persons from the Web
69
Nationality
Profession
Gender
Person
has
has
has
related_with
Fame
has
Period
lived
Figure 4.4. The ontology on historical persons to be populated.
one profession and can be related to multiple other people, we are interested in a
ranked list of professions and related persons for each historical person found. For
efficiency reasons, we only extract information from the snippets returned by the
search engine.
We use the given instances in the ontology O to populate the class
Person, i.e.
to find a list of names of historical persons. We again use the instances available in
the given ontology and combine these with patterns to formulate queries and hence
create a corpus to extract information from.
Suppose we use instances in the class
Profession to extract the persons. When
querying for the instance
composer, it is likely that few well-known composers
dominate the search results. As we are interested in a rich ontology of historical
persons, this is thus a less-suited approach.
The class
Period contains all combinations of years that are likely to denote the
periods of life of a historical person. Hence, the number of instances known for the
class
Period is by far the largest for all complete classes in O. As it is unlikely that
many important historical persons share both a date of birth and a date of death,
the use of this class is best suited to obtain a long and diverse list of persons. The
names of historical persons are often followed in texts by a period in years (e.g.
‘Vincent van Gogh (1853 - 1890)’). As this period is likely to denote the period
he or she lived in, we choose the pattern ”(year of birth – year of death)” to collect
snippets to identify the names of historical persons.
4.5.1 Identifying Person Names in Snippets
Having obtained a collection of snippets, the next problem is to extract instances
from the texts, in this case person names. We choose to identify the names within
the snippets using a rule-based approach.
First we extract all terms directly preceding the queried expressing that match
a regular expression similar to the approach in the previous case study. That is,
we extract terms of two or three capitalized words and compensate for initials,
inversions (e.g.
’Bach, Johann Sebastian’), middle names, Scottish names (e.g.
McCloud) and the like.

70
Subsequently, we remove extracted terms that contain a word in a tabu list
(e.g.
‘Biography’) and names that only occur once within the snippets. Having
filtered out a set of potential names of persons, we use a string matching among
the extracted names to remove typos and names extracted for the wrong period.
Using the 80,665 periods identified, we obtain a list of 28,862 terms to be added
as instance to the class
Person. Simultaneously, we extract the relations between
the periods queried and the extracted instances.
In the evaluation we analyze the quality of the extracted instances and compare
the rule-based approach with a state-of-the-art named entity recognizer using a
hidden markov model [Duda et al., 2000].
4.5.2 Using Mined Names to Find Additional Biographical Information
Having found a list of instances of the class
Person, we first determine a ranking of
the instances extracted.
Finding a Rank. To present the extracted information in an entertaining manner,
we determined the number of hits for each identified person. As names are not
always unique descriptors, we queried for the combination of the last name and
period (e.g.
’Rubens (1577 - 1640)’). Although the number of hits returned a search
engine is an estimate and irregularities may occur [V´eronis, 2006], we consider this
simple and efficient technique to be well suited for this purpose.
Now we use the names of these instances in a similar fashion to acquire bio-
graphical information for the 10,000 best ranked persons. To limit the number of
queries per instance, we select the pattern ’was’ to reflect the relation between
Per-
son on the one hand and Nationality, Gender and Profession on the other hand. By
querying phrases such as
’Napoleon Bonaparte was’ we thus expect to acquire sen-
tences containing the biographical information. Table 4.11 contains examples of
the sentences used to determine biographical information. We scan these sentences
for occurrences of the instances (and their synonyms) of the related classes.
Relating persons to a gender. We simply counted instances and their synonyms
within the snippets that refer to the gender
‘male’ the opposite words that refer
‘female’. We simply related each instance of Person to the gender with the highest
count.
Relating persons to a nationality. We assigned the nationality with the highest
count.
Relating persons to professions. For each person, we assigned the profession p
that most frequently occurred within the snippets retrieved. Moreover, as persons
may have multiple professions, all other professions with a count at least half of
the count of p were added.
Hence, using one query per instance of
Person, we identify basic biographical
information.

4.5 Extracting Historical Persons from the Web
71
Napoleon Bonaparte was the greatest military genius of the 19th century
Napoleon Bonaparte was born of lower noble status in Ajaccio, Corsica on
August 15, 1769
Napoleon Bonaparte was effectively dictator of France beginning in 1799 and
Napoleon Bonaparte was the emperor of France in the early 1800s
Napoleon Bonaparte was a bully, rude and insulting
Napoleon Bonaparte was in Egypt and was not enjoying his tour
Napoleon Bonaparte was a great warrior and a notable conqueror
Napoleon Bonaparte was born on August 15, 1769 to Carlo and Letizia
Bonaparte
Napoleon Bonaparte was defeated at Waterloo
Table 4.11. Example search results for the query ’Napoleon Bonaparte was’.
philosopher
1275
designer
222
composer
804
scientist
215
mathematician
773
musician
213
poet
668
historian
210
physicist
501
inventor
208
writer
478
essayist
201
playwright
469
engineer
199
novelist
429
singer
198
sculptor
362
dramatist
186
author
352
theorist
175
critic
346
illustrator
171
astronomer
343
journalist
166
painter
329
statesman
138
politician
323
teacher
138
artist
286
mystic
133
architect
284
educator
132
director
270
theologian
127
conductor
267
physician
125
actor
261
printmaker
124
pianist
224
scholar
112
Table 4.12. The professions that were found most often.
4.5.3 Evaluating the Identified Biographical Information
The rank assigned to each of the persons in the list provides a mechanism to present
the extracted data in an attractive manner. Table 4.13 gives the list of the 25 best
ranked persons and the identified biographical information. Using the criterion
defined in Section 4.5, Johann Sebastian Bach is thus the best known historical
figure.
As the data is structured, we can also perform queries to select subsets of the

72
Johann Sebastian Bach (1685-1750)
Germany
composer,organist
Wolfgang Amadeus Mozart (1756-1791)
Austria
composer,musician
Ludwig van Beethoven (1770-1827)
Germany
composer
Albert Einstein (1879-1955)
Germany
scientist,physicist
Franz Schubert (1797-1828)
Austria
composer
Johannes Brahms (1833-1897)
Germany
composer
William Shakespeare (1564-1616)
United Kingdom
author,poet
Joseph Haydn (1732-1809)
Austria
composer
Johann Wolfgang Goethe (1749-1832)
Germany
philosopher,director,poet..
Charles Darwin (1809-1882)
United Kingdom
naturalist
Robert Schumann (1810-1856)
Germany
composer
Leonardo da Vinci (1452-1519)
Italy
artist,scientist,inventor
Giuseppe Verdi (1813-1901)
Italy
composer
Frederic Chopin (1810-1849)
Poland
composer,pianist,poet
Antonio Vivaldi (1678-1741)
Italy
composer
Richard Wagner (1813-1883)
Germany
composer
Ronald Reagan (1911-2004)
United States
president
Franz Liszt (1811-1886)
Hungary
pianist,composer
Claude Debussy (1862-1918)
France
composer
Henry Purcell (1659-1695)
United Kingdom
composer
Immanuel Kant (1724-1804)
Germany
philosopher
James Joyce (1882-1941)
Ireland
author
Friedrich Schiller (1759-1805)
Germany
poet,dramatist
Georg Philipp Telemann (1681-1767)
Germany
composer
Antonin Dvorak (1841-1904)
Czech Republic
composer
Table 4.13. The 25 historical persons with the highest rank.
full ranked list of persons. For example, we can create a list of best ranked artists
(Table 4.14), or a ‘society’ of poets (Table 4.15). We note that Fr´ed´eric Chopin
is often referred to as ’the poet of the piano’. Table 4.16 shows that Vincent van
Gogh is the best ranked Dutch painter.
In Table 4.14 we give the top-40 persons that have as first profession either
artist or painter. Persons that also have as one of their professions artist or painter,
but not as their highest-scoring profession are Sir Winston Churchill, John Ruskin
and Kahlil Gibran. Their highest-scoring professions are politician, author and
poet, respectively.
The reader can verify that the given list of extracted persons are highly accu-
rate. However, lacking a benchmark set of the best known historical persons, we
manually evaluated samples of the extracted ontology to estimate precision and
recall.
Precision. To estimate the precision of the class
Person, we selected three
decennia, namely 1220-1229, 1550-1559 and 1880-1889, and analyzed for each
the candidate persons that were found to be born in this decennium. For the first
two decennia we analyzed the complete list, for decennium 1880-1889 we analyzed
only the first 1000 as well as the last 1000 names. This resulted in a precision of

4.5 Extracting Historical Persons from the Web
73
Leonardo da Vinci (1452 - 1519)
Italy
artist, scientist,...
Pablo Picasso (1881 - 1973)
Spain
artist
Vincent van Gogh (1853 - 1890)
Netherlands
artist, painter
Claude Monet (1840 - 1926)
France
artist, painter,...
Pierre-Auguste Renoir (1841 - 1919)
France
painter
Paul Gauguin (1848 - 1903)
France
painter
Edgar Degas (1834 - 1917)
France
artist, painter,...
Paul Cezanne (1839 - 1906)
France
painter, artist
Salvador Dali (1904 - 1989)
Spain
artist
Henri Michaux (1899 - 1984)
Belgium
artist, poet
Gustav Klimt (1862 - 1918)
Austria
painter, artist
Peter Paul Rubens (1577 - 1640)
Belgium
artist, painter
Katsushika Hokusai (1760 - 1849)
Japan
painter
Amedeo Modigliani (1884 - 1920)
Italy
artist, painter
JMW Turner (1775 - 1851)
United Kingdom
artist, painter
James Mcneill Whistler (1834 - 1903)
United States
artist
Rene Magritte (1898 - 1967)
Belgium
artist, painter
Henri Matisse (1869 - 1954)
France
artist
Rembrandt van Rijn (1606 - 1669)
Netherlands
artist, painter
Edouard Manet (1832 - 1883)
France
artist, painter
Herm Albright (1876 - 1944)
-
artist, engraver,...
Marc Chagall (1887 - 1985)
Russia
painter, artist
Edvard Munch (1863 - 1944)
Norway
painter, artist
Wassily Kandinsky (1866 - 1944)
Russia
artist, painter
Francisco Goya (1746 - 1828)
Spain
artist, painter
Table 4.14. The 25 artists with the highest rank.
0.94, 0.95, and 0.98, respectively. As the decennium of 1880-1889 resulted in
considerably more names, we take a weighted average of these results. This yields
an estimated precision for the complete list of 0.98.
We compare the precision of the rule-based approach with a state-of-the-art
machine-learning-based algorithm, the Stanford Named Entity Recognizer (SNER
[Finkel et al., 2005]), trained on the CoNLL 2003 English training data. Focussing
on persons born in the year 1882, using the rule-based approach we extracted 1,211
terms. SNER identified 24,652 unique terms as person names in the same snippets.
When we apply the same post-processing on SNER extracted data (i.e. removing
typos by string matching, single-word names and names extracted for different
periods), 2,760 terms remain, of which 842 overlap with the terms extracted using
the rule-based approach.
We manually inspected each of these 2,760 terms, resulting in a precision of
only 62%. Around half of the correctly extracted names are not recognized by the
rule-based approach, most of them due to the fact that these names did not directly
preceded the queried period.
To estimate the precision of the extracted biographical relations, we inspected
randomly selected sublists of the top 2500 persons. When we focus on the best

74
William Shakespeare (1564-1616)
United Kingdom
author,poet
Johann Wolfgang Goethe (1749-1832)
Germany
poet, psychologist, philosopher..
Frederic Chopin (1810-1849)
Poland
composer,pianist,poet
Friedrich Schiller (1759-1805)
Germany
poet,dramatist
Oscar Wilde (1854-1900)
Ireland
author,poet
Jorge Luis Borges (1899-1986)
Argentina
author,poet
Victor Hugo (1802-1885)
France
author,poet,novelist
Ralph Waldo Emerson (1803-1882)
United States
poet,philosopher,author
William Blake (1757-1827)
United Kingdom
poet
Dante Alighieri (1265-1321)
Italy
poet
Robert Frost (1874-1963)
United States
poet
Heinrich Heine (1797-1856)
Germany
poet
Robert Louis Stevenson (1850-1894)
Samoa
engineer,author,poet
Alexander Pope (1688-1744)
United Kingdom
poet
Hildegard von Bingen (1098-1179)
Germany
composer,scientist,poet
Lord Byron (1788-1824)
Greece
poet
John Donne (1572-1631)
United Kingdom
poet,author
Henri Michaux (1899-1984)
Belgium
poet
Walt Whitman (1819-1892)
United States
poet
Robert Burns (1759-1796)
United Kingdom
poet
Table 4.15. The 20 best ranked poets.
Vincent van Gogh (1853-1890)
Rembrandt van Rijn (1606-1669)
Johannes Vermeer (1632-1675)
Piet Mondrian (1872-1944)
Carel Fabritius (1622-1654)
Kees van Dongen (1877-1968)
Willem de Kooning (1904-1997)
Pieter de Hooch (1629-1684)
Jan Steen (1626-1679)
Adriaen van Ostade (1610-1685)
Table 4.16. The 10 best ranked painters from the Netherlands.
scoring professions for the 2500 persons, we estimate the precision of this relation
to be 96%. We did not encounter erroneously assigned genders, while we found
98% of the cases the right
Nationality, if one is found.
Hence, we conclude that the ontology populated using the rule-based approach
is precise.
Recall. We estimate the recall of the instances found for
Person by choosing
a diverse set of six books containing short biographies of historical persons. Of
the 1049 persons named in the books, 1033 were present in our list, which gives a
recall of 0.98 (Table 4.18).
From Wikipedia, we extracted a list of important 1882-born people
8
. The list
contains 44 persons. Of these 44 persons, 34 are indeed mentioned in the Google
snippets found with the queried patterns. Using the rule-based approach, we iden-
8
http://en.wikipedia.org/wiki/1882, visited January 2007

4.5 Extracting Historical Persons from the Web
75
Cesar Franck (1822 - 1890, B)
organist, composer, pianist
Vincent van Gogh (1853 - 1890, NL)
artist, painter
Roland de Lassus (1532 - 1594, B)
composer
Abraham Kuyper (1837 - 1920, NL)
theologian, politician
Henri Michaux (1899 - 1984, B)
artist, poet
Peter Paul Rubens (1577 - 1640, B)
artist, painter
Baruch Spinoza (1632 - 1677, NL)
philosopher
Rene Magritte (1898 - 1967, B)
artist, painter
Christiaan Huygens (1629 - 1695, NL)
astronomer, scientist,...
Rembrandt van Rijn (1606 - 1669, NL)
artist, painter
Johannes Vermeer (1632 - 1675, NL)
painter, artist
Edsger Wybe Dijkstra (1930 - 2002, NL)
computer scientist
Anthony van Dyck (1599 - 1641, B)
painter
MC Escher (1898 - 1972, NL)
artist
Antony van Leeuwenhoek (1632 - 1723, NL)
scientist
Piet Mondrian (1872 - 1944, NL)
artist, painter
Hugo Grotius (1583 - 1645, NL)
lawyer, philosopher,...
Jan Pieterszoon Sweelinck (1562 - 1621, NL)
composer, organist,...
Andreas Vesalius (1514 - 1564, B)
physician
Hieronymus Bosch (1450 - 1516, NL)
painter
Audrey Hepburn (1929 - 1993, B)
actress, princess
Ferdinand Verbiest (1623 - 1688, B)
astronomer
Desiderius Erasmus (1466 - 1536, NL)
philosopher, reformer,...
Theo van Gogh (1957 - 2004, NL)
judge, artist
Gerard Dou (1613 - 1675, NL)
painter, artist
Nicolaas Beets (1814 - 1903, NL)
king, poet, writer
Carel Fabritius (1622 - 1654, NL)
painter
Georges Simenon (1903 - 1989, B)
author
Kees van Dongen (1877 - 1968, NL)
painter
Gerardus Mercator (1512 - 1594, B)
cartographer
Emile Verhaeren (1855 - 1916, B)
poet, dramatist
Abel Janszoon Tasman (1603 - 1659, NL)
explorer
Pieter de Hooch (1629 - 1684, NL)
painter
Jan van Goyen (1596 - 1656, NL)
artist
Hendrick Goltzius (1558 - 1617, NL)
artist
Simon Stevin (1548 - 1620, NL)
mathematician
Jacob Jordaens (1593 - 1678, B)
artist, painter
Jan Steen (1626 - 1679, NL)
artist, painter,...
Jacobus Arminius (1560 - 1609, NL)
theologian
Guillaume Dufay (1400 - 1474, B)
composer
Table 4.17. The Belgian/Dutch persons with the highest rank.
tified 24 of these persons within the snippets. The other ones were only mentioned
once (and hence not recognized) or found in different places in the snippets, i.e.
not directly preceding the queried period. Using SNER, we identified 27 persons
from the Wikipedia list.
For the recall of the identified biographical relations, we observe that for the
10,000 persons that we considered all were given a gender, 77% were given a
nationality, and 95% were given one or more professions.
Hence, we conclude that using simple methods we have extracted reliable in-

76
BOOK
TOTAL
FOUND
RECALL
The Science Book
156
147
0.94
The Art Book
358
353
0.99
The Dutch Painters: 100 Seventeenth Century Masters
108
106
0.98
Philosophy: 100 Essential Thinkers
78
78
1.00
Herinneringen in Steen
195
195
1.00
Scientists and Inventions
154
154
1.00
Table 4.18. Recall for six popular scientific editions.
formation on historical persons and their biographies with good recall.
4.6 Conclusions
We evaluated the method to populate an ontology using a web search engine us-
ing a number of case-studies. We show that simple web information extraction
techniques can be used to precisely populate ontologies. For all studied cases, the
snippets showed to be a sufficient corpus to extract the information from. By com-
bining and structuring information from the Web, we create a valuable surplus to
the knowledge already available.
The use of the pattern-instance combinations in queries is an effective approach
to access relevant search results. We have shown that the redundancy of informa-
tion on the web enables us to precisely identify instances using the rule-based ap-
proach. For the data-oriented approach, the use of a large and representative set of
known instances is crucial.
The relation instances in the various case-studies were precisely identified us-
ing the pattern-based approach. Both with manually constructed patterns as well
as with learned patterns good results were achieved in the studied cases.

5
Application: Extracting Inferable
Information From the Web
In the previous chapter, we have focused on the extraction of factual information
that is easily verifiable using structured content on the web. The evaluations give
confidence in the quality of the output of the method.
Now we focus on the extraction of information that is not present as such on the
web, but can be inferred by combining data extracted from multiple websites. We
do so by presenting two case studies. In Section 5.1 we focus on an information
demand from the
Nederlands instituut voor Beeld en Geluid. Having defined a
thesaurus (or, an ontology) of keywords, the question is how to link a user-input
term to a keyword that is semantically closest. We use methods developed in the
previous chapters to address this task. Section 5.1 is based on [Geleijnse & Korst,
2007].
Section 5.2 focusses on the extraction of lyrics from the web. With numerous
fanpages, legal and less legal websites on lyrics, it is not straightforward to find a
reliable version on the lyrics for a given song. Section 5.2 is based on [Geleijnse
& Korst, 2006a] and [Korst & Geleijnse, 2006] as well as the related patent appli-
cations (page 166). We use a rule-based approach to identify lyrics within pages
found with a search engine and combine all versions into a most plausible version
for the given song.
77

78
5.1 Improving the Accessibility of a Thesaurus-Based Catalog
To consistently annotate items and to facilitate their retrieval, cultural heritage in-
stitutions often use controlled vocabularies for indexing their collections. For the
Nederlands instituut voor Beeld en Geluid
1
(B&G), the annotations are currently
the only basis to retrieve the audiovisual content. The maintenance of the anno-
tations as well as the addition of new annotated material is a costly and laborious
task. Many of the hundreds of thousands of documents are therefore only sparsely
annotated.
B&G uses a dedicated thesaurus, the
GTAA
(Gemeenschappelijke Thesaurus
Audiovisuele Archieven (
Common Thesaurus Audiovisual Archives), as a con-
trolled vocabulary. Especially for the items that are briefly annotated, e.g. where a
summary of the content is missing, searching for
GTAA
terms is the most effective
mechanism for retrieval.
Although the use of a controlled vocabulary such as the
GTAA
provides a uni-
form annotation over the whole collection, it also gives rise to two problems. On
the one hand, the retrieval of items – both for professionals and for the general
public – depends on the knowledge of the content of the
GTAA
. Proper use of the
terms in the
GTAA
is crucial for both indexing and retrieval. On the other hand, the
controlled vocabulary is updated from time to time as new terms become relevant.
B&G choose to limit the size of their controlled vocabulary. Therefore, all annota-
tions that contain terms that are removed from the vocabulary have to be updated,
as
expired terms are mapped to terms within the latest version of the
GTAA
.
As proper use of the
GTAA
is of value for the accessibility of the B&G catalog,
we focus on an assistant to identify proper terms within the thesaurus for a search
demand. Given an arbitrary search term, we want to identify
GTAA
terms with
a similar meaning. Such a mapping between the term and the
GTAA
can be of
assistance for those who want to search the catalog as it will provide more effective
search results. For those annotating an audiovisual production it can also be of use,
as it can help to find the closest terms within the
GTAA
.
For many languages, such as Dutch, no structured knowledge is available to
derive a mapping between an arbitrary term and the thesaurus. We therefore use
unstructured texts to extract such a mapping, by deploying techniques developed
in the fields of ontology mapping and web content mining. We derive semantic
relations between a query term and the thesaurus using search engine snippets.
We illustrate that the method presented is domain and language independent
by evaluating mappings of terms both to the Dutch
GTAA
and to the Agricultural
Thesaurus
2
(
NALT
) of the United States National Agricultural Library.
1
The Netherlands Institute for Sound & Vision, http://www.beeldengeluid.nl
2
http://agclass.nal.usda.gov/agt/agt.shtml

5.1 Improving the Accessibility of a Thesaurus-Based Catalog
79
5.1.1 Related Work
Together with the development of the semantic web, the research topic of ontology
matching arose [Shvaiko et al., 2006]. In ontology matching, the task is to combine
or create relations between two separately designed ontologies. Although most
approaches are based on the structures of the ontologies combined with lexical
matches (e.g. [Shvaiko & Euzenat, 2005; Meilicke, Stuckenschmidt, & Tamilin,
2007]), the use of web content mining has recently been deployed for this task
[Gligorov et al., 2007; Van Hage, Kolb, & Schreiber, 2006].
Web information extraction applied to the cultural heritage domain is addressed
by De Boer et al. [2007]. Here, ontologies of painters and art movements are
linked by analyzing web pages on art movements. The numbers of search engine
hits is used in [De Boer et al., 2006] to identify the periods corresponding to art
styles. In [Cilibrasi & Vitanyi, 2007] such numbers are used to identify relatedness
between Dutch 17th century painters. In [Navigli & Velardi, 2006] a method is
presented to create structured knowledge on the arts domain using the definitions
in a glossary. Patterns in the glosses are used to identify relations. These relations
link the concept to a named entity, extracted using a NER.
As an alternative approach to the use of web content mining to improve the
accessibility of the catalog, Malais´e et al. [2007] created a method to link the Dutch
GTAA
thesaurus to the English WordNet [Fellbaum, 1998] via a bilingual online
dictionary. As the
GTAA
contains many multi-word terms and compounds, such a
mapping can not always be found. Moreover, it is not trivial to link an arbitrary
given term via WordNet to the
GTAA
.
5.1.2 Problem Description and Outline
Given is a thesaurus, i.e. a list of terms and their semantic relations. Typical rela-
tions are the
broader term relation (
BT
) between a term and a more general term
(e.g.
herring gull and seagull), its counterpart the narrower term relation (
NT
), and
the
related term
RT
relation for terms that are associated with one another. More-
over, a thesaurus can contain preferred and non-preferred terms. The latter refer to
the first via the
use relation (
US
),
used for (
UF
) is its inverse. As a thesaurus solely
consists of a set of terms and their mutual relations, it can easily be described using
the terminology posed in Chapter 2. Van Assem et al. [2004] proposed a mecha-
nism to convert a thesaurus to semantic web format.
Apart from the standard thesaurus relations, the
GTAA
also distinguishes one or
more categories for each preferred term. These categories are subdivided into 15
main categories (e.g.
sports and leisure) and each containing 3 to 7 subcategories
(e.g.
recreation). The terms in the
GTAA
are mostly in plural, but the singular forms
are added as well.
Example terms from both the
GTAA
and the
NALT
are given in Tables 5.1 and

80
bioscooppersoneel
(cinema personnel)
1D05.03
economy – trades, services
1D12.01
arts and culture – general
1D13.02
sports and leisure – recreation
BT
personeel
(personnel)
BT
werknemers
(employees)
NT
filmoperateurs
(film operators)
RT
bioscopen
(cinemas)
RT
film
(film)
UF
explicateurs (±
silent film commen-
tator)
Table 5.1. Example terms from the
GTAA
earthworms
BT
invertebrates
BT
soil invertebrates
RT
earthworm burrows
RT
Lumbricidae
RT
vermiculture
RT
worm casts
Table 5.2. Example term from the the
NALT
5.2. For the
GTAA
term the translations in English are given. For example, the
entry shows that
invertebrates is a broader term (
BT
) for
earthworms.
Currently, detailed knowledge of the content of the
GTAA
thesaurus T is
crucial for describing (and redescribing) items within the catalog. Moreover, the
recall of briefly described items will improve when using search terms within the
GTAA
. Hence, an assistant is desired that suggests terms from the
GTAA
for a
given query term. The problem addressed in this section is the following.
Thesaurus Mapping Problem. Given a term v and a thesaurus T , find the
term t ∈ T that is semantically closest.

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 20 21 22 23 24 25 26 27 ... 57