Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	27/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 23 24 25 26 27 28 29 30 ... 57

v, the task is to identify a mapping, i.e. the most relevant term in T . We search the
three lists to select the ‘
best of three’.
This
best of three term is selected as follows. We select the term with the
highest average rank that is found by all methods. If no such term exists, we select
the term with the highest average rank over two of the three methods. We leave
this ‘
best of three’ field blank if no term is identified by more than one method.
Finally, the
set of winners is identified, consisting of at most four terms: the
best scoring terms for the individual methods plus the
best of three.
As the algorithm presented is intended to be an assistant for users of the catalog
rather than a fully automatic mapping, we also present the outputs of the individual
methods. An
HTML
page is generated where the terms are linked with the corre-
sponding entries in the thesaurus. Hence, even if the best suited term is not found,
the user can navigate to this term by clicking a closely related term.
As the number of queries is linear in the number of patterns for a given in-
put term, a real-time application is possible. With an (inefficient) implementation
where 21 queries (1 for the categories and 10 for both the hyponyms and enumer-
ation patterns) per term are performed sequentially, the method returns the results
within a minute.
5.1.8 Experimental Results
In this section we present experiments with two thesauri. We evaluate the perfor-
mance when mapping terms to the Dutch
GTAA
for the audiovisual archives, where
in the second part of this subsection we use the United States
NALT
agricultural
thesaurus.
To be of assistance, the relevant terms within the results of the method should
be observable at a single glance. We therefore not only analyze the performance
of the
best of three term, but also the precision of the three individual methods and
the average ranking of the terms in the benchmark set.
Experiments with the GTAA
We performed two experiments with the
GTAA
. In the first experiment, we map a
set of ‘expired terms’ to the thesaurus, where in the second we use a ‘leave-one-
out’ strategy to evaluate the recall and precision of the method. That is, we remove
a term t from the thesaurus T and map this term to T \{t}.
Mapping expired terms to the GTAA. As novel items are constantly added to
the archive of B&G, the
GTAA
is updated from time to time as well. A major

5.1 Improving the Accessibility of a Thesaurus-Based Catalog
85
[query] [keyword]
[query] en [keyword]
[keyword] en [query]
[keyword] [query]
[keyword] zoals [query]
[query] en andere [keyword]
[keyword] als [query]
[keyword] of [query]
[query] of [keyword]
[keyword] van [query]
[query] [keyword]
[keyword] en [query]
[query] en [keyword]
[keyword] [query]
[query] of [keyword]
[query] en de [keyword]
[query] de [keyword]
[keyword] of [query]
[query] in [keyword]
[query] van [keyword]
Table 5.3. The 10 hyponym patterns (left) and enumeration patterns (right) used
for the
GTAA
.
problem is replacement of ‘expired terms’ with terms within the latest version of
the
GTAA
.
In this experiment we discuss the applicability of our method to resolve this
problem. As a benchmark set, B&G provided us a list of 78 pairs of such expired
terms and the terms within the (current)
GTAA
to replace them.
We have automatically learned the patterns (Chapter 3) for the hyponym and
enumeration relations by selecting all terms in the thesaurus starting with a – e and
their
BT
or
RT
respectively. For each of the two methods, we use the 10 patterns
that are found to be most effective (Table 5.3). We use [query] as a placeholder
for the term to be queried (thus outside the thesaurus) and [keyword] denotes the
place in the snippets where we search for terms within the thesaurus. It is notable
that there is an overlap between the patterns for the two relations. The patterns
zoals and en andere are Dutch translations of the patterns first identified by Hearst
[1992] and translated by IJzereef in [2004].
The results for the test with the expired terms can be found in Tables 5.4 and
5.5. Table 5.4 shows the accuracy of the highest ranked terms for the three indi-
vidual methods as well as the accuracy for the
best of three and set of winners. It
shows that the
best of three provides the correct term – i.e. the benchmark – in 13
cases, while the highest scored terms with the hyponym, enumeration and lexical
methods are correct in only 12, 4, and 7 cases respectively. For 15 terms, no
best
of three could be identified.
We have also analyzed the recall of the terms that are
1 click away from the
benchmark term by either the
US
,
BT
,
NT
or
RT
relation. Hence, these terms found
have a closely related meaning. For example, given the term
bioscooppersoneel

86
WITHOUT CATEGORIES
USING CATEGORIES
benchmark
1 click away
benchmark
1 click away
#1 hyponyms
13
24
12
22
#1 enumerations
2
13
4
16
#1 lexical
6
11
7
15
best of three
14
21
13
24
set of winners
20
41
18
41
Table 5.4.
Performance for the 78 expired terms. We compare the precision
using the computed categories and subcategories (l.) with the approach where no
(sub)categories were assigned to the expired terms.
(see Table 5.1), the terms
personeel, werknemers, filmoperateurs, bioscopen, film
and
explicateurs are all linked to this term. If the method select either one of these
terms, the user can navigate in one step to the term
bioscooppersoneel. We analyze
the number of times one of the
one click away terms is among the search results.
The average rank is computed using the
one click away term with the highest rank.
If we analyze the term selected as the
best of three, then 24 out of 78 are one
click away from the term in the benchmark set. The set of winners contains such a
term in 41 out of the 78 cases.
Table 5.4 shows that the performance of the lexical method improves when
we take the category information into account. Contrary to our assumptions, the
results of the method using hyponym patterns does not improve when using the
category information. For the given benchmark set, the ranking of the enumeration
patterns is slightly improved when using the category information.
Table 5.5 focusses on presence of the benchmark term within the full ordered
lists. Depicted is the number of times the actual benchmark term is identified and
their average rank. Hence, the benchmark term is identified in 46 out of the 78 cases
using the hyponym patterns. However, the average rank of both the benchmark
term and of the terms with distance 1 is better using the enumeration patterns. Here
we see that the use of category information has a positive effect on the ranking of
the method using the hyponym patterns.
When analyzing the results of the methods, we encounter numerous terms
found that are intuitively correct. In Table 5.6 we give a number of examples
of expired terms, the
best of three alternative found and the benchmark as given by
B&G.
For example, for the term
tabakswinkels (tobacco shop) the term in the bench-
mark set,
detailhandel (retail trade), was not found. However, the found sugges-

5.1 Improving the Accessibility of a Thesaurus-Based Catalog
87
WITHOUT CATEGORIES
USING CATEGORIES
benchmark
1 click away
benchmark
1 click away
recall hyponyms
46
70
46
70
average ranking
14.84
8.14
9.45
7.72
recall enumerations
16
44
16
44
average ranking
7.43
3.63
7.31
3.47
Table 5.5. Recall and average rank for the 78 expired terms.
term
best of three
benchmark
tabaksplanten
planten
tabak
tobacco plants
plants
tobacco
tabakswinkels
winkels
detailhandel
tobacco shops
shops
retail trade
tegelzetters
bouwvakkers
ambachten
tiler
construction workers
crafts
titanium
metalen
chemische elementen
titanium
metals
chemical elements
toxicologie
geneeskunde
vergiftigingen
toxicology
medical science
poisonings
troepen
militairen
krijgsmacht
troops
soldiers
armed forces
tweeverdieners
gezinnen
inkomens
two-earner family
families
incomes
Table 5.6. Example terms and their English translations with their ’best of three’
and benchmark mapping.
tions
kiosk, sigaretten (cigarettes) and winkels (shops) seem valid alternatives as
well. For
treinongelukken (train accidents), we find ongelukken (accidents), ram-
pen (disasters) and verkeersongelukken (traffic accidents), where the correct term
was
spoorwegongelukken (railway accidents).
To test the effect of the choice of the values for q(t, v), we varied the value for
the main category. We compute the ranks for the three individual methods by incre-
mentally increasing the value for the main category, from 0.1 (the value for
sharing

88
benchmark
#1 lexical
138
#1 hyponyms
71
#1 enumerations
87
best of three
159
set of winners
239
benchmark
recall hyponyms:
343
average ranking:
13.66
recall enumerations:
146
average ranking:
2.12
Table 5.7. Performance: recall and average rank for the 573
GTAA
terms
no category) to 1 (the value for sharing a subcategory). For this experiments, the
differences encountered were negligible. Hence, q(t, v) can be simplified by only
distinguishing two cases: either v and t share a subcategory or not. The use of the
subcategory information is of particular use for the lexical- and enumeration-based
methods.
Leave one out. In this second experiment with the
GTAA
, we use the thesaurus
itself as a benchmark set. We proceed as follows. We select a term t within the
GTAA
that has a broader term b. We then remove t from the thesaurus and use this
thesaurus T \ t as a reference. The task is now to find the mapping of the term
outside the thesaurus (i.e. t) to b.
We use the same patterns as in the previous experiment. For fairness, we there-
fore will only evaluate with terms in the thesaurus starting with the letters
f to z
that have a broader term. This resulted in a test with 573 terms. When a term has
multiple broader terms, in the evaluations we focus on the best scoring one.
The results for these tests are given in Table 5.7. It shows that in 239 out of
the 573 terms (i.e. 42%) the correct term is among the set of winners (of size at
most 4). Table 5.7 shows again that the recall using the hyponym patterns is larger,
but the ranking of the enumeration-based method is more precise. The lower recall
using the enumerations can be explained by the structure of the thesaurus. As the
GTAA
is quite flat, for many terms found within the snippets no broader term is
defined.
As an example, in Table 5.8 we give the best scoring output for the query term
fietspaden (bicycle tracks). The broader term in the thesaurus is infrastructuur
(infrastructure).
Given the difficulty of the tasks, and the fact that the mappings chosen in the
benchmarks are sometimes debatable, we consider the results of the experiments
convincing. As the correct answer is present in the majority of the cases (as the
hyponym pattern method found 343 out of 573 correct mappings), the method

5.1 Improving the Accessibility of a Thesaurus-Based Catalog
89
best of three:
paden
lexical:
paden
paths/tracks
hyponyms:
wegen
roads
hyponyms:
trottoirs
sidewalks
hyponyms:
paden
hyponyms:
meren
lakes
hyponyms:
padden
toads
enumeration:
infrastructuur
infrastructure
enumeration:
paden
enumeration:
openbare voorzieningen
public services
enumeration:
wegen
enumeration:
beroepen
professions
Table 5.8. Best scoring output for
fietspaden
(bicycle tracks).
can be of value as an assistant for those working with the
GTAA
or the catalog of
B&G. The experiment with the expired term showed that the determination of the
categories improves the performance of the lexical approach. The results suggest
that this preliminary step can be omitted for the other two approaches.
Experiment with NALT
To illustrate that the methods used are suited for English as well, we perform the
last experiment with the Agricultural Thesaurus (
NALT
) of the US National Agri-
cultural Library.
The
NALT
contains Latin names for animals and plants, names for molecules
and bacteria, but also product names such as
Brie Cheese, champagne and fish
steaks.
We learn the patterns using the terms starting with the letter
a. The patterns
found for
NALT
are given in Table 5.9. As the
NALT
does not categorize the terms,
we omit the step as described in Section 5.1.3. As the
NALT
consists of 68581
terms, we also leave out the collection of the number of search engine hits for each
thesaurus term, as this would require 14 days using the
Yahoo!
API
.
We test the performance of the methods on the 3321
NALT
terms starting with
the letters
b to d that have a broader term. The results are given in Table 5.10. As an
illustration Table 5.11 gives the output for the term
dietary cation anion difference,
where
feed formulation is its broader term in the
NALT
. The right mapping for the
(more common term)
bitterness indeed was found (Table 5.12).
For the hyponym-pattern based approach, typically a long list of terms is found
that all co-occur once with the queried term. As no frequency information is avail-

90
[query] [keyword]
[query] and [keyword]
[query] are [keyword]
[query] or [keyword]
[keyword] and [query]
[query] the [keyword]
[query] and other [keyword]
[keyword] such as [query]
[keyword] including [query]
[keyword] or [query]
[query] [keyword]
[query] and [keyword]
[keyword] and [query]
[query] or [keyword]
[query] are [keyword]
[query] such as [keyword]
[query] of [keyword]
[keyword] or [query]
[query] as [keyword]
[query] for [keyword]
Table 5.9. The 10 hyponym (l.) and enumeration (r.) patterns used for
NALT
.
benchmark
#1 lexical
169
#1 hyponyms
10
#1 enumerations
100
best of three
192
set of winners
286
benchmark
recall hyponyms:
468
average ranking:
41.02
recall enumerations:
301
average ranking:
8.39
Table 5.10. Performance for the 3321
NALT
terms
able, the ranking is just alphabetic. Using the enumeration method however, less
terms are found. Moreover, as multiple hyponyms contribute to the score of their
hypernym, the scores for the terms found tend to differ. Hence, although the num-
ber of correct mappings found with the enumeration patterns is smaller, the ranking
of the method is much more reliable than the hyponym-based method.
It immediately shows that the results for
NALT
are far more modest. However,
given the nature of the
NALT
and the fact that we did not correct for the frequencies
of the terms, we consider the results as a proof of concept that this method is also
applicable to another domain and language.
5.1.9 Conclusions
We have developed an algorithm to assist people to find alternative terms within a
thesaurus for a given query term.
The algorithm developed combines three approaches to map a term to a term
within a thesaurus. We use both texts found with
Yahoo! as well as a simple
lexical matching technique to make the mapping. The algorithm is constructed in-

5.1 Improving the Accessibility of a Thesaurus-Based Catalog
91
hyponyms:
ammonia
hyponyms:
anions
hyponyms:
buffering capacity
hyponyms:
fever
hyponyms:
literature
hyponyms:
milk
hyponyms:
milk fever
hyponyms:
placenta
hyponyms:
retained placenta
hyponyms:
salts
hyponyms:
species differences
hyponyms:
urea
hyponyms:
viscosity
enumeration:
periparturient diseases and disorders
enumeration:
pregnancy complications
Table 5.11. The output for
dietary cation anion difference
.
best of three:
flavor
...
hyponyms:
face
hyponyms:
families
hyponyms:
fear
hyponyms:
flavor
hyponyms:
food choices
hyponyms:
garlic
...
enumeration:
flavor
enumeration:
ketones
enumeration:
thermodynamics
enumeration:
physics
enumeration:
light
enumeration:
grapes
Table 5.12. Part of the output for
bitterness
; 116 alternatives with the same score
were found using the hyponym patterns.
dependently from the content of the thesaurus and can easily be mapped to another
language. The combination of independent methods lead to considerably better
performances than any of the individual methods.
The method can facilitate searching a catalog with the use of index terms, since

92
the algorithm can present a small number of alternative thesaurus terms for a given
term. This reduces the number of alternatives from the 5,000
GTAA
terms to only
a handful. The experiments with the
GTAA
show that the algorithm indeed can be
usable as an assistant to find the right terms within the thesaurus.
5.2 Extracting Lyrics from the Web
In the second section of this chapter, we focus on a rather different application of
information extraction: the identification and construction of lyrics from the web.
We present an approach to automatically retrieve and extract lyrics of arbitrary
popular songs from the web. An increasing amount of music is distributed via the
Internet without the lyrics being included. Our approach offers the possibility to
retrieve the lyrics of popular songs with little or no user effort allowing them to be
read or sung during playback.
Lyrics are also used in karaoke-like settings. Applications that synchronize
the music with its lyrics are the focus of ongoing research [Y. Wang, Kan, Nwe,
Shenoy, & Yin, 2004; Iskander, Wang, Kan, & Li, 2006; Chen, Gao, Zhu, & Sun,
2006; Geleijnse et al., 2008]. These methods take both the audio-file and the lyrics
as input. In this section we show that the retrieval of lyrics can be done automati-
cally.
The lyrics of a song can also be used to extract additional information on the
corresponding song. For example, if we want to create a playlist with, say 60%
Christmas songs, the use of lyrics is the most obvious method to achieve this goal.
Earlier work by Logan et.al. [Logan, Kositsky, & Moreno, 2004] focussed on the
semantic analysis of lyrics. A set of songs was classified by genre using either
features extracted from audio or from the lyrics. Their results indicate that the
combination of the two could result in a better classification. Where Logan et.al.
used a model where genres were associated with frequently occurring words in
lyrics of the genre, other directions can be taken as well to extract information
from the lyrics. The lyrics can be used to detect the topic of a song (e.g. Christmas,
New York, lost love, summer holiday), its mood [Balog, Mishne, & De Rijke,
2006], the song’s structure or the language in which it is sung [Mahedero, Mart´ınez,
Cano, Koppenberger, & Gouyon, 2005]. Hence such meta-data extracted from
the lyrics can be used to find similar songs. Feature extracting from lyrics may
thus be a valuable addition next to acoustic features extracted from audio (e.g.
[McKinney & Breebaart, 2003]) and external sources such as reviews [Dave &
Lawrence, 2003; Whitman & Ellis, 2004] or arbitrary web pages on music [Knees
et al., 2004; Geleijnse & Korst, 2006c].
On the web several lyrics sites offer the lyrics of large collections of songs. So,
users could access one of these lyrics sites. Apart from having to extract the lyrics

5.2 Extracting Lyrics from the Web
93
from the pages manually, these sites have two disadvantages. Each of the sites
offers only a relatively small part of the total amount of available songs. In addition,
the lyrics they offer are rather error prone. The lyrics sites usually depend on
lyrics that have been uploaded by end users. Detailed analysis reveals considerable
differences in the lyrics offered by the various sites for the same song.
In this section we present an approach to retrieve multiple lyrics versions of a
given song using a search engine. This also offers access to the lyrics of songs that
can be found on the web but do not appear at popular lyrics sites. Furthermore, by
aligning the multiple lyrics versions we give direct insight into the differences in
these lyrics versions.
Our approach consists of the following steps. Using only the song title and
artist name, web pages are extracted that potentially contain lyrics of the song.
Usually such web pages contain, in addition to the lyrics, other material, including
surrounding text, advertisements, etc. From each of these web pages, the text
fragment that is expected to comprise the lyrics is identified and extracted. Next,
the multiple text fragments are compared to remove outliers. In order to compute
a ‘correct’ version of a song’s lyrics, the remaining text fragments are aligned on a
word-by-word basis, aiming to maximize the number of matching words.
To the best of our knowledge, Knees, Schedl & Widmer [2005] are the only
ones that consider web-based lyrics extraction. Our approach differs from theirs in
the following aspects. Although they use a similar method to acquire web pages
containing the lyrics, the subsequent steps of our approach are far more efficient.
Before carrying out the actual alignment, we reduce the number of text fragments
by removing outliers (i.e., text fragments that do not relate to the lyrics) by using a
fingerprinting method. In addition, Knees et al. perform an approximate multiple
sequence alignment of n web documents by aligning each pair of texts in a hier-
archical fashion, requiring a total of n
2
log n alignments of web document pairs.
In contrast, we select one reference text fragment and align each of the other text
fragments with this reference text, requiring only n alignments of text fragment
pairs, while retaining the same quality of solutions.
The outline of this section is as follows. First, we discuss an approach to iden-
tify a collection of documents that are likely to contain the intended lyrics (Sec-
tion 5.2.1). In Section 5.2.2 we present an algorithm to extract the lyrics from these
documents. The collected set of lyrics is likely to contain other texts than solely
the lyrics of the intended song. We present an efficient method to remove outliers
in Section 5.2.3. Having identified a set of lyrics that are expected to reflect the
intended song, we present a multiple sequence alignment algorithm to construct a
single version of the lyrics in Section 5.2.4. In Section 5.2.5 we discuss the com-
putational complexity of the proposed algorithms and make a detailed comparison
with the algorithm of Knees et al. We present experimental results in Section 5.2.6

94
allinanchor:
“[Artist]”, “[Title]”, lyrics
allintitle:
“[Artist]”, “[Title]”, lyrics
allinanchor:
“[Title]”, lyrics
allintitle:
“[Title]”, lyrics
“[Title]”, lyrics
Table 5.13. The query templates used to retrieve pages containing relevant lyrics.
and end with concluding remarks in Section 5.2.7.
5.2.1 Website-independent Lyrics Extraction
In this section we describe how we retrieve and select a number of texts that each
potentially constitute the lyrics of a given song. We assume that we are only given
the title of the song and the name of the performing artist. The lyrics extraction
algorithm consists of two steps. First we collect
URL
s of documents that potentially
contain the lyrics of the song of interest (Section 5.2.1). Since we are interested in
the lyrics themselves – and not in the documents as a whole – we then extract the
lyrics from these documents using a website independent approach (Section 5.2.2).
Collecting URLs of Documents
We first retrieve the
URL
s of documents that potentially contain the lyrics of the
intended song. We focus on heterogeneous sources and opt for an approach that is
website independent.
To obtain potentially relevant documents we use a list of queries with place-
holders. For our experiments we heuristically constructed the list of query tem-
plates in Table 5.13. The expressions [Artist] and [Title] are placeholders
for the artist name and song title.
The first query returns web pages that contain the exact strings of the song title
and artist name (by putting them between quotes) and the word
lyrics in the anchor
text surrounding the links to these pages (by adding the allinanchor clause).
The anchor text is the text that directly relates to a link and can be considered as a
condensed summary of the content of the page.
As an alternative to the allinanchor option we also use the allintext
option. For this application, the allinanchor-option usually offers more hits than
the allintitle-option, where the allintitle-option requires that the terms in the query
appear in the title of the web page. The allinanchor-option is especially worthwhile
for rare songs, for which the allintitle-option may give no or only a small number

5.2 Extracting Lyrics from the Web
95
of results. For the more well-known songs, the top-ranked results for both options
seem to be of equal quality.
We store the
URL
s of the search results of the first query in a list L in the order
as they are presented by the search engine. The results of the following queries are
added to the tail of L, where we omit double
URL
s.
5.2.2 Extracting Lyrics from the Documents
After retrieving the list of relevant
URL
s, we want to extract the lyrics from the
corresponding documents.
We make use of the textual structure of lyrics. Within a book, we can observe
at a single glance whether the text is prose or poetry. Like poetry, lyrics typically
consist of stanzas separated by blank lines. Each stanza consists of one or more
lines, where each line has a maximum number of characters. We opt for a rule-
based approach to recognize the lyrics in text (Chapter 3.2.2).
In hypertext, the end of a line is marked with
-tag. Each line within the
lyrics will thus end with such a tag. We assume the markup within the lyrics to be
constant. Hence, the only tags within the lyrics part of the
HTML
source file of the
page will be the end-of-line markers.
We use these characteristics to identify lyrics in a hypertext as follows. Let

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 23 24 25 26 27 28 29 30 ... 57