Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	34/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 30 31 32 33 34 35 36 37 ... 57

i j
= ’*’) is counted as a word, unless it is succeeded or preceded in its row
A
i◦
by only gaps. Otherwise, it is just skipped. If the most-occurring word is not
a gap, then this word is added to the final version. If the most-occurring word is a
gap, then this is just skipped for the final version.
To determine the similarity between the resulting final version and the ground
truth version we simply construct an optimal 2-alignment of these versions. Each
column of this 2-alignment can be associated with one of the following four cases,
namely a match, a substitution, a gap in the final version, or a gap in the ground
truth. The fraction of columns relating to these four cases are denotes by r
ma
, r
su
,

106
r
gf
, and r
gg
, respectively. Clearly, these fractions are between 0 and 1 and sum up
to 1. Analogously to Knees et al., we define the recall as
rec = r
ma
+ r
gf
(5.6)
and precision as
pre = r
ma
+ r
gg
.
(5.7)
For the total set of 258 songs our algorithm obtains an average recall of 0.93 and
an average precision of 0.86. This result is very similar to the best results obtained
by Knees et al. Occasionally, recall is considerably lower than the average recall
due to the fact that in the ground truth version a chorus is repeated explicitly while
it is not in the extracted final version. Likewise, for some songs, the precision
is considerably lower than the average precision due to the fact that in the final
version a chorus is repeated explicitly while it is not in the ground truth version.
Since these differences cannot really be considered as errors, we also determined
the average value of r
su
. This is only 0.02. In other words, in the alignments
of the extracted version with the ground truth version, only 2 out of 100 words
correspond to a substitution. We note that these substitutions still contain many
pairs as (movin’, moving), (yeah, yea), (’re, are) that cannot really be considered
as wrong.
The above results are averaged over all 258 songs. However, for 7 of the 258
songs the recall and precision are substantially below the above averages because
the algorithm selected the lyrics of another song. In all seven cases, the selected
song was by the same group or artist. In addition, in four of these seven cases
the song title of the intended song appears in the lyrics or even in the title of the
extracted song. For example, when searching for the lyrics of A Long Way From
Home by The Kinks the lyrics of Long Distance was found. This lyrics contains the
string ‘a long way from home’. For three of the seven cases, the intended song was
found as a second largest cluster. For the four other cases, the clustering resulted
in many small clusters, with an average fraction of outliers of 0.70, which could be
used as an indication that something is wrong. Furthermore, when extracting the
lyrics of all songs of a given artist, it can be easily checked whether the extracted
lyrics for different songs incidentally are (very) similar. Hence, these errors can be
detected automatically.
Comparison with the Yahoo! Music Collection
Since April 2007,
Yahoo! Music provides access to song lyrics for “hundreds
of thousands of songs”, being “the largest database of high quality lyrics”
3,4
To
3
http://www.gracenote.com/corporate/press/article.html/date=2007042400
4
It is notable that the (growing) content of
Yahoo! Music is restricted to material where the
copyright is granted. The experiment with
Yahoo! Music was conducted on August 3, 2007.

5.2 Extracting Lyrics from the Web
107
Cathy Dennis
-
Touch Me (All Night Long)
Floyd Cramer
-
On the Rebound
Frank Mills
-
Music Box Dancer
Groove Theory
-
Tell Me
Horst Jankowski
-
A Walk in The Black Forest
Inner Circle
-
Bad Boys
Table 5.16. Examples of songs that were not retrieved by the algorithm.
investigate whether the algorithm we presented is now obsolete, we compare the
results of our algorithm with the
Yahoo! Music lyrics collection.
An external company handed us a set of 609 song titles. We test our algorithm
on this collection and compare the results of our method with the content of the
Yahoo! Music lyrics database. The set mainly contains well-known artists and
songs from various genres.
For each song, we query
Yahoo! Music at most three times. Contrary to the
experiment with the 258 songs, this collection contains a number of song titles
and artist names containing parentheses. If the query lyrics, [songtitle],
[artist] fails, we remove the texts between parentheses in both the song title
and the artist name (e.g.
Blowin’ Me Up (With Her Love) is now queried as Blowin’
Me Up). If no results are found after the adaptation, we leave out the artist name
in the query.
As no ground truth is available for this set, we manually inspect the output of
the algorithm. We consider the retrieved lyrics to be correct if we recognize them
as the lyrics corresponding to the queried song.
The results of this experiment are as follows. Using the algorithm as described
in this article, we find lyrics of 577 songs. Of only 32 songs we did not find any
lyrics, or the lyrics retrieved did not correspond to the song, see Table 5.16 for
some examples.
When we query the
Yahoo! Music lyrics collection, we only find 191 correct
lyrics for the given song-artist combinations. Additionally, 12 lyrics of different
versions of the queried song were found. For example, for
Strangers in the Night
by
Frank Sinatra, the version of the song by Bette Midler was finally retrieved
using the third query. For 108 songs, lyrics to a different song were found. Hence,
no results were found for 301 out of the 609 songs in the collection.
Of the songs that were not found by the algorithm presented, three could be
found in
Yahoo! Music (viz. Table 5.18). Table 5.17 contains some example songs
that were not found in
Yahoo! Music, but were indeed retrieved using our algo-
rithm.

108
Beck
-
Loser
Crosby, Stills, Nash & Young
-
Woodstock
Aerosmith
-
Angel
Anita Baker
-
Giving You the Best That I Got
Bob Marley & The Wailers
-
Who Is Mr. Brown
Table 5.17. Examples of songs that were retrieved by the algorithm, but were not
found in Yahoo! Music.
Inner Circle
-
Bad Boys
Chitty Chitty Bang Bang Original Cast
-
Chitty Takes Flight (Finale to Act One)
Groove Theory
-
Tell Me
Table 5.18. The three songs that were found in Yahoo! Music, but could not be
retrieved using the algorithm.
Although the lyrics provided by
Yahoo! Music may be of high quality, this ex-
periment shows that some well known songs are not included. As no complete and
reliable web site is available for collecting lyrics, the algorithm described remains
a valuable tool for music research.
5.2.7 Concluding remarks
We have presented an approach to retrieve lyrics versions from the web using a
search engine, and to efficiently align them. In comparison to the approach by
Knees et al., our approach is much more efficient but nevertheless gives comparable
results. A second experiment illustrated that the algorithm is also able to find lyrics
of songs that are not stored on the large lyrics
Yahoo! Music. The algorithm as
presented in this article can be a valuable tool for those researching lyrics-based
music information retrieval [Kleedorfer, 2008; Mahedero et al., 2005]. Moreover,
the lyrics found can be used as a basis for automatic lyrics synchronization [Chen
et al., 2006; Y. Wang et al., 2004] and creating visual effects using images and
colored lights [Sekulovski et al., 2008; Geleijnse et al., 2008].

6
Discovering Information by Extracting
Community Data
Apart from factual information, the web also is a valuable source to gather
community-based data as people with numerous backgrounds, interests and ideas
contribute to the content of the web. Hence the web is also a valuable source to
extract opinions, characterizations and perceived relatedness between items.
We extract and combine information from diverse sources on the web to char-
acterize items such as
Madonna and The Great Gatsby using community-based
data. By combining information from various sources such as fan pages, newspa-
per reviews, gossip magazines and music websites, the aim is to create a character-
ization of for example
Madonna as expressed on the web. By combining this data,
we create new information that may not be verifiable as it is not available as such.
This chapter is organized as follows. In Section 6.1, we discuss the problem
definition and two alternative methods to extract information from the web. In
Section 6.2 we present a method to process extracted data into characterizations.
Section 6.3 focuses on the evaluation of the extracted community-based data using
data from a social website, while in Section 6.4 we present a number of case studies
followed by conclusions in Section 6.5.
109

110
6.1 Extracting Subjective Information from the Web
In this chapter, we are interested in the characterization of an item or concept by the
web community. For example, given the latest novel by
Philip Roth or Madonna’s
new single, we want to know the way people describe such items. Moreover given
a book or an artist, which other books or artists are considered to be related?
Users of so-called
social websites – or folksonomies – such as Flickr.com,
YouTube.com and Last.fm are invited to label the items described on these sites.
Unlike the thesauri studied in Chapter 5.1, the tags applied to the items have no
formally defined semantics, while the vocabulary is uncontrolled. However, in
practice tagging has shown to be an effective mechanism to describe and retrieve
content.
For a collection of items to be well searchable, a large and active community
is required who
explicitly labels the items with tags. Items that are not labeled or
labeled with less intuitive tags may thus not be retrievable. However, such commu-
nity websites describe items that are often also described on many other web pages.
Users are thus invited to enter knowledge that is potentially already available on
the web.
Current ontology population methods based on texts on the web (e.g. [Etzioni
et al., 2005; McDowell & Cafarella, 2006]) focus on factual information rather
than on more subjective, community-based descriptors of items. Here we present
methods to efficiently identify and structure information as can be found on social
websites. We focus on the labeling of items, such as musical artists, with tags
from unstructured web sources. Hence, we propose a method where the tagging of
artists is done implicitly by the web community. We thus compute the semantics of
an item (e.g. a musical artist) in terms of tags as perceived by the web community.
Previous methods (e.g. [Mika, 2007; Cilibrasi & Vitanyi, 2007; Schedl, Pohle,
Knees, & Widmer, 2006]) use a quadratic number of queries to a search engine. In
this chapter, we compare efficient techniques as discussed in chapter 2 with such
approaches.
A method to automatically label items with tags can on the one hand be used
as an alternative to the labeling of items by a community. On the other hand the
computed tags can be used in support of a community web site. For example,
computed tags can be presented as a suggestion to the user or can be used to avoid
a ‘cold start’ problem for items in a collection that have not been labeled yet.
6.1.1 Relatedness, Categories and Tags
To describe instances using the collective knowledge of the web community, we
thus adopt the notion of tags. In this chapter, we assume that a collection of in-
stances (books, popular artists, painters, etc.) is given as well as a list of relevant

6.1 Extracting Subjective Information from the Web
111
descriptors or tags. This leads to the ontology population problem with complete
classes (Chapter 2).
Problem Definitions. Given is an ontology O with two complete classes c

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 30 31 32 33 34 35 36 37 ... 57