5.2 Extracting Lyrics from the Web
97
5.2.3 Selecting a Subset of Lyrics
After having collected a number of URLs of documents and identified a number
of potential lyrics (the set
P) from the documents collected, the task remains to
remove the texts that are not the lyrics of the intended song
s.
The set
P of text fragments that is extracted as described in the previous section
is likely to also contain text fragments that do no relate to the lyrics of the intended
song, for a number of reasons, such as the following ones.
- A text fragment can be the lyrics of another song by the same artist (espe-
cially if the title of the intended song is a subsequence of the title of the other
song).
- A text fragment can be the listing of an album’s songs in which the intended
song appears (especially if the song title is identical to the album title).
- A text fragment can be a listing of a playlist.
In this stage we want to remove these so-called outliers, since they do not
reflect the intended lyrics. We use the assumption that the majority of the extracted
text fragments constitutes the lyrics of the intended song.
We cluster the text fragments on the basis of similarity and retain only the text
fragments in the largest cluster. As variations frequently occur in representations
of lyrics to the same song, exact string matching is unsuited for this purpose. Ap-
proximate string matching techniques of strings of lengths
s
0
and
s
1
are in general
Dostları ilə paylaş: