5.2 Extracting Lyrics from the Web
105
retrieval of our algorithm with the retrieval of the
Yahoo! lyrics service, which
uses the
GraceNote lyrics collection.
The 258 Song Test Set
Our lyrics extraction and alignment algorithms were tested on the set of 258 songs
from Knees, Schedl & Widmer [2005], of which we obtained the ground truth ver-
sions from these authors. The ground truth versions are the versions as they exactly
appear in the CD booklets. We next give experimental results for the successive
steps in the algorithm.
Collecting documents. To give the reader an idea of the number of docu-
ments that are expected to contain the lyrics of the various songs, for the 258
songs Google reported an average of 507 hits for the first query (containing the
allinanchor-option). However, using this first query, for 6 songs no hits were found.
Extracting lyrics. By extracting the lyrics from the documents, we get a sub-
stantial reduction. On average, the size of the extracted lyrics is only 7% of the
original document size. However, the reduction is rather modest in comparison to
the size of the documents after they have been stripped from HTML-tags and cor-
responding links. On average, the size of the extracted lyrics is 79% of the stripped
document size.
Removing outliers. On average, 38% of the extracted text fragments were found
to be outliers. Comparing the size of the largest cluster with the size of the second
largest cluster, we obtain that on average the first is four times as large as the second
one. Hence, on average, there is a clear winner among the clusters.
Multiple sequence alignment. For the 258 songs we derived the following re-
sults. To compare the results of the multiple sequence alignment with that of the
ground truth, we transform the multiple sequence alignment into a final version, by
applying simple majority voting on a word-by-word level. For each column in the
m-alignment, we select the word that occurs most often, where a gap is handled as
follows. When for a given column the different words are being counted, then a
gap (
a
Dostları ilə paylaş: