Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	33/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 29 30 31 32 33 34 35 36 ... 57

n text fragments to n/2 2-alignments. It does so by repeatedly selecting the best
aligning pair of text fragments, until it obtains n/2 2-alignments. To determine the

104
best aligning pair, an l × l dynamic programming table is constructed for each pair
of text fragments. This results in a total number of word comparisons of O(n
2
l
2
)
in the first iteration.
In the second iteration, it pairs the n/2 2-alignments to n/4 4-alignments, again
by repeatedly selecting the best aligning pair. To determine the best aligning pair,
a dynamic programming table is constructed for each pair of the set of n/2 2-
alignments, where each entry in the table requires 4 word comparisons. This results
in a total number of word comparisons of O(n
2
l
2
) in the second iteration.
Analogously, in the ith iteration the algorithm pairs the n/2
i−1
2
i−1
-alignments
to n/2
i
2
i
-alignments, where each entry in the dynamic programming tables re-
quires 2
2(i−1)
word comparisons. Again, this results in a total of O(n
2
l
2
) word
comparisons in the ith iteration.
Since the total alignment procedure will consist of log n iterations, we obtain a
total number of O(n
2
log n l
2
) word comparisons.
Our algorithm. In comparison, our algorithm first extracts text fragments from
the n documents by using the regular expression in O(nl) time, where l denotes
the maximum number of words in a collected document. Next, it selects a sub-
set of text fragments using the fingerprints in O(n lm) + O(n
2
m) time, where m is
the number of words that are used in a fingerprint. Constructing a fingerprint for
each of the n documents requires O(n lm) time. Constructing the graph and deter-
mining the largest connected component requires O(n
2
m) time. Next, a multiple
sequence alignment is constructed, by only aligning each of the text fragments to a
single reference text fragment. Assuming that there are no outliers in the collected
documents, this requires O(nl
2
) word comparisons. In addition, the insertion of
additional gaps to construct the final multiple sequence alignment requires O(nl)
time.
Hence, the total time complexity of our algorithm amounts to O(n lm + n
2
m +
nl
2
). Since m < l, we can simplify this total time complexity to O(n
2
m + nl
2
).
Furthermore, in practice one may assume that nm < l
2
. In that case, the total time
complexity can be further simplified to O(nl
2
).
As both multiple sequence alignment approaches give approximate results, it re-
mains to be evaluated whether our algorithm is able to obtain results of at least the
same quality. This is the subject of the next section.
5.2.6 Experimental Results
We tested the algorithms on two test sets. The first set was also used by Knees,
Schedl and Widmer in their lyrics extraction method [2005]. We tested the perfor-
mance using the lyrics of the songs as found in the
CD
booklets.
We conducted a second experiment on a set of 608 songs. We compare the

5.2 Extracting Lyrics from the Web
105
retrieval of our algorithm with the retrieval of the
Yahoo! lyrics service, which
uses the
GraceNote lyrics collection.
The 258 Song Test Set
Our lyrics extraction and alignment algorithms were tested on the set of 258 songs
from Knees, Schedl & Widmer [2005], of which we obtained the ground truth ver-
sions from these authors. The ground truth versions are the versions as they exactly
appear in the CD booklets. We next give experimental results for the successive
steps in the algorithm.
Collecting documents. To give the reader an idea of the number of docu-
ments that are expected to contain the lyrics of the various songs, for the 258
songs Google reported an average of 507 hits for the first query (containing the
allinanchor-option). However, using this first query, for 6 songs no hits were found.
Extracting lyrics. By extracting the lyrics from the documents, we get a sub-
stantial reduction. On average, the size of the extracted lyrics is only 7% of the
original document size. However, the reduction is rather modest in comparison to
the size of the documents after they have been stripped from HTML-tags and cor-
responding links. On average, the size of the extracted lyrics is 79% of the stripped
document size.
Removing outliers. On average, 38% of the extracted text fragments were found
to be outliers. Comparing the size of the largest cluster with the size of the second
largest cluster, we obtain that on average the first is four times as large as the second
one. Hence, on average, there is a clear winner among the clusters.
Multiple sequence alignment. For the 258 songs we derived the following re-
sults. To compare the results of the multiple sequence alignment with that of the
ground truth, we transform the multiple sequence alignment into a final version, by
applying simple majority voting on a word-by-word level. For each column in the
m-alignment, we select the word that occurs most often, where a gap is handled as
follows. When for a given column the different words are being counted, then a
gap (a

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 29 30 31 32 33 34 35 36 ... 57