5.2 Extracting Lyrics from the Web
103
cause I know the dreams that you keep |
that’s
where we meet
Cause I know the dreams that you keep |
That’s
where we meet
Cause I know the dreams that you keep |
That’s
where we meet
Cos
i know the dreams that you keep is wearing me
*
*
Cos
i know the dreams that you keep is wearing me
*
*
Cos
i know the dreams that you keep is wearing me
*
*
Cos
i know the dreams that you keep is wearing me
*
*
Cos
I know the dreams that you keep is wearing me
*
*
Cos
I know the dreams that you keep is wearing me
*
*
cause I know the dreams that you keep |
that’s
where we meet
----- -
-- ------- ----- -- ----
Figure 5.3. A fragment of a multiple sequence alignment of 10 lyrics versions of
the song
No distance left to run by
Blur.
5.2.5 Computational Complexity
In this section we analyze the computational complexity of the lyrics retrieval and
alignment algorithms presented in the previous sections, and we give a detailed
comparison with the computational complexity of the algorithms presented by
Knees et al. [2005]. In this analysis we do not take into account the time required
for issuing queries to Google and for retrieving the documents from the web. The
time required for these activities are assumed to be identical for both approaches.
Algorithm by Knees et al. The document collection approach of Knees et al. is
very similar to our approach. However, Knees et al. do not extract the lyrics from
the documents before aligning. They only remove HTML-tags and corresponding
links and preprocess the documents to handle annotations. For example, they re-
place annotations such as ‘
repeat chorus’ by the actual chorus. Knees et al. also do
not consider removing outliers before aligning. In conclusion, if both algorithms
start with the same collection of documents, but our approach will retain fewer and
shorter text fragments for the actual alignment.
Let us assume, nevertheless, that both multiple alignment algorithms have the
same number of text fragments as input, i.e., that there are no outliers in the set of
n
documents. For ease of analysis, let us assume that we have
n text fragments each
with a length of
l words, and let us assume
n to be a power of two. Furthermore,
let us assume that all alignments remain of
O(
l) length.
The multiple sequence alignment proposed by Knees et al. follows an iterative
hierarchical approach [Corpet, 1988]. For ease of reference, an alignment consist-
ing of
n rows is called an
n-alignment. In the first iteration the algorithm pairs the
Dostları ilə paylaş: