Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	32/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 28 29 30 31 32 33 34 35 ... 57

0
= (S
0
, S
1
, . . . , S
m
), and suppose that these word-sequences are ordered
by decreasing length, i.e., |S
i
| ≥ |S
i+1
| for all i = 0, 1, · · · , m − 1. Then, we select
S
0
as reference sequence and construct an optimal alignment between the reference
sequence and each of the other sequences, resulting in alignments A
1
, A
2
, . . . , A
m
.
The resulting m alignments A
1
, A
2
, . . . , A
m
are next combined into a single mul-
tiple alignment A = [a
i j
] containing m + 1 rows. We could simply take A
1◦
= A
1
1◦
,
and A
i+1◦
= A
i
2◦
for i = {1, . . . , m}. However, the upper rows in the different align-
ments (i.e. the rows corresponding to S
0
) will generally have a different number
of gaps, and these gaps will generally be at different positions. Let us consider
aligning the following three character sequences as an example: ‘california dream-
ing’ and ‘californian dream’ and ‘calif. dreaming’, then the first will be chosen as
reference sequence, resulting in the alignments A
1
and A
2
given as follows.
california* dreaming
california dreaming
californian dream***
calif.**** dreaming
Just taking the rows, as suggested above, from these two alignments would
result in an alignment A given as follows.
california* dreaming
californian dream***
calif.***** dreaming
To obtain a better alignment, a gap can be inserted right after the tenth character
in the third row, i.e., at the position where there is a gap in the upper row of A
1
.
We propose the following simple strategy to account for differences in gaps in
the upper rows of the given alignments A
1
, A
2
, . . . , A
m
. We simultaneously scan the
upper rows of the given alignments from beginning to end. If at a given position,
there is a gap in any of these upper rows, then for each alignment A
i
that does not
have a gap at this position, we insert a gap at that position both in the upper and
lower rows of A
i
.
A small example of the multiple sequence alignments that we obtain in this
fashion is given in Figure 5.3. For easy visual inspection the words in the same
column are appended with spaces such that they are all of equal length. The char-
acter | denotes a line break, which is considered as a single word. In addition, an
extra row has been added to underline the columns where there is disagreement.

5.2 Extracting Lyrics from the Web
103
cause I know the dreams that you keep |
that’s
where we meet
Cause I know the dreams that you keep |
That’s
where we meet
Cause I know the dreams that you keep |
That’s
where we meet
Cos
i know the dreams that you keep is wearing me
*
*
Cos
i know the dreams that you keep is wearing me
*
*
Cos
i know the dreams that you keep is wearing me
*
*
Cos
i know the dreams that you keep is wearing me
*
*
Cos
I know the dreams that you keep is wearing me
*
*
Cos
I know the dreams that you keep is wearing me
*
*
cause I know the dreams that you keep |
that’s
where we meet
----- -
-- ------- ----- -- ----
Figure 5.3. A fragment of a multiple sequence alignment of 10 lyrics versions of
the song No distance left to run by Blur.
5.2.5 Computational Complexity
In this section we analyze the computational complexity of the lyrics retrieval and
alignment algorithms presented in the previous sections, and we give a detailed
comparison with the computational complexity of the algorithms presented by
Knees et al. [2005]. In this analysis we do not take into account the time required
for issuing queries to Google and for retrieving the documents from the web. The
time required for these activities are assumed to be identical for both approaches.
Algorithm by Knees et al. The document collection approach of Knees et al. is
very similar to our approach. However, Knees et al. do not extract the lyrics from
the documents before aligning. They only remove HTML-tags and corresponding
links and preprocess the documents to handle annotations. For example, they re-
place annotations such as ‘
repeat chorus’ by the actual chorus. Knees et al. also do
not consider removing outliers before aligning. In conclusion, if both algorithms
start with the same collection of documents, but our approach will retain fewer and
shorter text fragments for the actual alignment.
Let us assume, nevertheless, that both multiple alignment algorithms have the
same number of text fragments as input, i.e., that there are no outliers in the set of n
documents. For ease of analysis, let us assume that we have n text fragments each
with a length of l words, and let us assume n to be a power of two. Furthermore,
let us assume that all alignments remain of O(l) length.
The multiple sequence alignment proposed by Knees et al. follows an iterative
hierarchical approach [Corpet, 1988]. For ease of reference, an alignment consist-
ing of n rows is called an n-alignment. In the first iteration the algorithm pairs the

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 28 29 30 31 32 33 34 35 ... 57