Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	28/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 24 25 26 27 28 29 30 31 ... 57

D be a document containing r end-of-line markers. Then, D can be considered
as a sequence (t
0
,t
1
, . . . ,t
r
) with r + 1 fragments, where the ith end-of-line marker
separates t
i−1
and t
i
.
Since we opt for a time-efficient algorithm, we scan the full document only
once. We therefore map the string of r + 1 fragments t
i
, 0 ≤ i ≤ r, to a string L of
the form (b|l|n)
r+1
, i.e. a string of length r + 1 containing only bs, ls and ns. The
mapping m of the fragments t
i
meets the following criteria.
- m(t
i
) = b whenever t
i
is empty or only consists of white space characters
(blank line),
- m(t
i
) = l (a lyrics line) whenever t
i
(a) does not contain any HTML-tags,
(b) contains between 3 and 80 characters and
(c) half of the characters are letters,
and
- m(t
i
) = n otherwise (non lyrics).
In terms of bs, ls and ns, lyrics can now be described by a regular expression.
As we provided the algorithm with a list of query templates, we also input a list of
regular expressions that describe lyrics (Table 5.14).

96
R
0
= (l
1−20
· b)
1−12
· l
1−20
R
1
= l
3−40
Table 5.14. Regular expressions R
0
and R
1
expressing lyrics within a page.
Here, a
i− j
denotes a sequence of as of a length l, i ≤ l ≤ j. The second ex-
pression accepts a wider variety of texts. This is useful where there no stanzas are
indicated, e.g. in lyrics of rap songs.
We match the string L to a regular expression as follows. As we want to extract
the full lyrics (and for example not omit the last stanza), we want to identify the
first longest substring that is described by the regular expression.
We match the string L with the regular expression R
0
= R
0
· (l + b + n)
∗
to find
a prefix of L that matches R
0
. If the first character of L is either a b or a l this is a
O(1) operation. If not, the matching has an upperbound of the maximum number
of lines in the lyrics as described by R
0
, hence 21 · 12 + 20 = 273, or the length of
L if |L| ≤ 273. If no such match is found we remove the first character of L and
repeat the procedure until either a match is found or L is the empty string. The
search for a substring matching R
0
thus has a time complexity of O(|L|
2
).
After having found a substring of L that matches R
0
, we want to find the longest
prefix that matches R
0
. We use a bounded linear search to determine the length of
the longest prefix that matches R
0
. We remove the last character from the string
until the full string matches R
0
. Hence, the identification of the longest substring
matching a regular expression has a total time complexity of O(|L|
2
).
Some web sites contain the lyrics within preformatted text (put between tags
and
). As a preprocessing step, we first extract such fragments
from the document. The preformatted text is then mapped to a sequence L with the
same rules (using the new line character instead of the
-tag).
We match the sequence L to the regular expressions R
0
and R
1
and extract the
fragment as described above. We use the corresponding substring in the document
to obtain the lyrics within it. Finally, we check the text that directly precedes and
succeeds the resulting substring, since the first and last lines of the lyrics may not
be included in this substring.
Again for efficiency reasons, not all documents in the list L are downloaded.
From the list of
URL
s, the top element is taken and the corresponding document
is retrieved. The above extraction algorithm is applied to the retrieved document.
The texts retrieved with either one of the regular expressions are added to a set P
of potential lyrics. We continue this process until lyrics have been extracted from
40 documents, or the list L is empty.

5.2 Extracting Lyrics from the Web
97
5.2.3 Selecting a Subset of Lyrics
After having collected a number of URLs of documents and identified a number
of potential lyrics (the set P) from the documents collected, the task remains to
remove the texts that are not the lyrics of the intended song s.
The set P of text fragments that is extracted as described in the previous section
is likely to also contain text fragments that do no relate to the lyrics of the intended
song, for a number of reasons, such as the following ones.
- A text fragment can be the lyrics of another song by the same artist (espe-
cially if the title of the intended song is a subsequence of the title of the other
song).
- A text fragment can be the listing of an album’s songs in which the intended
song appears (especially if the song title is identical to the album title).
- A text fragment can be a listing of a playlist.
In this stage we want to remove these so-called outliers, since they do not
reflect the intended lyrics. We use the assumption that the majority of the extracted
text fragments constitutes the lyrics of the intended song.
We cluster the text fragments on the basis of similarity and retain only the text
fragments in the largest cluster. As variations frequently occur in representations
of lyrics to the same song, exact string matching is unsuited for this purpose. Ap-
proximate string matching techniques of strings of lengths s
0
and s
1
are in general

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 24 25 26 27 28 29 30 31 ... 57