Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	40/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 36 37 38 39 40 41 42 43 ... 57

i
the tag is applied to artist i. The top count is the
sum over all (at most 100) n
i
. Table 6.5 contains examples of tags applied once in
the 1995 artist collection and their top counts. If this sum is one, only one user has
tagged one artist with this tag. Hence, the larger the top count, the more people will
have used the tag to describe their music. Out of these 7,981 tags, 7,238 have a top
count of at most 100. For comparison, the tag ’rock’ has a top count of 150,519.
Hence, we can conclude that the tags that are found only once in a collection of
artists are in general uncommon descriptors for an artist.
Based on these small experiments, we conclude that most frequently used tags
are relevant characterizations for musical artists. Moreover, although users can
label an artist with any term, the list of frequently used tags is relatively small. We
conclude that the number of tags that describe multiple artists is in the order of
thousands. If we select the tags that apply to 5% of the artists, the number is in the
order of hundreds.
5
e.g.
http://ws.audioscrobbler.com/1.0/tag/post-hardcore/
topartists.xml gives the top artists and counts for
post-hardcore

6.3 Evaluating Extracted Subjective Information
123
post-hardcore (8134)
fagzzz (0)
twee (4036)
when somebody loves you (0)
futurepop (3162)
ravens music (0)
mathcore (2865)
bands i met (0)
piano rock (2558)
most definitely a bamf (1)
Table 6.5. Examples of tags occurring only once with the high and low top counts.
6.3.2 Filtering the Tags
As indicated above, not all tags provide sufficient information for our task since
tags occur with small spelling variations and catalog data (such as the names of
artists or songs) are used as tag as well. Moreover, tags that are only applied to few
artists cannot be used to discriminate between artists, as no semantics are defined
for tags. Suppose that we have a collection I
a
of artists. We present a simple
method to filter out such meaningless tags.
Normalizing Tags. As we want tags to be descriptive, we filter out tags attached
to i ∈ I
a
as follows.
• If a tag equals the name of the artist, we remove it.
• We compute a normalized form for all tags by
– turning them into lowercase,
– computing the stem of all words in the tags using Porter’s stemming
algorithm [Porter, 1980], and
– removing all non-letter-or-digit characters in the tags.
• If two tags have the same normalized form, we remove the second one in the
list.
• We remove every infrequently applied tag. In our experiments, we remove
the tags that are applied to less than 5% of the artists in I
a
.
Track Filtering. As we want the tags to reflect the music of the artist, we propose
a next filtering step based on the tags applied to the best scoring tracks of the artist.
Audioscrobbler provides the most popular tracks per artist, based on the listening
behavior of the
Last.fm users. As tracks can also be tagged individually, we can
compare the tags applied to the artist with the tags applied to the top tracks. In the
Track Filtering step, we filter out tags applied to the artist, that are not applied to
his top tracks.

124
hip hop
alternative
Eminem
seen live
hiphop
metal
Aftermath
classic rock
Table 6.6. Tags removed for Eminem after normalization (l.) and track-filtering
(r.).
By removing the tags that do not reflect the (most popular) music of an artist,
we perform a second filtering step.

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 36 37 38 39 40 41 42 43 ... 57