Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	39/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 35 36 37 38 39 40 41 42 ... 57

p(i, j) = w · s(i, j) + (1 − w) ·
∑
i
0
,i
0
6=i
t(i, i
0
) · s(i
0
, j).
(6.7)
Note that p(i, j) = s(i, j) for w = 1. For instance i and tag j, p(i, j) can be read
as the confidence estimator that j is applicable to i. It simply shows that the sum
of p(i, j) for all tags j being applicable to i is 1.
Since both s and t are normalized, using simple calculus we show that p is
normalized as well, i.e. the sum of p(i, j) over all j equals 1.

6.2 Processing Extracted Subjective Information
117
∑
j
0
p(i, j
0
) =
∑
j
0
¡
w · s(i, j
0
) + (1 − w) ·
∑
i
0
,i
0
6=i
t(i, i
0
) · s(i
0
, j
0
)
¢
=
∑
j
0
¡
w · s(i, j
0
)
¢
+
∑
j
0
¡
(1 − w) ·
∑
i
0
,i
0
6=i
t(i, i
0
) · s(i
0
, j
0
)
¢
= w ·
∑
j
0
s(i, j
0
) + (1 − w) ·
∑
j
0
∑
i
0
,i
0
6=i
t(i, i
0
) · s(i
0
, j
0
)
= w + (1 − w) ·
∑
j
0
∑
i
0
,i
0
6=i
t(i, i
0
) · s(i
0
, j
0
)
= w + (1 − w) ·
∑
i
0
,i
0
6=i
∑
j
0
t(i, i
0
) · s(i
0
, j
0
)
= w + (1 − w) ·
∑
i
0
,i
0
6=i
t(i, i
0
) ·
∑
j
0
s(i
0
, j
0
)
= w + (1 − w) ·
∑
i
0
,i
0
6=i
t(i, i
0
)
= w + (1 − w)
= 1
It now remains to find an appropriate value for w. One approach is to identify
a training set of artists and related tags. Using the co-occurrences acquired we can
determine the value of w, 0 ≤ w ≤ 1, for which the scores of the tags fit the training
set best. In the following section, we investigate whether the performance of the
artist tagging method indeed improves for values of w smaller than 1.
Complexity Analysis. We analyze the computational complexity of the compu-
tation of values for p(i, j). First we compute the complexity of creating all values
for t(i, i
0
) and s(i, j). We assume that the values for the co-occurrences scores
co(i, j) and co(i, i
0
) are stored in ordered lookup tables.
For instances i and i
0
with co(i, i
0
) ≥ 1, t(i, i
0
) can be rewritten as follows,
t(i, i
0
) =
T (i, i
0
)
∑
i
0
T (i, i
0
)
=
co
(i,i
0
)
∑
i00,i006=i0
co
(i
00
,i
0
)
∑
i
0
co(i,i
0
)
∑
i1,i16=i0
co(i
1
,i
0
)
=
co(i, i
0
)
c(i
0
) · ∑
i
0
co(i,i
0
)
c(i
0
)
where c(i) is given by

118
c(i) =
∑
i
0
,i
0
6=i
co(i, i
0
).
(6.8)
The value for c(i) can be simply computed using n steps, where n is the size of
I
a
. Hence, computing c(i) for all i requires O(n
2
). The values of c(i) and co(i, j)
can hence be looked up in O(log n), while the computation of t(i, j) requires n
lookups for c(i
0
). Hence, t(i, i
0
) can be computed in O(n log n). Hence, computing
all values for t(i, i
0
) requires a time complexity O(n
3
log n).
The computation of s(i, j) can be done in a similar fashion, requiring
O(m log(nm)) steps, where m is the size of I
g
. All values of s(i, j) are thus com-
puted in O(nm
2
log(nm)).
We store the values of s and t again in a lookup table. Assuming that the number
of tags in I
g
does not exceed the size of the set of instances in I
a
, we conclude that
the preprocessing step requires a computational complexity of O(n
3
· log n).
Having computed all values for s(i, j) and t(i, i
0
), we can now compute p(i, j)
for a given w. The values for t(i, i
0
) · s(i
0
, j) are computed in n − 1 steps of two
lookups. Hence, the value for p(i, j) can be computed in O(n log n). To create
ordered lists of tags for all instances in I
a
thus requires a time complexity of O(m ·
n
2
log(n)).
In total, the time complexity for the computation of all values of p(i, j) is thus
O(n
3
log n + nm
2
log(mn) + mn
2
log n). If we assume m < n, the complexity is thus
O(n
3
log n). Especially for
PCM
this value is realistic, as it is to be expected that
most co-occurrence counts are at least 1.
6.3 Evaluating Extracted Subjective Information
In this section, we investigate whether information from social websites can be
used to evaluate the populated ontologies computed with the methods discussed
in the previous section. We focus on one of the larger social websites,
Last.fm,
and its topic: music. We investigate the consistency of the tags as provided by the
Last.fm community and compare this data with the concept genre that is often used
by professionals to characterize music.
Researchers in music information retrieval widely consider musical genre to be
an ill-defined concept [Aucouturier & Pachet, 2003; Scaringella, Zoia, & Mlynek,
2006; McKay & Fujinaga, 2006]. Several studies also showed that there is no
consensus on genre taxonomies [Aleksovski, Kate, & Harmelen, 2006; Pachet
& Cazaly, 2000]. However, automatic genre classification is a popular topic of
research in music information retrieval (e.g. [Basili, Serafini, & Stellato, 2004;
Tzanetakis & Cook, 2002; Li, Ogihara, & Li, 2003; Pampalk, Flexer, & Widmer,
2005; Schedl et al., 2006]).

6.3 Evaluating Extracted Subjective Information
119
McKay and Fujinaga [2006] conclude that musical genre classification is worth
pursuing. One of their suggestions is to abandon the idea that only one genre is ap-
plicable to a recording. Hence, multiple genres can be applicable to one recording
and a ranked list of genres should be computed per recording.
Today, the content of web sites such as
del.icio.us, flickr.com and youtube.com
is generated by their users. Such sites use community-based
tags to describe the
available items (photos, films, music, (scientific) literature, etc.). Although tags
have proven to be suitable descriptors for items, no clear semantics are defined.
Users can label an item with any term. The more an item is labeled with a tag, the
more the tag is assumed to be relevant to the item.
Last.fm is a popular internet radio station where users are invited to tag the
music and artists they listen to. Moreover, for each artist, a list of similar artists
is given based on the listening behavior of the users. Ellis et al. [2002] propose a
community-based approach to create a ground truth in musical artist similarity. The
research question was whether artist similarities as perceived by a large community
can be predicted using data from All Music Guide and from shared folders for peer-
to-peer networks. Now, with the
Last.fm data available for downloading, such
community-based data is freely available for non-commercial use.
In
Last.fm, tags are terms provided by users to describe music. They “are sim-
ply opinion and can be whatever you want them to be”
1
. For example, Madonna’s
music is perceived as
pop, glamrock and dance as well as 80s and camp. When we
are interested in describing music in order to serve a community (e.g. in a recom-
mender system), community-created descriptors can be valuable features.
In this section we investigate whether the
Last.fm data can be used to generate
a ground truth to describe musical artists. Although we abandon the idea of char-
acterizing music with labels with defined semantics (e.g. genres), we follow the
suggestion of MacKay and Fujinaga [2006] to characterize music with a ranked
list of labels. We focus on the way listeners perceive artists and their music, and
propose to create a ground truth using community data rather than to define one
by experts. In line with the ideas of Ellis et al. [2002], we use artist similarities
as identified by a community to create a ground truth in artist similarity. As tastes
and opinions change over time, a ground truth for music characterization should
be dynamic. We therefore present an algorithm to create a ground truth from the
dynamically changing
Last.fm data instead of defining it once and for all.
6.3.1 Analyzing the Last.fm Tags
Last.fm users are invited to tag artists, albums, and individual tracks. The 100
top-ranked tags (with respect to the frequency a tag is assigned) for these three
1
http://www.Last.fm/help/faq/?category=Tags

120
rap
Gangsta Rap
Hip-Hop
Aftermath
hip hop
favorites
Eminem
metal
hiphop
Favorite
pop
rnb
rock
dance
alternative
american
detroit
classic rock
seen live
r and b
Table 6.1. Top 20 tags for Eminem.
categories are easily accessible via the Audioscrobbler web services
API
2
. By an-
alyzing the listening behavior of its users,
Last.fm also provides artist similarities
via Audioscrobbler
3
. Per artist, a list of the 100 most similar artists is presented.
We first analyze tags for artists. As the lists of the top-ranked tags tend to con-
tain noise, we propose a simple mechanism to filter out such noise (Section 6.3.2).
In order to check the consistency of the tags, we inspect whether users label sim-
ilar artists with the same tags. We end Section 6.3 with a proposed mechanism to
create a dynamic ground truth in artist tagging and similarity using
Last.fm data.
Tagging of Artists
In Table 6.1, the 20 top-ranked tags for the artist Eminem are given, as found with
the Audioscrobbler web service. The terms
rap, hiphop and detroit can be seen as
descriptive for the artist and his music. Eminem is tagged with multiple terms that
reflect a genre but the tag
rap is more significant than metal.
Without questioning the quality or applicability of the terms in the list in Ta-
ble 6.1, we observe some noise in the tagging of this artist. Whether we consider
Eminem to be a hip-hop artist or not, after encountering the second highest ranked
tag
Hip-Hop, the tags hip hop, hiphop do not provide any new information. More-
over, the tag
Eminem does not provide any new information with respect to the
catalog data. The tags
favorite and good do not seem very discriminative.
To investigate whether the tags are indeed descriptive for a particular artist, we
collected the tags applied to a set of artists. In [Schedl et al., 2006]
4
, a list of 1,995
artists was derived from All Music Guide. We calculated the number of artists that
2
http://ws.audioscrobbler.com
3
e.g. http://ws.audioscrobbler.com/1.0/artist/Madonna/similar.xml
4
http://www.cp.jku.at/people/schedl/music/C1995a_artists_genres.
txt

6.3 Evaluating Extracted Subjective Information
121
jazz (809)
country (308)
seen live (658)
hard rock (294)
rock (633)
singer songwriter (291)
60s (623)
oldies (289)
blues (497)
female vocalists (285)
soul (423)
punk (282)
classic rock (415)
folk (281)
alternative (397)
heavy metal (277)
funk (388)
hip-hop (267)
pop (381)
instrumental (233)
favorites (349)
rnb (231)
american (345)
progressive rock (229)
metal (334)
electronica (215)
electronic (310)
dance (209)
indie (309)
alternative rock (208)
Table 6.2. The 30 most popular tags and their frequencies for the set of 1995
artists.
grimey (1)
stuff that needs further exploration (1)
disco noir (1)
american virgin festival (1)
gdo02 (1)
lektroluv compilation (1)
808 state (1)
electro techo (1)
iiiii (1)
richer bad rappers have not existed (1)
mussikk (1)
crappy girl singers (1)
good gym music (1)
techno manchester electronic acid house (1)
knarz (1)
music i tried but didnt like (1)
Table 6.3. Some of the least used tags for the 1995 artists.
are labeled with each of the tags. The most frequently occurring tags over all artists
are given in Table 6.2. Table 6.3 contains some of the tags that are applied only to
one artist. For the 1,995 artists, we encountered 14,146 unique tags.
If a tag is applied to many diverse artists, it cannot be considered to be dis-
criminative. We observe that there are no tags that are applied to a majority of the
artists. The high number of artists labeled with jazz can be explained by the fact
that the 1,995-artist-set contains 810 jazz artists. All frequent tags seem relevant
characterizations for musical artists or for the relation of the users to the artists
(e.g.
seen live).
The most debatable tag among the best scoring ones may be
favorites. Table 6.4

122
Radiohead
Coldplay
The Decemberists
Pink Floyd
Death Cab for Cutie
The Postal Service
The Beatles
Bright Eyes
The Shins
Elliot Smith
Table 6.4. The 10 top artists for the tag ’favorites’.
contains a list of the top artists for this tag, as extracted from audioscrobbler. We
notice that no mainstream dance or pop artists are among the list of 100 top artists
for
favorites. The 100 top artists for seen live are artists that toured in the 00s.
Tags that are applied to only one, or only a few artists are not informative either.
Since we do not consider the semantics of the tags, uniquely occurring tags cannot
be used to compute artist similarities.
We observe that the tags that are only applied once to artists in this set are more
prosaic, are in a language other than English, or simply contain typos (cf. “electro
techo” in Table 6.3). It is notable that in total 7,981 tags (56%) are applied to only
one artist. Only 207 tags are applied to at least 50 out of the 1,995 artists.
To check whether the 7,981 tags are descriptive for a larger set of artists, we
computed the
top count. For each of the at most 100 top artists
5
for this tag, we
extract the number of times n

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 35 36 37 38 39 40 41 42 ... 57