Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	47/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 43 44 45 46 47 48 49 50 ... 57

n
of the top n tags for artist i in the
ground truth (i.e. the normalized
Last.fm data) and evaluated precision p and recall
r of the computed ordered list L
m
of the m most applicable tags for i.
p =
|S
n
T
L
m
|
|L
m
|
and r =
|S
n
T
L
m
|
|S
n
|
The average recall and precision for the computed 25 highest ranked tags (i.e.
m = 25) compared with the 25 highest ranked tags by the
Last.fm (i.e. n = 25) is
given in Figure 6.14. For all values of w the precision is marginally larger than
recall, as we found less than 25 tags for few of the artists. We note that for the
given set of tags random precision and recall are both 0.10.
For the given settings, we hence obtain precision and recall rates between 0.25
and 0.3. For w = 0.25 we obtained the best results. Hence, we again observe that
the use of artist similarities improves the labeling of the artists.
For w = 0.25 we compute the average precision and recall of the top n
Last.fm
tags by repeatedly increasing m from 1 to 100. The results for various values of n
can be found in Figure 6.15.
Ranking. We also evaluate the ranking itself, hence the correlation between the
ranking of tag t
i
in the ground truth g
a
(t
i
) and the computed ranking r
a
(t
i
). For a
given artist, we focus on the ranking of the tags that are both in the ground truth
data and in the computed list.
The average Correlation Coefficient ρ per w for the 224 artists is given in Fig-

144
0.25
0.255
0.26
0.265
0.27
0.275
0.28
0.285
0.29
0.295
0.3
0
0.2
0.4
0.6
0.8
1
w
precision
recall
Figure 6.14. Precision and Recall for the 25 best scoring computed tags with
respect to the 25 best scoring normalized Last.fm tags for the 224 artists.
0
0.1
0.2
0.3
0.4
0.5
0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
precision
recall
n=10
n=25
n=50
n=100
Figure 6.15.
Precision and Recall for the n best scoring computed tags with
respect to the 25 best scoring normalized Last.fm tags for the 224 artists with
w = 0.25.
ure 6.16. The correlation is indeed positive – but weak – for all values of w. We
note that the value for ρ is slightly lower for values of w approaching both 0 and 1.
The results of the labeling of artists with the tags as applied by
Last.fm users

6.4 Experimental Results
145
0.32
0.325
0.33
0.335
0.34
0.345
0
0.2
0.4
0.6
0.8
1
w
Spearman’s Rank Correlation Coefficient
Rho
Figure 6.16. Spearman’s correlation coefficient between the 224 artist tagging
and the Last.fm ground truth.
are modest. Given the difficulty of the task and the nature of the ground truth, we
are nevertheless encouraged by the results.
We observe that some frequently applied tags occur infrequently in web texts
(e.g.
‘i want to hear everything streamable by them’). Such tags were rarely iden-
tified in the texts on the web. On the other hand, among the best scoring tags we
find terms that seem less descriptive but often occur on the web, for example
good,
hot and fun.
Tagging Books
In this second experiment, we focus on books and their tags. Using the social web-
site
LibraryThing.com, we create a ground truth for the 500 most popular books
on this website (Table 6.16 gives the top 25 at the moment of conducting the ex-
periment). After normalization, we reduced the size of the ground truth set of tags
to 286. The book titles have been slightly simplified by removing the text after the
colon (e.g. in
Animal farm : a fairy story).
As two author-title combinations are less likely to co-occur within a sentence,
we gather the co-occurrences scores for the books in I
a
using
DM
. We query the
book title and the name of the author and gather the (at most) 100 resulting doc-
uments. Again, the pages of the evaluation website are excluded. To identify
co-occurrences, we scan the documents only for the titles of the other books. The
identification of co-occurrences between tags and books is done is a similar fash-
ion.

146
1.
Harry Potter and the Sorcerer’s Stone by J.K. Rowling (21,415)
2.
Harry Potter and the Half-Blood Prince by J.K. Rowling (20,650)
3.
Harry Potter and the Order of the Phoenix by J.K. Rowling (19,510)
4.
Harry Potter and the Goblet of Fire by J.K. Rowling (18,658)
5.
Harry Potter and the Chamber of Secrets by J.K. Rowling (18,638)
6.
Harry Potter and the Prisoner of Azkaban by J.K. Rowling (18,567)
7.
The Da Vinci code by Dan Brown (16,013)
8.
The Hobbit by J.R.R. Tolkien (14,538)
9.
1984 by George Orwell (13,655)
10.
The Catcher in the Rye by J.D. Salinger (13,363)
11.
Pride and prejudice by Jane Austen (12,813)
12.
To Kill a Mockingbird by Harper Lee (11,890)
13.
The Great Gatsby by F. Scott Fitzgerald (11,331)
14.
The Lord of the Rings by J.R.R. Tolkien (10,572)
15.
Jane Eyre by Charlotte Bronte (9,847)
16.
The Curious Incident of the Dog in the Night-Time by Mark Haddon (9,526)
17.
Brave New World by Aldous Huxley (9,142)
18.
Life of Pi : a novel by Yann Martel (9,071)
19.
Animal Farm : a fairy story by George Orwell (8,967)
20.
Angels & Demons by Dan Brown (8,799)
Table 6.16. The taste of the crowds: the most popular books among the Library-
Thing community in August 2007. The figures between parentheses reflect the
number of people claiming to own the book.
Spearman’s rank correlation coefficient is given in Figure 6.17. It shows that
contrary to the previous experiments the use of related books has a negative effect
on the correlation with the ground truth. Again, the correlation is positive, but
weak, with best coefficients around 0.3.
Using w = 1, we also computed the average precision and recall for the top
25 tags in the ground truth set, see Figure 6.18. These results are also comparable
with the
Last.fm experiment.
The results for the tagging experiments are open to improvement. Mika [2007]
proposes to compute a semantic distance between tags. In future work, such an
approach can be used both to identify a ‘cleaner’ ground truth and to identify syn-
onyms of tags. The use of learned synonyms may improve the performance as
many tags occur infrequently in unstructured sources on the web. Hence, the as-
sumption that instances can be linked to tags using occurrences of pairs of the two

6.4 Experimental Results
147
fiction
classic
novel
paperback
literature
20th century
Favorites
American
fantasy
hardcover
series
science fiction
english
british
american literature
sf
Contemporary Fiction
Humor
contemporary
1001 books
Table 6.17. The 20 most frequently applied tags on
LibraryThing.com
.
0.18
0.2
0.22
0.24
0.26
0.28
0.3
0.32
0
0.2
0.4
0.6
0.8
1
w
Spearman’s Rank Correlation Coefficient
n=10
n=25
n=50
n=100
n=250
n=500
Figure 6.17. Spearman’s correlation coefficient between the computed tags and
the LibraryThing ground truth.
in texts does not hold for these cases.
Future work should therefore focus on the identification of formulations of
tags in unstructured texts. Using an annotated training set of artists and tags we
can learn such formulations. Moreover, currently we assume the tags in the set I

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 43 44 45 46 47 48 49 50 ... 57