Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	46/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 42 43 44 45 46 47 48 49 ... 57

a
and a list of movements
I
g
in art using Wikipedia and map the two. From Wikipedia we extracted a set I
a
of 1,280 well-known painters from the article
List of painters and a set I
g
of 77
movements in art from
List of art movements
9
. We tested the performance of the
algorithm on the subset of 160 painters who could be extracted from the Wikipedia
pages describing movements (e.g. from the page on
Abstract Expressionism). The
other 1,120 painters are either not mentioned on the pages describing styles or are
mentioned on more than one page. However, when computing similarities between
the painters, we take all 1,280 painters into account. For the elements of I
g
in this
9
www.wikipedia.org Both pages visited in April 2006.

6.4 Experimental Results
141
[Painter] synthetic [Movement]
[Painter] [Movement]
[Movement] artist [Painter]
[Movement] [Painter]
[Painter] and other [Movement]
[Painter] express [Movement]
[Painter] and [Movement]
[Painter] of the [Movement]
[Painter] tog initiativ til [Movement]
[Painter] uit de [Movement]
[Painter] experimenting with [Movement]
[Painter] and the [Movement]
[Painter] surrealism [Movement]
[Painter] arte [Movement]
Table 6.14. Best scoring learned patterns for painter - movement relation.
PAINTER
-
MOVEMENT
method
k = 0
best
(corresp. k)
PCM
0.35
0.35
(0)
PM
0.54
0.64
(18)
DM
0.65
0.81
(20)
PM
-
STEMMING
0.53
0.62
(28)
Table 6.15. Precision without related instances and best precision per method.
test no synonyms were added. For fairness, we excluded pages from the domain
wikipedia.org in the search queries.
For
PM
, we selected learned patterns for the mapping between the elements
in I
a
and I
g
. For learning, we used instance-pairs outside the test set. The best
scoring patterns can be found in Table 6.14. For the relation between the instances
in I
a
, these patterns found were mostly enumeration patterns, e.g.
“including b
and”. The complete details of both experiments and the patterns used in
PM
can be
found on the web page
10
. Due to the rareness of some of the painters and names of
movements, we did not use any additional terms in the queries for
DM
or
PCM
.
In Table 6.15 the performance of the initial mapping m
0
can be found for the
three methods (k = 0). The experiments show that in general the use of related
instances improves the categorization (see Table 6.15 and Figure 6.13). It shows
again that the methods with the lowest Google Complexity thus
PM
and
DM
per-
form better than
PCM
.
Although in the painter-movement experiment the number of categories iden-
tified (77) is much larger than in the previous experiment (16), the performance
10
http://gijsg.dse.nl/webconmine/

142
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
5
10
15
20
25
precision
k
painter-movement categorization
dm
pm
pm-stem
pcm
Figure 6.13. Precision for categorization of the painters.
of
PM
and especially
DM
is still good. The results of
PCM
indicate that when the
precision of the intermediate mapping is low (35%), the use of related instances
does not improve the results. In this experiment we even observe a deterioration
of the performance. Here
DM
clearly outperforms
PM
. This can be explained by
the fact that using
PM
considerably less painter-movement pairs could be extracted.
We expected the recall of
PM
to increase when applying stemming on the names
of movements and the texts extracted [Porter, 1980]. Although the number of pairs
extracted slightly increases, the precision does not improve (Table 6.15).
6.4.3 Tagging Instances
In this subsection, we focus on two case-studies on the tagging of instances related
to the methods described in Section 6.2.3. We compare the extracted lists of tags
with ground truth extracted from a social website. No previous work is known to us
in this field. We therefore present two exploratory studies in the automatic tagging
of instances. In the first experiment, we tag the set of 224 artists and evaluate
the tagging using
Last.fm. The second experiment focusses on books, where the
results are compared with data from
LibraryThing.com.
Tagging Musical Artists
In this experiment, we focus on the tagging of the 224 artists as done by the
Last.fm
community using the method described in Section 6.3. Using a large set of artists,
we select the 248 most frequently applied tags after the normalization procedure
11
.
11
The list of tags used can be found at http://gijsg.dse.nl/tags224.html

6.4 Experimental Results
143
We investigate whether our method is well suited to label the 224 artists and com-
pare the results with the tags as applied by the
Last.fm users.
The previous experiments showed that
PM
was the most successful alterna-
tive to identify artist similarities, while
DM
outperformed
PM
with respect to the
labeling of artists with genre names. We hence use
DM
to find the co-occurren-
ces between artist names and tags and reuse the results from
PM
to identify the
artist similarities. For fairness, the pages from
Last.fm and Audiocrobbler.com are
excluded from the search results.
Per artist in the test set, an average of 79 tags was identified using
DM
. All
tags in the test set were linked to at least one artist, however not for all tag/artist
combinations a score could be identified, as not all artists are related to one another.
We compare the computed ranking of the tags for the artists with a normalized
ranking as identified by the
Last.fm users as described in Section 6.3. For instance,
the terms
’Rocker’ and ’rock’ have the same normalized form.
We evaluate the computed rankings for the different values of w as follows.
We first evaluate the precision and recall for the highest ranked tags and secondly
compute Spearman’s rank correlation between the computed ranking and the one
from
Last.fm.
Precision and Recall. We selected the set S

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 42 43 44 45 46 47 48 49 ... 57