Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	43/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 39 40 41 42 43 44 45 46 ... 57

a
with
ambiguous terms.
We computed the artist similarities using t. Since no ground truth is available
for this collection of artist, we cannot evaluate the precision. In Table 6.11 we give
the top most related artists to six of the artists in the collection using the method as
discussed in Section 6.2.1.
We observe that the artists
Tool, Live and Fish frequently occur amidst the most
related artists. Especially for lesser famous artists, where the data is sparse, these
artist can be found often among the nearest neighbors.
Tool is for 1227 out of the
1731 other artists one of the 5 most similar artists,
Live is 1334 times in the top 5
and
Fish is 724.
Unlike most of the artist names in the commonly used evaluation sets, these
frequently occurring artist names are very ambiguous. A number of examples of
irrelevant snippets due to artist name ambiguity can be found in Table 6.9.
In Section 3.2 we addressed this problem by using
Google’s define function-
ality. We use the number of definitions as an estimator for the probability that the
term indeed reflects the intended instance.

134
ARTIST
BASELINE
p
lin
p
sqrt
Live
1227
1
54
Tool
1334
0
642
Fish
724
0
7
Juli
691
1251
1207
Table 6.10. Number of times an ambiguous artist name occurs among the top 5
nearest neighbors of the 1731 other artists.
Ideally, for each occurrence of an artist name in a text we want to observe
whether the occurrence indeed reflects the intended artist. However, the automatic
parsing of sentences is troublesome as the snippets contain broken sentences and
may be multilingual. Moreover if an artist name is identified as a subject or object
within a sentence, then we still do not know whether the term indeed reflects the
artist.
We therefore aim for a method where we estimate the probability that a term
a indeed reflects the intended artist named a. Using functions p
lin
(equation (3.3)
on page 43) or p
sqrt
(equation (3.4)), we estimate the relatedness between two in-
stances as follows,
T
0
(a, b) =
co
0
(i, i
0
)
∑
i
00
,i
00
6=i
0
co
0
(i
00
, i
0
)
,
(6.9)
with
co
0
(i, i
0
) = co(i, i
0
) · p(i) · p(i
0
).
(6.10)
Note that for p(i) = p(i
0
) = 1, we have the baseline function T (i, i
0
).
In this section we investigate the effect of the use of the ambiguity correction
on the performance on the test sets.
For both the sets of 224 and 1995 artists, we collected the numbers of defini-
tion for all the artist names. We recomputed the artist similarities using the linear
approach p
lin
and the square root approach p
sqrt
and compared the two with the
baseline.
An alternative approach is to explicitly add terms such as ’music’ to the query
expression. However, this approach leads to less snippets, while the snippets re-

6.4 Experimental Results
135
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
5
10
15
20
25
30
precision
k
Precision for k-NN Artist Similarity
baseline
linear
sqrt
Figure 6.9. Precision for the sets of 224 artists using the three ambiguity estima-
tors.
turned contained less related instances.
We present the results for the sets I
224
and I
1995
in Figures 6.9 and 6.10. For the
set of 224 the performance of the methods using disambiguation is slightly less than
that of the baseline approach. This result is expected, as no ambiguous terms occur
in the set of 224 artists. For the set of 1995 artists however, the results improve
using either the
uniform or the sqrt approach. We note that contrary to the set
of 224 artists, the 1995 set does contain some ambiguous names such
Autograph,
Gamma Ray and Hypocrisy.
For the set of 1732 artists in our own collection, we compare the number of
times that ambiguous artist names occur among the 5 nearest neighbors for the
other artists (Table 6.10). We note that for the term
Juli only one definition is
found. Although the distribution of ambiguous names is quite different for p
lin
and
p
sqrt
, we cannot draw conclusions on which approach is better suited as currently
no ground truth for artist similarity ranking is available. Hence, a ground truth data
set for such a diverse collection with ambiguous artist names is needed. With such
a set, we can obtain better insights in the quality of web information extraction
methods for these purposes.
6.4.2 Categorizing Instances
In this subsection, we focus on experiments that address the categorization of the
instances in I

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 39 40 41 42 43 44 45 46 ... 57