Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	21/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 ... 17 18 19 20 21 22 23 24 ... 57

C
=
{c
Director
, c
Actor
, c
Movie
},
R
=
{r
acts-in
, r
directed
},
I
Director
=
{
Steven Spielberg ,Francis Ford Coppola}
I
Actor
=
/0,
I
Movie
=
/0,
r
acts-in
=
(acts in, c
Actor
, c
Movie
, ϕ, /0)
r
directed
=
(directed, c
Director
, c
Movie
, ϕ, /0)
We thus only identify three classes, each of them are incomplete. The class
Director is the only class where instances are provided. For the two relations, no
relation instances are given. The goal is to identify movies directed by these direc-
tors using patterns expressing the
directed relation. The movies found are used to
find starring actors, where those actors are the basis of the search for other movies
in which they played, etc. In this experiment, we focus on the population of O in
one iteration, thus using a predefined set of patterns. We extract the information
from the top 100 snippets returned by
Google.
For the two relations considered, we have manually selected the following
patterns and placeholders.
P
acts-in
=
{[
Movie] starring [Actor]}
P
directed
=
{[
Director]’s [Movie] , [Movie], director [Director] }.
Instance identification. For all three classes, as a placeholder, we use the
remaining part of the sentence preceding or following the queried expression. We
do so, as multiple instances of the same class are often enumerated (e.g. in the
sentence
Titanic starring Leonardo Di Caprio and Kate Winslet.). As actors and
directors are generally both persons, we apply the same recognition rules for these
two classes.
As the structure of instances of
Movie is less regular, we focus on the context
of such instances in the texts. We recognize an instance of
Movie if it is placed
between quotation marks. A person’s name (instances of the classes
Director and
Actor) is recognized as the longest sequence of two or three capitalized words.
Another feature of the recognition function is the use of tabu words
1
. An ex-
1
Also called
stop words in literature.

4.1 Populating a Movie Ontology
53
tracted term is rejected as instance, if it contains one of the tabu words. We use a
list of about 90 tabu words for the person names (containing words like ‘DVD’ and
‘Biography’). For the movie titles we use a much shorter list, since movie titles
can be much more diverse. We have constructed the tabu word lists based on the
output of a first run of the algorithm.
We check each of the extracted candidate instances with the use of one of the
following queries: “The movie [Movie]”, “[Actor] plays”, or “[Director] directed”.
A candidate is accepted, if the number of hits to the query exceeds a threshold.
After some tests we choose 5 as a threshold value, since this threshold filtered out
not only false instances but most of the common spelling errors in true instances
as well.
Experimental results.
We have found 7,000 instances of the class Actor,
3,300 of Director and 12,000 of Movie. The total number of retrieved instances
increases with about 7% when 500 query results are used instead of 100.
We first ran the algorithm with the names of two (well-known) directors as
input:
Francis Ford Coppola and Steven Spielberg. Afterwards, we experimented
with other less famous directors as input.
An interesting observation is that the outputs are independent of the input sets.
That is, when we take a subset of the output of an experiment as the input of
another experiment, the outputs are the same, modulo some small differences due
to the changes in the Google query results over time.
When we analyze the precision of the results, we use the data from the Internet
Movie Database (IMDb)
2
as a reference. An instance in the populated ontology is
accepted as a correct one, if it can be found in IMDb. We have manually checked
three sequences of 100 instances (at the beginning, middle and end of the gener-
ated file) of each class. Based on the exact matches with the entries in IMBb, we
estimate a precision of 0.78. Most misclassified instances were misspellings or
different formulations of the same entity (e.g. “Leo DiCaprio” and “Leonardo Di-
Caprio”). Other identified instances, like
James Bond (found as an instance of both
Movie and Actor) and Mickey Mouse (found as an actor in Fantasia 2000) were
related to the movie domain, but are no instances of the intended classes. It is also
notable that not all instances found relate to distinct concepts. For example,
the
Karate Kid as well as Karate Kid were identified as movies, likewise Leo DiCaprio
and
Leonardo DiCaprio were added to the class Actor, while Hable con Ella and
Talk to Her are alternative titles for the same movie.
Likewise, we have also analyzed the precision of the relations, we estimate
the precision of the relation between movie and director around 0.85, and between
2
http://www.imdb.com

54
CATEGORY
RECALL
Best Actor
0.96
Best Actress
0.94
Best Director
0.98
Best Picture
0.87
Table 4.1. Recall of Academy Award Winners
movie and actor around 0.90.
With respect to the recall of the algorithm, we first observe that number of
entries in IMDb exceeds our ontology by far. Although our algorithm performs
especially well on recent productions, we also are interested how well it performs
on classic movies, actors and directors. First, we made lists of all Academy Award
winners (1927-2005) in a number of relevant categories, and checked the recall
(Table 4.1).
IMDb has a top 250 of best movies ever, of which 85% were retrieved. We
observe that results are strongly oriented towards Hollywood productions. We also
made a list of all winners of the Cannes Film Festival, the ‘Palme d’Or’. Alas, our
algorithm only extracted 26 of the 58 winning movies in this category.
Sumida et al. [2006] used a large Japanese web corpus to identify a list of movie
titles. The texts are scanned for hyponym patterns [Hearst, 1992], with phrases
like
Movies such as. Movie titles are extracted when by signaling text between the
Japanese variant of quotation marks. They report a precision of 83%. KnowItAll
[Etzioni et al., 2005] uses hyponym patterns to find actors and movie titles as well
as the patterns
[Actor] stars in [Movie] and [Actor] star of [Movie]. Noun phrases
are extracted as candidate instances. These candidate instances are subsequently
checked using additional queries. The focus of the evaluation is on the population
of the classes, rather than on the identification of relation instances. Precision and
recall are formulated in terms of the texts processed, rather than using ground truth
ontology. For instances of the class
Actor high precision is obtained for various
levels of recall. The precision of instances
Movie found with KnowItAll is less
precise.
4.2 Identifying Burger King and its Empire
Inspired by one of the list questions in the 2004 Text Retrieval Conference (TREC)
Question Answering track [Voorhees, 2004], (
‘What countries is Burger King lo-
cated in?’), we are interested in populating an ontology with restaurants and the

4.2 Identifying Burger King and its Empire
55
PATTERN
PREC
SPR
FREQ

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 ... 17 18 19 20 21 22 23 24 ... 57