“All science is either physics or stamp collecting.”
— Ernest Rutherford.
Whether we want to know the name of Canada’s capital, or gather opinions on
Philip Roth’s new novel, the World Wide Web currently is the de-facto source to
find an arbitrary piece of information. In an era where a community-based source
as Wikipedia is found to be as accurate as the Encyclopaedia Britannica [Giles,
2005], the collective knowledge of the internet contributors is an unsurpassed col-
lection of facts, analyses and opinions. This knowledge simplifies the process for
people to gather knowledge, form an opinion or buy a cheap and reliable product.
With its rise in the late nineties, the web was intended as a medium to distribute
content among an audience. Alike newspapers and magazines, the communication
was merely one way. The content published on the web was presented in an often
attractive format and lay-out, using a natural language (e.g. Dutch) we are most
acquainted with.
Nowadays, only a few years later, the web is a place where people can easily
contribute, share and reuse thoughts, stories or other expressions of creativity. The
popularity of social web sites enriches the information available on the web. This
mechanism turned the web into a place where people can form nuanced opinions
about virtually any imaginable subject.
1
2
To enable people to share and reuse content, such as the location of that great
Vietnamese restaurant in Avignon on
Google Maps, the information on the web
is currently not only presented in a human-friendly fashion, but also in formats
that allow interpretation of information by machines. The so-called Social Web,
or Web2.0, enables people to easily create and publish content. Moreover, content
can be easily reused and combined.
A movement next to the social web is the
semantic web. The semantic web
community has created a dedicated formal language to express concepts, predicates
and relations between concepts. Using this
mathematical language for general in-
formation, knowledge can be expressed on every imaginable topic. The semantic
web can be seen as a distributed knowledge base. Instead of browsing through web
pages, the semantic web enables direct access to information.
The more information is already expressed in the semantic web languages, the
easier it becomes to represent new information. For example, to model the concept
of
First Lady of the United States, it may be needed to first model the concepts
country, United States, person, president, married, time, period and so on. The use
of earlier defined notions makes the content of the semantic web richer, as content
created by various parties can be linked and combined.
In the late sixties in Eindhoven, N.G. De Bruijn and his group developed the
mathematical language for mathematics and system Automath [De Bruijn, 1968;
Nederpelt, Geuvers, & De Vrijer, 2004]. Automath is a dedicated formal language
to express mathematics. The project can be seen as an attempt to formulate and
propagate a universal language for mathematics, that is checked by a system. Such
languages serve two goals. On the one hand, it is a means to ensure mathematical
correctness. If a theorem is provided with a proof in the mathematical language,
and the well-designed system accepts this proof, then the theorem can be consid-
ered to be true. On the other hand, the language provides a means of clear and
unambiguous communication.
Białystok, Poland, the home town of the constructed language
Esperanto, is
the base of one of the most active projects on formal mathematical languages. The
Mizar system builds on a set of axioms. A collection of mathematics is formalized
(i.e. derived from the set of axioms) through out the years. Although the Mizar
team have succeeded to completely formalize a whole handbook on continuous
lattices (by 16 authors in 8 years time), the formalization of an elementary the-
ory in another mathematical subject (i.e. group theory) proved to be too ambitious
[Geleijnse, 2004].
In spite of the work done by the semantic web and formal mathematics re-
searchers, both mathematicians and web publishers prefer natural language over
dedicated artificial languages to express their thoughts and findings. In mathe-
matics, dedicated researchers are formalizing (or translating) definitions, theorems
1.1 Information on the Web
3
and their proofs into formal languages. The translation of mathematics into formal
languages was the topic of my 2004 master’s thesis. In this thesis, I will discuss
approaches to catch information on the web into a dedicated formalism. Although
both topics may be closer to stamp collecting than to physics, I do hope that you
will enjoy this work.
1.1 Information on the Web
In this thesis, we focus on information that is represented in natural language texts
on the web. We make use of the text itself rather than of the formatting. Hence,
we extract information from unstructured texts rather than from formatted tables
or
XML
. Although some web sites may be more authoritative than others, we do
not distinct between sources as such.
Suppose we are interested in a specific piece of information, for example the
capital of Australia. Nowadays, the web is an obvious source to learn this and
many other facts. The process of retrieving such information generally starts with
the use of a search engine, for example
Google or perhaps the search engine in
Wikipedia. As we are unaware of the name of Australia’s capital, we query for
terms that can be expected to co-occur with this specific piece of information. The
term
Australia is of course a good candidate, but the combination of the words
Australia and capital will more probably lead to relevant pages.
The everyday internet user has learned to formulate effective search engine
queries. However, the fact
‘Canberra is the capital of Australia’ still has to be
identified within the search results. The search engine returns documents that are
likely to reveal this information, but we have to search the retrieved documents for
the fact itself.
To understand a text, we have to be able to parse the sentences, know the precise
semantics of the words, recognize co-references, read between the lines, resolve
ambiguities etc. Hence, for machines this is not a trivial task.
The study of information extraction addresses a subproblem of document (or,
text) understanding: the identification of instances of classes (e.g. names of per-
sons, locations or organizations) and their relations in a text (e.g. the expressed
relation between
Canberra and Australia). In this thesis we study how information
extraction can be applied on a specific text corpus: the web.
In this thesis, we focus on the following problem. We are given a domain of
interest, expressed using classes and relations. The goal is to extract information
from unstructured texts on the web. We first find relevant texts on the web using
a search engine. Having retrieved a collection of relevant texts, we focus on two
information extraction tasks. On the one hand we are interested to discover and
extract instances from the given classes, while on the other hand we extract rela-
4
tions between such instances. The extracted information is stored in a structured,
machine-interpretable format.
With structured information available, we can easily find the information we
are interested in. The extracted information can be used in intelligent applications,
e.g. in recommender systems to acquire additional meta data. This meta data can
be used to make meaningful recommendations for music or TV programs. For
example, suppose a user has expressed a preference for TV programs relating to
France. The recommender system may be able to recognize regions as
Languedoc
and
Midi-Pyr´en´ees and cities as Cahors and Perpignan using the extracted infor-
mation. Likewise, if the user has expressed a preference for French music the
system will be able to recognize the names of artists like
Carla Bruni and Charles
Aznavour.
1.1.1 Structured Information on the Web
Of course, not all information on the web is unstructured. As alternative sources
for information, we distinguish the following three structured representations of
information on the web.
• The semantic web and other
XML
-based languages. Pages written in these
language are dedicated subparts of the web for machine interpretable infor-
mation. Information represented in these formats can fairly easily be ex-
tracted.
• Web sites with a uniform lay-out. Large web sites, that make use of a
database, typically present their content in a uniform lay-out. For example,
the lay-out of the
Amazon page for a
CD
by
Jan Smit has a similar lay-out as
the page for
Spice by The Spice Girls. Hence, given a page within Amazon,
we can easily identify the title, price, reviews and other information based
on the lay-out.
• Tables and other formatted elements inside web pages. In columns in a table,
typically similar elements are stored. For example, if multiple terms from
one column are known to be soccer players, all other terms in the column
can be expected to be soccer players as well.
When we are interested in information that is available on the web, from a
practical point of view, the use of unambiguous structured information is always
preferred over the extraction of information from unstructured texts. However, as
not all information is available in such a manner, web information extraction –
from unstructured texts – is a relevant research topic.
1.2 Information Extraction and Web Information Extraction
5
1.1.2 The Social Web and its Potential
The web as we know it today enables us to get a nuanced view on products, events,
people and so on. The internet community can easily create content in the form of
weblogs, comments, reviews, movies, images and so on. All this information can
be used to form an opinion or help in for example selecting the right mattress to
buy or book to read. Although the content provided by amateurs may undermine
the influence of journalists, critics and other professionals [Keen, 2007], we can
learn from the
collective knowledge of the web contributors.
Where the semantic web merely focusses on representing the facts of life, the
social web touches on a more vague or abstract representation of knowledge: the
‘wisdom of the crowds’. This collective knowledge can be seen as a sign of the
times, or a general opinion about a subject.
1.2 Information Extraction and Web Information Extraction
Web Information Extraction (WIE) is the task to identify, structure and combine
information from natural language texts on the web. Given a domain of interest,
we want to create a knowledge base on this topic.
As information gathering from structured sources is in general easier and more
reliable than the use of unstructured texts, web information extraction is particu-
larly interesting for the following information demands.
- The information that cannot be extracted from structured or semi-structured
sources, such as
XML
documents, single web sites or tables, but is spread
across various web pages.
- The information that is expected to be present on the web. Obviously, we
cannot extract information that is not present in the studied corpus. Hence,
we can say in general that web information extraction is suited for all topics
that people write about.
1.2.1 A Comparison between Web Information Extraction and Traditional
Information Extraction
Information extraction (IE) is the task of identifying instances (named entities and
other terms of interest) and relations between those instances in a collection of
texts, called a text corpus. In this work, instances can be terms and other linguistic
entities (e.g.
twentieth president, guitarist, sexy) as well as given names (e.g. The
Beatles, Eindhoven, John F. Kennedy).
For example, consider the following two sentences.
George W. Bush is the current president of the United States. He was born
in New Haven, CT.
6
We may consider
George W. Bush, current president, the United States and
New Haven, CT to be instances in the presented example. A task in information
extraction could be to isolate these terms and identify their
class, or the other way
around: when given a class (e.g.
Location), find the instances.
As we deal with natural language, ambiguities and variations may occur. For
example, one can argue that the sequence
president of the United States is a pro-
fession rather than
the current president or current president of the United States.
Apart from identifying such entities, a second information extraction task may
be to identify relations between the entities. The verb ‘is’ reflects the relation ‘has
profession’ in the first sentence. To identify the place of birth, we have to observe
that ‘he’ is an anaphora referring to
George W. Bush.
Traditional information extraction tasks focus on the identification of named
entities in large text corpora such as collections of newspaper articles or biomedical
texts. In this thesis however, we focus on the web as a corpus.
Suppose that we are interested in a list of all countries in the world with their
capitals. When we extract information from a collection of newspaper articles
(e.g. three months of the New York Times), we cannot expect all information to be
present. At best, we can try to discover every country-capital combination that is
expressed within the corpus. However, when we use the web as a corpus, we can
expect that every country-capital combination is expressed at least once. Moreover
each of the combinations is likely to be expressed on various pages with multiple
formulations. For example,
’Amsterdam is the capital of the Netherlands’ and ’The
Netherlands and its capital Amsterdam (...)’ are different formulations of the same
fact. In principle, we have to be able to interpret only one of the formulations to
extract the country-capital combination. Hence, in comparison with a ’traditional’
newspaper corpus, we can both set different objectives and apply different methods
to extract information from the web.
With respect to the task of information extraction, the nature of this corpus has
implications for the method, potential objectives and evaluation. In Table 1.1 the
most important differences between the two can be found.
1.2 Information Extraction and Web Information Extraction
7
N
EWSPAPER
C
ORPUS
W
EB
C
ORPUS
No or fewer redundancy. Es-
pecially for smaller corpora, we
cannot expect that information is
redundantly present.
Redundancy. Because of the
size of the web, we can expect
information to be duplicated,
or formulated in various ways.
If we are interested in a fact,
we have to be able to identify
just one of the formulations to
extract it.
Constant and reliable.
In
corpus-based IE, it is assumed
that the information in the cor-
pus is correct.
Temporal and unreliable. The
content of the web is created
over several years by numerous
contributors. The data is thus
unreliable and may be out-dated.
Statements that are correctly
extracted are not necessarily
true or can be outdated.
Often monolingual and ho-
mogeneous.
If the author or
nature (e.g. articles from the
Wall Street Journal) of the cor-
pus is known beforehand, it is
easier to develop heuristics or to
train named entity recognizers
(NERs).
Multilingual and heteroge-
neous. The web is not restricted
to a single language and the
texts are produced by numerous
authors for diverse audiences.
Annotated
test
corpora
available. In order to train su-
pervised learning based named
entity recognizers, test corpora
are available where instances of
a limited number of classes are
marked within the text.
No representative annotated
corpora. As no representative
annotated texts are available, the
web as a corpus is currently less
suited for supervised machine
learning approaches.
8
Static. Experimental results are
independent of time and place
as the corpora are static.
Dynamic.
The contents of
the web changes continuously,
results of experiments may thus
also change over time.
Facts only.
Information Ex-
traction tasks on newspaper
corpora mainly focus on the
identification of facts.
Facts and opinions.
As a
multitude of users contributes
to the web, its contents is also
suited for opinion mining.
Corpus is Key. In traditional
information extraction, the task
is to identify all information
that can be found in the corpus.
The information extracted is
expected to be as complete as
possible with respect to the
knowledge represented in the
corpus.
Information Demand is Key.
As for many information de-
mands the web can be expected
to contain all information re-
quired, the evaluation is based
on the soundness and complete-
ness of extracted information
itself.
Table 1.1: Comparison between the Web as a corpus and ‘tradi-
tional’ corpora.
1.2.2 Three Information Demands
We separate the information that can be extracted from the web into three cate-
gories: facts, inferable information and community-based knowledge.
Fact Mining
The first and probably most obvious category of information that can be extracted
from the web is factual information. In this category we focus on the extraction
of factual statements (e.g.
‘Tom Cruise stars in Top Gun’, Brussels is Belgium’s
capital). Such statements can be expected to be expressed within a single document
or even within a sentence. Hence, the extraction of factual information focusses on
the identification of a collection of factual statements, each expressed within a
single document.
In Chapter 4, we focus on the extraction of such factual information from the
web. We use the extracted information to get insights in the performance of our
algorithms, as a ground truth is often available for these information demands.
1.2 Information Extraction and Web Information Extraction
9
Mining Inferable Data
An application domain other than factual data, is the extraction of
inferable data
from the web. Inferable data is not present as such on the web, but when it is
discovered it can be recognized by human judges as true or relevant. We create such
information by combining data from multiple sources. For example, the average
price of an 19 inch
LCD
television in shops in Eindhoven can be identified by
combining data from multiple web sites.
In Chapter 5, we discuss two information demands, where the required infor-
mation is inferred from data extracted from the web. First, we present a method to
extract lyrics from the web. Although many dedicated websites exist on this topic,
it is not trivial to return a correct version of the lyrics of a given song. As many
typo’s, mishearings and other errors occur in the lyrics present on the web, there
is need to construct a correct version using the various versions available. Such a
correct version may even not be present on the web. When a user is given such a
version however, it is relatively easy to judge the correctness.
The second application focuses on an information demand from a Dutch au-
diovisual archive. The collection of audiovisual material are annotated using a
dedicated thesaurus, a list of keywords and their relations. To retrieve a partic-
ular document, knowledge on the content of this thesaurus is crucial. However,
both professional users and the general audience cannot be expected to know each
and every word that is contained in the thesaurus. Using web information extrac-
tion techniques, we present a method to link a given keyword to the term in the
thesaurus with the closest meaning.
Community-based Knowledge Mining
The web is not only a well-suited text corpus to mine factual information. As a
large community of users contributes to the contents of the web, it can also be
used to mine more subjective knowledge. For example, we call
Paul Gauguin
a post-impressionist and related to
Vincent van Gogh, Christina Aguilera a pop
artist similar to
Britney Spears. Such qualifications may not all be facts, but rather
thoughts shared by a large community.
In the last part of this thesis (Chapter 6) we focus on methods to automatically
find such internet community-based information. On the one hand we classify in-
stances (e.g.
pop artists) into categories and on the other hand identifying a distance
matrix of related instances. The information found can be used to create an auto-
mated folksonomy: a knowledge base where items are tagged using implicit input
from multiple users.
In restricted domains (e.g.
Movies) for fact mining, the use of information ex-
traction techniques for semi-structured information may be well usable. The
In-
10
ternet Movie Database
1
for example is a reliable, semi-structured source to extract
data on movies. When we are interested in subjective data based on opinions of
the web community however, we cannot restrict ourselves to a single source. We
combine data from multiple web sites, and thus multiple contributors, to charac-
terize instances. We can however use semi-structured data from social websites as
as
last.fm as a benchmark on restricted domains like music [Geleijnse, Schedl, &
Knees, 2007].
1.3 Related Work
We first focus on research on the extraction of information from semi-structured
sources on the web. While the problem addressed is similar to the one in this
thesis (i.e. extracting and combining information from multiple documents into
a structured machine interpretable format), the source and therefore the methods
differ.
In the second subsection, we focus on related research fields.
Finally,
Section1.3.3 focusses on previous work specific to web information extraction.
1.3.1 Gathering Information from Structured Sources
Information extraction from structured sources is thoroughly described in for ex-
ample [Chang, Kayed, Girgis, & Shaalan, 2006] and [Crescenzi & Mecca, 2004].
These methods, ‘wrappers’, make use of the homogeneous lay-out of large web
sites with pages that are constructed using a data-base.
As discussed in Section 1.1.1, web sites such as
amazon.com and imdb.com
make use of a database and present automatically generated web pages. The lay-
out is uniform over the whole site, but the relevant information changes from page
to page. For example, within an online music store, the information related to a par-
ticular album is page dependent. The performing artist, the title of the album and
other catalogue data can be found on the exact same place on the page. The
HTML
-
source of the two pages will also only differ at these places. For pages within a
large web site, a wrapper algorithm can be created the information of interest from
an arbitrary page within the site. Agichtein and Gravano [2000] make use of the
homogeneous lay-out of large websites to extract information by first annotating a
number of pages using a training set of known instances. Etzioni and others [2005]
combine the extraction of information from unstructured sources with the identifi-
cation of instances within tables. Shchekotykhin et al. [2007] describe a method to
recognize tables on a specific domain (digital cameras and notebooks) and extract
the information represented in these tables. In [Auer et al., 2007] structured text
from
Wikipedia is used to create semantic web content.
1
http://www.imdb.com
1.3 Related Work
11
1.3.2 Related Fields and Tasks
In this subsection, we mention several tasks are closely related to web information
retrieval.
Information Retrieval Information retrieval is often referred to as the task to
return an (ordered) list of relevant document for a given query [Van Rijsbergen,
1979]. Kraaij [2004] gives an overview of commonly used models and techniques
as well as evaluation methods for information retrieval.
A high quality document retrieval system is an essential aspect of an informa-
tion extraction system as the retrieval of relevant documents or fragments is the
first step in any large scale information extraction task.
In this work, we use a web search engine that retrieves relevant documents
using an indexed collection of web pages [Brin & Page, 1998]. These pages are
used to extract the information from the domain of interest. On the other hand,
extracted information, such as given names, can be used to index documents in an
information retrieval system.
Named Entity Recognition In the nineties, the Message Understanding Con-
ferences (
MUC
) focused on the recognition of named entities (such as names of
persons and organizations) in a collection of texts [Chinchor, 1998]. Initially, this
work was mostly based on rules on the syntax and context of such named enti-
ties. For example, two capitalized words preceded by the string ‘
mr.’ will de-
note the name of a male person. As the creation of such rules is a laborious task,
approaches became popular where named entities were recognized using machine
learning techniques [Mitchell, 1997], for example in [Zhou & Su, 2002; Brothwick,
1999; Finkel, Grenager, & Manning, 2005]. However, such approaches typically
make use of annotated training sets where instances (e.g.
‘Microsoft’) are labeled
with their class (
‘Organization’). For tasks where instances are to be recognized of
other classes (e.g. the class
Movie or Record Producer) annotated data may not be
at hand.
The identification of more complex entities is studied by Downey et al. [2007].
With statistical techniques based on the collocation of subsequent words, terms
such as movie titles are identified. Alternative rule-based approaches also give con-
vincing results using the web as a corpus [Sumida, Torisawa, & Shinzato, 2006].
Schutz and Buitelaar [2005] focus on the recognition of relations between named
entities in the soccer domain by using dependency parse trees [Lin, 1998].
Question Answering Question Answering is a task where one is offered a ques-
tion in natural language [Voorhees, 2004]. Using a large text corpus, an answer to
this question is to be returned. Although many variations in this task occur, typi-
cally the question is to be parsed to determine the type of the answer. For example,
12
the type of the answer for
Who killed John F. Kennedy? is person. Based on the
content of the corpus, a person name is to be returned. Question Answering also fo-
cusses on other types of questions with a more difficult answer structure (e.g.
Why
did Egyptians shave their eyebrows?), the shortest possible text fragment is to be
returned [Verberne, Boves, Oostdijk, & Coppen, 2007]. Dumais et al. [2002] use
the redundancy of information in a large corpus in a question answering system.
Statements can be found at different places in the text and in different formulations.
Hence, answers to a given question can possibly be found at multiple parts in the
corpus. Dumais et al. extract candidate answers to the questions at multiple places
in the corpus and subsequently select the final answer from the set of candidate
answers.
Information extraction can be used for a question-answering setting, as the
answer is to be extracted from a corpus [Abney, Collins, & Singhal, 2000]. Un-
like question-answering, we are not interested in finding a single statement (corre-
sponding to a question), but in
all statements in a pre-defined domain. Functional
relations, where an instance is related to at most one other instance, in informa-
tion extraction correspond to factoid questions. For example the question
In which
country was Vincent van Gogh born?, corresponds to finding instances of Person
and
Country and the ‘was born in’-relation between the two. Non-functional re-
lations, where instances can be related to multiple other instances, can be used to
identify answers to list questions, for example “name all books written by Louis-
Ferdinand C´eline” or “which countries border Germany?” [Dumais et al., 2002;
Schlobach, Ahn, Rijke, & Jijkoun, 2007].
1.3.3 Previous work on Web Information Extraction
Information extraction and ontology constructing are two closely related fields.
For reliable information extraction, we need background information, e.g. an on-
tology. On the other hand, we need information extraction to generate broad and
highly usable ontologies. A good overview on state-of-the-art ontology learning
and populating from text can be found in [Cimiano, 2006].
McCallum [2005] gives a broad introduction to the field of information extrac-
tion. He concludes that the accuracy of information extraction systems does not
only depend on the design of the system, but also on the regularity of the texts
processed.
The topic of hyponym extraction is by far the most studied topic in web infor-
mation extraction. The task is given a term to either find it broader term (i.e. its
hypernym), or to find a list of hyponyms given a hypernym. Etzioni and colleagues
have developed KnowItAll: a hybrid web information extraction system [2005]
that finds lists of instances of a given class from the web using a search engine. It
combines hyponym patterns [Hearst, 1992] and learned patterns for instances of the
1.3 Related Work
13
class to identify and extract named-entities. Moreover, it uses adaptive wrapper al-
gorithms [Crescenzi & Mecca, 2004] to extract information from html markup such
as tables. KnowItAll is efficient in terms of the required amount of search engine
queries as the instances are not used to formulate queries. In [Downey, Etzioni, &
Soderland, 2005] the information extracted by KnowItAll is post-processed using
a combinatorial model based on the redundancy of information on the web.
The extraction of general relations from texts on the web is recently studied
in [Banko, Cafarella, Soderland, Broadhead, & Etzioni, 2007] and [Bunescu &
Mooney, 2007]. Craven et al. manually labeled instances such as person names
and names of institutions to identify relations between instances from university
home pages. Recent systems use an unsupervised approach to extract relations
from the web. Sazedj and Pinto [2006] map parse trees of sentences to the verb
describing a relation to extract relations from text.
Cimiano and Staab [2004] describe a method to use a search engine to verify
a hypothesis relation. For example, if we are interested in the ‘is a’ or hyponym
relation and we have the instance
Nile, we can use a search engine to query phrases
expressing this relation (e.g.
“rivers such as the Nile” and “cities such as the Nile”).
The number of hits to such queries is used to determine the validity of the hypothe-
sis. Per instance, the number of queries is linear in the number of classes (e.g.
city
and
river) considered.
In [De Boer, Someren, & Wielinga, 2007] a number of documents on art styles
are collected. Names of painters are identified within these documents. The doc-
uments are evaluated by counting the number of painters in a training set (of e.g.
expressionists) that appear in the document. Painters appearing on the best ranked
documents are then mapped to the style. De Boer et al. use a training set and
page evaluation, where other methods simply observe co-occurrences [Cilibrasi &
Vitanyi, 2007].
A document-based technique in artist clustering is described in [Knees, Pam-
palk, & Widmer, 2004]. For all music artists in a given set, a number of documents
is collected using a search engine. For sets of related artists a number of discrim-
inative terms is learned. These terms are used to cluster the artists using support
vector machines.
The number of search engine hits for pairs of instances can be used to com-
pute a semantic distance between the instances [Cilibrasi & Vitanyi, 2007]. The
nature of the relation is not identified, but the technique can for example be used to
cluster related instances. In [Zadel & Fujinaga, 2004] a similar method is used to
cluster artists using search engine counts. In [Schedl, Knees, & Widmer, 2005], the
number of search engine hits of combinations of artists is used in clustering artists.
However, the total number of hits provided by the search engine is an estimate and
not always reliable [V´eronis, 2006].
14
In [Pang, Lee, & Vaithyanathan, 2002; Dave & Lawrence, 2003; Kim & Hovy,
2004; Pang & Lee, 2005] methods are discussed to identify opinions on reviewed
products. For example, given is a set of reviews of some flat screen television
mined from the web. The task is to assign a grade to the product or its specific
features (e.g. the quality of the speakers).
The extraction of social networks using web data is a frequently addressed
topic. For example, Mori et al. [2006] use tf·idf (see [Salton & Buckley, 1988;
Manning & Sch¨utze, 1999]) to identify relations between politicians and locations
and Jin, Matsuo and Ishizuka [2006] use inner-sentence co-occurrences of com-
pany names to identify a network of related companies.
1.4 Outline
This thesis is organized as follows. In the next chapter, we formulate the prob-
lem and give an outline of the method to extract information from the web. This
method gives rise to two subproblems, on the one hand the identification of rela-
tions in texts and on the other hand the identification of the terms and given names
of interest. We will discuss these subproblems in Chapter 3. To obtain evidence for
the applicability of the methods discussed in this thesis, in Chapter 4 we present a
number of case-studies, where we extract factual information from the web. Chap-
ter 5 focusses on two applications of web information extraction. Contrary to the
case-studies in Chapter 4, the information extracted here cannot be found in struc-
tured sources. Chapter 6 handles the extraction of community-based data from the
web, where we find
tags for a set of instances. Finally, the conclusions can be
found in Chapter 7.
2
A Pattern-Based Approach to Web
Information Extraction
In this chapter we present a global outline for an approach to extract information
from the web. Hereto we first define a formal model for the concept ‘informa-
tion’. Next, we discuss the design constraints that are specific for both the corpus,
i.e. the web, and the use of a state-of-the-art search engine. Based on the design
constraints, a global method to extract information from the web is presented.
2.1 Introduction
In this section, we first focus on a model to represent information. Using the def-
initions provided in Section 2.1.2, we formulate our problem definition in Sec-
tion 2.1.3.
2.1.1 A Model for ‘Information’
Finding a suitable representation of information is one of key tasks in computing
science. We call data information, when it has a meaning. That is, when it can be
used for some purpose, for example the answering of questions.
To represent the concept information, we let ourselves be inspired by the se-
mantic web community. This community uses the concept
ontology, which is de-
fined by Gruber as ‘a specification of a conceptualization’ [1995]. Wikipedia pro-
15
16
vides a perhaps somewhat more practical definition: ‘ontology is a data model that
represents a set of concepts within a domain and the relationships between those
concepts’
1
.
In the semantic web languages, an information unit or
statement consists of a
triplet of the form subject - predicate - object, for example
Amsterdam - is capi-
tal of - the Netherlands or the Netherlands - has capital - Amsterdam. Analogous
to the object-oriented programming paradigm, we speak of
classes and their in-
stances. Note that in this model instances are part of the ontology. This allows us
to express knowledge on concepts such as
Amsterdam and their domains (City),
but also enables us to express relations between concepts. As the predicates can be
as refined as required, this model can be used to express statements that are more
complex.
The semantic web languages OWL and RDFS enable the formulation of prop-
erties of classes and relations. These languages are rich [Smith, Welty, & McGuin-
ness, 2004], but complex [Ter Horst, 2005]. In this work, we opt for a simple
formalization as the focus of this work is on the extraction of information, rather
than on the use of the extracted information. We note that constructs that allow rea-
soning, such as axioms and temporal properties are not included in this formalism.
An
initial ontology serves three purposes.
1. It is a specification of a domain of interest. Using the classes and relations,
the concepts of interest are described. A domain is specified by defining the
relevant classes (e.g.
City, Capital) and relevant relations (e.g. is located in
defined on classes
City and Country).
2. The ontology is used to specify the inhabitants of the classes and relations:
the formalizations of the statements describing the actual instances and their
relations. For example,
Amsterdam is an instance of the class Capital and the
pair (
Amsterdam, the Netherlands) may be a relation instance of is located
in.
3. We use the ontology the specify an information demand. By defining classes
and their instances as well as relations and relation instances, we model the
domain and indicate the information that is to be extracted from the web.
Now suppose we are interested in a specific piece of information, for exam-
ple:
the Capital of Australia, artists similar to Michael Jackson, the art movements
associated with Pablo Picasso or the profession Leonardo da Vinci is best known
for. We assume that such information can easily be deduced from an ontology that
contains all relevant data. The aim of this work is to automatically fill, or
populate,
1
http://en.wikipedia.org/ article: Ontology (Computer Science), accessed December
2007.
2.1 Introduction
17
an ontology that describes a domain of interest. We hence focus on populating an
ontology on the one hand with instances and on the other hand with pairs of related
instances.
2.1.2 Definitions and Problem Statement
The semantic web languages are created to describe information in a machine
readable fashion, where each concept is given a unique, unambiguous descrip-
tor, a universal resource identifier (e.g.
http://dbpedia.org/resource/-
Information_extraction
is the
URI
for the research topic of
Information Ex-
traction). By reusing the defined
URI
s, distributed content can be linked and a
connected knowledge base is built.
For reasons of simplicity we abstract from the semantic web notations. By
keeping the definitions simple, the notations introduced in this thesis can be trans-
lated into the semantic web languages with fair ease, as we maintain the subject -
predicate - object structure used in the semantic web languages.
We define an ontology O as follows.
Definition [Ontology]. An ontology O is a pair ( C, R), with C the set of classes
and R the set of relations.
Dostları ilə paylaş: |