Information extraction from the web using a search engine Citation for published version (apa)

Yüklə 0,9 Mb.

Pdf görüntüsü

səhifə	6/57
tarix	09.02.2022
ölçüsü	0,9 Mb.
	#52298

1 2 3 4 5 6 7 8 9 ... 57

• We need to compose a strategy to retrieve relevant text.
• We have to focus on a method to actually extract information from these
texts.
We will argue that choosing a strategy to retrieve documents influences the
process of extracting information from these documents. In this section, we will
discuss a global method to populate an ontology (i.e. to extract information) using
texts retrieved with a search engine. The strategy chosen to formulate search engine
queries effects the method to extract information from the texts retrieved.

22
Before presenting an ontology population algorithm in Section 2.2.2, we first
discuss the consequences of choosing a commercial search engine and the web as
a corpus.
2.2.1 Design Constraints
The use of a commercial search engine and the nature of the texts on the web lead
to requirements that constrain the design of a method to extract information from
the web.
Search Engine Restrictions
In this thesis, we use a search engine that provides us with ‘the most relevant’
pages on the web for a given query. As the web is a collection of billions of
changing, emerging and disappearing pages, it is infeasible to extract information
from each and every one of them. As we hence need a reliable web document
retrieval system, we use a state-of-the-art search engine to find relevant documents.
The design of such a search engine is a separate concern and outside the scope of
this thesis. Therefore, we choose to use such commercial search engines for our
purposes. Using search engines like
Yahoo! or Google also facilitates the reuse of
the methods developed, as programmer’s interfaces are provided.
The use of a (commercial) search engine also has important disadvantages.
• A query sent to the search engine from two different machines can give dif-
ferent search results, as the services of large search engines are distributed.
• The search results differ over time, as the web changes and the pages indexed
and ranked are continuously updated.
Hence, an experiment making use of a distributed search engine can give dif-
ferent results when conducted at any other time or place. For this reason, the use
of static corpora as test sets in information extraction are currently the only basis
to objectively compare experimental results. Hence, experimental results of alter-
native approaches in web information extraction are hard to compare.
In the first chapter, we give a comparison between static corpora and the Web
as a corpus. We choose not to test our methods on static corpora to benchmark
the performance with other methods, as our method is specifically designed for
the characteristics of the Web. However, where possible we do compare our web
information extraction approach with work by others.
An initiative where a
snapshot of the web is stored and indexed would be a
stimulus for the field of web information extraction. Such a time and location
independent search engine would facilitate a reliable comparison of alternative
approaches in web information extraction.

2.2 Extraction Information from the Web using Patterns
23
Currently, both
Google and Yahoo! allow a limited amount of automatic
queries per day. At the moment of writing this thesis,
Google allows only 1,000
queries a day, where each query returns at most 10 search results. Hence if for a
given query expression the maximum of 1,000 search results are available, we need
to formulate 100 queries using the
Google
API
.
Yahoo! currently is more generous,
allowing 5,000 automated queries per day, where at most 100 search results are
returned per query.
Hence, this search engine use restriction requires us to analyze our approach
not only in terms of time and space complexity, but also in terms of the order of
number of queries, which we termed the
Google Complexity.
Definition [Google Complexity].
For a web information extraction algo-
rithm using a search engine, we refer to the required number of queries as the
Google complexity.

Yüklə 0,9 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 57