Aviation Data Mining
David A. Pagels
Division of Science and Mathematics
University of Minnesota, Morris
Morris, Minnesota, USA 56267
pagel093@morris.umn.edu
ABSTRACT
We explore different methods of data mining in the field of
aviation and their effectiveness. The field of aviation is al-
ways searching for new ways to improve safety. However, due
to the large amounts of aviation data collected daily, parsing
through it all by hand would be impossible. Because of this,
problems are often found by investigating accidents. With
the relatively new field of data mining we are able to parse
through an otherwise unmanageable amount of data to find
patterns and anomalies that indicate potential incidents be-
fore they happen. The data mining methods outlined in this
paper include Multiple Kernel Learning algorithms, Hidden
Markov Models, Hidden Semi-Markov Models, and Natural
Language Processing.
Keywords
Aviation, Data Mining, Multiple Kernel Learning, Hidden
Markov Model, Hidden Semi-Markov Model, Natural Lan-
guage Processing
1.
INTRODUCTION
On January 31st, 2000 a plane travelling from Puerto Val-
larta, Mexico to Seattle, Washington dove from 18,000 feet
into the Pacific Ocean, losing 89 lives. The cause of this
accident was found to be “a loss of airplane pitch control
resulting from the in-flight failure of the horizontal stabi-
lizer trim system jackscrew assembly’s acme nut threads.
The thread failure was caused by excessive wear resulting
from Alaska Airlines’ insufficient lubrication of the jackscrew
assembly”[2].
The cause of this accident was predictable
through analysis of flight data recordings. There are many
other incidents that would also be preventable through anal-
ysis of the flight data recordings.
Data mining is a broad field of data science that was de-
veloped to make predictions on future data based on pat-
terns found in collected data. Finding patterns in aviation
data manually is impracticable due to the mass amount of
data produced every day. Data mining has been able to
This work is licensed under the Creative Commons Attribution-
Noncommercial-Share Alike 3.0 United States License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/ or
send a letter to Creative Commons, 171 Second Street, Suite 300, San Fran-
cisco, California, 94105, USA.
UMM CSci Senior Seminar Conference, December 2014
Morris, MN.
start addressing this problem. Although they are not yet
optimized for mining of aviation data in their current state,
some common data mining methods, such as kernel meth-
ods, text classification, and Hidden Semi-Markov Models,
are being explored. Kernel techniques have been largely de-
veloped around either discrete or continuous data. This lim-
itation makes it unsuited for use on the combined discrete
and continuous data collected in aviation. Hidden Markov
Models are limited to analyzing sequences without the abil-
ity to take into account the duration of actions. Aviation
incident reports often contain a small amount of informa-
tion per report, while current methods of text classification
requires large amounts of descriptive data. Although these
approaches are not optimal for aviation data, we can use
these concepts to produce new approaches for data mining.
In section 2 of this paper we will be explaining the con-
cepts necessary to understand the approaches to aviation
data mining outlined in sections 3 and 4.
In section 3,
three different methods of data mining in aviation will be
introduced, and the methods explained. The first of these
three methods is data mining using Multiple Kernel Learn-
ing, which finds patterns in combined discrete and contin-
uous data. The second of the three methods compares the
effectiveness of the Hidden Markov Model versus the Hid-
den Semi-Markov Model in detecting anomalies. The third
method analyzes the effectiveness of a text classification al-
gorithm. Section 4 will discuss the relative success found in
the results of these methods, and section 5 will summarize
the effectiveness of the methods.
2.
BACKGROUND
To discuss several methods of data mining used in aviation
today, we need to understand several data mining concepts.
These concepts include supervised, semi-supervised, and un-
supervised learning; text classification; Natural Language
Processing (NLP); Support Vector Machines (SVMs); Hid-
den Markov Models (HMMs); Hidden Semi-Markov Models
(HSMMs); and kernels. To summarize, we classify the data
by searching for key words in text (NLP), by finding clus-
ters of data (kernels), or by observing the probability of a
sequence of events (HMMs and HSMMs).
2.1
Aviation Data
We show implementation of these methods on three types
of aviation data in this paper. The first is Flight Record-
ing Data collected by the flight data recorder. The flight
data recorder is informally known as the black box. Planes
equipped with flight recording data typically record up to
1
Pagels: Aviation Data Mining
Published by University of Minnesota Morris Digital Well, 2015
500 variables of data per second for the duration the plane is
being operated [2]. Some of the variables described in these
flight data recordings are time, altitude, vertical accelera-
tion, and heading [1]. Some of these variables are discrete
and some are continuous.
The second type of data is synthetic data. This is data
generated with flight anomalies intentionally placed in the
data to test the abilities of the algorithms to recognize the
anomaly. These anomalies are referred to as dispersed anoma-
lies. These anomalies may be an unconventional sequence
of events, an unusual duration between events, etc. Some of
the synthetic data used in this paper is data generated from
a robust flight simulator, FlightGear. The FlightGear simu-
lator is often used in the aviation industry and in academia
due to its accuracy [6].
The third type of data is aviation incident reports. These
reports do not have any strict conventions, do not require
pilots to use specific terms, and include narratives. Since
this data is not uniform, we must find a method to determine
the relevant and important data.
2.2
Labels and Labeled Data
A label is a descriptive word assigned to data based on
some property of the data.
The labels in this paper are
called shaping factors, or shapers, of an aviation incident.
Examples of shapers in an aviation incident might include
illness, hazardous environment, a distracted pilot, etc.
2.3
Supervised, Semi-Supervised, and Unsu-
pervised Learning
There are many methods of finding a function to describe
data. This function is commonly called the model, as it is
made to model some set of data. Three such methods in-
clude supervised, semi-supervised, and unsupervised learn-
ing. Supervised learning uses labeled data to form the func-
tion. Semi-supervised learning uses some labeled data along
with some unlabeled data to form the function. Unsuper-
vised learning uses no labeled data to form the function. The
term supervised in this context means that the labels for the
data have already been found and are being used to con-
struct the new model in a somewhat predictable way. The
set of data used in supervised and semi-supervised learning
is called the training set.
2.4
Natural Language Processing
Natural Language Processing (NLP) is a field of computer
science focused on gathering meaningful data from text gen-
erated by humans. Aviation incident reports are not uni-
form, as they are filled out by humans. To get meaningful
data from these reports, we first have to identify the overall
picture of the data. This process is called text classification.
Text classification is a general term and there are several dif-
ferent methods of text classification. The research outlined
in this paper classifies text by using some prelabeled incident
reports. Using the reports and the shapers associated with
these reports, we can then find words in the reports that
are commonly associated with a shaper. These words are
referred to as expanders. While these expanders are being
found, we can label unlabeled reports that are likely to be
associated with a shaper if it contains a minimum number
of expanders.
2.5
Kernels and Support Vector Machines
A Support Vector Machine (SVM) classifies new data into
one of two categories. This data is represented by vectors
which are denoted by an arrow over a variable. It does this
by separating the data with a hyperplane. A hyperplane is
a line/plane of regression that best separates the two cate-
gories of data. This hyperplane is constructed by the SVM.
For example, the hyperplane in Figure 1 is the line separat-
ing the two clusters. Sometimes, an SVM is unable to pro-
duce a hyperplane. When this is the case, a kernel trick is
used. A kernel trick maps the plane into a higher dimension
so that a hyperplane may be found by the SVM [Figure 2].
The hyperplane in the left image of Figure 3 is the plane
separating the two clusters, it is then shown mapped back
into two dimensions in the right image of Figure 3. These
clusters are considered labeled after the hyperplane is con-
structed. The label is determined by the location of the data
point relative to the hyperplane. A kernel is a function used
to find the similarity between unlabeled data and the data
points, and label it accordingly.
Figure 1: Linearly separable data [4].
Figure 2: In the case of non-linearly separable data,
we can use a kernel trick to map the data to a higher
dimension [4].
Figure 3: Once the data is mapped to a higher di-
mension, we find a hyperplane to separate it [4].
2
Dostları ilə paylaş: |