Aviation Data Mining

Yüklə 0,64 Mb.

Pdf görüntüsü

səhifə	4/4
tarix	02.01.2022
ölçüsü	0,64 Mb.
	#46664

1 2 3 4

Aviation Data Mining

Document Outline

Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal, Vol. 2 [2015], Iss. 1, Art. 3

http://digitalcommons.morris.umn.edu/horizons/vol2/iss1/3

Table 1: Positive and Negative Expanders [8].

Shaping Factor

Positive Expanders

Negative Expanders

Familiarity

unfamiliar, layout, unfamiliarity, rely

Physical Environment

cloud, snow, ice, wind

Physical Factors

fatigue, tire, night, rest, hotel, awake, sleep, sick

declare, emergency, advisory, separation

Preoccupation

distract, preoccupied, awareness, situational,

task, interrupt, focus, eye, configure, sleep

declare, ice, snow, crash, fire, rescue, anti,

smoke

Pressure

bad, decision, extend, fuel, calculate, reserve,

diversion, alternate

with that set, it is labeled with the shaper associated with

the expanders. For example, if the bootstrapping algorithm

is labeling reports with the Preoccupation label using the

set of positive expanders in Table 1, and an unlabeled report

contains the words “awareness”, “task”, “eye”, and “smoke”,

it would label this report with the Preoccupation shaper

regardless of any negative expander.

Figure 8: The bootstrapping algorithm [8].

The bootstrapping algorithm [Figure 8] takes several in-

puts. These inputs consist of a set of positively labeled train-

ing examples of a shaper, a set of negatively labeled training

examples of a shaper, a set of unlabeled narratives, and the

number of bootstrapping iterations (k). Positive examples

of a shaper are narratives which contain words that indicate

that shaper, negative examples of a shaper are narratives

that include words that indicate that the shaper is not ap-

propriate for said narratives. To find more expanders, two

empty data sets are initialized, one for positive expanders,

one for negative expanders. The algorithm then iterates the

ExpandTrainingSet function k times. In these iterations, if

the size of the set of positively labeled training examples is

larger than the size of the set of negatively labeled training

examples, a new positive expander is found and vice versa.

In the second function of the algorithm, ExpandTrain-

ingSet, 4 new expanders are found.

The inputs for this

function are the narrative sets of the positive and negative

shaper examples in their respective variable (A or B depen-

dent on the size of the narrative sets), the set of narratives

not assigned a shaper, and a unigram feature set: a set of

positive or negative expanders (dependent on the sizes of the

positive and negative narrative sets). Expanders are found

by finding the log of the number of narratives in one set

(P or N) containing a word, t, divided by the number of

narratives in the other set containing the word, t, for every

word. The maximum value of these values is then found. If

expanding the positively labeled training examples, the pos-

itive narrative set is used in the numerator, and the negative

narrative set is used in the denominator, and vice versa for

expanding the negatively labeled training examples. If this

word is not already an expander, it is added to the set of

expanders. After the 4 new expanders are found, all of the

unlabeled narratives that contain more than 3 words in the

list of relevant expanders are added to the relevant list of

labeled training examples.

To use the semi-supervised bootstrapping algorithm, some

data must already be labeled. To initially label a test set of

incident reports, two graduate students not affiliated with

the research labeled 1,333 randomly selected reports from

the ASRS database.

The reports are then classified by a pre-existing software

package called LIBSVM to have a baseline to which we may

compare the bootstrapping algorithm.

RESULTS

In this section we discuss the outcomes of the three meth-

ods compared to their respective baselines.

4.1

Multiple Kernel Learning

The data of concern in the set of 3500 flights were the

points below 10,000 feet mean sea level (MSL). The flight

data of these flights was passed through an algorithm to rid

the set of flights where the sensors or sensor values were un-

reasonable, likely due to noise or other malfunctions. This

left 2500 of the original 3500 flights for analysis. To find a

training set for the algorithm from these 2500 flights, “an

aggressive data quality filter was applied to the remaining

flights”, which returned “approximately 500 flights” [2]. Of

the 2500 flights, 227 flights were found to be anomalous by

the MKAD method. Of these 227 anomalous flights, 19 were

discrete, 94 were continuous, and 114 were heterogeneous

(discrete and continuous). Table 2 shows the results from

the multiple kernel method research, the overlap between

Pagels: Aviation Data Mining

Published by University of Minnesota Morris Digital Well, 2015

this multiple kernel method and the baselines. Multiple ker-

nel learning for heterogeneous anomaly detection suggest the

MKAD approach was able to detect anomalies indicated by

both discrete and continuous data more effectively than the

baseline methods based on this overlap.

Algorithms

Overlap of anomalous flights

(with MKAD)

Discrete

Continuous

Heterogeneous

21%

59%

34%

53%

54%

O & S

58%

59%

67%

MKAD

114

Table 2:

Overlap between MKAD approach and

baselines.

The baselines are represented by O for

Orca and S for SequenceMiner. The values of O &

S are the union of their anomalous sets [2].

4.2

HMMs and HSMMs

Overall, of the scenarios listed in section 3.2, HSMM per-

formed better on scenarios 1 and 2, and performed simi-

larly to HMM on scenarios 3, 4, and 5. While the authors

of the paper discussed possible methods to further improve

anomaly detection using a HSMM, the result confirms the

relevance and importance of an algorithm that takes dura-

tion of states into account.

4.3

Semi-Supervised Bootstrapping Algorithm

A sample of the words indicative of certain labels, or ex-

panders, found when the bootstrapping algorithm was run

on the set of incident reports may be found in Table 1.

We can get an idea of the effectiveness of the bootstrap-

ping algorithm based on the expanders.

In a table from

Semi-Supervised Cause Identification from Aviation Safety

Reports, 1.8% of the reports were annotated with the ‘Pres-

sure’ shaper. Even with the small percentage of the data

set having a cause of pressure it is easy to see how the pos-

itive expanders shown in the table can indicate pressure as

a cause leading to the incident.

The bootstrapping algorithm’s effectiveness was measured

by F-measure. An F-measure is the combination of precision

and recall. Precision is the fraction of reports accurately

assigned a shaper, recall is the fraction of the reports for

a shaper that were properly labeled.

The bootstrapping

Algorithms’ F-measure yielded “a relative error reduction of

6.3% in F-measure over a purely supervised baseline when

applied to the minority classes” [8].

CONCLUSION

Techniques in data mining show signs of improving the

ability to detect anomalies in aviation data. We are now

able to detect heterogeneous anomalies in data, where be-

fore we were only able to find either discrete or continuous

anomalies. To do this we use Multiple Kernel Learning. We

have learned that a Hidden Semi-Markov Model approach

to detecting anomalies is favorable over a Hidden Markov

Model approach. This is due to Hidden Semi-Markov Mod-

els having the ability to model the probability of sequences

with the duration of states having significance. Lastly, we

have looked at a new text classification approach to effec-

tively identifying causes in aviation incident reports with an

emphasis on minority causes. To accomplish this, the boot-

strapping algorithm was used to find causes based on key

words contained in the aviation reports. Some continuing

problems which have yet to be addressed in the field of data

mining of aviation data include:

• Overly generalized data in incident reports, making

cause identification a difficult task

• Providing a simple way for these methods to be de-

ployed

• Linking reports between other data (e.g. linking inci-

dent report to aircraft maintenance records) [7].

ACKNOWLEDGEMENTS

Many thanks to Peter Dolan, Elena Machkasova, and An-

drew Latterner for their invaluable feedback.

REFERENCES

[1] 14 Code of Federal Regulations 121.344. 2011.

[2] S. Das, B. L. Matthews, A. N. Srivastava, and N. C.

Oza. Multiple kernel learning for heterogeneous

anomaly detection: algorithm and aviation safety case

study. In Proceedings of the 16th ACM SIGKDD

international conference on Knowledge discovery and

data mining, pages 47–56. ACM, 2010.

[3] J. W. Hunt and T. G. Szymanski. A fast algorithm for

computing longest common subsequences. In

Communications of the ACM: Volume 20-Number 5,

pages 350–353. ACM, 1997.

[4] E. Kim. Everything you wanted to know about the

kernel trick (but were too afraid to ask).

http://www.eric-kim.net/eric-kim-net/posts/1/

kernel_trick.html, 2013.

[5] J. Lin, E. Keogh, L. Wei, and S. Lonardi. Experiencing

sax: a novel symbolic representation of time series. Data

Mining and Knowledge Discovery, 15(2):107–144, 2007.

[6] I. Melnyk, P. Yadav, M. Steinbach, J. Srivastava,

V. Kumar, and A. Banerjee. Detection of precursors to

aviation safety incidents due to human factors. In Data

Mining Workshops (ICDMW), 2013 IEEE 13th

International Conference on, pages 407–412. IEEE,

2013.

[7] Z. Nazeri, E. Bloedorn, and P. Ostwald. Experiences in

mining aviation safety data. In ACM SIGMOD Record,

volume 30, pages 562–566. ACM, 2001.

[8] I. Persing and V. Ng. Semi-supervised cause

identification from aviation safety reports. In

Proceedings of the Joint Conference of the 47th Annual

Meeting of the ACL and the 4th International Joint

Conference on Natural Language Processing of the

AFNLP: Volume 2-Volume 2, pages 843–851.

Association for Computational Linguistics, 2009.

6

Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal, Vol. 2 [2015], Iss. 1, Art. 3

http://digitalcommons.morris.umn.edu/horizons/vol2/iss1/3

Document Outline

Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal
- 2015
Aviation Data Mining
- David A. Pagels
  - Recommended Citation
tmp.1426542185.pdf.BhIT6

Yüklə 0,64 Mb.

Dostları ilə paylaş:

1 2 3 4