Introduction to Pattern Recognition Prediction in Bioinformatics

Yüklə 3,95 Mb.

tarix	04.01.2017
ölçüsü	3,95 Mb.
	#4400

Introduction to Pattern Recognition

Prediction in Bioinformatics
What do we want to predict?

Features from sequence
Data mining

How can we predict?

Homology / Alignment
Pattern Recognition / Statistical Methods / Machine Learning

What is prediction?

Generalization / Overfitting
Preventing overfitting: Homology reduction

How do we measure prediction?

Performance measures
Threshold selection

Sequence → structure → function

Prediction from DNA sequence

Protein-coding genes

transcription factor binding sites
transcription start/stop
translation start/stop
splicing: donor/acceptor sites

Non-coding RNA

tRNAs
rRNAs
miRNAs

General features

Structure (curvature/bending)
Binding (histones etc.)

Folding / structure

Folding / structure
Post-Translational Modifications

Attachment: phosphorylation glycosylation lipid attachment
Cleavage: signal peptides, propeptides, transit peptides
Sorting: secretion, import into various organelles, insertion into membranes

Interactions
Function

Enzyme activity
Transport
Receptors
Structural components
etc...

Protein sorting in eukaryotes

Data: UniProt annotation of protein sorting

Annotations relevant for protein sorting are found in:

the CC (comments) lines
cross-references (DR lines) to GO (Gene Ontology)‏
the FT (feature table) lines

ID INS_HUMAN Reviewed; 110 AA.
AC P01308;
...
DE Insulin precursor [Contains: Insulin B chain; Insulin A chain].
GN Name=INS;
...
CC -!- SUBCELLULAR LOCATION: Secreted.
...
DR GO; GO:0005576; C:extracellular region; IC:UniProtKB.
...
FT SIGNAL 1 24

3 types of non-experimental qualifiers in the CC and FT lines:

Potential: Predicted by sequence analysis methods
Probable: Inconclusive experimental evidence
By similarity: Predicted by alignment to proteins with known location

Problems in database parsing

Prediction methods

Homology / Alignment
Simple pattern recognition

Example:

PROSITE entry PS00014, ER_TARGET:
Endoplasmic reticulum targeting sequence.
Pattern: [KRHQSA]-[DENQ]-E-L>

Statistical methods

Weight matrices: calculate amino acid probabilities
Other examples: Regression, variance analysis, clustering

Machine learning

Like statistical methods, but parameters are estimated by iterative training rather than direct calculation
Examples: Neural Networks (NN), Hidden Markov Models (HMM), Support Vector Machines (SVM)

Prediction of subcellular localisation from sequence

Signal-based prediction

Signal peptides

von Heijne 1983, 1986 [WM]
SignalP (Nielsen et al. 1997, 1998; Bendtsen et al. 2004) [NN, HMM]

Mitochondrial & chloroplast transit peptides

Mitoprot (Claros & Vincens 1996) [linear discriminant using physico-chemical parameters]
ChloroP, TargetP* (Emanuelsson et al. 1999, 2000) [NN]
iPSORT* (Bannai et al. 2002) [decision tree using physico-chemical parameters]
Protein Prowler* (Hawkins & Bodén 2006) [NN]

*= includes also signal peptides

Nuclear localisation signals

PredictNLS (Cokol et al. 2000) [regex]
NucPred (Heddad et al. 2004) [regex, GA]

Composition-based prediction

Nakashima and Nishikawa 1994 [2 categories; odds-ratio statistics]
ProtLock (Cedano et al. 1997) [5 categories; Mahalanobis distance]
Chou and Elrod 1998 [12 categories; covariant discriminant]
NNPSL (Reinhardt and Hubbard 1998) [4 categories; NN]
SubLoc (Hua and Sun 2001) [4 categories; SVM]
PLOC (Park and Kanehisa 2003) [12 categories; SVM]
LOCtree (Nair & Rost 2005) [6 categories; SVM incl. regions, structure and profiles]
BaCelLo (Pierleoni et al. 2006) [5 categories; SVM incl. regions and profiles]

Pro:

does not require knowledge of signals
works even if N-terminus is wrong

Con:

cannot identify isoform differences

A simple statistical method: Linear regression

Overfitting

A classification problem

How to estimate parameters for prediction?

Model selection

The test set method

Cross Validation

Which kind of Cross Validation?

Problem: sequences are related

If the sequences in the test set are closely related to those in the training set, we can not measure true generalization performance

Solution: Homology reduction

Calculate all pairwise similarities in the data set
Define a threshold for being ”neighbours” (too closely related)
Calculate numbers of neighbours for each example, and remove the example with most neighbours
Repeat until there are no examples with neighbours left
Alternative: Homology partitioning
keep all examples, but cluster them so that no neighbours end up in the same fold
Should be combined with weighting

Defining a threshold for homology reduction

Yüklə 3,95 Mb.

Dostları ilə paylaş:

Introduction to Pattern Recognition Prediction in Bioinformatics

Introduction to Pattern Recognition

Prediction in Bioinformatics

What do we want to predict?

How can we predict?

What is prediction?

How do we measure prediction?

Sequence → structure → function

Prediction from DNA sequence

Protein-coding genes

Non-coding RNA

General features

Folding / structure

Folding / structure

Post-Translational Modifications

Interactions

Function

Protein sorting in eukaryotes

Data: UniProt annotation of protein sorting

Annotations relevant for protein sorting are found in:

3 types of non-experimental qualifiers in the CC and FT lines:

Problems in database parsing

Prediction methods

Homology / Alignment

Simple pattern recognition

Statistical methods

Machine learning

Prediction of subcellular localisation from sequence

Signal-based prediction

Signal peptides

Mitochondrial & chloroplast transit peptides

Nuclear localisation signals

Composition-based prediction

Nakashima and Nishikawa 1994 [2 categories; odds-ratio statistics]

ProtLock (Cedano et al. 1997) [5 categories; Mahalanobis distance]

Chou and Elrod 1998 [12 categories; covariant discriminant]

NNPSL (Reinhardt and Hubbard 1998) [4 categories; NN]

SubLoc (Hua and Sun 2001) [4 categories; SVM]

PLOC (Park and Kanehisa 2003) [12 categories; SVM]

LOCtree (Nair & Rost 2005) [6 categories; SVM incl. regions, structure and profiles]

BaCelLo (Pierleoni et al. 2006) [5 categories; SVM incl. regions and profiles]

A simple statistical method: Linear regression

Overfitting

A classification problem

How to estimate parameters for prediction?

Model selection

The test set method

The test set method

The test set method

The test set method

The test set method

Cross Validation

Cross Validation

Cross Validation

Cross Validation

Cross Validation

Cross Validation

Cross Validation

Which kind of Cross Validation?

Problem: sequences are related

If the sequences in the test set are closely related to those in the training set, we can not measure true generalization performance

Solution: Homology reduction

Calculate all pairwise similarities in the data set

Define a threshold for being ”neighbours” (too closely related)

Calculate numbers of neighbours for each example, and remove the example with most neighbours

Repeat until there are no examples with neighbours left

Alternative: Homology partitioning

keep all examples, but cluster them so that no neighbours end up in the same fold

Should be combined with weighting

Defining a threshold for homology reduction

Defining a threshold for homology reduction

**3 types of non-experimental qualifiers in the CC and FT lines:**