|
Introduction to Pattern Recognition Prediction in Bioinformatics
|
tarix | 04.01.2017 | ölçüsü | 3,95 Mb. | | #4400 |
|
Prediction in Bioinformatics What do we want to predict? - Features from sequence
- Data mining
How can we predict? - Homology / Alignment
- Pattern Recognition / Statistical Methods / Machine Learning
What is prediction? - Generalization / Overfitting
- Preventing overfitting: Homology reduction
How do we measure prediction? - Performance measures
- Threshold selection
Sequence → structure → function
Protein-coding genes - transcription factor binding sites
- transcription start/stop
- translation start/stop
- splicing: donor/acceptor sites
Non-coding RNA General features - Structure (curvature/bending)
- Binding (histones etc.)
Folding / structure Folding / structure Post-Translational Modifications - Attachment: phosphorylation glycosylation lipid attachment
- Cleavage: signal peptides, propeptides, transit peptides
- Sorting: secretion, import into various organelles, insertion into membranes
Interactions Function - Enzyme activity
- Transport
- Receptors
- Structural components
- etc...
Protein sorting in eukaryotes
Data: UniProt annotation of protein sorting Annotations relevant for protein sorting are found in: - the CC (comments) lines
- cross-references (DR lines) to GO (Gene Ontology)
- the FT (feature table) lines
- ID INS_HUMAN Reviewed; 110 AA.
- AC P01308;
- ...
- DE Insulin precursor [Contains: Insulin B chain; Insulin A chain].
- GN Name=INS;
- ...
- CC -!- SUBCELLULAR LOCATION: Secreted.
- ...
- DR GO; GO:0005576; C:extracellular region; IC:UniProtKB.
- ...
- FT SIGNAL 1 24
3 types of non-experimental qualifiers in the CC and FT lines: - Potential: Predicted by sequence analysis methods
- Probable: Inconclusive experimental evidence
- By similarity: Predicted by alignment to proteins with known location
Problems in database parsing
Prediction methods Homology / Alignment Simple pattern recognition - Example:
- PROSITE entry PS00014, ER_TARGET:
- Endoplasmic reticulum targeting sequence.
- Pattern: [KRHQSA]-[DENQ]-E-L>
Statistical methods - Weight matrices: calculate amino acid probabilities
- Other examples: Regression, variance analysis, clustering
- Like statistical methods, but parameters are estimated by iterative training rather than direct calculation
- Examples: Neural Networks (NN), Hidden Markov Models (HMM), Support Vector Machines (SVM)
Prediction of subcellular localisation from sequence
Signal-based prediction Signal peptides - von Heijne 1983, 1986 [WM]
- SignalP (Nielsen et al. 1997, 1998; Bendtsen et al. 2004) [NN, HMM]
Mitochondrial & chloroplast transit peptides - Mitoprot (Claros & Vincens 1996) [linear discriminant using physico-chemical parameters]
- ChloroP, TargetP* (Emanuelsson et al. 1999, 2000) [NN]
- iPSORT* (Bannai et al. 2002) [decision tree using physico-chemical parameters]
- Protein Prowler* (Hawkins & Bodén 2006) [NN]
- *= includes also signal peptides
- PredictNLS (Cokol et al. 2000) [regex]
- NucPred (Heddad et al. 2004) [regex, GA]
Composition-based prediction Nakashima and Nishikawa 1994 [2 categories; odds-ratio statistics] ProtLock (Cedano et al. 1997) [5 categories; Mahalanobis distance] Chou and Elrod 1998 [12 categories; covariant discriminant] NNPSL (Reinhardt and Hubbard 1998) [4 categories; NN] SubLoc (Hua and Sun 2001) [4 categories; SVM] PLOC (Park and Kanehisa 2003) [12 categories; SVM] LOCtree (Nair & Rost 2005) [6 categories; SVM incl. regions, structure and profiles] BaCelLo (Pierleoni et al. 2006) [5 categories; SVM incl. regions and profiles]
- Pro:
- Con:
- cannot identify isoform differences
A simple statistical method: Linear regression
Overfitting
A classification problem
How to estimate parameters for prediction?
Model selection
The test set method
The test set method
The test set method
The test set method
Cross Validation
Cross Validation
Cross Validation
Cross Validation
Cross Validation
Cross Validation
Cross Validation
Which kind of Cross Validation?
Problem: sequences are related
Solution: Homology reduction Calculate all pairwise similarities in the data set Define a threshold for being ”neighbours” (too closely related) Calculate numbers of neighbours for each example, and remove the example with most neighbours Repeat until there are no examples with neighbours left Alternative: Homology partitioning keep all examples, but cluster them so that no neighbours end up in the same fold
Defining a threshold for homology reduction
Defining a threshold for homology reduction
Dostları ilə paylaş: |
|
|