Introduction to Pattern Recognition Prediction in Bioinformatics



Yüklə 3,95 Mb.
tarix04.01.2017
ölçüsü3,95 Mb.
#4400


Introduction to Pattern Recognition

  • Prediction in Bioinformatics

  • What do we want to predict?

    • Features from sequence
    • Data mining
  • How can we predict?

    • Homology / Alignment
    • Pattern Recognition / Statistical Methods / Machine Learning
  • What is prediction?

    • Generalization / Overfitting
    • Preventing overfitting: Homology reduction
  • How do we measure prediction?

    • Performance measures
    • Threshold selection

Sequence → structure → function



Prediction from DNA sequence

  • Protein-coding genes

    • transcription factor binding sites
    • transcription start/stop
    • translation start/stop
    • splicing: donor/acceptor sites
  • Non-coding RNA

    • tRNAs
    • rRNAs
    • miRNAs
  • General features

    • Structure (curvature/bending)
    • Binding (histones etc.)


Folding / structure

  • Folding / structure

  • Post-Translational Modifications

    • Attachment: phosphorylation glycosylation lipid attachment
    • Cleavage: signal peptides, propeptides, transit peptides
    • Sorting: secretion, import into various organelles, insertion into membranes
  • Interactions

  • Function

    • Enzyme activity
    • Transport
    • Receptors
    • Structural components
    • etc...


Protein sorting in eukaryotes



Data: UniProt annotation of protein sorting

  • Annotations relevant for protein sorting are found in:

    • the CC (comments) lines
    • cross-references (DR lines) to GO (Gene Ontology)‏
    • the FT (feature table) lines
      • ID INS_HUMAN Reviewed; 110 AA.
      • AC P01308;
      • ...
      • DE Insulin precursor [Contains: Insulin B chain; Insulin A chain].
      • GN Name=INS;
      • ...
      • CC -!- SUBCELLULAR LOCATION: Secreted.
      • ...
      • DR GO; GO:0005576; C:extracellular region; IC:UniProtKB.
      • ...
      • FT SIGNAL 1 24
  • 3 types of non-experimental qualifiers in the CC and FT lines:

    • Potential: Predicted by sequence analysis methods
    • Probable: Inconclusive experimental evidence
    • By similarity: Predicted by alignment to proteins with known location


Problems in database parsing



Prediction methods

  • Homology / Alignment

  • Simple pattern recognition

    • Example:
      • PROSITE entry PS00014, ER_TARGET:
      • Endoplasmic reticulum targeting sequence.
      • Pattern: [KRHQSA]-[DENQ]-E-L>
  • Statistical methods

    • Weight matrices: calculate amino acid probabilities
    • Other examples: Regression, variance analysis, clustering
  • Machine learning

    • Like statistical methods, but parameters are estimated by iterative training rather than direct calculation
    • Examples: Neural Networks (NN), Hidden Markov Models (HMM), Support Vector Machines (SVM)


Prediction of subcellular localisation from sequence



Signal-based prediction

  • Signal peptides

    • von Heijne 1983, 1986 [WM]
    • SignalP (Nielsen et al. 1997, 1998; Bendtsen et al. 2004) [NN, HMM]
  • Mitochondrial & chloroplast transit peptides

    • Mitoprot (Claros & Vincens 1996) [linear discriminant using physico-chemical parameters]
    • ChloroP, TargetP* (Emanuelsson et al. 1999, 2000) [NN]
    • iPSORT* (Bannai et al. 2002) [decision tree using physico-chemical parameters]
    • Protein Prowler* (Hawkins & Bodén 2006) [NN]
        • *= includes also signal peptides
  • Nuclear localisation signals

    • PredictNLS (Cokol et al. 2000) [regex]
    • NucPred (Heddad et al. 2004) [regex, GA]


Composition-based prediction

  • Nakashima and Nishikawa 1994 [2 categories; odds-ratio statistics]

  • ProtLock (Cedano et al. 1997) [5 categories; Mahalanobis distance]

  • Chou and Elrod 1998 [12 categories; covariant discriminant]

  • NNPSL (Reinhardt and Hubbard 1998) [4 categories; NN]

  • SubLoc (Hua and Sun 2001) [4 categories; SVM]

  • PLOC (Park and Kanehisa 2003) [12 categories; SVM]

  • LOCtree (Nair & Rost 2005) [6 categories; SVM incl. regions, structure and profiles]

  • BaCelLo (Pierleoni et al. 2006) [5 categories; SVM incl. regions and profiles]



A simple statistical method: Linear regression



Overfitting



A classification problem



How to estimate parameters for prediction?



Model selection



The test set method



The test set method



The test set method



The test set method



The test set method



Cross Validation



Cross Validation



Cross Validation



Cross Validation



Cross Validation



Cross Validation



Cross Validation



Which kind of Cross Validation?



Problem: sequences are related



Solution: Homology reduction

  • Calculate all pairwise similarities in the data set

  • Define a threshold for being ”neighbours” (too closely related)

  • Calculate numbers of neighbours for each example, and remove the example with most neighbours

  • Repeat until there are no examples with neighbours left

  • Alternative: Homology partitioning

  • keep all examples, but cluster them so that no neighbours end up in the same fold

  • Should be combined with weighting



Defining a threshold for homology reduction



Defining a threshold for homology reduction



Yüklə 3,95 Mb.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin