1、BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,BCB 444/544,Lecture 28Gene Prediction - finish itPromoter Prediction #28_Oct29,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Mon Oct 29 - Lecture 28Promoter & Regulatory Element Prediction Chp 9 - pp 113 - 126Wed Oct 30 - Lecture 29Phylogenetic
2、s Basics Chp 10 - pp 127 - 141Thurs Oct 31 - Lab 9 Gene & Regulatory Element PredictionFri Oct 30 - Lecture 29Phylogenetic Tree Construction Methods & Programs Chp 11 - pp 142 - 169,Required Reading (before lecture),BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Assignments & Announcements,Mon O
3、ct 29 - HW#5 - will be posted todayHW#5 = Hands-on exercises with phylogenetics and tree-building softwareDue: Mon Nov 5 (not Fri Nov 1 as previously posted),BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,BCB 544 “Team“ Projects,Last week of classes will be devoted to ProjectsWritten reports due
4、: Mon Dec 3 (no class that day)Oral presentations (20-30) will be: Wed-Fri Dec 5,6,7 1 or 2 teams will present during each class periodSee Guidelines for Projects posted online,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,BCB 544 Only: New Homework Assignment,544 Extra#2 Due: PART 1 - ASAPPART
5、 2 - meeting prior to 5 PM Fri Nov 2Part 1 - Brief outline of Project, email to Drena & Michaelafter response/approval, then: Part 2 - More detailed outline of projectRead a few papers and summarize status of problemSchedule meeting with Drena & Michael to discuss ideas,BCB 444/544 F07 ISU Dobbs #28
6、- Promoter Prediction,Seminars this Week,BCB List of URLs for Seminars related to Bioinformatics:http:/www.bcb.iastate.edu/seminars/index.htmlNov 1 Thurs - BBMB Seminar 4:10 in 1414 MBB Todd Yeates UCLA TBA -something cool about structure and evolution?Nov 2 Fri - BCB Faculty Seminar 2:10 in 102 ScI
7、 Bob Jernigan BBMB, ISU Control of Protein Motions by Structure,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Chp 8 - Gene Prediction,SECTION III GENE AND PROMOTER PREDICTIONXiong: Chp 8 Gene PredictionCategories of Gene Prediction Programs Gene Prediction in Prokaryotes Gene Prediction in Euka
8、ryotes,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Computational Gene Prediction: Approaches,Ab initio methods Search by signal: find DNA sequences involved in gene expression Search by content: Test statistical properties distinguishing coding from non-coding DNA Similarity-based methods Dat
9、abase search: exploit similarity to proteins, ESTs, cDNAs Comparative genomics: exploit aligned genomes Do other organisms have similar sequence? Hybrid methods - best,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Computational Gene Prediction: Algorithms,Neural Networks (NNs) (more on these la
10、ter)e.g., GRAILLinear discriminant analysis (LDA) (see text)e.g., FGENES, MZEFMarkov Models (MMs) & Hidden Markov Models (HMMs) e.g., GeneSeqer - uses MMs GENSCAN - uses 5th order HMMs - (see text)HMMgene - uses conditional maximum likelihood (see text),This is a new slide,BCB 444/544 F07 ISU Dobbs
11、#28- Promoter Prediction,Signals Search,Approach: Build models (PSSMs, profiles, HMMs, ) and search against DNA. Detected instances provide evidence for genes,This is a new slide,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Content Search,Observation: Encoding a protein affects statistical pro
12、perties of DNA sequence: Nucleotide.amino acid distribution GC content (CpG islands, exon/intron) Uneven usage of synonymous codons (codon bias) Hexamer frequency - most discriminative of these for identifying coding potentialMethod: Evaluate these differences (coding statistics) to differentiate be
13、tween coding and non-coding regions,This is a new slide,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Human Codon Usage,This is a new slide,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Predicting Genes based on Codon Usage Differences,Algorithm: Process sliding window Use codon frequencie
14、s to compute probability of coding versus non-coding Plot log-likelihood ratio:,This is a new slide,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,In different genomes: Translate DNA into all 6 reading frames and search against proteins (TBLASTX,BLASTX, etc.)Within same genome: Search with EST/c
15、DNA database (EST2genome, BLAT, etc.).Problems: Will not find “new” or RNA genes (non-coding genes). Limits of similarity are hard to define Small exons might be overlooked,Similarity-Based Methods: Database Search,This is a new slide,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Similarity-Bas
16、ed Methods: Comparative Genomics,Idea: Functional regions are more conserved than non-functional ones; high similarity in alignment indicates geneAdvantages: May find uncharacterized or RNA genes Problems: Finding suitable evolutionary distance Finding limits of high similarity (functional regions),
17、This is a new slide,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Human-Mouse Homology,Comparison of 1196 orthologous genes Sequence identity between genes in human vs mouse Exons: 84.6% Protein: 85.4% Introns: 35% 5 UTRs: 67% 3 UTRs: 69%,This is a new slide,BCB 444/544 F07 ISU Dobbs #28- Promo
18、ter Prediction,Thanks to Volker Brendel, ISU for the following Figs & Slides,Slightly modified from:BSSI Genome Informatics Module http:/www.bioinformatics.iastate.edu/BBSI/course_desc_2005.html#moduleBV Brendel vbrendeliastate.edu,Brendel et al (2004) Bioinformatics 20: 1157,BCB 444/544 F07 ISU Dob
19、bs #28- Promoter Prediction,Perform pairwise alignment with large gaps in one sequence (due to introns) Align genomic DNA with cDNA, ESTs, protein sequences Score semi-conserved sequences at splice junctions Using Bayesian probability model & 1st order MMScore coding constraints in translated exons
20、Using Bayesian model,Spliced Alignment Algorithm,GeneSeqer - Brendel et al.- ISU,http:/deepc2.psi.iastate.edu/cgi-bin/gs.cgi,Brendel et al (2004) Bioinformatics 20: 1157 http:/bioinformatics.oxfordjournals.org/cgi/content/abstract/20/7/1157,Brendel 2005,BCB 444/544 F07 ISU Dobbs #28- Promoter Predic
21、tion,i: ith position in sequence : avg information content over all positions 20 nt from splice site : avg sample standard deviation of ,Splice Site Detection,Do DNA sequences surrounding splice “consensus“ sequences contribute to splicing signal?,YES,Brendel 2005,BCB 444/544 F07 ISU Dobbs #28- Prom
22、oter Prediction,Information Content vs Position,Which sequences are exons & which are introns?How can you tell?,Brendel 2005,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Markov Model for Spliced Alignment,Brendel 2005,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Evaluation of Splice Site
23、 Prediction,Fig 5.11 Baxevanis & Ouellette 2005,This is a new slide,TP = positive instance correctly predicted as positive FP = negative instance incorrectly predicted as positive TN = negative instance correctly predicted as negative FN = positive instance incorrectly predicted as negative,Right!,B
24、CB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Evaluation of Predictions,Normalized specificity:,Specificity:,Misclassification rates:,Coverage,Sensitivity:,Predicted Positives,True Positives,False Positives,Recall,Do not memorize this!,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Evaluatio
25、n of Predictions - in English,Specificity:,Sensitivity:,= Coverage,In English? Sensitivity is the fraction of all positive instances having a true positive prediction.,= Recall,In English? Specificity is the fraction of all predicted positives that are, in fact, true positives.,IMPORTANT: in medical
26、 jargon, Specificity is sometimes defined differently (what we define here as “Specificity“ is sometimes referred to as “Positive predictive value“),IMPORTANT: Sensitivity alone does not tell us much about performance because a 100% sensitivity can be achieved trivially by labeling all test cases po
27、sitive!,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Best Measures for Comparison?,ROC curves (Receiver Operating Characteristic (?!) http:/en.wikipedia.org/wiki/Roc_curveCorrelation Coefficient Matthews correlation coefficient (MCC)MCC = 1 for a perfect prediction0 for a completely random ass
28、ignment-1 for a “perfectly incorrect“ prediction,Do not memorize this!,In signal detection theory, a receiver operating characteristic (ROC), or ROC curve is a plot of sensitivity vs (1 - specificity) for a binary classifier system as its discrimination threshold is varied. The ROC can also be repre
29、sented equivalently by plotting fraction of true positives (TPR = true positive rate) vs fraction of false positives (FPR = false positive rate),This slide has been changed,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Brendel 2005,GeneSeqer: Inputhttp:/deepc2.psi.iastate.edu/cgi-bin/gs.cgi,BCB
30、 444/544 F07 ISU Dobbs #28- Promoter Prediction,Brendel 2005,GeneSeqer: Output,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Brendel 2005,GeneSeqer: Gene Evidence Summary,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Gene Prediction - Problems & Status?,Common errors? False positive interg
31、enic regions: 2 annotated genes actually correspond to a single gene False negative intergenic region: One annotated gene structure actually contains 2 genes False negative gene prediction: Missing gene (no annotation) Other: Partially incorrect gene annotation Missing annotation of alternative tran
32、scriptsCurrent status? For ab initio prediction in eukaryotes: HMMs have better overall performance for detecting intron/exon boundaries Limitation? Training data: predictions are organism specific Combined ab initio/homology based predictions: Improved accurracy Limitation? Availability of identifi
33、able sequence homologs in databases,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Recommended Gene Prediction Software,Ab initio GENSCAN: http:/genes.mit.edu/GENSCAN.html GeneMark.hmm: http:/exon.gatech.edu/GeneMark/ others: GRAIL, FGENES, MZEF, HMMgene Similarity-based BLAST, GenomeScan, EST2G
34、enome, Twinscan Combined: GeneSeqer, http:/deepc2.psi.iastate.edu/cgi-bin/gs.cgi ROSETTA Consensus: because results depend on organisms & specific task, Always use more than one program! Two servers hat report consensus predictions GeneComber DIGIT,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,
35、Other Gene Prediction Resources: at ISU,http:/www.bioinformatics.iastate.edu/bioinformatics2go/,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Other Gene Prediction Resources: GaTech, MIT, Stanford, etc.,Current Protocols in Bioinformatics (BCB/ISU owns a copy - currently in my lab!)Chapter 4 Fi
36、nding Genes 4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations 4.2 Using MZEF To Find Internal Coding Exons 4.3 Using GENEID to Identify Genes 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm 4.6 Eu
37、karyotic Gene Prediction Using GeneMark.hmm 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome 4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation 4.10 Using RepeatMasker to Identify Repe
38、titive Elements in Genomic Sequences,Lists of Gene Prediction Softwarehttp:/www.bioinformaticsonline.org/links/ch_09_t_1.htmlhttp:/cmgm.stanford.edu/classes/genefind/,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Chp 9 - Promoter & Regulatory Element Prediction,SECTION III GENE AND PROMOTER PRE
39、DICTIONXiong: Chp 9 Promoter & Regulatory Element PredictionPromoter & Regulatory Elements in Prokaryotes Promoter & Regulatory Elements in Eukaryotes Prediction Algorithms,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Eukaryotic genomes Are packaged in chromatin & sequestered in a nucleus Are
40、larger and have multiple linear chromosomes Contain mostly non-protein coding DNA (98-99%)Prokarytic genomes DNA is associated with a nucleoid, but no nucleus Much larger, usually single, circular chromosome Contain mostly protein encoding DNA,Eukaryotes vs Prokaryotes: Genomes,BCB 444/544 F07 ISU D
41、obbs #28- Promoter Prediction,Eukaryotes vs Prokryotes: Gene Structure,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Eukaryotic genes Are larger and more complex than in prokaryotes Contain introns that are “spliced” out to generate mature mRNAs* Often undergo alternative splicing, giving rise
42、to multiple RNAs* Are transcribed by 3 different RNA polymerases (instead of 1, as in prokaryotes)* In biology, statements such as this include an implicit “usually” or “often”,Eukaryotes vs Prokaryotes: Genes,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Primary level of control?Prokaryotes: T
43、ranscription initiation Eukaryotes: Transcription is also very important, but Expression is regulated at multiple levelsmany of which are post-transcriptional: RNA processing, transport, stability Translation initiation Protein processing, transport, stability Post-translational modification (PTM) S
44、ubcellular localizationRecent important discoveries: small regulatory RNAs (miRNA, siRNA) are abundant and play very important roles in controlling gene expression in eukaryotes, often at post-transcriptional levels,Eukaryotes vs Prokaryotes: Levels of Gene Regulation,BCB 444/544 F07 ISU Dobbs #28-
45、Promoter Prediction,Eukaryotes vs Prokaryotes: Regulatory Elements,Prokaryotes:Promoters & operators (for operons) - cis-acting DNA signalsActivators & repressors - trans-acting proteins (we wont discuss these)Eukaryotes:Promoters & enhancers (for single genes) - cis-acting Transcription factors - t
46、rans-actingImportant difference? What the RNA polymerase actually binds,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Prokaryotic Promoters,RNA polymerase complex recognizes promoter sequences located very close to and on 5 side (“upstream”) of tansription initiation siteProkaryotic RNA polymer
47、ase complex binds directly to promoter, by virtue of its sigma subunit - no requirement for “transcription factors” binding first Prokaryotic promoter sequences are highly conserved: -10 region -35 region,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Eukaryotic Promoters,Eukaryotic RNA polymera
48、se complexes do not bind directly to promoter sequencesTranscription factors must bind first and serve as landmarks recognized by RNA polymerase complexesEukaryotic promoter sequences are less highly conserved, but many promoters (for RNA polymerase II) contain : -30 region “TATA“ box -100 region “C
49、CAAT“ box,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Eukaryotic Promoters vs Enhancers,Both promoters & enhancers are binding sites for transcription factors (TFs) Promoters essential for initiation of transcription located “relatively” close to start site (usually 100 kb),BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,Eukaryotic genes are transcribed by 3 different RNA polymerases (Location of promoter regions, TFBSs & TFs differ, too),BIOS Scientific Publishers Ltd, 1999,Brown Fig 9.18,BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction,