A Field Guide part 2.ppt

上传人:刘芸 文档编号:377825 上传时间:2018-10-09 格式:PPT 页数:98 大小:2.82MB
下载 相关 举报
A Field Guide part 2.ppt_第1页
第1页 / 共98页
A Field Guide part 2.ppt_第2页
第2页 / 共98页
A Field Guide part 2.ppt_第3页
第3页 / 共98页
A Field Guide part 2.ppt_第4页
第4页 / 共98页
A Field Guide part 2.ppt_第5页
第5页 / 共98页
亲,该文档总共98页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述

1、A Field Guide part 2,August 30, 2005,University of Colorado Health Sciences Center,Part 2,Entrez: text searching a GenBank record preview/index,BLAST: sequence searching pre-computed searches algorithms whats new?,VAST: structure searching,Example: mapping oligos to a genome,GenBank Records,The Flat

2、file Format,A Typical GenBank Record,LOCUS NM_019570 4279 bp mRNA linear INV 28-OCT-2004 DEFINITION Mus musculus REV1-like(S. cerevisiae)(Rev1l),mRNA ACCESSION NM_019570 VERSION NM_019570.3 GI:50811869 KEYWORDS .,GenBank Record: Feature Table,GenPept identifier,GenBank Record: Feature Table, cont.,G

3、enBank Record: sequence,skip,Indexing for Nucleotide UID 59958365,Field Indexed Termsprimary accession NM_001012399 title Bos taurus hemochromatosis (hfe), mRNA. organism Bos taurus sequence length 1168 modification date 2005/02/19 properties biomol mrnagbdiv mamsrcdb refseq,Global Entrez Search: HF

4、E,HFE,Entrez Nucleotide: HFE,137 records,Not HFE,Smarter Query,hfetitle,AND humanorgn,hfetitle AND humanorgn (cont),Primary data,Preview/Index,Preview/Index,Preview/Index: Properties, srcdb,Properties,Preview/Index: Properties, srcdb,AND srcdb refseqProperties,Preview/Index: Properties, srcdb,AND sr

5、cdb ddbj/embl/genbankProperties,#1 hfe 137 #2 hfetitle AND humanorgn 42#3 #2 AND srcdb refseqprop 11 #4 #2 AND srcdb ddbj/embl/genbankprop 31,Database Queries,#5 #4 AND gbdiv priprop 29 #4 #4 AND gbdiv estprop 2,Molecule Queries,#1 hfe 116 #2 hfetitle AND humanorgn 42#3 #2 AND biomol mrnaprop 29 #4

6、#2 AND biomol genomicprop 13,More Queries,Fields are database-specific,Other Entrez Databases,UniSTS: markers on the Genethon map of human chromosome 12 GenethonMap Name AND humanorganism AND 12chromosome,UniGene: rat clusters that have at least one mRNA ratorganism NOT 0mrna count,Structure: struct

7、ures of bacterial kinases with resolutions below 2 bacteriaorganism AND kinase AND 000.00:002.00resolution,SNP: uniquely mapped microsatellites on human chr2 microsatSNP Class AND 1Map Weight AND 2Chromosome) AND humanorgn,Basic Local Alignment Search Tool,BLAST Web Searches, 2005,200,000,Nucleotide

8、 or protein: Related SequencesBLAST link: BLink,Precomputed BLAST Services,Transcript clusters: UniGeneProtein homologs: HomoloGene,Link to Related Sequences,Related Sequences,Most similar,Least similar,BLink (BLAST Link),BLink Output,Global vs Local Alignment,Global vs Local Alignment,Seq1: WHEREIS

9、WALTERNOW (16aa) Seq2: HEWASHEREBUTNOWISHERE (21aa),The Flavors of BLAST,Standard BLAST nucleotide, protein and translations (blastn, blastp, blastx, tblastn, tblastx) traditional “contiguous” word hit Megablast optimized for large batch searches can use discontiguous words PSI-BLAST constructs PSSM

10、s automatically; uses as query very sensitive protein search RPS BLAST searches a database of PSSMs tool for conserved domain searches,“contiguous”,discontiguous,Fast - heuristic approach based on Smith WatermanLocal alignmentsStatistical significance- Expect valueVersatile- blastn, blastp, blastx,

11、tblastn, tblastx, rps-blast, psi-blast- www, standalone, and network clients,Why Is BLAST So Popular?,How BLAST Works,Make lookup table of “words” for query Scan database for hits Ungapped extensions of hits (initial HSPs) Gapped extensions (no traceback) Gapped extensions (traceback; alignment deta

12、ils),Nucleotide Words,GTACTGGACAT TACTGGACATGACTGGACATGGCTGGACATGGATGGACATGGACGGACATGGACCGACATGGACCCACATGGACCCT,Make a lookup table of words,. . .,Protein Words,GTQ TQIQITITVTVEVEDEDLDLF.,Make a lookup table of words, -f 11 = blastp default ,Minimum Requirements for a Hit,Nucleotide BLAST requires o

13、ne exact matchProtein BLAST requires two neighboring matches within 40 aa,GTQITVEDLFYNISEI YYN,ATCGCCATGCTTAATTGGGCTTCATGCTTAATT,neighborhood words,one exact match,two matches, -A 40 = blastp default ,BLASTP Summary,High-scoring pair (HSP),Scoring Systems - Nucleotides,A G C T A +1 3 3 -3 G 3 +1 3 -

14、3 C 3 3 +1 -3 T 3 3 3 +1,Identity matrix,CAGGTAGCAAGCTTGCATGTCA | | | raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA, -r 1 -q -3 ,Scoring Systems - Proteins,Position Independent Matrices PAM Matrices (Percent Accepted Mutation)Derived from observation; small dataset of alignmentsImplicit model of evol

15、utionAll calculated from PAM1PAM250 widely used BLOSUM Matrices (BLOck SUbstitution Matrices)Derived from observation; large dataset of highly conserved blocksEach matrix derived separately from blocks with a defined percent identity cutoffBLOSUM62 - default matrix for BLAST Position Specific Score

16、Matrices (PSSMs)PSI- and RPS-BLAST,A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -

17、2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1

18、-2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1A R N D C Q E G H I L K M F P S T W Y V X,BLOSUM62,Position-Specific Score Matrix,DAF-1,Serine/Threonine protein kinases catalytic loop,A R N D C Q E G H I L K M F P S T W Y V435 K -1 0 0 -1 -2 3 0 3

19、 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E 0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0 0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1 0 -1 0 0 0 2 0 -1 -1 0 -1438 N -1 0 -1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1 1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2 5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2 -3 -2 -2 -2 -2 -1 -2 -1 0

20、-3 7 -1 -2 -3 -1 -1441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2 3 1 0 -3 -3 0442 M -3 -4 -4 -4 -3 -4 -4 -5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A 4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1 -2 -4 -3 4444 H -4 -2 -1 -3 -5 -2 -2 -4 10 -6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8 -3 -4 0 -1 -2 -3 -2 -5 -4 0

21、-3 -2 -4 -3 -3 0 -4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6 -3 -5 -6 -4 -2 -3 -7 -5 -5447 I -4 -5 -6 -6 -3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5 -3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2 -2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4450 K 0 3 0 1 -5 0 0 -4 -1 -4 -3

22、 4 -3 -2 2 1 -1 -5 -4 -4451 N -4 -3 8 -1 -5 -2 -2 -3 -1 -6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5452 I -3 -5 -5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5 -3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6 -5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6 -3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5 455 K -2 1 1 4 -5 0 -1 -2 1

23、-4 -2 4 -3 -2 -3 0 -1 -5 -2 -3456 N 1 1 3 0 -4 -1 1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D -3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0 -2 -6 -4 -5458 L -3 -1 0 -3 0 -3 -2 3 -4 -2 3 0 1 1 -2 -2 -3 5 -1 -3,Position-Specific Score Matrix,catalytic loop,Local Alignment Statistics,High scores of local a

24、lignments between two random sequences follow the Extreme Value Distribution,Score (S),Alignments,Expect Value E = number of database hits you expect to find by chance, S,your score,expected number of random hits,More info: www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html,Gapped Alignments,Gappin

25、g provides more biologically realistic alignmentsGapped BLAST parameters are simulated for each scoring matrixAffine gap costs = -(a+bk) a = gap open penalty b = gap extend penalty A gap of length 1 receives the score -(a+b),An Alignment BLAST Cannot Make,1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACC

26、ACGCTATTCTTGCTGTTG| | | | | | | | | | | | | | | | | | |1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT| | | | | | | | | | | | | |61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT121 GGGGTGAACAAGGTTATTTCAGGCTT

27、GCTCGTGGTAAAAAC| | | | | | | | | |121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC,Reason: no contiguous exact match of 7 bp.,BLAST 2 Sequences (blastx) output:,An Alignment BLAST Can Make,Solution: compare protein sequences; BLASTX,Score = 290 bits (741), Expect = 7e-77 Identities = 147/331 (44%), Pos

28、itives = 206/331 (61%), Gaps = 8/331 (2%) Frame = +3,Other BLAST Algorithms,Megablast Discontiguous Megablast PSI-BLAST PHI-BLAST,Megablast: NCBIs Genome Annotator,Long alignments of similar DNA sequences Greedy algorithm Concatenation of query sequences Faster than blastn; less sensitive,MegaBLAST

29、& Word Size,Trade-off: sensitivity vs speed,Too fast for you?,MegaBLAST & Word Size,Trade-off: sensitivity vs speed,Discontiguous Megablast,Uses discontiguous word matches Better for cross-species comparisons,Templates for Discontiguous Words,W = 11, t = 16, coding: 1101101101101101 W = 11, t = 16,

30、non-coding: 1110010110110111 W = 12, t = 16, coding: 1111101101101101 W = 12, t = 16, non-coding: 1110110110110111 W = 11, t = 18, coding: 101101100101101101 W = 11, t = 18, non-coding: 111010010110010111 W = 12, t = 18, coding: 101101101101101101 W = 12, t = 18, non-coding: 111010110010110111 W = 1

31、1, t = 21, coding: 100101100101100101101 W = 11, t = 21, non-coding: 111010010100010010111 W = 12, t = 21, coding: 100101101101100101101 W = 12, t = 21, non-coding: 111010010110010010111,Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March

32、, 2002; 18(3):440-5,W = word size; # matches in template t = template length,Discontiguous (Cross-species) MegaBLAST,Discontiguous Word Options,MegaBLAST vs Discontiguous MegaBLAST,NM_017460,Homo sapiens cytochrome P450, family 3, subfamily A, polypeptide 4 (CYP3A4), transcript variant 1, mRNA (2768

33、 letters),vs Drosophila,MegaBLAST vs Discontiguous MegaBLAST,MegaBLAST = “No significant similarity found.”,Discontiguous megaBLAST =,Another Example . . .,Discontiguous megaBLAST = numerous hits . . .,Query: NM_078651 Drosophila melanogaster CG18582-PA (mbt) mRNA, (3244 bp) /note= mushroom bodies t

34、iny; synonyms: Pak2, STE20, dPAK2,MegaBLAST = “No significant similarity found.”,Database: nr (nt), Mammaliaorgn,Ex: Discontiguous MegaBLAST,Ex: BLASTN,PSI-BLAST,Example: Confirming relationships of purine nucleotide metabolism proteins,Position-specific Iterated BLAST,gi|113340|sp|P03958|ADA_MOUSE

35、ADENOSINE DEAMINASE (ADENOSINE MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGF VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVD EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAY RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGA VRFKNDKANYSLNTDDPLIFK

36、STLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK,PSI-BLAST,0.005,E value cutoff for PSSM,RESULTS: Initial BLASTP,Same results as protein-protein BLAST; different format,Results of First PSSM Search,Other purine nucleotide metabolizing enzymes not found by ordinary BLAST,Tenth PSSM Search: Convergence,Just b

37、elow threshold, another nucleotide metabolism enzyme,Reverse PSI-BLAST (RPS)-BLAST,Adenosine/AMP Deaminase Domain,. . .,PHI-BLAST,gi|231729|sp|P30429|CED4_CAEEL CELL DEATH PROTEIN 4 MLCEIECRALSTAHTRLIHDFEPRDALTYLEGKNIFTEDHSELISKMSTRLERIANFLRIYRRQASE LIDFFNYNNQSHLADFLEDYIDFAINEPDLLRPVVIAPQFSRQMLDRKLL

38、LGNVPKQMTCYIREYHV IKKLDEMCDLDSFFLFLHGRAGSGKSVIASQALSKSDQLIGINYDSIVWLKDSGTAPKSTFDLFTDI LKSEDDLLNFPSVEHVTSVVLKRMICNALIDRPNTLFVFDDVVQEETIRWAQELRLRCLVTTRDVEI ASQTCEFIEVTSLEIDECYDFLEAYGMPMPVGEKEEDVLNKTIELSSGNPATLMMFFKSCEPKTFEK,GAxxxxGKST,Genome BLAST,Genome BLAST via Map Viewer,Example Search Pathways: H

39、emochromatosis,Gene,“hemochromatosis” HFE,nucleotide sequence,Example: Human Genome BLAST,Human Genome BLAST: Results,Human Genome BLAST: MapViewer,Whats New?,BLAST Databases,Nucleotide refseq_rna = NM_*, XM_* refseq_genomic = NC_*, NG_* env_nt environmental samplefilter, e.g., 16S rRNAProtein refse

40、q = NP_*, XP_* env_nr,New Formatter,Select lower case,Select red,New Formatter,gray line = same database hithsps color-coded independently,BLAST Output: Alignments & Filter,low complexity sequence filtered,Advanced Options,Limit to Organism,allfilter NOT ma,Example Entrez QueriesallFilter NOT mammal

41、iaOrganismray finned fishesOrganismsrcdb refseqPropertiesNucleotide only:biomol mrnaPropertiesbiomol genomicPropertiesOtherAdvancede 10000 expect value-v 2000 descriptions-b 2000 alignments,-e 10000 -v 2000,Searching by Structure,Why search for similar structures?Find homologs with low sequence simi

42、larityExplore protein evolution: similar protein folds can support different functionsIdentify conserved core elements to model related proteins of unknown structure,Indexing into MMDB,Structure,MMDB Molecular Modeling Data Base,Structure Summary,Conserved Domains,3D Domain Neighbors,Structure Neigh

43、bors,3D Domains,1,3,2,4,Conserved Domains,SH3,SH2,VAST: Alignment,For each protein chain,locate SSEs (secondary structure elements),represent SSEs as individual vectors,1,2,3,4,5,6,Human IL-4,IL-4 & Leptin,align the vectors.,VAST,Structure neighbors,Taq DNA polymerase,VAST Results for the Chain,Tabl

44、e view,VAST,Vector Alignment Search Tool,3D Domain structure neighbors,VAST Results for Domain 1,Not found with Chain query!,Best way to convert PDB files to MMDB format for viewing with Cn3D!,submit file to PDB,Example: Mapping Oligos Onto a Genome,forward CCATGGCGACCCTGGAAAAGCreverse CAGCAGCGGCTGT

45、GCCTGCGG,?,?,?,Map Oligos Onto Genome,CCATGGCGACCCTGGAAAAGCNNNNNNNNNNCAGCAGCGGCTGTGCCTGCGG,-W 7 e 1000,Genome BLAST Results,Primer Alignments,forward primer,reverse primer,MapViewer,MapViewer,Sequence View (sv),forward,reverse,Service Addresses,BLAST blast-helpncbi.nlm.nih.gov General Help infoncbi.nlm.nih.gov Wayne Matten mattenncbi.nlm.nih.gov,

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 教学课件 > 大学教育

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1