A Comparison of Algorithms for SpeciesIdentification based .ppt

上传人:孙刚 文档编号:377811 上传时间:2018-10-09 格式:PPT 页数:16 大小:343.50KB
下载 相关 举报
A Comparison of Algorithms for SpeciesIdentification based .ppt_第1页
第1页 / 共16页
A Comparison of Algorithms for SpeciesIdentification based .ppt_第2页
第2页 / 共16页
A Comparison of Algorithms for SpeciesIdentification based .ppt_第3页
第3页 / 共16页
A Comparison of Algorithms for SpeciesIdentification based .ppt_第4页
第4页 / 共16页
A Comparison of Algorithms for SpeciesIdentification based .ppt_第5页
第5页 / 共16页
亲,该文档总共16页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述

1、A Comparison of Algorithms for Species Identification based on DNA barcodes,Bogdan Paaniuc,CSE Department, University of Connecticut,Joint work with Alexander Gusev, Sotirios Kentros, James Lindsay and Ion Mndoiu,Introduction,Several methods proposed for assigning specimens to species TaxI (Steinke

2、et al.05), Likelihood ratio test (Matz&Nielsen06), BOLD-IDS(Ratnasingham&Hebert 07) No direct comparisons on standardized benchmarks This work: Direct comparison of methods from three main classes Distance-based, tree-based, and statistical model-based Explore the effect of repository size #barcodes

3、/species, #species,Species identification problem Given repository containing barcodes from known species and a new barcode find its species,Datasets,Fishes of Australia Container Part Ward et. al, 05 754 barcodes, 211 species, 113 genera Cowries Meyer and Paulay, 05 2036 barcodes, 263 species, 46 g

4、enera Birds of North America - Phase II Kerr K.C.R. et al, 07 2589 barcodes, 656 species, 289 genera Bats of Guyana Clare E.L. et al, 06 840 barcodes, 96 species, 50 genera Hesperidia of the ACG 1Hajibabaei M. et al, 05 4267 barcodes, 561 species, 207 genera 90% in training and 10% in testing,Distan

5、ce-based methods,Barcode assigned to closest specie Two-variants: Minimum/Maximum or AverageHamming distance MIN-HD, AVG-HD Percent of sequence divergencence Aminoacid Similarity MAX-AA-SIM, AVG-AA-SIM Blossom62 matrix to score similarity Convex Score similarity MAX-CS-SIM Higher score to longer con

6、secutive runs of matches Tri-nucleotide frequency distance MIN-3FREQ Euclidian distance between vectors of frequencies Combined method COMB Assignment made using majority rule,Distance-based methods,Tree-based methods,Exemplar NJ Meyer&Paulay05 One exemplar per species (random) One neighbor joining

7、tree for exemplar + unknown barcodes Profile NJ Muller et al, 04 Distance between profiles Neighbor joining tree for the species profiles Phylogenetic Traversal Construct NJ-tree from training profiles Traverse down the tree (from the root) Choose least distant branchSubstitution models: UNC, JK, K2

8、P, TN.,Tree-based methods,Statistical model-based,Likelihood ratio test for species membership using MCMC Matz&Nielsen06 Impractical runtime even for moderate #speciesScalable models explored: position weight matrices, Markov chains, hidden Markov models Similar to models used successfully in other

9、sequence analysis problems such as DNA motif finding and protein families,Positional Weight Matrix(PWM),Assumption: independence of lociP(x|SP) = P(x1|SP)*P(x2|SP)*P(xn|SP)For each locus, P(xi|SP) is estimated as the probability of seeing each nucleotide at that locus in DB sequences from species SP

10、,Inhomogeneous Markov Chain (IMC),Takes into account dependencies between consecutive loci,start,A,C,T,G,A,C,T,G,locus 1,locus 2,locus 3,locus 4,Hidden Markov Model (HMM),Same structure as the IMC Each state emits the associated DNA base with high probability; but can also emit the other bases with

11、probability equal to mutation rate Barcode x generated along path p with probability equal to product of emission & transitions along p P(x|HMM) = sum of probabilities over all paths Efficiently computed by forward algorithm,Probabilistic model-based methods,HMM not scalable genus level identificati

12、on,Comparison of representative methods,Effect of #barcodes/species,BOLD species with at least 25 barcodes (270 sp, 17197 barcodes) randomly picked 5-20 barcodes from all species All remaining barcodes used in testing,Effect of #species,BOLD species with at least 10 barcodes (690 sp, 23558 barcodes)

13、 Randomly picked 100 to 690 species (10 barcodes per species) All remaining barcodes from picked species used in testing,Conclusions & Ongoing work,Presented an initial comparison of a broad range of species assignment methodsOngoing work explores further effects New specie detection Barcode length/quality Runtime scalability (up to millions of species) More datasets,

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 教学课件 > 大学教育

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1