生物信息学实验.ppt

上传人:medalangle361 文档编号:389368 上传时间:2018-10-14 格式:PPT 页数:55 大小:1.83MB
下载 相关 举报
生物信息学实验.ppt_第1页
第1页 / 共55页
生物信息学实验.ppt_第2页
第2页 / 共55页
生物信息学实验.ppt_第3页
第3页 / 共55页
生物信息学实验.ppt_第4页
第4页 / 共55页
生物信息学实验.ppt_第5页
第5页 / 共55页
亲,该文档总共55页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述

1、2018/10/14,1,生物信息学实验,实验2 隐马尔科夫模型,上海交通大学 生命科学技术学院 生物信息学与生物统计学系,2018/10/14,2,生物学中常用的统计模型,Structured probability models Markov models Hidden markov modelsArtificial Neural Network (A.N.N),2018/10/14,3,Introduction,Hidden Markov Models (HMMs) 最早是在上个世纪60年代末70年代初提出来的。 进入80年代以后,逐渐被利用在各个领域。,2018/10/14,4,Int

2、roduction,Hidden Markov Models 作为一种强有力的统计学模型,主要被应用在一些连续行的或时间延续性的事件建模上 语音识别系统。 生物学中的DNA/protein序列的分析 机器人的控制。 文本文件的信息提取。,2018/10/14,5,HMM的优点,1,它的数学结构非常丰富,适用于各个领域的研究。 2,在很多领域中,已经证明它的结果和实际符合的相当好。,2018/10/14,6,Probability Review,2018/10/14,7,独立事件概率,设想我们做一连串的实验,而每次实验所可能发生的结果定为 E1,E2, En,。(可能是有限也可能是无限)。每一个

3、结果 Ek,如果给定一个出现的可能性 pk(即概率),则某一特定样本之序列 Ej1 Ej2 Ejn出现的概率为 p(Ej1 Ej2 Ejn) =pj1 Pjn。,2018/10/14,8,马尔科夫链,一般及常用的统计中,彼此相互独立大概是最有用的一个观念。用简单的术语來说,互相独立就是彼此毫不相干,一点牵涉都沒有。 但是实际生活中很多事件是相互关联的 不是互相独立也就是相互关联的意思,但是要怎样相关呢?如何在相关中作一些简单的分类呢?马尔科夫链就是要描述在相关这个概念中最简单的一种。但即使如此,有关马可夫链的理论已经相当丰富了。在概率理论中,它几乎占了绝大的部分。,2018/10/14,9,马

4、尔科夫链,在马尔科夫链中考虑最简单的相关性。在在这种情况下,我们不能给任一个事件 Ej 一個概率 pj 但我们给一对事件 (Ej,Ek) 一個概率 pjk,这个时候 pjk 的解释是一种条件概率,就是假设在某次实验中 Ej 已经出现,而在下一次实验中 Ek 出现的概率。除了 pjk 之外,还需要知道第一次实验中 Ej 出現的機率 aj。有了这些资料后,一個样本序列 Ej0 Ej1 Ejn(也就是说第零次实验结果是Ej0,第一次一次是 Ej1第 n 次实验是 Ejn)的概率就很清楚的是 P(Ej0,Ej1,Ejn) =aj pj0j1 pj1j2 pjn-1jn。,2018/10/14,10,隐

5、马尔科夫模型,但是在大多数情况下我们所观察到的值并不是序列本身的元素。 即观察值不等于状态值。 故我们引入隐马尔科夫模型。,2018/10/14,11,定义,一个HMM 是一个五元组:(X , O, A, B, ) 其中:X = q1,.qN:状态的有限集合O = v1,.,vM:观察值的有限集合A = aij,aij = p(Xt+1 = qj |Xt = qi):转移概率B = bik,bik = p(Ot = vk | Xt = qi):输出概率 = i, i = p(X1 = qi):初始状态分布,2018/10/14,12,假设,对于一个随机事件,有一个观察值序列:O1,.,OT 该

6、事件隐含着一个状态序列:X1,.,XT 假设1:马尔可夫假设(状态构成一阶马尔可夫链) p(Xi|Xi-1X1) = p(Xi|Xi-1) 假设2:不动性假设(状态与具体时间无关)p(Xi+1|Xi) = p(Xj+1|Xj),对任意i,j成立 假设3:输出独立性假设(输出仅与当前状态有关) p(O1,.,OT | X1,.,XT) = p(Ot | Xt),2018/10/14,13,马尔科夫链 Vs 隐马尔科夫模型,Markov chains have entirely observable states. However a “Hidden Markov Model” is a mode

7、l of a Markov Source which admits an element each time slot depending upon the state. The states are not directly observed,2018/10/14,14,Problems,令 = A,B, 为给定HMM的参数, 令 = O1,.,OT 为观察值序列, 隐马尔可夫模型(HMM)的三个基本问题: 评估问题:对于给定模型,求某个观察值序列的概率p(|) ;forward algorithm 解码问题:对于给定模型和观察值序列,求可能性最大的状态序列;viterbi algorith

8、m 学习问题:对于给定的一个观察值序列,调整参数,使得观察值出现的概率p(|)最大。Forward-backward algorithm,2018/10/14,15,Solutions,Evaluation problem:forward algorithm 定义向前变量 采用动态规划算法,复杂度O(N2T) Decoding problem:Viterbi algorithm 采用动态规划算法,复杂度O(N2T) Learning problem:forward-backward algorithm EM算法的一个特例,带隐变量的最大似然估计,2018/10/14,16,Struct HMM

9、,typedef struct /* number of states; Q=1,2,.,N */ int N; /* number of observation symbols; V=1,2,.,M*/int M; /* A1N1N. aij is the transition prob of going from state i * at time t to state j at time t+1 */ double *A;/* B1N1M. bjk is the probability of observing symbol k in state j */ double *B; /* p

10、i1N pii is the initial state distribution. */double *pi; HMM;,2018/10/14,17,算法:向前算法(1),算法:向前算法(2),定义前向变量为HMM在时间t输出序列O1Ot,并且位于状态Si的概率:,2018/10/14,19,算法:向前算法(3),迭代公式为:,结果为:,2018/10/14,20,Forward algorithm,算法:向后算法(1),2018/10/14,22,算法:Viterbi算法(1),The Viterbi algorithm is a dynamic programming algorithm

11、 that computes the most likely state transition path given an observed sequence of symbols. It is actually very similar to the forward algorithm。,2018/10/14,23,Viterbi algorithm,2018/10/14,24,Viterbi in c,/* 1. Initialization */ for (i = 1; i N; i+) delta1i = phmm-pii * (phmm-BiO1); psi1i = 0; /* 2.

12、 Recursion */ for (t = 2; t N; j+) maxval = 0.0; maxvalind = 1; for (i = 1; i N; i+) val = deltat-1i*(phmm-Aij); if (val maxval) maxval = val; maxvalind = i; deltatj = maxval*(phmm-BjOt); psitj = maxvalind; ,2018/10/14,25,生物学中的数学模型,2018/10/14,26,马氏链,2018/10/14,27,马氏链,2018/10/14,28,马氏链,2018/10/14,29,

13、隐马可夫模型,2018/10/14,30,隐马可夫模型,2018/10/14,31,隐马可夫模型 profile,2018/10/14,32,Related software,HMMER http:/hmmer.wustl.edu/ SAM(Sequence Alignment and Modeling System) http:/www.soe.ucsc.edu/ HMMpro A windows version for HMMThe Division of Biomedical Informatics at Cincinnati Childrens Hospital Medical Cen

14、ter metaMEME: A motif based Hidden Markov Model,2018/10/14,33,HMMER,Profile hidden Markov models (profile HMMs) can be used to do sensitive database searching using statistical descriptions of a sequence familys consensus. HMMER is a freely distributable implementation of profile HMM software for pr

15、otein sequence analysis. The current version is HMMER 2.3.2 (3 Oct 2003), containing minor bugfixes and updates for the May 2003 release of HMMER 2.3.,2018/10/14,34,HMMER,2018/10/14,35,How to create a HMM,2018/10/14,36,Example: 1. Sequence selection,选取相关的序列,2018/10/14,37,2.Alignment,Save result as m

16、sf format,多序列比对,2018/10/14,38,模型建立,3.Hmmbuild4.Hmmt5.Hmmcalibrate,模型建立,用相关序列对模型进行训练,参数调整,2018/10/14,39,模型文件(1),HMMER2.0 2.3.2 NAME globins50 LENG 162 ALPH Amino RF no CS no MAP yes COM ./hmmbuild globins.hmm globins50.msf NSEQ 50 DATE Thu Sep 18 00:02:14 2008 CKSUM 4694 XT -8455 -4 -1000 -1000 -8455

17、 -4 -8455 -4 NULT -4 -8455 NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644,模型文件(2),模型部分: HMM A C D E F G H I K L M N P Q R S T V W Ym-m m-i m-d i-m i-i d-m d-d b-m m-e-222 * -28071 -1412 -1712 -339 -321 -1729 113 -1457 261 -1493 -1591 1181 -1737 -32 -13

18、59 -1788 77 -1353 2620 -2119 -1697 4- -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249- -1909 -8804 -451 -894 -1115 -701 -1378 -110 *2 -1118 -1371 -1805 -1237 -1464 -2231 -889 2528 2067 -899 -510 -1267 -2325 -644 -266 -1422 -1057 -63 -1884 -1486 5- -149 -500 233

19、43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249- -18 -6914 -7956 -894 -1115 -3550 -129 * * ,2018/10/14,41,6.未知序列的搜索查询,Hmmsearch: search a sequence against the profile HMM 未知查询序列Artemia.fa Profile HMM:Globin.hmm Command: hmmsearch globin.hmm Artemia.fa,查询程序,查询的未知序列文件,所用模型,查询命令

20、,2018/10/14,42,查询结果,结果分为2个部分 1:说明部分(数据说明、选项、模型说明) 2: 结果序列部分,2018/10/14,43,Result1,第一部分:相关信息说明,软件信息:版本、权限等,HMM文件名称,查询的阈值等,HMM文件的一些描述信息,2018/10/14,44,Result 2.1,HIT序列分值,E值,domain数目,HIT domains分值、位置、E值等信息,2018/10/14,45,Result 2.2,高分匹配序列比对,2018/10/14,46,Result 2.3,所有序列HIT分值、E值的图形分布,2018/10/14,47,Result

21、2.4,结果统计数据,2018/10/14,48,Application of HMM:pfam,2018/10/14,49,Application of HMM,TMHMM:Prediction of transmembrane helices in proteins http:/www.cbs.dtu.dk/services/TMHMM/,2018/10/14,50,PFAM,Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is

22、 available on the World Wide Web in the UK at http:/www.sanger.ac.uk/Software/Pfam/, in Sweden at http:/www.cgb.ki.se/Pfam/, in France at http:/pfam.jouy.inra.fr/ and in the US at http:/pfam.wustl.edu/.,2018/10/14,51,Pfam Introduction,Pfam is a database of protein domain families. Pfam contains cura

23、ted multiple sequence alignments for each family, as well as profile hidden Markov models (profile HMMs) for finding these domains in new sequences. Pfam contains functional annotation, literature references and database links for each family.,2018/10/14,52,Pfam Introduction,Version 14.0, June 2004,

24、 7459 families 22336 unique Pfam-A domain architectures Two big families Pfam-A : A high-quality manual part of Pfam. Pfam-B : Low-quality automatically generated alignments of sequence clusters in SWISSPROT and TrEMBL that are not modelled in the curated part of Pfam.,2018/10/14,53,Pfam Introductio

25、n,There are two multiple alignments for each Pfam family, the seed alignment that contains a relatively small number of representative members of the family and the full alignment that contains all members in the database that can be detected. All alignments use sequences taken from pfamseq, which i

26、s a non-redundant protein set composed of SWISS-PROT and SP-TrEMBL. The profile HMM is built from the seed alignment using the HMMER package, which is then used to search the pfamseq sequence database,2018/10/14,54,Pfam Goals,One of the main goals of Pfam was to aid the annotation of the Caenorhabdi

27、tis elegans genome . Traditional approaches to large scale sequence annotation use a pairwise sequence comparison method such as BLAST to find similarity to proteins of known function. Annotations are then transferred from the protein of known function to the predicted protein. The pairwise similari

28、ty search does not give a clear indication of the domain structure of the proteins. Mistakes in annotation can result from not considering the domain organisation of proteins . For example a protein may be misannotated as an enzyme when the similarity is only to a regulatory domain. Since its incept

29、ion, Pfam has been developed to provide broad support for automated protein sequence classification and annotation. During the last year there have been significant changes and extensions to Pfam, which further this role.,2018/10/14,55,蛋白二级结构分析,The PredictProtein server http:/blast.ym.edu.tw/tools/predictprotein/predictprotein.html,

展开阅读全文
相关资源
猜你喜欢
相关搜索
资源标签

当前位置:首页 > 教学课件 > 大学教育

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1