ImageVerifierCode 换一换
格式:PPT , 页数:43 ,大小:630KB ,
资源ID:378973      下载积分:2000 积分
快捷下载
登录下载
邮箱/手机:
温馨提示:
如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝扫码支付 微信扫码支付   
注意:如需开发票,请勿充值!
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【http://www.mydoc123.com/d-378973.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(Bioinformatics Algorithms and Data Structures.ppt)为本站会员(bowdiet140)主动上传,麦多课文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文库(发送邮件至master@mydoc123.com或直接QQ联系客服),我们立即给予删除!

Bioinformatics Algorithms and Data Structures.ppt

1、Bioinformatics Algorithms and Data Structures,Chapter 1: Exact String Matching Instructor: Dr. Rose January 14, 2003,Exact String Matching,Types of Abstract Problems: Pattern matching, i.e., find pattern P in string S Similarity comparison, i.e., what is the longest common substring in S1 and S2? Ca

2、n we find P P in S? We can think of P as a mutation of P. What are the regions of similarity in S1 and S2? We can also do this with mutations,Exact String Matching,Q: What is the underlying theme common to these abstract problems? A: Correlation, i.e., correlation between two signals, strings, etc.,

3、Exact String Matching,Q: What is the simplest way to compare two strings? A: Look for a mapping of one string into the other.,Exact String Matching,Given two strings S1 and S2, Where length(S1) = length(S2) : Start at the beginning of S1 and S2 Compare corresponding characters, i.e., S11 & S21, S12

4、& S22, etc Continue until either: All the characters in S1 have been matched or A mismatch is found,Exact String Matching,If there is a mismatch shift 1 character position along S2 and start over, e.g., compare S11& S22, S12& S23, etc Keep doing this until a match is found or the possible starting p

5、ositions in S2 are exhausted.,Exact String Matching,Example: S1 = adab, S2=abaracadabaraa b a r a c a d a b a r a 1: a d d != a 2: _ a a != b 3: _ a d d != r 4: _a a != r 5: _a d d != c 6: _ a a != c 7: _ a d a b Finally! Q: How many comparisons? A: 13, looks ok?,Exact String Matching,Example 2: S1

6、= aaaaab, length(S1) = 6 S2 = aaaaaaaaaaab, length(S2) = 12a a a a a a a a a a a b 1: a a a a a b b != a 2: _ a a a a a b b != a 3: _a a a a a b b != a 4: _ a a a a a b b != a 6: _a a a a a b b != a 7: _ a a a a a b b != a,Exact String Matching,Example 2 continued from previous slidea a a a a a a a

7、a a a b 8: _ a a a a a b Finally!Q: How many comparisons were made? A: 42 = 7 X 6 = (12 6 + 1) X 6= (N M + 1) X M Where length(S2) = N and length(S1) = M Q: Where did this come from? A: There are N M + 1 possible match positions in S2,Exact String Matching,Bottom line, the time complexity is Q(NM) O

8、bservation: Notice that many of the characters in S2 are involved in multiple comparisons. WHY? A: Because the nave approach doesnt learn from previous matches. By the time the first mismatch occurs, we know what the first 6 characters of S1 and S2 are.,Exact String Matching,Note: A smarter approach

9、 would not involve the first 6 characters of S2 in subsequent comparisons. Fast matching algorithms take advantage of this insight. Q: Where does this insight come from? A: Preprocessing either S1 or S2.,Exact String Matching,Insight: if a match fails1) dont make redundant comparisons2) skip to the

10、first next possible match position. Note: the next possible match position may not be 1 character shift away. Lets consider both of these ideas with respect to examples 1 and 2,Exact String Matching,Lets review example 2: a a a a a a a a a a a b S2 1: a a a a a b S1 b != a, we have seen the first 6

11、characters 2: _a a a a a b b != a, we already know the as match,we only need to try to match the b 3: _a a a a a b b != a, ditto 4: _ a a a a a b b != a, ditto 6: _a a a a a b b != a, ditto 7: _ a a a a a b b != a, ditto 8: _a a a a a b Finally! The number of comparisons is 12 instead of the previou

12、s 42,Exact String Matching,Lets review example 1: S1 = adab, S2=abaracadabaraa b a r a c a d a b a r a 1:a d d != b, we have seen the first 2 charactersThe next possible match must be at least two positions away 2: _ a d d != r, we have seen the first 4 chars of S2The next possible match must be at

13、least two positions away 3: _ a d d != c, we have seen the first 6 chars of S2The next possible match must be at least two positions away 4: _ a d a b Finally! Q: How many comparisons? A: 10. The previous approach took 13 comparisons,Preprocessing a String,Core IdeaFor each position i1 in string S,

14、identify the maximal substring that matches a prefix of S. Q: Why do we want to do this? A: We will use this information in two ways:1) This tells us how far to skip for the next possible match. (Recall example 1)2) Knowledge of prefix matches allows us to avoid redundant comparisons (Recall example

15、 2) Do we need to go back and review examples 1 and 2?,Preprocessing a String,Let M(Si) denote the maximal substring that matches a prefix of S at position i1 Example: S = aabcaabxaaz (from textbook) M(S2) = a M(S3) = M(S4) = M(S5) = aab,Preprocessing a String,Let Z(Si) denote the length of the maxi

16、mal substring M(Si) starting from position i1 that matches a prefix of S Example: S = aabcaabxaaz (from textbook) Z(S2) = 1, since M(S2) = a Z(S3) = 0, since M(S3) = Z(S4) = 0, since M(S4) = Z(S5) = 3, since M(S5) = aab,Preprocessing a String,Consider the figure above, depicting string S and two max

17、imal substrings a and b from positions j and k, respectively that match prefixes of S.Zj is the length of a, and Zk is the length of b. Gusfield refers to these boxes as Z-boxes.,Preprocessing a String,Lets look at a concrete instance of this abstraction,Preprocessing a String,For all i1, ri denotes

18、 the right-most endpoint of the Z-boxes containing i. Note that while i is in both a and b, the rightmost endpoint of these Z-boxes is the endpoint of a.,Preprocessing a String,Lets compare the abstract depiction with our concrete example.,Preprocessing a String,li is the left end of the Z-box endin

19、g at ri.,Preprocessing a String,Again, compare the abstract with the concrete.,Preprocessing a String,We will now consider how to find the Z-boxes in linear time, O(|S|).We can use this to find exact matches in linear time.,Preprocessing a String,We start by computing Z2, explicitly comparing charac

20、ters S1&S2, etc.If Z2 0, then let r = r2 and l = l2= 2, o/w let r = 0 and l = 0.,Preprocessing a String,Iterate to compute all subsequent Zk.When Zk is computed, all previous Zi, 1 i = k-1 are already known.,The Z Algorithm,If k r, then k is not in any Z-box that has yet been found.We must find Zk b

21、y comparing characters starting at position k with characters starting at position 1 in S.,The Z Algorithm,If k = r, then k is contained in a previously found Z-box, say a.,Then the substring b from k to r matches a substring of a from k to Zl.,The Z Algorithm,Here is a concrete example where k = r.

22、 We see that k is contained in a previously found Z-box a.,The substring b from k to r matches a substring of a from k to Zl.,The Z Algorithm,We need to check if the value of Zk is nonzero. Why? If Zk is nonzero, then there is a prefix of S starting k. This means that k must also be the start of a p

23、refix of S.,The Z Algorithm,Here is a concrete example where the value of Zk is zero. The substring starting at k is not a prefix of S, nor is the substring at k.,The Z Algorithm,If Zk is nonzero, how long is the prefix starting at k?Minimally, it is at least as long as the smaller of Zk and |b|.Of

24、course it may be longer.,The Z Algorithm,The prefix starting at k is at least the smaller of Zk and |b|.Case 2a: If Zk |b|, then its length is exactly Zk as depicted in the figure below.,In this case, r and l remain unchanged.,The Z Algorithm,Here is a concrete example where Zk |b|. In this case, 3

25、6.The length of the prefix starting at k is exactly Zk , i.e., 3.In this case, r and l remain unchanged.,The Z Algorithm,Case 2b: If Zk = |b|, then b is a prefix of S as depicted in the figure below.,It could be the case that Zk |b|. This can only be determined by extending the match past r.,The Z A

26、lgorithm,Here is a concrete example where Zk = |b|, i.e., 3 = 3. We can see that b is a prefix of S.,The Z Algorithm,Here is a concrete example where Zk |b|. We can see that b is a prefix of S and so is this longer substring starting at k.Only by extending the match past r are we able to distinguish

27、 between Zk = |b| and Zk |b|.,The Z Algorithm,In extending the match past r, say a mismatch occurs at q, q = r + 1. Set Zk = q k, r = q 1, and l = k as shown in the figure below.,The Z Algorithm,Using our concrete example: In extending the match past r, a mismatch occurs at q, q = r + 2.,Set Zk = q

28、k, r = q 1, and l = k.,The Z Algorithm,Continue to iterate through the entire string. Computing subsequent Zk will entail only the cases we discussed: Case 1: k r, k is not in a known Z-box. Find Zk by explicitly matching with the start of S. Set r & l accordingly. Case 2a: Zk = |b|, b is a prefix o

29、f S. Try to extend the match. Set l = k and r = q 1, where q is the position of the first mismatch.,The Z Algorithm,Theorem 1.4.1 Using algorithm Z, value Zk is correctly computed and variables r and l are correctly updated. Proof on page 9 of text.Theorem 1.4.2 All the Zk(S) values are computed by

30、algorithm Z in O(|S|), i.e., linear time. Proof on page 10 of text.,A Simple Linear-Time Exact String Matching Algorithm,We can use algorithm Z by itself as a simple linear-time string matching algorithm.Let S = P$T where: T is the target string, |T| = m P is the pattern string, |P| = n, n = m $ is

31、character not appearing in either P or T.Apply algorithm Z to S.,A Simple Linear-Time Exact String Matching Algorithm,Since $ does not appear in P or T, no prefix of S can be longer than n, i.e., |P|.We only need to consider Zi(S) for i in T, i.e., i n + 1Any value of i, such that i n + 1, where Zi(S) = n, indicates a match of P at position i (n+1) in T.All Zi(S) are computed in O(m+n) = O(m),

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1