ImageVerifierCode 换一换
格式:PPT , 页数:71 ,大小:1.26MB ,
资源ID:373181      下载积分:2000 积分
快捷下载
登录下载
邮箱/手机:
温馨提示:
如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝扫码支付 微信扫码支付   
注意:如需开发票,请勿充值!
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【http://www.mydoc123.com/d-373181.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(A Revealing Introduction to Hidden Markov Models.ppt)为本站会员(livefirmly316)主动上传,麦多课文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文库(发送邮件至master@mydoc123.com或直接QQ联系客服),我们立即给予删除!

A Revealing Introduction to Hidden Markov Models.ppt

1、A Revealing Introduction to Hidden Markov Models,Mark Stamp,1,HMM,Hidden Markov Models,What is a hidden Markov model (HMM)? A machine learning technique A discrete hill climb technique Where are HMMs used? Speech recognition Malware detection, IDS, etc., etc. Why is it useful? Efficient algorithms,H

2、MM,2,Markov Chain,Markov chain is a “memoryless random process” Transitions depend only on current state and transition probabilities matrix Example on next slide,HMM,3,Markov Chain,We are interested in average annual temperature Only consider Hot and Cold From recorded history, we obtain probabilit

3、ies See diagram to the right,HMM,4,H,C,0.7,0.6,0.3,0.4,Markov Chain,Transition probability matrixMatrix is denoted as ANote, A is “row stochastic”,HMM,5,H,C,0.7,0.6,0.3,0.4,Markov Chain,Can also include begin, end states Begin state matrix is In this example,Note that is row stochastic,HMM,6,H,C,0.7

4、,0.6,0.3,0.4,begin,end,0.6,0.4,Hidden Markov Model,HMM includes a Markov chain But this Markov process is “hidden” Cannot observe the Markov process Instead, we observe something related to hidden states Its as if there is a “curtain” between Markov chain and observations Example on next slide,HMM,7

5、,HMM Example,Consider H/C temperature example Suppose we want to know H or C temperature in distant past Before humans (or thermometers) invented OK if we can just decide Hot versus Cold We assume transition between Hot and Cold years is same as today That is, the A matrix is same as today,HMM,8,HMM

6、 Example,Temp in past determined by Markov process But, we cannot observe temperature in past Instead, we note that tree ring size is related to temperature Look at historical data to see the connection We consider 3 tree ring sizes Small, Medium, Large (S, M, L, respectively) Measure tree ring size

7、s and recorded temperatures to determine relationship,HMM,9,HMM Example,We find that tree ring sizes and temperature related byThis is known as the B matrix:Note that B also row stochastic,HMM,10,HMM Example,Can we now find temps in distant past? We cannot measure (observe) temp But we can measure t

8、ree ring sizes and tree ring sizes related to temp By the B matrix So, we ought to be able to say something about temperature,HMM,11,HMM Notation,A lot of notation is required Notation may be the most difficult part,HMM,12,HMM Notation,To simplify notation, observations are taken from the set 0,1,M-

9、1 That is, The matrix A = aij is N x N, whereThe matrix B = bj(k) is N x M, where,HMM,13,HMM Example,Consider our temperature example What are the observations? V = 0,1,2, which corresponds to S,M,L What are states of Markov process? Q = H,C What are A,B, , and T? A,B, on previous slides T is number

10、 of tree rings measured What are N and M? N = 2 and M = 3,HMM,14,Generic HMM,Generic view of HMMHMM defined by A,B, and We denote HMM “model” as = (A,B,),HMM,15,HMM Example,Suppose that we observe tree ring sizes For 4 year period of interest: S,M,S,L Then = (0, 1, 0, 2) Most likely (hidden) state s

11、equence? We want most likely X = (x0, x1, x2, x3) Let x0 be prob. of starting in state x0 Note prob. of initial observation And ax0,x1 is prob. of transition x0 to x1 And so on,HMM,16,HMM Example,Bottom line? We can compute P(X) for any X For X = (x0, x1, x2, x3) we haveSuppose we observe (0,1,0,2),

12、 then what is probability of, say, HHCC? Plug into formula above to find,HMM,17,HMM Example,Do same for all 4-state sequences We find The winner is? CCCH Not so fast my friend,HMM,18,HMM Example,The path CCCH scores the highest In dynamic programming (DP), we find highest scoring path But, HMM maxim

13、izes expected number of correct states Sometimes called “EM algorithm” For “Expectation Maximization” How does HMM work in this example?,HMM,19,HMM Example,For first position Sum probabilities for all paths that have H in 1st position, compare to sum of probs for paths with C in 1st position - bigge

14、st wins Repeat for each position and we find:,HMM,20,HMM Example,So, HMM solution gives us CHCH While dynamic program solution is CCCH Which solution is better? Neither! Why is that? Different definitions of “best”,HMM,21,HMM Paradox?,HMM maximizes expected number of correct states Whereas DP choose

15、s “best” overall path Possible for HMM to choose “path” that is impossible Could be a transition probability of 0 Cannot get impossible path with DP Is this a flaw with HMM? No, its a feature,HMM,22,The Three Problems,HMMs used to solve 3 problems Problem 1: Given a model = (A,B,) and observation se

16、quence O, find P(O|) That is, we score an observation sequence to see how well it fits the given model Problem 2: Given = (A,B,) and O, find an optimal state sequence Uncover hidden part (as in previous example) Problem 3: Given O, N, and M, find the model that maximizes probability of O That is, tr

17、ain a model to fit the observations,HMM,23,HMMs in Practice,Typically, HMMs used as follows Given an observation sequence Assume a hidden Markov process exists Train a model based on observations Problem 3 (determine N by trial and error) Then given a sequence of observations, score it vs model from

18、 previous step Problem 1 (high score implies its similar to training data),HMM,24,HMMs in Practice,Previous slide gives sense in which HMM is a “machine learning” technique We do not need to specify anything except the parameter N And “best” N found by trial and error That is, we dont have to think

19、too much Just train HMM and then use it Best of all, efficient algorithms for HMMs,HMM,25,The Three Solutions,We give detailed solutions to the three problems Note: We must have efficient solutions Recall the three problems: Problem 1: Score an observation sequence versus a given model Problem 2: Gi

20、ven a model, “uncover” hidden part Problem 3: Given an observation sequence, train a model,HMM,26,Solution 1,Score observations versus a given model Given model = (A,B,) and observation sequence O=(O0,O1,OT-1), find P(O|) Denote hidden states as X = (x0, x1, . . . , xT-1) Then from definition of B,P

21、(O|X,)=bx0(O0) bx1(O1) bxT-1(OT-1) And from definition of A and ,P(X|)=x0 ax0,x1 ax1,x2 axT-2,xT-1,HMM,27,Solution 1,Elementary conditional probability fact:P(O,X|) = P(O|X,) P(X|) Sum over all possible state sequences X,P(O|) = P(O,X|) = P(O|X,) P(X|)= x0bx0(O0)ax0,x1bx1(O1)axT-2,xT-1bxT-1(OT-1) Th

22、is “works” but way too costly Requires about 2TNT multiplications Why? There better be a better way,HMM,28,Forward Algorithm,Instead of brute force: forward algorithm Or “alpha pass” For t = 0,1,T-1 and i=0,1,N-1, lett(i) = P(O0,O1,Ot,xt=qi|) Probability of “partial sum” to t, and Markov process is

23、in state qi at step t What the? Can be computed recursively, efficiently,HMM,29,Forward Algorithm,Let 0(i) = ibi(O0) for i = 0,1,N-1 For t = 1,2,T-1 and i=0,1,N-1, lett(i) = (t-1(j)aji)bi(Ot) Where the sum is from j = 0 to N-1 From definition of t(i) we seeP(O|) = T-1(i) Where the sum is from i = 0

24、to N-1 Note this requires only N2T multiplications,HMM,30,Solution 2,Given a model, find “most likely” hidden states: Given = (A,B,) and O, find an optimal state sequence Recall that optimal means “maximize expected number of correct states” In contrast, DP finds best scoring path For temp/tree ring

25、 example, solved this But hopelessly inefficient approach A better way: backward algorithm Or “beta pass”,HMM,31,Backward Algorithm,For t = 0,1,T-1 and i=0,1,N-1, lett(i) = P(Ot+1,Ot+2,OT-1|xt=qi,) Probability of partial sum from t to end and Markov process in state qi at step t Analogous to the for

26、ward algorithm As with forward algorithm, this can be computed recursively and efficiently,HMM,32,Backward Algorithm,Let T-1(i) = 1 for i = 0,1,N-1 For t = T-2,T-3, ,1 and i=0,1,N-1, lett(i) = ai,jbj(Ot+1)t+1(j) Where the sum is from j = 0 to N-1,HMM,33,Solution 2,For t = 1,2,T-1 and i=0,1,N-1 defin

27、et(i) = P(xt=qi|O,) Most likely state at t is qi that maximizes t(i) Note that t(i) = t(i)t(i)/P(O|) And recall P(O|) = T-1(i) The bottom line? Forward algorithm solves Problem 1 Forward/backward algorithms solve Problem 2,HMM,34,Solution 3,Train a model: Given O, N, and M, find that maximizes proba

28、bility of O Here, we iteratively adjust = (A,B,) to better fit the given observations O The size of matrices are fixed (N and M) But elements of matrices can change It is amazing that this works! And even more amazing that its efficient,HMM,35,Solution 3,For t=0,1,T-2 and i,j in 0,1,N-1, define “di-

29、gammas” ast(i,j) = P(xt=qi, xt+1=qj|O,) Note t(i,j) is prob of being in state qi at time t and transiting to state qj at t+1 Then t(i,j) = t(i)aijbj(Ot+1)t+1(j)/P(O|) And t(i) = t(i,j) Where sum is from j = 0 to N 1,HMM,36,Model Re-estimation,Given di-gammas and gammas For i = 0,1,N-1 let i = 0(i) F

30、or i = 0,1,N-1 and j = 0,1,N-1 aij = t(i,j)/t(i) Where both sums are from t = 0 to T-2 For j = 0,1,N-1 and k = 0,1,M-1 bj(k) = t(j)/t(j) Both sums from from t = 0 to T-2 but only t for which Ot = k are counted in numerator Why does this work?,HMM,37,Solution 3,To summarize Initialize = (A,B,) Comput

31、e t(i), t(i), t(i,j), t(i) Re-estimate the model = (A,B,) If P(O|) increases, goto 2,HMM,38,Solution 3,Some fine points Model initialization If we have a good guess for = (A,B,) then we can use it for initialization If not, let i 1/N, ai,j 1/N, bj(k) 1/M Subject to row stochastic conditions Note: Do

32、 not initialize to uniform values Stopping conditions Stop after some number of iterations Stop if increase in P(O|) is “small”,HMM,39,HMM as Discrete Hill Climb,Algorithm on previous slides shows that HMM is a “discrete hill climb” HMM consists of discrete parameters Specifically, the elements of t

33、he matrices And re-estimation process improves model by modifying parameters So, process “climbs” toward improved model This happens in a high-dimensional space,HMM,40,Dynamic Programming,Brief detour For = (A,B,) as above, its easy to define a dynamic program (DP) Executive summary: DP is forward a

34、lgorithm, with “sum” replaced by “max” Precise details on next slides,HMM,41,Dynamic Programming,Let 0(i) = i bi(O0) for i=0,1,N-1 For t=1,2,T-1 and i=0,1,N-1 computet(i) = max (t-1(j)aji)bi(Ot) Where the max is over j in 0,1,N-1 Note that at each t, the DP computes best path for each state, up to t

35、hat point So, probability of best path is max T-1(j) This max only gives best probability Not the best path, for that, see next slide,HMM,42,Dynamic Programming,To determine optimal path While computing optimal path, keep track of pointers to previous state When finished, construct optimal path by t

36、racing back points For example, consider temp example Probabilities for path of length 1:These are the only “paths” of length 1,HMM,43,Dynamic Programming,Probabilities for each path of length 2Best path of length 2 ending with H is CH Best path of length 2 ending with C is CC,HMM,44,Dynamic Program

37、,Continuing, we compute best path ending at H and C at each step And save pointers - why?,HMM,45,Dynamic Program,Best final score is .002822 And, thanks to pointers, best path is CCCH But what about underflow? A serious problem in bigger cases,HMM,46,Underflow Resistant DP,Common trick to prevent un

38、derflow Instead of multiplying probabilities we add logarithms of probabilities Why does this work? Because log(xy) = log x + log y And adding logs does not tend to 0 Note that we must avoid 0 probabilities,HMM,47,Underflow Resistant DP,Underflow resistant DP algorithm: Let 0(i) = log(i bi(O0) for i

39、=0,1,N-1 For t=1,2,T-1 and i=0,1,N-1 computet(i) = max (t-1(j) + log(aji) + log(bi(Ot) Where the max is over j in 0,1,N-1 And score of best path is max T-1(j) As before, must also keep track of paths,HMM,48,HMM Scaling,Trickier to prevent underflow in HMM We consider solution 3 Since it includes sol

40、utions 1 and 2 Recall for t = 1,2,T-1, i=0,1,N-1,t(i) = (t-1(j)aj,i)bi(Ot) The idea is to normalize alphas so that they sum to one Algorithm on next slide,HMM,49,HMM Scaling,Given t(i) = (t-1(j)aj,i)bi(Ot) Let a0(i) = 0(i) for i=0,1,N-1 Let c0 = 1/a0(j) For i = 0,1,N-1, let a0(i) = c0a0(i) This take

41、s care of t = 0 case Algorithm continued on next slide,HMM,50,HMM Scaling,For t = 1,2,T-1 do the following:For i = 0,1,N-1, at(i) = (at-1(j)aj,i)bi(Ot) Let ct = 1/at(j) For i = 0,1,N-1 let at(i) = ctat(i),HMM,51,HMM Scaling,Easy to show at(i) = c0c1ct t(i) () Simple proof by induction So, c0c1ct is

42、scaling factor at step t Also, easy to show thatat(i) = t(i)/t(j) Which implies aT-1(i) = 1 (),HMM,52,HMM Scaling,By combining () and (), we have1 = aT-1(i) = c0c1cT-1 T-1(i)= c0c1cT-1 P(O|) Therefore, P(O|) = 1 / c0c1cT-1 To avoid underflow, we computelog P(O|) = - log(cj) Where sum is from j = 0 t

43、o T-1,HMM,53,HMM Scaling,Similarly, scale betas as ctt(i) For re-estimation, Compute t(i,j) and t(i) using original formulas, but with scaled alphas and betas This gives us new values for = (A,B,) “Easy exercise” to show re-estimate is exact when scaled alphas and betas used Also, P(O|) cancels from

44、 formula Use log P(O|) = - log(cj) to decide if iterate improves,HMM,54,All Together Now,Complete pseudo code for Solution 3 Given: (O0,O1,OT-1) and N and M Initialize: = (A,B,) A is NxN, B is NxM and is 1xN i 1/N, aij 1/N, bj(k) 1/M, each matrix row stochastic, but not uniform Initialize: maxIters

45、= max number of re-estimation steps iters = 0 oldLogProb = -,HMM,55,Forward Algorithm,Forward algorithm With scaling,HMM,56,Backward Algorithm,Backward algorithm or “beta pass” With scaling Note: same scaling factor as alphas,HMM,57,Gammas,Here, use scaled alphas and betas So formulas unchanged,HMM,

46、58,Re-Estimation,Again, using scaled gammas So formulas unchanged,HMM,59,Stopping Criteria,Check that probability increases In practice, want logProb oldLogProb + And dont exceed max iterations,HMM,60,English Text Example,Suppose Martian arrives on earth Sees written English text Wants to learn some

47、thing about it Martians know about HMMs So, strip our all non-letters, make all letters lower-case 27 symbols (letters, plus word-space) Train HMM on long sequence of symbols,HMM,61,English Text,For first training case, initialize: N = 2 and M = 27 Elements of A and are about each Elements of B are

48、each about 1/27 We use 50,000 symbols for training After 1st iter: log P(O|) -165097 After 100th iter: log P(O|) -137305,HMM,62,English Text,Matrices A and converge:What does this tells us? Started in hidden state 1 (not state 0) And we know transition probabilities between hidden states Nothing too

49、 interesting here We dont care about hidden states,HMM,63,English Text,What about B matrix? This much more interesting Why?,HMM,64,A Security Application,Suppose we want to detect metamorphic computer viruses Such viruses vary their internal structure But function of malware stays same If sufficiently variable, standard signature detection will fail Can we use HMM for detection? What to use as observation sequence? Is there really a “hidden” Markov process? What about N, M, and T? How many Os needed for training, scoring?,

copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1