1、An introduction to Bayesian Networks and the Bayes Net Toolbox for Matlab,Kevin Murphy MIT AI Lab 19 May 2003,Outline,An introduction to Bayesian networks An overview of BNT,Qualitative part: Directed acyclic graph (DAG) Nodes - random vars. Edges - direct influence,Quantitative part: Set of conditi
2、onal probability distributions,Compact representation of joint probability distributions via conditional independence,Together: Define a unique distribution in a factored form,What is a Bayes (belief) net?,Figure from N. Friedman,What is a Bayes net?,C ? R,B,E | A,A node is conditionally independent
3、 of its ancestors given its parents, e.g.,Hence,From 25 1 = 31 parameters to 1+1+2+4+2=10,Why are Bayes nets useful?,- Graph structure supports - Modular representation of knowledge - Local, distributed algorithms for inference and learning - Intuitive (possibly causal) interpretation,- Factored rep
4、resentation may have exponentially fewer parameters than full joint P(X1,Xn) =lower sample complexity (less data for learning)lower time complexity (less time for inference),What can Bayes nets be used for?,Posterior probabilities Probability of any event given any evidence Most likely explanation S
5、cenario that explains evidence Rational decision making Maximize expected utility Value of Information Effect of intervention Causal analysis,Radio,Call,Figure from N. Friedman,Explaining away effect,Domain: Monitoring Intensive-Care Patients 37 variables 509 parametersinstead of 237,A real Bayes ne
6、t: Alarm,Figure from N. Friedman,More real-world BN applications,“Microsofts competitive advantage lies in its expertise in Bayesian networks” - Bill Gates, quoted in LA Times, 1996 MS Answer Wizards, (printer) troubleshooters Medical diagnosis Genetic pedigree analysis Speech recognition (HMMs) Gen
7、e sequence/expression analysis Turbocodes (channel coding),Dealing with time,In many systems, data arrives sequentially Dynamic Bayes nets (DBNs) can be used to model such time-series (sequence) data Special cases of DBNs include State-space models Hidden Markov models (HMMs),State-space model (SSM)
8、/ Linear Dynamical System (LDS),“True” state,Noisy observations,Example: LDS for 2D tracking,Sparse linear Gaussian systems ) sparse graphs,Hidden Markov model (HMM),Phones/ words,acoustic signal,transition matrix,Gaussian observations,Probabilistic graphical models,Probabilistic models,Directed,Und
9、irected,Graphical models,Alarm network State-space models HMMs Nave Bayes classifier PCA/ ICA,Markov Random Field Boltzmann machine Ising model Max-ent model Log-linear models,(Bayesian belief nets),(Markov nets),Toy example of a Markov net,X1,X2,X5,X3,X4,e.g, X1 ? X4, X5 | X2, X3,Xi ? Xrest | Xnbrs
10、,Potential functions,Partition function,A real Markov net,Estimate P(x1, , xn | y1, , yn)Y(xi, yi) = P(observe yi | xi): local evidenceY(xi, xj) / exp(-J(xi, xj): compatibility matrix c.f., Ising/Potts model,Observed pixels,Latent causes,Inference,Posterior probabilities Probability of any event giv
11、en any evidence Most likely explanation Scenario that explains evidence Rational decision making Maximize expected utility Value of Information Effect of intervention Causal analysis,Radio,Call,Figure from N. Friedman,Explaining away effect,Kalman filtering (recursive state estimation in an LDS),Est
12、imate P(Xt|y1:t) from P(Xt-1|y1:t-1) and yt Predict: P(Xt|y1:t-1) = sXt-1 P(Xt|Xt-1) P(Xt-1|y1:t-1) Update: P(Xt|y1:t) / P(yt|Xt) P(Xt|y1:t-1),Forwards algorithm for HMMs,Predict:,Update:,Message passing view of forwards algorithm,Forwards-backwards algorithm,Discrete analog of RTS smoother,Belief P
13、ropagation,Figure from P. Green,Generalization of forwards-backwards algo. /RTS smoother from chains to trees - linear time, two-pass algorithm,aka Pearls algorithm, sum-product algorithm,BP: parallel, distributed version,Stage 1.,Stage 2.,Representing potentials,For discrete variables, potentials c
14、an be represented as multi-dimensional arrays (vectors for single node potentials) For jointly Gaussian variables, we can use y(X) = (m, S) or y(X) = (S-1 m ,S-1) In general, we can use mixtures of Gaussians or non-parametric forms,Manipulating discrete potentials,Marginalization,Multiplication,80%
15、of time is spent manipulating such multi-dimensional arrays!,Manipulating Gaussian potentials,Closed-form formulae for marginalization and multiplication O(1)/O(n3) complexity per operation Mixtures of Gaussian potentials are not closed under marginalization, so need approximations (moment matching)
16、,Semi-rings,By redefining * and +, same code implements Kalman filter and forwards algorithm By replacing + with max, can convert from forwards (sum-product) to Viterbi algorithm (max-product) BP works on any commutative semi-ring!,Inference in general graphs,BP is only guaranteed to be correct for
17、trees A general graph should be converted to a junction tree, by clustering nodes Computationally complexity is exponential in size of the resulting clusters (NP-hard),Approximate inference,Why? to avoid exponential complexity of exact inference in discrete loopy graphs Because cannot compute messag
18、es in closed form (even for trees) in the non-linear/non-Gaussian case How? Deterministic approximations: loopy BP, mean field, structured variational, etc Stochastic approximations: MCMC (Gibbs sampling), likelihood weighting, particle filtering, etc,- Algorithms make different speed/accuracy trade
19、offs,- Should provide the user with a choice of algorithms,Learning,Parameter estimation Model selection (structure learning),Parameter learning,Figure from M. Jordan,Conditional Probability Tables (CPTs),iid data,If some values are missing (latent variables), we must use gradient descent or EM to c
20、ompute the (locally) maximum likelihood estimates,Structure learning (data mining),Gene expression data,Figure from N. Friedman,Genetic pathway,Structure learning,Learning the optimal structure is NP-hard (except for trees) Hence use heuristic search through space of DAGs or PDAGs or node orderings
21、Search algorithms: hill climbing, simulated annealing, GAs Scoring function is often marginal likelihood, or anapproximation like BIC/MDL or AIC,Structural complexity penalty,Summary: why are graphical models useful?,- Factored representation may have exponentially fewer parameters than full joint P
22、(X1,Xn) = lower time complexity (less time for inference) lower sample complexity (less data for learning)- Graph structure supports Modular representation of knowledge Local, distributed algorithms for inference and learning Intuitive (possibly causal) interpretation,The Bayes Net Toolbox for Matla
23、b,What is BNT? Why yet another BN toolbox? Why Matlab? An overview of BNTs design How to use BNT Other GM projects,What is BNT?,BNT is an open-source collection of matlab functions for inference and learning of (directed) graphical models Started in Summer 1997 (DEC CRL), development continued while
24、 at UCB Over 100,000 hits and about 30,000 downloads since May 2000 About 43,000 lines of code (of which 8,000 are comments),Why yet another BN toolbox?,In 1997, there were very few BN programs, and all failed to satisfy the following desiderata: Must support real-valued (vector) data Must support l
25、earning (params and struct) Must support time series Must support exact and approximate inference Must separate API from UI Must support MRFs as well as BNs Must be possible to add new models and algorithms Preferably free Preferably open-source Preferably easy to read/ modify Preferably fast,BNT me
26、ets all these criteria except for the last,A comparison of GM software,www.ai.mit.edu/murphyk/Software/Bayes/bnsoft.html,Summary of existing GM software,8 commercial products (Analytica, BayesiaLab, Bayesware, Business Navigator, Ergo, Hugin, MIM, Netica), focused on data mining and decision support
27、; most have free “student” versions 30 academic programs, of which 20 have source code (mostly Java, some C+/ Lisp) Most focus on exact inference in discrete, static, directed graphs (notable exceptions: BUGS and VIBES) Many have nice GUIs and database support,BNT contains more features than most of
28、 these packages combined!,Why Matlab?,Pros Excellent interactive development environment Excellent numerical algorithms (e.g., SVD) Excellent data visualization Many other toolboxes, e.g., netlab Code is high-level and easy to read (e.g., Kalman filter in 5 lines of code) Matlab is the lingua franca
29、 of engineers and NIPS Cons: Slow Commercial license is expensive Poor support for complex data structures Other languages I would consider in hindsight: Lush, R, Ocaml, Numpy, Lisp, Java,BNTs class structure,Models bnet, mnet, DBN, factor graph, influence (decision) diagram CPDs Gaussian, tabular,
30、softmax, etc Potentials discrete, Gaussian, mixed Inference engines Exact - junction tree, variable elimination Approximate - (loopy) belief propagation, sampling Learning engines Parameters EM, (conjugate gradient) Structure - MCMC over graphs, K2,Example: mixture of experts,softmax/logistic functi
31、on,1. Making the graph,X = 1; Q = 2; Y = 3; dag = zeros(3,3); dag(X, Q Y) = 1; dag(Q, Y) = 1;,Graphs are (sparse) adjacency matrices GUI would be useful for creating complex graphs Repetitive graph structure (e.g., chains, grids) is best created using a script (as above),2. Making the model,node_siz
32、es = 1 2 1; dnodes = 2; bnet = mk_bnet(dag, node_sizes, discrete, dnodes);,X is always observed input, hence only one effective value Q is a hidden binary node Y is a hidden scalar node bnet is a struct, but should be an object mk_bnet has many optional arguments, passed as string/value pairs,3. Spe
33、cifying the parameters,bnet.CPDX = root_CPD(bnet, X); bnet.CPDQ = softmax_CPD(bnet, Q); bnet.CPDY = gaussian_CPD(bnet, Y);,CPDs are objects which support various methods such as Convert_from_CPD_to_potential Maximize_params_given_expected_suff_stats Each CPD is created with random parameters Each CP
34、D constructor has many optional arguments,4. Training the model,load data ascii; ncases = size(data, 1); cases = cell(3, ncases); observed = X Y; cases(observed, :) = num2cell(data);,Training data is stored in cell arrays (slow!), to allow for variable-sized nodes and missing values casesi,t = value
35、 of node i in case t,engine = jtree_inf_engine(bnet, observed);,Any inference engine could be used for this trivial model,bnet2 = learn_params_em(engine, cases);,We use EM since the Q nodes are hidden during training learn_params_em is a function, but should be an object,X,Y,Q,Before training,After
36、training,5. Inference/ prediction,engine = jtree_inf_engine(bnet2); evidence = cell(1,3); evidenceX = 0.68; % Q and Y are hidden engine = enter_evidence(engine, evidence); m = marginal_nodes(engine, Y); m.mu % EY|X m.Sigma % CovY|X,Other kinds of models that BNT supports,Classification/ regression:
37、linear regression, logistic regression, cluster weighted regression, hierarchical mixtures of experts, nave Bayes Dimensionality reduction: probabilistic PCA, factor analysis, probabilistic ICA Density estimation: mixtures of Gaussians State-space models: LDS, switching LDS, tree-structured AR model
38、s HMM variants: input-output HMM, factorial HMM, coupled HMM, DBNs Probabilistic expert systems: QMR, Alarm, etc. Limited-memory influence diagrams (LIMID) Undirected graphical models (MRFs),Summary of BNT,Provides many different kinds of models/ CPDs lego brick philosophy Provides many inference al
39、gorithms, with different speed/ accuracy/ generality tradeoffs (to be chosen by user) Provides several learning algorithms (parameters and structure) Source code is easy to read and extend,What is wrong with BNT?,It is slow It has little support for undirected models Models are not bona fide objects
40、 Learning engines are not objects It does not support online inference/learning It does not support Bayesian estimation It has no GUI It has no file parser It is more complex than necessary,Some alternatives to BNT?,HUGIN: commercial Junction tree inference only, no support for DBNs PNL: Probabilist
41、ic Networks Library (Intel) Open-source C+, based on BNT, work in progress (due 12/03) GMTk: Graphical Models toolkit (Bilmes, Zweig/ UW) Open source C+, designed for ASR (HTK), binary avail now AutoBayes: code generator (Fischer, Buntine/NASA Ames) Prolog generates matlab/C, not avail. to public VI
42、BES: variational inference (Winn / Bishop, U. Cambridge) conjugate exponential models, work in progress BUGS: (Spiegelhalter et al., MRC UK) Gibbs sampling for Bayesian DAGs, binary avail. since 96,Why yet another GM toolbox?,In 2003, there are still very few GM programs that satisfy the following d
43、esiderata: Must support real-valued (vector) data Must support learning (params and struct) Must support time series Must support exact and approximate inference Must separate API from UI Must support MRFs as well as BNs Must be possible to add new models and algorithms Preferably free Preferably open-source Must be easy to read/ modify Must be fast (smarter algorithms, not better coding!) Must be integrated with data analysis environment,
copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1