1、Bayesian models of inductive learning,Tom Griffiths UC Berkeley,Josh Tenenbaum MIT,Charles Kemp CMU,Outline,Morning 9:00-10:30: Introduction: Why Bayes?; Basics of Bayesian inference (Josh) 11:00-12:30: How to build a Bayesian cognitive model (Tom)Afternoon 1:30-3:00: Hierarchical Bayesian models an
2、d learning structured representations (Charles) 3:30-5:00: Monte Carlo methods and nonparametric Bayesian models (Tom),What you will get out of this tutorial,Our view of what Bayesian models have to offer cognitive science In-depth examples of basic and advanced models: how the math works & what it
3、buys you A sense for how to go about the process of building Bayesian models Some (not extensive) comparison to other approaches Opportunities to ask questions,The big question,How does the mind get so much out of so little?,Our minds build rich models of the world and make strong generalizations fr
4、om input data that is sparse, noisy, and ambiguous in many ways far too limited to support the inferences we make. How do we do it?,Learning words for objects,Learning words for objects,The big question,How does the mind get so much out of so little? Perceiving the world from sense data Learning abo
5、ut kinds of objects and their properties Inferring causal relations Learning and using words, phrases, and sentences Learning and using intuitive theories of physics, psychology, biology, Learning social structures, conventions, and rules,The goal: A general-purpose computational framework for under
6、standing how people makethese inferences, and how they can be successful.,The problem of induction,Abstract knowledge. (Constraints / Inductive bias / Priors),The problems of induction,1. How does abstract knowledge guide inductive learning, inference, and decision-making from sparse, noisy or ambig
7、uous data? 2. What is the form and content of our abstract knowledge of the world? 3. What are the origins of our abstract knowledge? To what extent can it be acquired from experience? 4. How do our mental models grow over a lifetime, balancing simplicity versus data fit (Occam), accommodation versu
8、s assimilation (Piaget)? 5. How can learning and inference proceed efficiently and accurately, even in the presence of complex hypothesis spaces?,A toolkit for reverse-engineering induction,Bayesian inference in probabilistic generative models Probabilities defined on a range of structured represent
9、ations: spaces, graphs, grammars, predicate logic, schemas, programs. Hierarchical probabilistic models, with inference at all levels of abstraction Models of unbounded complexity (“nonparametric Bayes” or “infinite models”), which can grow in complexity or change form as observed data dictate. Appr
10、oximate methods of learning and inference, such as belief propagation, expectation-maximization (EM), Markov chain Monte Carlo (MCMC), and sequential Monte Carlo (particle filtering).,Phrase structure S,Utterance U,Grammar G,P(S | G),P(U | S),P(S | U, G) P(U | S) x P(S | G),Bottom-up Top-down,Phrase
11、 structure,Utterance,Speech signal,Grammar,“Universal Grammar”,Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG),(Han and Zhu, 2006),Vision as probabilistic parsing,Principles,Structure,Data,Whole-object principle Shape bias Taxonomic principle Contrast principle Basic-level bias,Learnin
12、g word meanings,Causal learning and reasoning,Principles,Structure,Data,Goal-directed action (production and comprehension),(Wolpert et al., 2003),Why Bayesian models of cognition?,A framework for understanding how the mind can solve fundamental problems of induction. Strong, principled quantitative
13、 models of human cognition. Tools for studying peoples implicit knowledge of the world. Beyond classic limiting dichotomies: “rules vs. statistics”, “nature vs. nurture”, “domain-general vs. domain-specific” . A unifying mathematical language for all of the cognitive sciences: AI, machine learning a
14、nd statistics, psychology, neuroscience, philosophy, linguistics. A bridge between engineering and “reverse-engineering”.Why now? Much recent progress, in computational resources, theoretical tools, and interdisciplinary connections.,Outline,Morning Introduction: Why Bayes? (Josh) Basics of Bayesian
15、 inference (Josh) How to build a Bayesian cognitive model (Tom)Afternoon Hierarchical Bayesian models and learning structured representations (Charles) Monte Carlo methods and nonparametric Bayesian models (Tom),Bayes rule,Sum over space of alternative hypotheses,For any hypothesis h and data d,Baye
16、sian inference,Bayes rule: An example Data: John is coughing Some hypotheses:John has a coldJohn has lung cancerJohn has a stomach flu Prior P(h) favors 1 and 3 over 2 Likelihood P(d|h) favors 1 and 2 over 3 Posterior P(h|d) favors 1 over 2 and 3,Plan for this lecture,Some basic aspects of Bayesian
17、statistics Comparing two hypotheses Model fitting Model selection Two (very brief) case studies in modeling human inductive learning Causal learning Concept learning,Coin flipping,Basic Bayes data = HHTHT or HHHHH compare two hypotheses:P(H) = 0.5 vs. P(H) = 1.0 Parameter estimation (Model fitting)
18、compare many hypotheses in a parameterized familyP(H) = q : Infer q Model selection compare qualitatively different hypotheses, often varying in complexity:P(H) = 0.5 vs. P(H) = q,Coin flipping,HHTHT,HHHHH,What process produced these sequences?,Comparing two hypotheses,Contrast simple hypotheses: h1
19、: “fair coin”, P(H) = 0.5 h2:“always heads”, P(H) = 1.0 Bayes rule:With two hypotheses, use odds form,Comparing two hypotheses,D: HHTHT H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = ? P(D|H2) = 0 P(H2) = 1-?,Comparing two hypotheses,D: HHTHT H1, H2: “fair coin”, “always heads” P(D|H1) =
20、 1/25 P(H1) = 999/1000 P(D|H2) = 0 P(H2) = 1/1000,Comparing two hypotheses,D: HHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000,Comparing two hypotheses,D: HHHHHHHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/210 P(H1) = 999/1000 P(D|H2) = 1 P
21、(H2) = 1/1000,Measuring prior knowledge,1. The fact that HHHHH looks like a “mere coincidence”, without making us suspicious that the coin is unfair, while HHHHHHHHHH does begin to make us suspicious, measures the strength of our prior belief that the coin is fair. If q is the threshold for suspicio
22、n in the posterior odds, and D* is the shortest suspicious sequence, the prior odds for a fair coin is roughly q/P(D*|“fair coin”). If q 1 and D* is between 10 and 20 heads, prior odds are roughly between 1/1,000 and 1/1,000,000. 2. The fact that HHTHT looks representative of a fair coin, and HHHHH
23、does not, reflects our prior knowledge about possible causal mechanisms in the world. Easy to imagine how a trick all-heads coin could work: low (but not negligible) prior probability. Hard to imagine how a trick “HHTHT” coin could work: extremely low (negligible) prior probability.,Plan for this le
24、cture,Some basic aspects of Bayesian statistics Comparing two hypotheses Model fitting Model selection Two (very brief) case studies in modeling human inductive learning Causal learning Concept learning,Coin flipping,Basic Bayes data = HHTHT or HHHHH compare two hypotheses:P(H) = 0.5 vs. P(H) = 1.0
25、Parameter estimation (Model fitting) compare many hypotheses in a parameterized familyP(H) = q : Infer q Model selection compare qualitatively different hypotheses, often varying in complexity:P(H) = 0.5 vs. P(H) = q,Assume data are generated from a parameterized model:What is the value of q ? each
26、value of q is a hypothesis H requires inference over infinitely many hypotheses,Model fitting (Parameter estimation),d1 d2 d3 d4,P(H) = q,q,Assume hypothesis space of possible models:Which model generated the data? requires summing out hidden variables requires some form of Occams razor to trade off
27、 complexity with fit to the data.,Model selection,d1,d2,d3,d4,Fair coin: P(H) = 0.5,d1,d2,d3,d4,P(H) = q,q,d1,d2,d3,d4,Hidden Markov model:si Fair coin, Trick coin,q,j,Parameter estimation vs. Model selection across learning and development,Causality: learning the strength of a relation vs. learning
28、 the existence and form of a relation Perception: learning the strength of a cue vs. learning the existence of a cue, in sensory cue combination Language acquisition: learning a speakers accent, or frequencies of different words vs. learning a new tense or syntactic rule (or learning a new language,
29、 or the existence of different languages) Concepts: learning what horses look like vs. learning that there is a new species (or learning that there are species) Intuitive physics: learning the mass of an object vs. learning about the existence of a force (e.g., gravity, magnetism),A hierarchical lea
30、rning framework,model,data,parameterized model,A hierarchical learning framework,model,data,model class,parameterized model,Assume data are generated from a model:What is the value of q ? each value of q is a hypothesis H requires inference over infinitely many hypotheses,Bayesian parameter estimati
31、on,d1 d2 d3 d4,P(H) = q,q,D = 10 flips, with 5 heads and 5 tails. q = P(H) on next flip? 50% Why? 50% = 5 / (5+5) = 5/10. Why? “The future will be like the past”Suppose we had seen 4 heads and 6 tails. P(H) on next flip? Closer to 50% than to 40%. Why? Prior knowledge.,Some intuitions,Posterior dist
32、ribution P(q | D) is a probability density over q = P(H) Need to specify likelihood P(D | q ) and prior distribution P(q ).,Integrating prior knowledge and data,Likelihood and prior,Likelihood: Bernoulli distribution P(D | q ) = q NH (1-q ) NT NH: number of heads NT: number of tailsPrior:P(q ) ,?,D
33、= 10 flips, with 5 heads and 5 tails. q = P(H) on next flip? 50% Why? 50% = 5 / (5+5) = 5/10. Why? Maximum likelihood: Suppose we had seen 4 heads and 6 tails. P(H) on next flip? Closer to 50% than to 40%. Why? Prior knowledge.,Some intuitions,A simple method of specifying priors,Imagine some fictit
34、ious trials, reflecting a set of previous experiences strategy often used with neural networks or building invariance into machine vision.e.g., F =1000 heads, 1000 tails strong expectation that any new coin will be fairIn fact, this is a sensible statistical idea.,Likelihood and prior,Likelihood: Be
35、rnoulli(q ) distribution P(D | q ) = q NH (1-q ) NT NH: number of heads observed NT: number of tails observedPrior: Beta(FH,FT) distribution P(q ) q FH-1 (1-q ) FT-1 FH: fictional observations of heads FT: fictional observations of tails,Shape of the Beta prior,Posterior is Beta(NH+FH,NT+FT) same fo
36、rm as prior!,Bayesian parameter estimation,P(q | D) P(D | q ) P(q ) = q NH+FH-1 (1-q ) NT+FT-1,Conjugate priors,A prior p(q ) is conjugate to a likelihood function p(D | q ) if the posterior has the same functional form of the prior. Parameter values in the prior can be thought of as a summary of “f
37、ictitious observations”. Different parameter values in the prior and posterior reflect the impact of observed data. Conjugate priors exist for many standard models (e.g., all exponential family models),d1 d2 d3 d4,q,FH,FT,Posterior predictive distribution:,D = NH,NT,P(q | D) P(D | q ) P(q ) = q NH+F
38、H-1 (1-q ) NT+FT-1,Bayesian parameter estimation,P(H|q ) P(q | D, FH, FT) dq,“hypothesis averaging”,dn,P(dn = H|D, FH, FT) =,d1 d2 d3 d4,q,FH,FT,dn,Posterior predictive distribution:,D = NH,NT,P(q | D) P(D | q ) P(q ) = q NH+FH-1 (1-q ) NT+FT-1,Bayesian parameter estimation,(NH+FH+NT+FT),(NH+FH),P(d
39、n = H|D, FH, FT) =,Example: coin fresh from bank,e.g., F =1000 heads, 1000 tails strong expectation that any new coin will be fair After seeing 4 heads, 6 tails, P(H) on next flip = 1004 / (1004+1006) = 49.95%Compare: F =3 heads, 3 tails weak expectation that any new coin will be fair After seeing 4
40、 heads, 6 tails, P(H) on next flip = 7 / (7+9) = 43.75%,Example: thumbtack,e.g., F =5 heads, 3 tails weak expectation that tacks are slightly biased towards heads After seeing 2 heads, 0 tails, P(H) on next flip = 7 / (7+3) = 70%Some prior knowledge is always necessary to avoid jumping to hasty conc
41、lusions. Suppose F = : After seeing 1 heads, 0 tails, P(H) on next flip = 1 / (1+0) = 100%,Origin of prior knowledge,Tempting answer: prior experience Suppose you have previously seen 2000 coin flips: 1000 heads, 1000 tails,Problems with simple empiricism,Havent really seen 2000 coin flips, or any f
42、lips of a thumbtack Prior knowledge is stronger than raw experience justifiesHavent seen exactly equal number of heads and tails Prior knowledge is smoother than raw experience justifiesShould be a difference between observing 2000 flips of a single coin versus observing 10 flips each for 200 coins,
43、 or 1 flip each for 2000 coins Prior knowledge is more structured than raw experience,A simple theory,“Coins are manufactured by a standardized procedure that is effective but not perfect, and not in principle biased toward heads or tails.” Justifies generalizing from previous coins to the present c
44、oin. Justifies smoother and stronger prior than raw experience alone. Explains why seeing 10 flips each for 200 coins is more valuable than seeing 2000 flips of one coin.,A hierarchical Bayesian model,d1 d2 d3 d4,FH,FT,d1 d2 d3 d4,q1,d1 d2 d3 d4,q Beta(FH,FT),Coin 1,Coin 2,Coin 200,.,q2,q200,Backgro
45、und theory,Qualitative prior knowledge (e.g., symmetry) can influence estimates of continuous parameters (FH, FT).,Explains why 10 flips of 200 coins are better than 2000 flips of a single coin: more informative about FH, FT.,Coins,Learning the parameters of a generative model as Bayesian inference.
46、 Prediction by Bayesian hypothesis averaging. Conjugate priors an elegant way to represent simple kinds of prior knowledge. Hierarchical Bayesian models integrate knowledge across instances of a system, or different systems within a domain, to explain the origins of priors.,Summary: Bayesian paramet
47、er estimation,A hierarchical learning framework,model,data,model class,parameterized model,Stability versus Flexibility,Can all domain knowledge be represented with conjugate priors? Suppose you flip a coin 25 times and get all heads. Something funny is going on But with F =1000 heads, 1000 tails, P
48、(heads) on next flip = 1025 / (1025+1000) = 50.6%. Looks like nothing unusual. How do we balance stability and flexibility? Stability: 6 heads, 4 tails q 0.5 Flexibility: 25 heads, 0 tails q 1,Bayesian model selection,Which provides a better account of the data: the simple hypothesis of a fair coin,
49、 or the complex hypothesis that P(H) = q ?,vs.,P(H) = q is more complex than P(H) = 0.5 in two ways: P(H) = 0.5 is a special case of P(H) = q for any observed sequence D, we can choose q such that D is more probable than if P(H) = 0.5,Comparing simple and complex hypotheses: the need for Occams razor,Probability,