Week One - The Golem of Prague / The Garden of Forking Data / Sampling the Imaginary

Small World: The world of the golem’s assumptions. Bayesian golems are optimal, in the small world.

The small world is the self-contained logical world of the model. Within the small world, all possibilities are nominated (i.e. the sample space)

Within the small world of the model, it is important to be able to verify the model’s logic, making sure that it performs as expected under favorable assumptions. Bayesian models have some advantages in this regard, as they have reasonable claims to optimality: No alternative model could make better use of the information in the data and support better decisions, assuming the small world is an accurate description of the real world

Large world: The real world. No Guarantee of optimal procedures.

The large world is the broader context in which one deploys a model. In the large world, there may be events that were not imagined in the small world. Moreover, the model is always an incomplete representation of the large world, and so will make mistakes, even if all kinds of events have been properly nominated.

We move between the small world and large world when modelling.

The way that Bayesian models learn from evidence is arguably optimal in the small world. When their assumptions approximate reality, they also perform well in the large world. But large world performance has to be demonstrated rather than logically deduced. Passing back and forth between these two worlds allows both formal methods, like Bayesian inference, and informal methods, like peer review, to play an indispensable role

Bayesian Data Analysis

Count all the ways data can happen according to assumptions. Assumptions with more ways that are consistent with data are more plausible.

  • The Future:
    • Full of branching paths
    • Each choice closes come
  • The data:
    • Many possible events
    • Each observation eliminates some

There is a bag of four marbles that come in only two colours: blue and white. From multiple draws, what are the content of the bag? How many blue marbles and how many white marbles? There are only 5 possibilites based on what we know about the possible outcomes.

We draw with replacement from the bag to get observations. We count up all the possible ways that we can get the data (ie. blue, white, blue marbles). These are the ways that are consistent with the data.

We compare the results with other possible outcomes:

This is Bayesian inference: A bayesian model no matter how complicated is just counting. There’s a set of assumptions and it counts the marbles that could have appeared according to those assumptions. This is the posterior probability distribution.

These are the relative plausibilities of the diferent conjectures.

If we take another marble from the bag and it’s blue, then we update the posterior probability distribution to reflect the new data. This is done by multiplying the counts.

Incorporating other information

e.g. 

We can also multiply the posterior probability by the new information:

From counts to probability

It is helpful to think of this strategy as adhering to a principle of honest ignorance: When we don’t know what caused the data, potential causes that may produce the data in more ways are more plausible.

It’s hard to use these counts though, so we almost always standardize them in a way that transforms them into probabilities.

There’s a mathematical way to compress all of this. Specifically, we define the updated plausibility of each possible composition of the bag, after seeing the data, as:

For any value p can take, we judge the plausibility of that value p as proportional to the number of ways it can get through the garden of forking data.

We normalise the values to between 0 and 1 to become probabilities.

This represents the weight of evidence for that possibility conditional on the evidence.

ways <- c(0, 3,8,9, 0)

ways / sum(ways)
## [1] 0.00 0.15 0.40 0.45 0.00

These plausibilities are also probabilities-they are non-negative (zero or positive) real numbers that sum to one. And all of the mathematical things you can do with probabilities you can also do with these values.

Building a Model

By working with probabilities instead of raw counts, Bayesian inference is made much easier, but it looks much harder.

We need to create a procedure by which we build up a model. We start with the known data and build a Bayesian model of how to explain that data from what we know about where the data came from:

We design a model given the scientific information we have about the data, what we know in general about the process before we see the exact values that have appeared. We only know about the generative process, the nature of the scientific background, and we can design a model. Then we condition on the data using Bayesian inference, we update (based on the conjectures plausibility) and then we count ways and then we get critical about the results, what the model may be missing etc (recursive model evaluation).

Suppose you have a globe representing our planet, the Earth. This version of the world is small enough to hold in your hands. You are curious how much of the surface is covered in water. You adopt the following strategy: You will toss the globe up in the air. When you catch it, you will record whether or not the surface under your right index finger is water or land. Then you toss the globe up in the air again and repeat the procedure.43 This strategy generates a sequence of surface samples from the globe. The first nine samples might look like:

W, L, W, W, W, L, W, L, W

where W indicates water and L indicates land.

A data story

Bayesian data analysis usually means producing a story for how the data came to be. This story may be descriptive, specifying associations that can be used to predict outcomes, given observations. Or it may be causal, a theory of how some events produce other events. Typically, any story you intend to be causal may also be descriptive. But many descriptive stories are hard to interpret causally. But all data stories are complete, in the sense that they are sufficient for specifying an algorithm for simulating new data

You can motivate your data story by trying to explain how each piece of data is born. This usually means describing aspects of the underlying reality as well as the sampling process. The data story in this case is simply a restatement of the sampling process:

The data story is then translated into a formal probability model. This probability model is easy to build, because the construction process can be usefully broken down into a series of component decisions.


Bayesian updating

Our problem is one of using the evidence-the sequence of globe tosses-to decide among different possible proportions of water on the globe. These proportions are like the conjectured marbles inside the bag, from earlier in the chapter. Each possible proportion may be more or less plausible, given the evidence. A Bayesian model begins with one set of plausibilities assigned to each of these possibilities. These are the prior plausibilities. Then it updates them in light of the data, to produce the posterior plausibilities.

Let’s program our Bayesian machine to initially assign the same plausibility to every proportion of water, every value of p

The dashed horizontal line represents this initial plausibility of each possible value of p. After seeing the first toss, which is a “W,” the model updates the plausibilities to the solid line. The plausibility of p = 0 has now fallen to exactly zero-the equivalent of “impossible.” Why? Because we observed at least one speck of water on the globe, so now we know there is some water

The additional samples from the globe are introduced to the model, one at a time. Each dashed curve is just the solid curve from the previous plot, moving left to right and top to bottom. Every time a “W” is seen, the peak of the plausibility curve moves to the right, towards larger values of p. Every time an “L” is seen, it moves the other direction. The maximum height of the curve increases with each sample, meaning that fewer values of p amass more plausibility as the amount of evidence increases. As each new observation is added, the curve is updated consistent with all previous observations.


Evaluate

The Bayesian model learns in a way that is demonstrably optimal, provided that the real, large world is accurately described by the model. This is to say that your Bayesian machine guarantees perfect inference, within the small world. No other way of using the available information, and beginning with the same state of information, could do better.

However, if there are important differences between the model and reality, then there is no logical guarantee of large world performance. And even if the two worlds did match, any particular sample of data could still be misleading. So it’s worth keeping in mind at least two cautious principles:

First, the model’s certainty is no guarantee that the model is a good one. As the amount of data increases, the globe tossing model will grow increasingly sure of the proportion of water. This means that the posterior curves will become increasingly narrow and tall, restricting plausible values within a very narrow range. But models of all sorts-Bayesian or not-can be very confident about an inference, even when the model is seriously misleading. This is because the inferences are conditional on the model. What your model is telling you is that, given a commitment to this particular model, it can be very sure that the plausible values are in a narrow range. Under a different model, things might look differently.

Second, it is important to supervise and critique your model’s work. Consider again the fact that the updating in the previous section works in any order of data arrival. We could shuffle the order of the observations, as long as six W’s and three L’s remain, and still end up with the same final plausibility curve. That is only true, however, because the model assumes that order is irrelevant to inference. When something is irrelevant to the machine, it won’t affect the inference directly. But it may affect it indirectly, because the data will depend upon order. So it is important to check the model’s inferences in light of aspects of the data it does not know about.

We know the model’s assumptions are never exactly right, in the sense of matching the true data generating process. Therefore there’s no point in checking if the model is true. Failure to conclude that a model is false must be a failure of our imagination, not a success of the model. Moreover, models do not need to be exactly true in order to produce highly precise and useful inferences. All manner of small world assumptions about error distributions and the like can be violated in the large world, but a model may still produce a perfectly useful estimate.

Instead, the objective is to check the model’s adequacy for some purpose. This usually means asking and answering additional questions, beyond those that originally constructed the model. Both the questions and answers will depend upon the scientific context.

Components of the model

#land = 0; water = 1
tosses <- c(1, 0, 1, 1, 1, 0, 1, 0, 1)

tosses
## [1] 1 0 1 1 1 0 1 0 1
# dbinom = density binomial distribution. 
# sum(tosses) = number of waters
# length(tosses) = number of tosses
# probability is 0.5 as its binomial i.e., can only be either water or land.

# Doesn't take into account any prior information on what we know about the proportion of water

water_likelihood <- dbinom(sum(tosses), size = length(tosses), prob = 0.5)

water_likelihood
## [1] 0.1640625

That number is the relative number of ways to get six water, holding p at 0.5 and N = W + L at nine. So it does the job of counting relative number of paths through the garden.

binomial distribution is rather special, because it represents the maximum entropy way to count binary events. The distribution contains no additional information other than: There are two events, and the probabilities of each in each trial are p and 1 − p.

Unobserved variables

The distributions we assign to the observed variables typically have their own variables. In the binomial above, there is p, the probability of sampling water. Since p is not observed, we usually call it a parameter. Even though we cannot observe p, we still have to define it.

In statistical modeling, many of the most common questions we ask about data are answered directly by parameters:

  • What is the average difference between treatment groups?
  • How strong is the association between a treatment and an outcome?
  • Does the effect of the treatment depend upon a covariate?
  • How much variation is there among groups?

For every parameter you intend your Bayesian machine to consider, you must provide a distribution of prior plausibility, its prior. A Bayesian machine must have an initial plausibility assignment for each possible value of the parameter, and these initial assignments do useful work.

Priors are chosen to reflect what we know about a phenomenon. Within Bayesian data analysis in the natural and social sciences, the prior is considered to be just part of the model. As such it should be chosen, evaluated, and revised just like all of the other components of the model.

If you don’t have a strong argument for any particular prior, then try different ones. Because the prior is an assumption, it should be interrogated like other assumptions: by altering it and checking how sensitive inference is to the assumption. No one is required to swear an oath to the assumptions of a model, and no set of assumptions deserves our obedience.

Uniform (flat prior) is not good to use. There is always something known to improve inference (decreases risk of overfitting to the data).

It is typical to conceive of data and parameters as completely different kinds of entities. Data are measured and known; parameters are unknown and must be estimated from data. Usefully, in the Bayesian framework the distinction between a datum and a parameter is not so fundamental. Sometimes we observe a variable, but sometimes we do not. In that case, the same distribution function applies, even though we didn’t observe the variable. As a result, the same assumption can look like a “likelihood” or a “prior,” depending upon context, without any change to the model.

A model is born

we can now summarize our model. The observed variables W and L are given relative counts through the binomial distribution. So we can write, as a shortcut:

W ~ Binomial(N, p)

where N = Water + Land

The above is just a convention for communicating the assumption that the relative counts of ways to realize W in N trials with probability p on each trial comes from the binomial distribution.

The unobserved parameter p gets:

p ~ Uniform(0, 1)

This means that p has a uniform—flat—prior over its entire possible range, from zero to one. Note: Uniform isn’t the best as we know the Earth has more water than land, even if we don’t know the exact proportions yet.

Making the model go

Once you have named all the variables and chosen definitions for each, a Bayesian model can update all of the prior distributions to their purely logical consequences: the posterior distribution. For every unique combination of data, likelihood, parameters, and prior, there is a unique posterior distribution. This distribution contains the relative plausibility of different parameter values, conditional on the data and model. The posterior distribution takes the form of the probability of the parameters, conditional on the data. In this case, it would be Pr(p|W, L), the probability of each possible value of p, conditional on the specific W and L that we observed.

Posterior = Probability of the data * Prior / Average probability of the data

The denominator is to ensure that the posterior is standardised to one (integrates to one)

The key lesson is that the posterior is proportional to the product of the prior and the probability of the data. Why? Because for each specific value of p, the number of paths through the garden of forking data is the product of the prior number of paths and the new number of paths. Multiplication is just compressed counting. The average probability on the bottom just standardizes the counts so they sum to one.

Motors

We can compute the posterior distribution via:

  1. Grid approximation
  2. Quadratic approximation
  3. Markov chain Monte Carlo (MCMC)

Grid Approximation

While most parameters are continuous, capable of taking on an infinite number of values, it turns out that we can achieve an excellent approximation of the continuous posterior distribution by considering only a finite grid of parameter values.

But in most of your real modeling, grid approximation isn’t practical. The reason is that it scales very poorly, as the number of parameters increases. So in later chapters, grid approximation will fade away, to be replaced by other, more efficient techniques.

# define grid
p_grid <- seq( from=0 , to=1 , length.out=20 )

# define prior
prior <- rep( 1 , 20 )

# compute likelihood at each value in grid
likelihood <- dbinom( 6 , size=9 , prob=p_grid )

# compute product of likelihood and prior
unstd.posterior <- likelihood * prior

# standardize the posterior, so it sums to 1
posterior <- unstd.posterior / sum(unstd.posterior)

plot( p_grid , posterior , type="b" ,
xlab="probability of water" , ylab="posterior probability" )
mtext( "20 points" )

Quadratic approximation

A useful approach is quadratic approximation. Under quite general conditions, the region near the peak of the posterior distribution will be nearly Gaussian—or “normal”—in shape. This means the posterior distribution can be usefully approximated by a Gaussian distribution. A Gaussian distribution is convenient, because it can be completely described by only two numbers: the location of its center (mean) and its spread (variance). A Gaussian approximation is called “quadratic approximation” because the logarithm of a Gaussian distribution forms a parabola. And a parabola is a quadratic function. So this approximation essentially represents any log-posterior with a parabola.

This phenomenon, where the quadratic approximation improves with the amount of data, is very common. It’s one of the reasons that so many classical statistical procedures are nervous about small samples: Those procedures use quadratic (or other) approximations that are only known to be safe with infinite data. Often, these approximations are useful with less than infinite data, obviously. But the rate of improvement as sample size increases varies greatly depending upon the details. In some models, the quadratic approximation can remain terrible even with thousands of samples.

The quadratic approximation, either with a uniform prior or with a lot of data, is often equivalent to a maximum likelihood estimate (MLE) and its standard error. The MLE is a very common non-Bayesian parameter estimate. This correspondence between a Bayesian approximation and a common non-Bayesian estimator is both a blessing and a curse.

Markov chain Monte Carlo

There are lots of important model types, like multilevel (mixed-effects) models, for which neither grid approximation nor quadratic approximation is always satisfactory.

The most popular of these is Markov chain Monte Carlo (MCMC), which is a family of conditioning engines capable of handling highly complex models.

Instead of attempting to compute or approximate the posterior distribution directly, MCMC techniques merely draw samples from the posterior. You end up with a collection of parameter values, and the frequencies of these values correspond to the posterior plausibilities. You can then build a picture of the posterior from the histogram of these samples.

n_samples <- 1000
p <- rep( NA , n_samples )
p[1] <- 0.5
W <- 6
L <- 3
for ( i in 2:n_samples ) {
p_new <- rnorm( 1 , p[i-1] , 0.1 )
if ( p_new < 0 ) p_new <- abs( p_new )
if ( p_new > 1 ) p_new <- 2 - p_new
q0 <- dbinom( W , W+L , p[i-1] )
q1 <- dbinom( W , W+L , p_new )
p[i] <- ifelse( runif(1) < q1/q0 , p_new , p[i-1] )
}

Summary

The target of inference in Bayesian inference is a posterior probability distribution. Posterior probabilities state the relative numbers of ways each conjectured cause of the data could have produced the data. These relative numbers indicate plausibilities of the different conjectures. These plausibilities are updated in light of observations through Bayesian updating.

a Bayesian model is a composite of variables and distributional definitions for these variables. The probability of the data, often called the likelihood, provides the plausibility of an observation (data), given a fixed value for the parameters. The prior provides the plausibility of each possible value of the parameters, before accounting for the data. The rules of probability tell us that the logical way to compute the plausibilities, after accounting for the data, is to use Bayes’ theorem. This results in the posterior distribution.

Practice Questions

easy 1. Which of the expressions below correspond to the statement: the probability of rain on Monday?

answer: Pr(rain | Monday) and Pr(rain | Monday) / Pr(Monday)

easy 2. Which of the following statements corresponds to the expression: Pr(Monday|rain)?

answer: The probability that it is Monday, given that it is raining.

easy 3. Which of the expressions below correspond to the statement: the probability that it is Monday, given that it is raining?

answer: Pr(Monday | rain) and Pr(rain|Monday)*Pr(Monday)/Pr(rain)

the latter is due to Bayes rule (need to derive this to make sure I understand)

easy 4. What does it mean to say “the probability of water is 0.7”?

We believe that the true proportion of water on the globe is 0.7 (70%)

medium 1. Compute and plot the grid approximate posterior distribution for each of the following sets of observations. In each case, assume a uniform prior for p

  1. W,W,W
# define grid
p_grid <- seq( from=0 , to=1 , length.out=100 )

# define prior
prior <- rep( 1 , 100 )

# compute likelihood at each value in grid
likelihood <- dbinom( 3 , size=3 , prob=p_grid )

# compute product of likelihood and prior
unstd.posterior <- likelihood * prior

# standardize the posterior, so it sums to 1
posterior <- unstd.posterior / sum(unstd.posterior)

plot( p_grid , posterior , type="b" ,
xlab="probability of water" , ylab="posterior probability" )

  1. W,W,W,L
# define grid
p_grid <- seq( from=0 , to=1 , length.out=100 )

# define prior
prior <- rep( 1 , 100 )

# compute likelihood at each value in grid
likelihood <- dbinom( 3 , size=4 , prob=p_grid )

# compute product of likelihood and prior
unstd.posterior <- likelihood * prior

# standardize the posterior, so it sums to 1
posterior <- unstd.posterior / sum(unstd.posterior)

plot( p_grid , posterior , type="b" ,
xlab="probability of water" , ylab="posterior probability" )

  1. L, W, W, L, W, W, W
# define grid
p_grid <- seq( from=0 , to=1 , length.out=100 )

# define prior
prior <- rep( 1 , 100 )

# compute likelihood at each value in grid
likelihood <- dbinom( 5 , size=7 , prob=p_grid )

# compute product of likelihood and prior
unstd.posterior <- likelihood * prior

# standardize the posterior, so it sums to 1
posterior <- unstd.posterior / sum(unstd.posterior)

plot( p_grid , posterior , type="b" ,
xlab="probability of water" , ylab="posterior probability" )

Medium 2. Now assume a prior for p that is equal to zero when p < 0.5 and is a positive constant when p ≥ 0.5. Again compute and plot the grid approximate posterior distribution for each of the sets of observations in the problem just above.

# define grid
p_grid <- seq( from=0 , to=1 , length.out=20 )

# define prior
prior <- ifelse(p_grid < 0.5, 0, 1)

# compute likelihood at each value in grid
likelihood <- dbinom( 5 , size=7 , prob=p_grid )

# compute product of likelihood and prior
unstd.posterior <- likelihood * prior

# standardize the posterior, so it sums to 1
posterior <- unstd.posterior / sum(unstd.posterior)

plot( p_grid , posterior , type="b" ,
xlab="probability of water" , ylab="posterior probability" )

Week Two: Geocentric Models / Wiggly Orbits

Week Three: The Many Variables and The Spurious Waffles