Books used



Models


Inferential statistical methods help us decide what to believe in. With inferential statistics, we don’t just introspect to find the truth. Instead, we rely on data from observations. Based on the data, what should we believe in?

Data from scientific experiments, especially those involving humans or animals, are unmitigated heaps of variability. Theories in science tend to be rife with parameters of uncertain magnitude, and competing theories are numerous. In these situations, the mathematics of statistical inference provide precise numerical bounds on our uncertainty. The math allows us to determine accurately what the data imply for different possible beliefs. The math can tell us exactly how likely or unlikely each possibility is, even when there is an infinite spectrum of possibilities.

Usually, it is straightforward to calculate the probability of obtaining different data samples if we know the process that generated the data in the first place. For example, if we know that a coin is fair, then we can calculate the probability of it landing heads up (the probability equals 1/2). However, we typically do not have perfect knowledge of these processes, and it is the goal of statistical inference to derive estimates of the unknown characteristics, or parameters, of these mechanism.

Bayesian statistics allows us to go from what is known - the data (the results of the coin throw here) - and extrapolate backwards to make probabilistic statements about the parameters (the underlying bias of the coin) of the processes that were responsible for its generation. In Bayesian statistics, this inversion process is carried out by application of Bayes’ rule.

Models of observations and models of beliefs

A “formal” model uses mathematical formulas to precisely describe something. In the context of statistical models, the models are typically models of probabilities. Some models describe the probabilities of observable events; e.g., we can have a formula that describes the probability that a coin will come up heads. Other models describe the extent to which we believe in various underlying possibilities; e.g., we can have a formula that describes how much we believe in each possible bias of the coin.

Models have parameters

The probability of rain depends on many things, but in particular it might depend on elevation above sea level. Thus, the probability of rain, which is the output of the model, depends on the location’s elevation, which is a value that is input to the model. The exact relationship between input and output could be modulated by another value that governs exactly how much the input affects the output. This modulating value is called a parameter. The model formula specifies that the input does affect the output, but the parameters govern exactly how much

We theorists might not know in advance exactly what value of the parameter to believe in, so we entertain a spectrum of possible values, each with a different degree of belief

we can have a mathematical model of the probability that certain observable events happen. This mathematical model has parameters. The values of the parameters determine the exact probabilities generated by the model. Our beliefs regard the possible values of the parameters. We may believe strongly in some parameter values but less strongly in other values. The form of our beliefs about various parameter values can itself be expressed as a mathematical model, with its own (hyper-)parameters

Prior and posterior beliefs

Prior belief because it’s our belief before taking into account some particular set of observations. After observing the flips of the coin, we had modified beliefs. These are called a posterior belief because they are computed after taking into account a particular set of observations. Bayesian inference gets us from prior to posterior beliefs.

Three goals for inference from data

Statistical inference is the logical framework which we can use to trial our beliefs about the noisy world against data. We formalise our beliefs in models of probability. The models are probabilistic because we are ignorant of many of the interacting parts of a system, meaning we cannot say with certainty whether something will, or will not, occur. Suppose that we are evaluating the efficacy of a drug in a trial. Before we carry out the trial, we might believe that the drug will cure 10% of people with a particular ailment. We cannot say which 10% of people will be cured because we do not know enough about the disease or individual patient biology to say exactly whom. Statistical inference allows us to test this belief against the data we obtain in a clinical trial.

Estimation of parameter values

One goal we may have is deciding to what extent we should believe in each of the possible parameter values. What we are determining is how much we believe in each of the available parameter values.

The posterior beliefs typically increase the magnitude of belief in some parameter values, while lessening the degree of belief in other parameter values. So this process of shifting our beliefs in various parameter values is called “estimation of parameter values.”

Prediction of data values

Another goal we may have is predicting other data values, given our current beliefs about the world. Prediction means inferring the values of some missing data based on some other included data.

An ability to make specific predictions is one of the primary uses of mathematical models. In Bayesian inference, to predict data values, we typically take a weighted average of our beliefs. We let each belief make its individual prediction, and then we weigh each of those predictions according to how strongly we believe in them.

Model comparison

A third goal of statistical inference is model selection, a.k.a. model comparison. If we have two different models of how something might happen, then an observation of what really does happen can influence which model we believe in most. What Bayesian inference tells us is how to shift our magnitude of belief across the available models.

One of the nice qualities of Bayesian model comparison is that it intrinsically adjusts for model complexity. More complex models will fit data better than simple models, merely because the complex models have more flexibility. Unfortunately, more complex models will also fit random noise better than simpler models.


Probability


Inferential statistical techniques provide precision to our uncertainty about possibilities. Uncertainty is measured in terms of probability.

Bayesian inference uses probability theory to allow us to update our uncertain beliefs in light of data.

What is a probability?

First, think of some event where the outcome is uncertain. A probability is a numerical measure of the likelihood of the event. It is a number that we attach to an event. The outcome of the event is random and is the variable that represents the outcome is called a random variable (RV).

In probability theory, we describe the behaviour of random variables. This is a statistical term for variables that associate different numeric values with each of the possible outcomes of some random process. By random here we do not mean the colloquial use of this term to mean something that is entirely unpredictable. A random process is simply a process whose outcome cannot be perfectly known ahead of time (it may nonetheless be quite predictable). So for a coin flip, we may create a random variable X that takes on the value 1 if the coin lands heads up or 0 for tails up. Because the coin flip can produce only a countable number of outcomes (in this case two), X is a discrete random variable.

A probability is a number from 0 to 1. If we assign a probability of 0 to an event, this indicates that this event never will occur. A probability of 1 attached to a particular event indicates that this event always will occur.

Probability is a mathematical tool used to study randomness. It deals with the chance (the likelihood) of an event occurring. For example, if you toss a fair coin four times, the outcomes may not be two heads and two tails. However, if you toss the same coin 4,000 times, the outcomes will be close to half heads and half tails. The expected theoretical probability of heads in any one toss is 0.5 (Law of Large Numbers).

Probability theory is “the doctrine of chances”. It’s a branch of mathematics that tells you how often different kinds of events will happen (e.g. What are the chances of a fair coin coming up with head 10 times in a row?).

The “truth of the world” is known, and my question relates to the “what kind of events” will happen. probabilistic questions start with a known model of the world, and we use that model to do some calculations.

We are often interested in knowing the probability of a random variable taking a certain value. For example, what is the probability that when I roll a fair 6-sided die, it lands on 3.

The model is known, but the data are not

Probability deals with predicting the likelihood of future events

Probability: Given known parameters, find the probability of observing a particular set of data.

There are two basic interpretations, or ways of thinking, about probabilities. These are the relative frequency viewpoint and the subjective viewpoint.

The Relative Frequency Interpretation of Probability

We are interested in learning about the probability of some event in some process. In general, the probability of an event can be approximated by the relative frequency , or proportion of times that the event occurs.

In Frequentist (or Classical) statistics, we suppose that our sample of data is the result of one of an infinite number of exactly repeated experiments. The sample we see in this context is assumed to be the outcome of some probabilistic process. Any conclusions we draw from this approach are based on the supposition that events occur with probabilities, which represent the long-run frequencies with which those events occur in an infinite series of experimental repetitions

For example, if we flip a coin, we take the proportion of heads observed in an infinite number of throws as defining the probability of obtaining heads. Frequentists suppose that this probability actually exists, and is fixed for each set of coin throws that we carry out.

In Frequentist statistics the data are assumed to be random and results from sampling from a fixed and defined population distribution. For a Frequentist the noise that obscures the true signal of the real population process is attributable to sampling variation - the fact that each sample we pick is slightly different and not exactly representative of the population.

Frequentist statisticians, on the other hand, view the unseen part of the system - the parameters of the probability model - as being fixed and the known parts of the system - the data - as varying

To a Frequentist, this is because we have picked a slightly odd sample from the population of infinitely many repeated throws. If we flip the coin another 10 times, we will likely get a different result because we then pick a different sample

The Frequentist perspective is less flexible and assumes that these parameters are constant, or represent the average of a long run - typically an infinite number - of identical experiments. There are occasions when we might think that this is a reasonable assumption. For example, if our parameter represented the probability that an individual taken at random from the UK population has dyslexia, it is reasonable to assume that there is a true, or fixed, population value of the parameter in question.

The Subjective Interpretation of Probability

The relative frequency notion of probability is useful when the process of interest, say tossing a coin, can be repeated many times under similar conditions. But we wish to deal with uncertainty of events from processes that will occur a single time.

In the case where the process will happen only one time, how do we view probabilities? You assign a number to this event (a probability) which reflects your personal belief in the likelihood of this event happening.

A subjective probability reflects a person’s opinion about the likelihood of an event

The numbers you assign must be proper probabilities. That is, they must satisfy some basic rules that all psrobabilities obey.

Bayesians do not imagine repetitions of an experiment in order to define and specify a probability. A probability is merely taken as a measure of certainty in a particular belief. For Bayesians the probability of throwing a ‘heads’ measures and quantifies our underlying belief that before we flip the coin it will land this way.

Bayesians do not view probabilities as underlying laws of cause and effect. They are merely abstractions which we use to help express our uncertainty. In this frame of reference, it is unnecessary for events to be repeatable in order to define a probability. We are thus equally able to say, ‘The probability of a heads is 0.5’ or ‘The probability of the Democrats winning the 2020 US presidential election is 0.75’. Probability is merely seen as a scale from 0, where we are certain an event will not happen, to 1, where we are certain it will

For Bayesians, probabilities are seen as an expression of subjective beliefs, meaning that they can be updated in light of new data. The formula invented by the Reverend Thomas Bayes provides the only logical manner in which to carry out this updating process. Bayes’ rule is central to Bayesian inference whereby we use probabilities to express our uncertainty in parameter values after we observe data.

Bayesians assume that, since we are witness to the data, it is fixed, and therefore does not vary. We do not need to imagine that there are an infinite number of possible samples, or that our data are the undetermined outcome of some random process of sampling.

We never perfectly know the value of an unknown parameter (for example, the probability that a coin lands heads up). This epistemic uncertainty (namely, that relating to our lack of knowledge) means that in Bayesian inference the parameter is viewed as a quantity that is probabilistic in nature.

For Bayesians, the parameters of the system are taken to vary, whereas the known part of the system - the data - is taken as given.

In Bayesian statistics parameters can be assumed fixed, but that we are uncertain of their value (here the true prevalence of dyslexia) before we measure them, and use a probability distribution to reflect this uncertainty.

Assigning Probabilities

The collection of possible results is called the sample space of the experiment. The next step is to assign numbers called probabilities to the different outcomes.

In some cases, the different outcomes specified in the sample space are equally likely; in this case, it is relatively easy to assign probabilities. For many situations, outcomes will not be equally likely and this method will not work. In the situation where the random process is repeatable under similar conditions, one can simulate the process many times, and assign probabilities by computing proportions of outcomes.

Listing All Possible Outcomes (The Sample Space)

Rolling two dice: sample space = {2,3,4,5,6,7,8,9,10,11,12}

Once we understand what the collection of all possible outcomes looks like, we can think about assigning probabilities to the different outcomes.

The sample space is the set of possible outcomes of an experiment. Points in the sample space are called sample outcomes, realizations, or elements. Subsets of the sample space are called Events.

An outcome is a result of a random experiment. The set of all possible outcomes is called the sample space.

e.g. rolling a die has a sample space: S = {1,2,3,4,5,6}

When we repeat a random experiment several times, we call each one of them a trial.

Probability Rules

A probability function is a function P(x) and has the following properties:

Computing Probabilities With Equally Likely Outcomes

Before we can compute any probabilities for outcomes in a random process, we have to define the sample space, or collection of all possible outcomes. If we have listed all outcomes and it is reasonable to assume that the outcomes are equally likely, then it is easy to assign probabilities.

If there are N possible outcomes in an experiment and the outcomes are equally likely, then you should assign a probability of 1/N to each outcome.

Computing Probabilities by Simulation

In the case where all of the outcomes of an experiment are equally likely, then it is easy to assign probabilities. But, when outcomes are not equally likely, it can be hard to allocate probabilities to events. However, there is a general method which will give us probabilities when the random process can be repeated many times under similar conditions.

Probabilities of “OR” and “NOT” Events

The first property is useful for finding the probability of one event or another event. The second property tells us how to compute the probability that an event does not occur.

The addition property (for computing probabilities of “or” events)

The complement property (for computing probabilities of “not” events)

Probability Distributions

List of possible numbers in the experiment and the corresponding probabilities is called a probability distribution.

What is a Probability Distribution?

We assume that the outcome of the random process is a number. We describe probabilities of number outcomes (such as the man’s height) by use of a probability distribution. These probabilities tell us the likelihood of finding different values when an element is randomly selected.

Probability distributions, just like probabilities, must satisfy some basic rules:

The outcome of the lottery example is a discrete probability distribution since the variable we measure - the winning number - is confined to a finite set of values. However, we could similarly define a probability distribution where our variable may equal one value from an infinite number of possible values across a spectrum. Imagine that, before test driving a second-hand car, we are uncertain about its value. From seeing pictures of the car, we might think that it is worth anywhere from £2000 to £4000, with all values being equally likely

For the continuous case of the second-hand car example, it indicates that p(v) = 1/2000 >= 0 for 2000 =< v =< 4000, but how do we determine whether this distribution satisfies the second requirement for a valid probability distribution? To do this we could do the continuous equivalent of summation, which is integration. However, we want to avoid doing this (difficult) maths if possible! Fortunately, since integration is essentially just working out an area underneath a curve, we can calculate the integral by appealing to the geometry of the graph. Since this is just a rectangular shape, we calculate the integral by multiplying the base by its height:

area = 1/2000 * 2000 = 1

Often theorists use probability mass to handle discrete distributions, where the distribution’s values are directly interpretable as probabilities, and probability densities to handle continuous distributions. Unlike their discrete sisters, continuous distributions need to be integrated to yield a probability.

Discrete distributions: Probability mass

When the sample space consists of discrete outcomes, then we can talk about the probability of each distinct outcome

The probability of a discrete outcome, such as the probability of falling into an interval on a continuous scale, is referred to as a probability mass.

When the sample space consists of discrete outcomes, then we can talk about the probability of each distinct outcome (e.g. The sample space of a six-sided die has six discrete outcomes).

The discrete probability distribution for the lottery shown in the left-hand panel is straightforward to interpret. To calculate the probability that the winning number, X , is 3, we just read off the height of the relevant bar, and conclude that:

P(x = 3) = 1/100

In the discrete case, to calculate the probability that a random variable takes on any value within a range, we sum the individual probabilities corresponding to each of the values. In the lottery example, to calculate the probability that the winning number is 10 or less, we just sum the probabilities of it being {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}:

P(x <= 10) = P(x = 1) + P(x = 2) + … P(X = 10)

1/100 + 1/100 + 1/100 + … 1/100 = 1/10

Continuous distributions

Instead of talking about the infinitesimal probability mass of each infinitesimal interval, we will talk about the ratio of the probability mass to the interval width. That ratio is called the probability density. Density is the amount of stuff per unit of space it takes up. Because we are measuring amount of stuff by its mass, then density is the mass divided by the amount space it occupies.

When the sample space is an uncountable set (i.e. we cannot list the elements in the set).

Properties of probability density functions

for any continuous value that is split up into intervals, the sum of the probability masses of the intervals must be 1.

Any function that has only nonnegative values and integrates to 1 can be construed as a probability density function. Perhaps the most famous probability density function is the normal distribution, also known as the Gaussian distribution.

p(x) is the probability density in the infinitesimal interval around x.

when we consider p(unknown parameter) for a continuous random variable, it turns out we should interpret its values as probability densities, not probabilities.

We can use a continuous probability distribution to calculate the probability that a random variable lies within an interval of possible values. To do this, we use the continuous analogue of a sum, an integral. However, we recognise that calculating an integral is equivalent to calculating the area under a probability density curve. For the car example, we can calculate the probability that the car’s value lies between £2500 and £3000 by determining the rectangular area underneath the graph.

P(2500 <= value <= 3000) = 1/2000 * 500 = 1/4

1/200 is the height and 500 is the base (i.e. the horizontal distance between 2500 and 3000).

While it is important to understand that probabilities and probability densities are not the same types of entity, the good news for us is that Bayes’ rule is the same for each. So we can readily write:

The normal probability density function

The integral of the normal density is approximated by summing the masses of all the tiny intervals.

Highest density interval (HDI)

A way of summarizing a distribution is the highest density interval, abbreviated HDI. The HDI indicates which points of a distribution are most credible, and which cover most of the distribution.

The HDI summarizes the distribution by specifying an interval that spans most of the distribution, say 95% of it, such that every point inside the interval has higher credibility than any point outside the interval.

The mean of a distribution

A popular way of summarising a distribution is by its mean, which is a measure of central tendency for a distribution. More intuitively, a mean, or expected value, of a distribution is the long-run average value that would be obtained if we sampled from it an infinite number of times.

The method to calculate the mean of a distribution depends on whether it is discrete or continuous in nature. However, the concept is essentially the same in both cases. The mean is calculated as a weighted sum (for discrete random variables) or integral (for continuous variables) across all potential values of the random variable where the weights are provided by the probability distribution.

TWO-WAY DISTRIBUTIONS

There are many situations in which we are interested in the conjunction of two outcomes. e.g. What is the probability of being dealt a card that is both a queen and a heart?

Conditional probability

We often want to know the probability of one outcome, given that we know another outcome is true.

We denote the conditional probability of hair color given eye color as p(h|e), which is spoken “the probability of h given e.” The intuitive calculations above are then written p(h|e) = p(e, h)/p(e).

When we know that B has occurred, every outcome that is outside B should be discarded. Thus, our sample space is reduced to the set B. Now the only way A can happen is when the outcome belongs to the set A and B (intersection). We divide P(A and B) by P(B) so that the conditional probability of the new sample space becomes 1.

Conditional probability of P(A|B) is undenfined when P(B) = 0. If event B never occurs (i.e. P(B) = 0), then it doesn’t make sense to talk about the probability of A given B.

For any two events A and B, the conditional probability of A given B or P(A|B), is the probability that event A will occur given that we already know that event B has occurred.

To obtain an algebraic formula for a conditional probability, we can begin with the multiplicative rule of probability, which says that the probability of the intersection of two events is the product of the marginal probability of the first and the conditional probability of the second times the first:

The conditional probability is the probability that some event(s) occur given that we know other events have already occured. If A and B are two events then the conditional probability of A occuring given B has occured is written P(A|B). e.g. Probability that a card is a four given that we have drawn a red card is P(4/red) = 2/26 = 1/13. (There are 52 cards in the park, 26 are red and 26 are black. We know that we’ve picked a red card, so there’s only 26 cards to pick from. Of these, 2 have the value of 4.)

Independence

Let A be the event that it rains tomorrow, and suppose that P(A)=1/3. Also suppose that I toss a fair coin; let B be the event that it lands heads up. We have P(B)=1/2.

Marginal Probability

If A is an event, then the marginal probability is the probability of that event occuring, P(A).

What is P(A|B)?

P(A|B) = P(A) = 1/3

The result of my coin toss does not have anything to do with tomorrow’s weather. Thus, no matter if B happens or not, the probability of A should not change.

if two events A and B are independent and P(B) not equal to 0, then P(A|B)=P(A).

When events are independent, the joint probability is just the product of the individual marginal probabilities of the events: P(A^B) = P(A) X P(B). e.g. P(coin landing heads and rolling a 6) = P(A = heads, b = 6) = 1/2 X 1/6 = 1/12. (1/2 for rolling heads and 1/6 for rolling a 6).

independence means we can multiply the probabilities of events to obtain the probability of their intersection

Two events are independent if the occurrence (or nonoccurrence) of one of them does not affect the probability that the other one occurs:

Joint Probability

The probability of the intersection of two or more events. It is the intersection of the circles of two events on a venn diagram. If A and B are two events then the joint probability of the two events is written P(A ^ B). e.g. the probability that a card drawn from a pack is red and has a value of 4 is P(red and 4) = 2/52 = 1/26 (There are two cards that are red and have the value of 4).

Laws of Total Probability

The law of total probability comes into play when you wish to know the marginal (unconditional) probability of some event, but you only know its probability under some conditions.

It’s used to find the probability of an event, A, when you don’t know enough about A’s probabilities to calculate it directly. Instead, you take a related event, B, and use that to calculate the probability for A.


Bayes’ Rule


Bayes rule allows us to use some knowledge or belief that we already have (i.e. prior) to help calulate the probability of a related event. For example, if we want to find the probability of selling ice cream on a hot and sunny day, Bayes’ rule gives us the tools to use prior knowledge about the likelihood of selling ice cream on any other type of day (e.g. rainy, windy, snowy).

P(A|B) does not equal P(B|A). However, Bayes’ rule tells us the relationship between the two conditional probabilities.

If we know P(A|B), but we’re interested in the probability P(B|A), we can use the rules of conditional probability:

Divided by P(A), we obtain:

Where A and B are events, P(B|A) is the conditional probability that event B occurs given that event A has already occured. P(B|A) has the same meaning, but the roles of A and B are reversed and P(B) and P(A) are the marginal probabilities of the events occuriing respectively.

The numerator of Bayes’ theorem uses the fact that the probability of the intersection of two events is the probability of the first event multiplied by the conditional probability of the second event given the first (i.e. this is the same as the probability of A AND B).

The denominator uses this same fact plus the fact that any event plus its complement comprises the entire sample space, and together an event and its complement have a probability of 1, so the sum of the conditional probabilities of (B given A) times the probability of A, and (B given ~A) times the probability of ~A, equals the probability of B.

A crucial application of Bayes’ rule is to determine the probability of a model when given a set of data. What the model itself provides is the probability of the data, given specific parameter values and the model structure. We use Bayes’ rule to get from the probability of the data, given the model, to the probability of the model, given the data.

Derived from definitions of conditional probability

The definition simply says that the probability of y given x is the probability that they happen together relative to the probability that x happens at all.

Now we do some very simple algebraic manipulations. First, multiply both sides of by p(r) to get

The denominator as an integral over continuous values

Bayes’ rule also applies to continuous variables, but probability masses become probability densities and sums become integral.

The y in the numerator is a specific fixed value, whereas the y in the denominator is a variable that takes on all possible values of y over the integral.

Example 1

We have a prior probability that there’s a 4/52 (1/13) chance of getting a king card from a deck of 52 cards (king card of hearts, diamonds, clubs, and jacks suit).

If we have data that the pulled card is a face card, what is the probabiliy that it’s a king?

P(hypothesis = king) = 1/13

P(data = face card) = 12/52 = 3/13 (There are 3 face cards per suit and there are 4 suits)

P(face card | king) = 1 (since every king card has to be a face card)

Bayes Rule:

P(king | Face card) = (1 * 1/13) / (3/13)

P(king | Face card) = 1/3

We know that because we pulled a face card out before, it is more likely that the next pulled card is a king (before the new information i.e. it was a face card), the chance was 1/13. Since we know the card is a face card, then it now has higher likelihood of being a king (1/3).

Example 2

It’s a typically hot morning in June in Durham. You look outside and see some dark clouds rolling in. Is it going to rain?

Historically, there is a 30% chance of rain on any given day in June. Furthermore, on days when it does in fact rain, 95% of the time there are dark clouds that roll in during the morning. But, on days when it does not rain, 25% of the time there are dark clouds that roll in during the morning.

Given that there are dark clouds rolling in, what is the chance that it will rain?

Let R = it rains. Let C = clouds roll in. We want Pr (R | C)

From the problem, we know that Pr (R) = 0.30. Pr(C | R) = 0.95 Pr(C | not R) = 0.25.

Hence, using Bayes rule, we have:

Pr (R | C) = Pr (R and C) / Pr (C) = Pr (C | R) Pr (R) / Pr (C) = (.95)(.30) / Pr(C).

Now, Pr (C) = Pr ( C and R) + Pr (C and not R) = (.95)(.30) + Pr (C | not R) Pr (not R) = (.95)(.30) + (.25)(.70) = 0.46

Hence, Pr (R | C ) = (.95)(.30) / .46 = .619.

There is a 61.9% chance that it will rain, given that clouds rolled in during the morning.

Applied to models and data

A model specifies:

p( data values | parameters values and model structure )

We use Bayes’ rule to convert that to what we really want to know, which is how strongly we should believe in the model, given the data:

p( parameters values and model structure | data values )

OR

P( hypothesis | data) = P(data | hypothesis) * P(hypothesis) / P(data)

When we have observed some data, we use Bayes’ rule to determine our beliefs across competing parameter values in a model, and to determine our beliefs across competing models.

The prior probability of the parameter values is the marginal distribution, P(parameter value). This is simply the probability of each possible value of the parameter value, collapsed across all possible values of data.

When we observe a particular data value, D, so we know it is true.

The posterior distribution on the parameter value is obtained by dividing the conjoint probabilities in that row by the row marginal, p(D). Thus, the posterior probability of parameter value is just the conjoint probabilities in that row, normalized by p(D) to sum to 1.

What are Models and why do we Need them?

All models are wrong. They are idealised representations of reality that result from making assumptions which, if reasonable, may recapitulate some behaviours of a real system.

Explicitly stating our models can be used for:

  • To predict
  • To explain
  • To guide data collection
  • To discover new questions
  • To bound outcomes to plausible ranges
  • To illuminate uncertainties
  • To challenge the robustness of prevailing theory through pertubations
  • To reveal the apparently simple (complex) to be complex (simple)

Whenever we build a model, whether it is statistical, biological or sociological, we should ask: What do we hope to gain by building this model, and how can we judge its success? Only when we have answers to these basic questions should we proceed to model building.

How to Choose an Appropriate Likelihood

Before we use a model to make decisions in the real world, we require it to be able to explain key characteristics of the system’s behaviour for the past and present. With this in mind we introduce the following framework for building a model:

Examples

An individual’s disease status:

Suppose you work for the state as a healthcare analyst who wants to estimate the prevalence of a certain disease. Also, imagine (unrealistically) that we begin with a sample of only one person, for whom we have no prior information. Let the disease status of that individual be denoted by the binary random variable X, which equals:

x = 0 (no disease) or 1 (disease)

The goal of our analysis is to estimate a probability,?? , that a randomly chosen individual has the disease. We now calculate the probability of each outcome for our sample of one individual:

P(x = 0| unknown parameter) = (1 - unknown parameter)

P(x = 1| unknown parameter) = unknown parameter

We can use the Bernoulli probability density to calculate this

Bayesian Inference Via Bayes’ Rule

Bayes’ rule tells us how to update our prior beliefs in order to derive better, more informed, beliefs about a situation in light of new data. In Bayesian inference, we test hypotheses about the real world using these posterior beliefs. As part of this process, we estimate characteristics that interest us, which we call parameters, that are then used to test such hypotheses.

The Bayesian inference process uses Bayes’ rule to estimate a probability distribution for those unknown parameters after we observe the data.

Bayes’ rule as used in statistical inference:

In Bayesian inference, we describe uncertainty using probability distributions. Bayes’ rule describes how to convert the likelihood - itself, not a valid probability distribution - into a posterior probability distribution.

To carry out this conversion we must specify a probability distribution known as a prior. This distribution is a measure of our pre-data beliefs about the parameters of the likelihood function.

The final part of the formula - the denominator - is fully determined by our choice of likelihood and prior.

The goal of Bayesian inference is to calculate the posterior probability distribution for the parameters which interest us.

Probability models are characterised by a set of parameters which, when varied, generate a range of different system behaviours. If the model choice is appropriate, we should be able to tune these parameters so that the model’s behaviour mimics the behaviour of the real-world system that we are investigating. In Bayesian inference, we wish to determine a posterior belief in each set of parameter values. This means that in Bayesian inference we instead hold the data constant, and vary the parameter values.

Likelihood:

Starting with the numerator on the right-hand side of expression, we come across the term Pr(data|unknown parameter), which we call the likelihood. This tells us the probability of generating the particular sample of data if the parameters in our statistical model were equal to unknown parameter. When we choose a statistical model, we can usually calculate the probability of particular outcomes, so this is easily obtained.

e.g. flipping the coin twice, we can calculate the probabilities of the four possible outcomes by multiplying the probabilities of the individual outcomes:

Likelihood in Bayesian analysis

When starting out in statistical inference, it can seem bewildering to choose a likelihood function that is appropriate for a given situation. In this chapter, we use a number of case studies to explain how to choose a likelihood. This process should begin with the analyst listing the various assumptions about the data generating process. The analyst should then search through the list of probability distributions and select one (or a number) that satisfy these conditions. The model selection process does not stop here, however. After a model is fitted to the data it is important to check that the results are consistent with the actual data sample and, if necessary, adjust the likelihood.

In all statistical inference, we use an idealised model to approximate a real-world process that interests us. This model is then used to test hypotheses about the world. In Bayesian statistics, the evidence for a particular hypothesis is summarised in a posterior probability distribution.

Bayesians call a likelihood. What does this mean in simple, everyday language? Imagine that we flip a coin and record its outcome. The simplest model to represent this outcome ignores the angle the coin was thrown at, and its height above the surface, along with any other details. Because of our ignorance, our model cannot perfectly predict the behaviour of the coin. This uncertainty means that our model is probabilistic rather than deterministic. We might also suppose that the coin is fair, so the probability of the coin landing heads up is given by:

unknown parameter = 1/2

Furthermore, if the coin is thrown twice, we assume that the result of the first flip does not affect the result of the second. This means that the results of the first and second coin flips are independent

We can use our model to calculate the probability of obtaining two heads in a row:

where Model represents the set of assumptions that we make in our analysis

We can also calculate the corresponding probabilities for all possible outcomes for two coin flips. The most heads that can occur is 2, and the least is 0 (if both flips land tails up). The most likely number of heads is 1 since this can occur in two different ways - either the first coin lands heads up and the second lands tails up, or vice versa - whereas the other possibilities (all heads or no heads) can occur in only one way.

Why do we call P(data|unknown parameter) a likelihood and not a probability?

This is because in Bayesian inference we do not keep the parameters of our model fixed. In Bayesian analysis, the data are fixed and the parameters vary. In particular, Bayes’ rule tells us how to calculate the posterior probability density for any value of the unknown parameter.

Our distribution is not a valid probability distribution. Hence, when we vary unknown parameter, p(data|unknown parameter) is not a valid probability distribution. We thus introduce the term likelihood to describe p(data|unknown parameter) when we vary the parameter, unknown parameter.

In Bayesian inference, we always vary the parameter and hold the data fixed (we only obtain one sample). Thus, from a Bayesian perspective, we use the term likelihood to remind us that p(data|unknown parameter) is not a probability distribution.

Priors:

p(unknown parameter), is the most controversial part of the Bayesian formula, which we call the prior distribution of. It is a probability distribution which represents our pre-data beliefs across different values of the parameters in our model, unknown parameter.

Continuing the coin example, we might assume that we do not know whether the coin is fair or biased beforehand, so suppose all possible values of unknown parameter [0,1] - which represents the probability of the coin falling heads up - are equally likely. We can represent these beliefs by a continuous uniform probability density on this interval (see the black line). More sensibly, however, we might believe that coins are manufactured in a way such that their weight distribution is fairly evenly distributed, meaning that we expect that the majority of coins are reasonably fair. These beliefs would be more adequately represented by a prior similar to the one shown by the red line

The denominator:

This represents the probability of obtaining our particular sample of data if we assume a particular model and prior. The denominator is fully determined by our choice of prior and likelihood function.

Posteriors: the goal of Bayesian inference

The posterior probability distribution p(unknown parameter|data) is the main goal of Bayesian inference. The more data that is collected, (in general) the less impact the prior exerts on posterior distributions.

we want to obtain p(unknown parameter|data) - the probability of the parameter or hypothesis under investigation, given the data set which has been observed.

Questions

Suppose that, in an idealised world, the ultimate fate of a thrown coin heads or tails is deterministically given by the angle at which you throw the coin and its height above a table. Also in this ideal world, the heights and angles are discrete. However, the system is chaotic (highly sensitive to initial conditions), and the results of throwing a coin at a given angle and height.

Q. Suppose that all combinations of angles and heights are equally likely to be chosen. What is the probability that the coin lands heads up?

P(heads) = 19/40 (since there are 19 outcomes that are head)

Q. Now suppose that some combinations of angles and heights are more likely to be chosen than others, with the probabilities below. What are the new probabilities that the coin lands heads up?

P(heads) = 0.5 (we find the total weighted average where the result is heads)

Q. We force the coin-thrower to throw the coin at an angle of 45 degrees. What is the probability that the coin lands heads up?

We must now find a weighted average of the coin flip outcomes given that we are constrained to be in the row corresponding to 45 degrees

head / (head + tail + tail + tail + tail) 0.03 / (0.03 + 0.02 + 0.01 + 0.05 + 0.02) 0.03 / 0.13

P(heads) = 0.231

Q. We force the coin-thrower to throw the coin at a height of 0.2m. What is the probability that the coin lands heads up?

(0.03 + 0.05 + 0.02 + 0.03 + 0.03) / (0.03 + 0.05 + 0.02 + 0.03 + 0.03 + 0.05 + 0.02) = 0.695

p(heads) = 0.695