Bayesian Inference

Wed, Jun 23, 2021 tags: [ programming datascience julia ]

updated 2021-07-14

“Bayesian Inference” – sounds fancy, eh? So non-frequentist, machine-learny, does it not? Anyway, I hadn’t really heard too much about it before, but had to invest some time understanding it in order to work on my BSc thesis (proposed title: Reconstructing Nuclear Fuel Cycles Using Bayesian Inference: a first implementation), and now I seem to see it everywhere.

But let’s start with the Bayes in Bayesian inference. If you are like me, then you’ve probably heard about Bayes’ theorem, which is usually presented in the form

$$P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}$$

A term such as $P(A \mid B)$ is called conditional probability, and is to be read as “the probability of event $A$ occurring, given that we have observed event $B$. This implies that the two events are linked somehow, and that we can use knowledge about the one occurring in order to gain information about the other one occurring at the same time. To understand this more intuitively, take a look at this visualization (thanks, asciiflow), the area representing the probability of certain events $A$, $B$, and $C$:

P(A or B or C) = 1
+-------------------------------------------+
|                                           |
|    C                                      |
|                +----------------------+   |
|                |                      |   |
|    +-----------+----------+           |   |
|    |           |++++++++++|           |   |
|    |           |++++++++++|           |   |
|    |    A      |++++++++++|    B      |   |
|    |           |++++++++++|           |   |
|    +-----------+----------+           |   |
|                |                      |   |
|                +----------------------+   |
|                                           |
+-------------------------------------------+

The shaded area represents the event $A \wedge B$ ($\wedge$ means “and”). The first important insight is now that $$P(A \mid B) = \frac{P(A \wedge B)}{P(B)},$$ i.e. the ratio of the shaded area to the area $B$ is the conditional probability for “$A$ given $B$”. In simple words: If $B$ has occurred, the ratio of the shaded area and the area of $B$ is the probability that $A$ has also occurred. I hope that this makes sense to you. It also quickly leads to the proof of Bayes’ Theorem:

$$P(A \mid B) = \frac{P(A \wedge B)}{P(B)} \underbrace{=}_{\mathrm{Symmetry}} \frac{P(B \wedge A)}{P(B)} = \frac{P(B \mid A) P(A)}{P(B)}$$

which is fulfilled if we take $P(B\mid A) = P(B \wedge A) / P(A)$ (which we just saw above with $B$ and $A$ switched), and re-arrange it for $P(B \wedge A)$.

Now, what Bayesian statistics is about, is asking the question: “If we know that event $B$ has occurred; what is the probability that $A$ has also occurred?” This question is often hard to answer. But in many cases, the opposite question “What is the probability that $B$ has occurred, if we know that $A$ has?” is much easier to answer; and so are the questions for the so-called prior probabilities $P(A)$ and $P(B)$. (Don’t worry, I will show an example below.)

To reiterate in words instead of equations: The probability that $A$ has occurred, if we know that $B$ has in fact occurred ($P(A \mid B)$) is represented by the shaded area in the diagram. The conditional probability $P(A \mid B)$ is the ratio of the shaded area to the area of $B$ (see the equation just above). But the shaded area is also equal to the probability $P(B \mid A)$ times the probability that A occurs: $P(B\mid A) P(A)$. If we divide this by $P(B)$, we again get the ratio of the shaded area and the area of $B$, i.e. the conditional probability $P(A \mid B)$. (This paragraph just reiterated what we described in terms of equations above)

A simple example.

This has been exceedingly theoretical. To improve the situation we got ourselves into, let’s take a look at a classic problem within Bayesian statistics, that is slightly less theoretical – but mainly shows why it can be very interesting to look at a problem from the Bayesian point of view.

+---------------------+   +---------------------+
|                     |   |                     |
|    COOKIES 1        |   |    COOKIES 2        |
|                     |   |                     |
|    10 CHOCOLATE     |   |   20 CHOCOLATE      |
|    30 PLAIN         |   |   20 PLAIN          |
|                     |   |                     |
|                     |   |                     |
|                     |   |                     |
|                     |   |                     |
+---------------------+   +---------------------+

We have two cookie jars, one with 10 chocolate cookies and 30 plain cookies; and one with half/half chocolate and plain cookies. Let’s assume our jars are not made of glass, so that we cannot see into them. In general, it’s difficult to look on top of some cookies and see the ratio of cookies in a jar.

How can Bayesian statistics help us here? We will start with what can be even called “inference” now. The problem is: If we take one cookie from a certain jar; what does the chocolate content of the cookie tell us about which jar we took it from?

Let us model this by assigning the event “We are looking at jar #1” to random variable (RV) $A$, and the event “We took a chocolate cookie” to random variable $B$:

Let’s apply Bayes’ formula, to see which jar we are probably looking at if we have drawn a chocolate cookie:

$$P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)} = \frac{\frac{1}{4} \frac{1}{2}}{\frac{3}{8}} = \frac{1}{3}$$

So we are probably looking at jar 2, with a probability of $2/3$. In this case, we can even cross-check the result for plausibility: given that $2/3$ of all chocolate cookies (20 out of 30) are in jar 2, it makes sense that our probability of looking at jar 2 is $2/3$.

Now what if we have drawn two chocolate cookies? Intuitively, it’s increasingly likely that we are looking at jar 2, because it has more cookies! Let’s say we ate the first cookie, and thus didn’t put it back into the jar. How do the probabilities for $BB$ (two chocolate cookies) look like?

Let’s put together the numbers again:

$$P(A \mid BB) = \frac{P(BB \mid A) P(A)}{P(BB)} = \frac{\frac{10 \cdot 9}{40 \cdot 39} \cdot\frac{1}{2}}{\frac{30 \cdot 29}{80 \cdot 79}} = 0.21 $$

With two cookies like this, we can already have a useful information about which jar we do – 79% that it’s jar 2 – but no certainty. You can continue with this, like asking what if we had drawn one chocolate and one plain cookie (in that case, chances are 60% that we are looking at jar 2 – check if you get the same result, and let me know if not because maybe I made a mistake!).

Inference?

This example was simple, and doesn’t have to do much with inference. “Inference” usually means predicting facts based on a model that shall be built for the purpose of predictions. For example, if you want to predict a random variable $y$ from an independent variable $x$, you could use a linear regression:

y  ^
   |
   |
   |                                   x
   |                                  x
   |                                x x
   |                            xxxx
   |                            x
   |                        x
   |                    xx
   |                   xx
   |                 xxx
   |                xx
   |             xxx
   |          xxxx
   |        xxx
   |      xx
   |     xx
   |   xxx
   | xxx
   |xx
   +------------------------------------------------>
                                                   x

The model in this case would be an equation such as $y = m x + b$. However, such models are not always that easily built. Bayesian inference can be used to build both very simple models such as this linear regression, and more complicated ones. What you should keep in mind is that a big part of statistics is not only finding appropriate models, but also finding the correct parameters of the models. These parameters then allow inference of the modeled dependent variable.

In order to stop the hand-waving, let’s first introduce a few important, but not very complex definitions:

Please also make sure that you’ve understood the last paragraph – it is one of the things that took me quite some time to understand: The parameters $\theta(\alpha)$ of a model are probability distributions initially parameterized by $\alpha$; we will see later what initially means.

In terms of the linear model above, our goal is to find the likely distribution of $m$ and $b$ (that’s $\theta$). The parameters of these distributions are again formulated as distributions $\alpha$! We can describe it in R syntax, for example, for an imaginary linear model:

$$(1)~m \sim N(\mu = 4, \sigma = 1.2)$$ $$(2)~b \sim N(\mu = 2, \sigma = 0.5)$$ $$(6)~\sigma \sim N(\mu = 1/10, \sigma = 1/15)$$

I use Normal distributions $N(\mu, \sigma)$ just as an example; any probability distribution can be used in its place. We will get to know the parameters here more closely in the next paragraphs, and learn that $m$ and its friends are parameterized by the hyperparameters $\alpha$, and called “priors” (because we assume their characteristics prior to looking at samples – we can express our prior knowledge as probability distributions here). The next steps that follow will show how a probabilistic programming framework helps us find the most likely actual distributions of the parameters (here $m, b, \sigma$), taking into account both the provided priors as well as the real-world observations we supply to it.

Please also note that the linear model is not of much importance in the area of Bayesian inference. I like using it here, though, because it is well known and clear what it represents, and what its parameters are! You can read a more fleshed-out description of one on the PyMC website.

Bayes again

We are now not talking about events anymore – as we did above – but about samples and parameter vectors. We can insert the new terms into the Bayesian law:

$$P(\theta \mid X) = \frac{P(X \mid \theta) P(\theta)}{P(X)}$$

Let’s also give names to these terms. My promise to you at this point: If you understand (even just roughly) what these terms stand for, you will have understood the gist of Bayesian inference.

What is special about the likelihood is that by maximizing it for a given sample, we obtain the “best-fit” parameters. The famous least-squares method (minimizing $\chi^2$) is an example of this (at least: in many cases it is). I.e., we try to find those parameters that most likely explain our observations. Note that Bayesian Inference is not only interested in the parameters with the maximum likelihood (like some other methods), but probes the entire parameter space to obtain the probable distributions of the parameters.

$$P(\theta \mid X) \sim P(X \mid \theta) P(\theta)$$

and because we know that the probability distribution must be normed ($\int P(\theta \mid X) \mathrm{d}\theta = 1$), we don’t actually need to know what number is below the fraction.

What?

What does this give us in practice? The main point we should take away is this: The probability distributions for our model parameters $\theta$ (which we are trying to find!) are proportional to the likelihood $P(X \mid \theta)$ (which we can calculate for any given distribution $\theta(\alpha)$ using our observation $X$) and the prior probability (our assumptions about what model parameters should look like), $P(\theta)$. Keep in mind that the hyperparameters $\alpha$ are implicitly carried everywhere.

I’ve mentioned before that the goal is finding the posterior distribution of parameters $P(\theta \mid X)$. This is where probabilistic programming comes into the picture, as we have to assume that there is usually no closed-form solution to the posterior distribution. What probabilistic programming allows us to do: Find the actual distribution $P(\theta \mid X)$ of the model’s parameters.

The humility of this approach is something nice in my opinion. We don’t say: “Let’s find the parameters of the model!” and then get a single value per parameter, maybe with an uncertainty bound; instead, we ask for “how does the probability distribution of the parameters look like”, which inherently shows the uncertainty in the statistics we are running.

Probabilistic Programming & Sampling

How do we go from a model and its priors to estimations about its parameters and their distribution, that we (well, or mostly I) have talked so much about? The answer is sampling. We have already given up on the closed-form probability distribution. What we do instead is probe the parameter space, that is the vector space containing $\theta$, and check how the likelihood function responds – i.e., if a newly selected parameter vector “better explains” our observations, or doesn’t. Over the course of sampling a model, we expect to find distributions of samples for each parameter that are proportional to its individual probability distribution. Or said another way, we want many samples for those specific parameter values that explain our observation well, and we want few samples for those parameter values that don’t explain our observation all that well.

This is easier said than done: There are some simple algorithms (one of which I will describe later on), but to make this work in high-dimension spaces, i.e. complicated models, requires some deeper magic I cannot claim to understand yet.

But luckily, there are software packages encapsulating this functionality. Examples are PyMC3 and Turing.jl. In the following section, I will show an example of how a problem can be modeled and solved using Bayesian inference, using the knowledge I have hopefully managed to convey to you so far.

Applying what we know in PyMC3.

This has been a big step for me, when I learned about Bayesian inference. I will attempt to explain the purpose and the usage of PyMC (and similar libraries) in the way that I wished someone had explained it to me; because it’s certainly not an easy topic. While I want to revisit the cookies from above later, let’s first start with a scenario that is a bit more intuitively modeled in PyMC. After showing it, I will attempt to explain at a high level what PyMC does to solve the problem, and then in the next section show in greater detail (though not precisely) how the sampling and inference works.

I like Normal distributions for demonstrating or imagining statistical problems, because they are neither trivial nor complicated. Just in the sweet spot to check assumptions while not going weird places. Therefore, they will make an appearance in the following models – usually they are also “good enough”, which is good, as I won’t (be able to) formally show why we should use a Normal distribution.

Scenario: You are an archaeologist.

I read this in an article about a German gas pipeline, where archaeologists had to do an emergency excavation on a field. The specific item I read about was only tangentially related, but bear with me here: Imagine you find silver fragments scattered around, just below the surface, meaning that farmers and agriculture had quite some influence on this treasure. We assume that at some point the treasure was concentrated in one point (that’s how treasures work), and are now mapping the fragments’ locations. Now you wonder: Where was that original location? We could just calculate the mean location from the fragments, but that wouldn’t give us any sense about the uncertainty. Luckily, the scattering is a random process, and the sample (locations of fragments) just ask to be put into a Bayesian model!

First, let us “measure” our “locations”. We have the fortunate situation of being able to specify the true distributions of the artifacts, for which we will be using (you guessed it) Normal distributions:

from scipy import stats

# The actual location of the treasure
truex, truey = 1, 5

x_locs = stats.norm(truex, 5).rvs(100)
y_locs = stats.norm(truey, 3).rvs(100)

From this, we find the measured locations (our sample):

Archeological locations

Looks realistic, doesn’t it? Now on to modeling. This is where your mind needs to adjust to the paradigm of probabilistic programming: We will specify variables not as values, but as distributions. Our model looks like this:

import pymc3 as pm

# Create a new model
with pm.Model() as locmod:
    # Introduce variables: One each for x-/y-location and their respective
    # variance. `Flat` and `HalfFlat` are the least-interesting prior
    # distributions: They convey no information and are a uniform distribution
    # assigning the probability 1 to any location. Note that realistically we could
    # put a normal distribution here, with an estimated mean and standard deviation.
    x = pm.Flat('x')
    sx = pm.HalfFlat('sx')
    y = pm.Flat('y')
    sy = pm.HalfFlat('sy')
    
    # Declare our observations. `observed=` links this variable to the sample,
    # and indicates that this variable is not sampled, but instead used for the
    # likelihood calculation.
    obsx = pm.Normal('obsx', mu=x, sigma=sx, observed=x_locs)
    obsy = pm.Normal('obsy', mu=y, sigma=sy, observed=y_locs)

Let’s look at this snippet in detail. The imports are clear, and the with ... line is just how PyMC3 wants to be used. Inside that block, we introduce probabilistic variables x and y; those represent the distributions of the x/y means of the assumed Normal distributions that the fragment locations follow. We do the same for the respective sigmas, only that we declare them to be positive (Half in HalfFlat). By using Flat distributions, we allow ourselves not having to select hyperparameters $\alpha$, as described in the formal part above.

At the end of the block, we link the probabilistic variables to the observations. Despite looking very similar to the first declarations, the use of observed= tells PyMC3 that these variables are to be used for likelihood calculation. This essentially means, as I’ve written earlier, “comparing how well the model fits”.

What follows now should be understood in the context of: we have distributions x, y, sx, sy serving as parameter vector $\vec \theta$, which in turn is the input to the likelihood calculation.

Sampling

All that remains now is to sample the model:

with locmod:
    trace = pm.sample(5000)

PyMC3 will now sample the model, and calculate the posterior distributions of the parameters x, y, sx, sy. posterior again means after having taken into account the evidence (sample). A simple way of understanding this part is the following algorithm:

  1. Choose an initial parameter set x, y, sx, sy, and calculate its likelihood using the obsx/obsy information.
  2. Sample random numbers x, y, sx, sy according to the specified distributions.
    1. If a previous point was sampled, it can be useful to make the selection of a new point depend on the previous point. For example, by sampling from a Normal distribution centered at the previous point.
  3. Calculate the likelihood using the obsx/obsy information, using the sampled parameters as distribution parameters. Compare with the previously obtained likelihood.
    1. If the new likelihood is greater than the previous one, “accept” the new sample of parameters.
    2. If the new likelihood is less than the previous one, “accept” the new sample with a probability proportional to the ratio of new and old likelihood. I.e., if the new likelihood is 1/2 of the old, accept the new one with a probability of 50%.
  4. Record the sample (either the new accepted one, or the previous one, if the new one was rejected).
  5. Go to step 2.

What will this algorithm lead to? In the best case, it will “walk” around the parameter space that $\theta = (x, y, sx, sy)$ lives in. Each new point depends on the previous point (by step 2.1 and step 3.2), forming a chain of points. The chain will over time cover the multi-dimensional parameter space and will have more samples for those parameters that lead to a higher likelihood; remember: if a new sample leads to a higher likelihood, it is always accepted (so the chain will extend to areas that lead to a higher likelihood), and a new sample leading to a smaller likelihood will be accepted proportionately to the ratio. Overall, these simple rules will lead to two desirable properties:

Now I can finally lift the covers, too, and tell you the names of these ideas: The algorithm above is called “Metropolis-Hastings algorithm” (and a generalization of the Metropolis algorithm used in statistical physics), and the “chain of samples” idea (or algorithm family) is called MCMC, or “Markov Chain Monte Carlo”. Markov chains also appear in Computer Science, and encapsulate the idea of having a chain of e.g. samples (here), with each sample depending on the previous one. Markov chains have some nice properties, but the (lack of) depth of my understanding prohibits me from writing more about them here. Now you also see where the MC in PyMC3 stems from!

This can be easily visualized. Assuming our likelihood function depends on parameters $x, y$, and looks like this:

2D likelihood function, sin\*cos or so

Then a simulated walk with uniform sampling distribution of radius 1 looks like this, for the first 750 steps:

A 750-step metropolis walk in x-y-space

The behavior matches what I’ve described. Apparently the algorithm started in the lower-left corner (low $x$ and $y$), and then walked into the other three areas of high likelihood. The number of samples generally correlates well with the likelihood. You can already see its weakness though: It barely overcomes the areas of very low likelihood ("+” shaped, centered), and will not properly cover some areas. Not here, but hopefully in a future post I will write about sampling algorithms that are more advanced and allow for better sampling. Another weakness is only apparent in higher dimensionality, i.e. with more than just two parameters. In that case, the “curse of dimensionality” would require extremely high numbers of samples for acceptable coverage.

The code for this simple example can be found on gist.github.com; I wrote it in Julia using Plots.jl.

Back to Archaeology

As a final note on sampling, let’s connect the ideas here to the previous archaeology example. There, we had four parameters instead of two ($\mu_x, \sigma_x, \mu_y, \sigma_y$), but otherwise the application of the sampling shown here is quite straight-forward. By drawing samples from the four specified distributions (which were flat, i.e. unspecified; the details of this sampling entail some kind of magic, and we shall show gratitude to the PyMC3 developers for making it so easy for us), and calculating the likelihood (that’s the part with obsx and obsy), PyMC3 walked the parameter space, preferring regions where the evaluated likelihood was higher than where it was low. Thus, there will be more samples of e.g. x for those values of x more likely to lead to the observed distribution. I.e., as explained earlier, the number of samples is higher for those parameter values that more likely lead to our observation.

If you’ve understood this example, and the previous paragraph – you can be glad, and proud; because that is most of what there is to understand. Of course, there are many details that are more complicated than what I have presented (which is why I didn’t present them, I usually like to understand what I write about), but this is the gist. Also note that this example was relatively simple, as it essentially was just about reconstructing (or fitting) two Normal distributions. The same concept applies to much more complicated problems, however.

What is the result of our model? After all, I still haven’t shown you what we get out of it! For this, I wanted to wait until after I talked a bit about the background, because I found the fancy graphics we obtain sometimes confusing and not that helpful in understanding the ideas behind the framework – they occasionally led me the wrong direction when trying to think about what was going on.

For once, we can plot the trace of our model. In the case of PyMC3, the separate package arviz provides the function plot_trace() for this, which we simply feed the return value of pm.sample():

An Arviz trace

This is a direct representation of the walk (on the right) and the sample distribution (on the left). The sample distributions, or posterior predictive distribution, are proportional to the number of samples drawn for each specific parameter value. The nice property of Bayesian inference manifests itself here: We have a specific, tractable probability distribution of each parameter. We can also draw samples from these distributions; if you are interested, look for sample_posterior_predictive() in the pymc3 module.

Now you can imagine that in those regions where the parameters’ distributions are rather low, the likelihood recorded was low too. Around the peaks of the distributions, we observed a high likelihood for the fragments’ locations that we wrote down in the fields.

Also take notice of the peaks’ locations. Compare them to our original (“true”) distributions we used to “scatter the fragments”:

$$x \sim N(1, 5)$$ $$y \sim N(5, 3)$$

We find a rather good match:

The Arviz summary()

in which mean/sd are mean and standard deviation of the respective distribution, and hdi stands for highest density interval, which tells us about the probability of the true value being above/below the shown value. The means mostly match, and are definitely within one $\sigma$ of the true value. Only sy doesn’t, and this shows the limits of Bayesian inference (well, or any statistical inference): we can be wrong…

In the trace plot above you will also have noticed that there are multiple graphs in each figure. This doesn’t have a fundamental meaning, and comes from the fact that PyMC3 by default runs multiple chains (we remember from the Metropolis-Hastings algorithm what a chain is), with potentially different distributions. Here, it doesn’t make a great difference at all, but in the complicated four-pole distribution shown as toy example, multiple chains potentially starting in different areas of the support would have drastically increased the sampled distribution (predictive posterior distribution).

Addendum: log-likelihood

One phrase you may encounter repeatedly is the log likelihood. It is nothing to be afraid of; as its name says, it’s simply the logarithm of the likelihood function. It is used for numerical reasons, as the likelihood function has a tendency to become very very small (many distributions show exponentially falling behavior). By using the log-likelihood, we have “handy” values, and can comfortably calculate ratios (which are important e.g. in the Metropolis-Hastings algorithm) by using subtraction:

$$\frac{L(A)}{L(B)} = \exp(\log(L(A)) - \log(L(B)))$$

Internally, PyMC3 uses the log-likelihood mostly everywhere, but this is transparent for us users.

Phew

That’s it for now! If you’ve managed to follow through this – not necessarily against technical complexity, but the burden of having to read my writing –, you should probably have a good intuitive understanding of Bayesian inference. I can only encourage you to try it out on real-world problems once in a while (try modeling the cookie example, or how to find out if a sports cards publisher is trying to scam you by much less frequently putting certain players in their decks!), and to not be discouraged by the sometimes-lacking documentation of PyMC3. Alternatively, try out Turing.jl, which is a very similar framework for Julia, or even implement Bayesian sampling from scratch! I’ve done the latter, and for simple problems using the Metropolis-Hastings algorithm, it is quite doable.

References

Most facts here were taken and then digested from various sources, such as en.wikipedia.org, towardsdatascience.com, and the PyMC3 and Turing.jl documentation.

If you have found mistakes, I would be honored (because that means somebody’s reading this) if you could inform me of them discretely via one or another avenue of contact.