Weighted Means - a Bayesian View

Wed, May 25, 2022 tags: [ julia physics datascience ]

Every student in quantitative sciences – I will argue – sooner or later comes across the weighted mean, a tool to incorporate measurements with different uncertainties into a overall measurement of better uncertainty than any single measurement. This is commonly defined for measurements $x_i$ with uncertainties $\sigma_i$ as the following:

$$\overline x = \frac{\sum_i x_i/\sigma_i^2}{\sum_i 1/\sigma_i^2}$$ $$\overline\sigma^2 = \sum_i \frac{1}{\sigma_i^2}$$

It is straight-forward to interpret this as a weighted sum of measurements, with “better” (low-uncertainty/high-precision) measurements receiving a higher weight than less precise ones.

I saw today an interpretation of this in a Bayesian framework; however, not explained in any detail. I am a fan of Bayesian approaches, and wanted to see if this is something tractable. It turns out that it is; even though from a cursory internet search, I didn’t find any direct material on this.

Therefore, I will approach this problem here in a two-pronged way: first, by (amateurish) derivation from straight math; and second, by using a Monte Carlo approach in a Bayesian Inference framework (Turing.ml in Julia).


Using the above definition, we start from $N$ measurements $x_i = x_1, …, x_N$ with associated uncertainties $\sigma_i$. As lazy aspiring physicist, it is reasonable to assume a Gaussian distribution of the true value, i.e. $x_i \sim \mathcal{N}(x_i, \sigma_i)$. The goal is calculating the aggregate measurement $\overline x$ and its uncertainty $\overline \sigma$. Further, we assume that the aggregate measurement is also identical to the true value (this is a dangerous and probably wrong assumption – but it gets us to the right result. Sue me). Further, we use the symmetry of the Gaussian distribution, which we call $$pdf(x; \mu, \sigma) := \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right).$$

For each measurement, we assume now that

$$x_i \sim \mathcal{N}(\overline x, \sigma_i),$$

i.e., it falls within a $\sigma$-distribution around the true value (symmetry of the distribution!). Now the goal is finding $pdf(x; \overline T, \overline \sigma)$, i.e. the true value’s distribution. Note that we’re entering the Bayesian world by asking for a distribution instead of a point measurement; however, as the point measurement parameterizes a Gaussian distribution, this is somewhat analogous in the end.

Using Bayes’ theorem, we can state

$$p(\overline x \mid x_1, …, x_N) = \frac{p(x_1, …, x_N \mid \overline x) p(\overline x)}{p(x_1, …, x_N)}. ~~(*)$$

The problem is our ignorance of both $p(\overline x)$ and $p(x_1, …, x_N)$, the prior distribution and the normalization factor. The latter is usually inferred at the end (as it is given by the remaining parameters), and can be found by integration; the prior distribution may be known for some cases, but otherwise we have to assume it to be uniform. As a uniform distribution, it will have a certain constant probability density depending on the interval of the uniform distribution; it will turn out not to matter so much, thus we define a constant factor

$$c := \frac{ p(\overline x)}{p(x_1, …, x_N)}$$

The goal of this exercise is inferring $(*)$, which now is reduced to

$$p(\overline x \mid x_1, …, x_N) = c \cdot p(x_1, …, x_N \mid \overline x),$$

and we know that the LHS will be a Gaussian probability density (per assumption), whereas the RHS – likelihood times a constant factor – will be a product of Gaussian densities (assuming independence of measurements from each other):

$$p(\overline x \mid x_1, …, x_N) = c \cdot \prod_i p(x_i \mid \overline x) = c \cdot \prod_i pdf(x_i; \overline x, \sigma_i)$$

for which we fortunately have a closed form

$$p(\overline x \mid x_1, …, x_N)= c \frac{1}{\sqrt{2\pi} \prod_i \sigma_i} \exp\left(- \sum_i \frac{(x_i - \overline x)^2}{2\sigma_i^2}\right)$$

This is our non-normalized conditional distribution for the “true value”. We can attempt to recover its parameters and compare them with the goal set out at the beginning; to look for the maximum of the PDF, we set its derivative w.r.t. $\overline x$ to 0 and solve for $\overline x$ (the Gaussian PDF only has one point in which its derivative is zero, save for $\pm\infty$). This is not really Bayesian, it is just Maximum Likelihood Estimation (MLE) (so I tricked you a bit in the title…).

$$\frac{d}{d\overline x} pdf(\overline x \mid x_1, … x_N) = c \frac{1}{\sqrt{2\pi} \prod_i \sigma_i} \left(\sum_i \frac{(x_i - \overline x)}{\sigma_i^2}\right) \exp\left(-\sum_i \frac{(x_i - \overline x)^2}{2\sigma_i^2}\right) = 0$$

Obviously, only the sum will go to zero at a finite value for $\overline x$, and we simply solve for

$$\sum_i \frac{(x_i - \overline x)}{\sigma_i^2} = 0 \Leftrightarrow \sum_i \frac{x_i}{\sigma_i^2} = \overline x \sum_i \frac{1}{\sigma_i^2}$$

and therefore

$$\overline x = \frac{\sum_i {x_i}/{\sigma_i^2}}{\sum_i {1}/{\sigma_i^2}}.$$

Unfortunately, finding the variance of this distribution is not quite as straight-forward, analytically – which is why I haven’t done it. If you have an idea, I’d love to know! (But we know from above that it is supposed to be $\overline\sigma^2 = \sum_i 1/\sigma_i^2$.)

Monte Carlo

In the second approach, we can check if the presented math was at least empirically correct. For this, we’re using:

using LinearAlgebra
using Statistics
using StatsPlots
using Turing

with a few example measurements and uncertainties:

# True value: 4
meas, uncert = [4.1, 3.8, 4.3, 4.05], [.1, .2, .2, .3]

and a Turing model:

@model function weighted_model(meas::Vector{Tv}, unc::Vector{Ts}) where {Tv <: Real, Ts <: Real}
    m ~ Uniform(0, 10)
    meas ~ MvNormal(fill(m, length(vals)), diagm(unc.^2))

This warrants a little bit of explanation: This model takes in the values and uncertainties as observations, which serve to condition the distribution. m is the mean to be inferred, corresponding to $\overline x$. The first line declares it to conform to a uniform prior between 0 and 10 (our true value is 4!), although the bounds don’t have any special meaning.

The second line is used for the likelihood calculation, which is essential to the inference process: It declares that the measurements are expected to follow a multivariate Gaussian (Normal) distribution with $N$ means m (after all, every measurement is based on the same true value), and associated uncertainties. diagm constructs a covariance matrix from the measurement variances – this is using the symmetry again, as the most intuitive view is that a measurement is surrounded by its uncertainty-defined distribution, and not sampling from the true value’s distribution with the measurement device’s uncertainty; the symmetry makes this interchangeable.

The inference process will sample many different values for m and deliver an estimation of its probability distribution – which is what this is all about.

For comparison, we also calculate the aggregate measurement the old way:

function weighted_mean(val::Vector, std::Vector)
    stdsum = sum(1 ./ std.^2)
    m = sum(v/(std[i]^2) for (i, v) in enumerate(val))/stdsum
    s = sqrt(1/stdsum)
    m, s

weighted_mean(meas, uncert)
# (4.0810344827586205, 0.07878385971583354)

Now for the fun part, the Monte Carlo sampling; this is fairly trivial:

# Sample model with NUTS algorithm, using 1000 samples.
chain = sample(weighted_model(meas, uncert), NUTS(), 3000)

which results in

Density plot

    Chains MCMC chain (3000×13×1 Array{Float64, 3}):

    Iterations        = 1001:1:4000
    Number of chains  = 1
    Samples per chain = 3000
    Wall duration     = 0.24 seconds
    Compute duration  = 0.24 seconds
    parameters        = m
    internals         = lp, n_steps, is_accept, acceptance_rate, log_density, [...]

    Summary Statistics
      parameters      mean       std   naive_se      mcse         ess      rhat    ⋯
          Symbol   Float64   Float64    Float64   Float64     Float64   Float64    ⋯

               m    4.0828    0.0801     0.0015    0.0023   1142.9760    0.9998    ⋯
                                                                    1 column omitted

      parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
          Symbol   Float64   Float64   Float64   Float64   Float64 

               m    3.9261    4.0282    4.0815    4.1367    4.2369

Check out the “Summary Statistics” table: the mean and standard deviation for m, the “true value”, agree quite well with the explicit calculation above! You have to believe me (or try yourself) when I say that the agreement persists across different measurements.


I’m happy that I managed to translate this quite simple problem into the Bayesian domain, and that the results worked out well. The problem from classical statistics doesn’t translate very obviously into a Bayesian concept, but with a few assumptions this hasn’t been difficult at all. I’m always glad to use new tools (like Turing.ml) even for extremely simple applications like this one, where most of them are still very elegant and straight-forward to use. In this vein, I hope that you as reader also enjoy this kind of article, which doesn’t require any significant amount of prior knowledge.

Please let me know the different ways I have probably messed up in this article; I’m always looking forward to feedback.