Weighted Means - a Bayesian View
Every student in quantitative sciences – I will argue – sooner or later comes across the weighted mean, a tool to incorporate measurements with different uncertainties into a overall measurement of better uncertainty than any single measurement. This is commonly defined for measurements $x_i$ with uncertainties $\sigma_i$ as the following:
$$\overline x = \frac{\sum_i x_i/\sigma_i^2}{\sum_i 1/\sigma_i^2}$$ $$\overline\sigma^2 = \sum_i \frac{1}{\sigma_i^2}$$
It is straight-forward to interpret this as a weighted sum of measurements, with “better” (low-uncertainty/high-precision) measurements receiving a higher weight than less precise ones.
I saw today an interpretation of this in a Bayesian framework; however, not explained in any detail. I am a fan of Bayesian approaches, and wanted to see if this is something tractable. It turns out that it is; even though from a cursory internet search, I didn’t find any direct material on this.
Therefore, I will approach this problem here in a two-pronged way: first, by (amateurish) derivation from straight math; and second, by using a Monte Carlo approach in a Bayesian Inference framework (Turing.ml in Julia).
Derivation
Using the above definition, we start from $N$ measurements $x_i = x_1, …, x_N$ with associated uncertainties $\sigma_i$. As lazy aspiring physicist, it is reasonable to assume a Gaussian distribution of the true value, i.e. $x_i \sim \mathcal{N}(x_i, \sigma_i)$. The goal is calculating the aggregate measurement $\overline x$ and its uncertainty $\overline \sigma$. Further, we assume that the aggregate measurement is also identical to the true value (this is a dangerous and probably wrong assumption – but it gets us to the right result. Sue me). Further, we use the symmetry of the Gaussian distribution, which we call $$pdf(x; \mu, \sigma) := \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right).$$
For each measurement, we assume now that
$$x_i \sim \mathcal{N}(\overline x, \sigma_i),$$
i.e., it falls within a $\sigma$-distribution around the true value (symmetry of the distribution!). Now the goal is finding $pdf(x; \overline T, \overline \sigma)$, i.e. the true value’s distribution. Note that we’re entering the Bayesian world by asking for a distribution instead of a point measurement; however, as the point measurement parameterizes a Gaussian distribution, this is somewhat analogous in the end.
Using Bayes’ theorem, we can state
$$p(\overline x \mid x_1, …, x_N) = \frac{p(x_1, …, x_N \mid \overline x) p(\overline x)}{p(x_1, …, x_N)}. ~~(*)$$
The problem is our ignorance of both $p(\overline x)$ and $p(x_1, …, x_N)$, the prior distribution and the normalization factor. The latter is usually inferred at the end (as it is given by the remaining parameters), and can be found by integration; the prior distribution may be known for some cases, but otherwise we have to assume it to be uniform. As a uniform distribution, it will have a certain constant probability density depending on the interval of the uniform distribution; it will turn out not to matter so much, thus we define a constant factor
$$c := \frac{ p(\overline x)}{p(x_1, …, x_N)}$$
The goal of this exercise is inferring $(*)$, which now is reduced to
$$p(\overline x \mid x_1, …, x_N) = c \cdot p(x_1, …, x_N \mid \overline x),$$
and we know that the LHS will be a Gaussian probability density (per assumption), whereas the RHS – likelihood times a constant factor – will be a product of Gaussian densities (assuming independence of measurements from each other):
$$p(\overline x \mid x_1, …, x_N) = c \cdot \prod_i p(x_i \mid \overline x) = c \cdot \prod_i pdf(x_i; \overline x, \sigma_i)$$
for which we fortunately have a closed form
$$p(\overline x \mid x_1, …, x_N)= c \frac{1}{\sqrt{2\pi} \prod_i \sigma_i} \exp\left(- \sum_i \frac{(x_i - \overline x)^2}{2\sigma_i^2}\right)$$
This is our non-normalized conditional distribution for the “true value”. We can attempt to recover its parameters and compare them with the goal set out at the beginning; to look for the maximum of the PDF, we set its derivative w.r.t. $\overline x$ to 0 and solve for $\overline x$ (the Gaussian PDF only has one point in which its derivative is zero, save for $\pm\infty$). This is not really Bayesian, it is just Maximum Likelihood Estimation (MLE) (so I tricked you a bit in the title…).
$$\frac{d}{d\overline x} pdf(\overline x \mid x_1, … x_N) = c \frac{1}{\sqrt{2\pi} \prod_i \sigma_i} \left(\sum_i \frac{(x_i - \overline x)}{\sigma_i^2}\right) \exp\left(-\sum_i \frac{(x_i - \overline x)^2}{2\sigma_i^2}\right) = 0$$
Obviously, only the sum will go to zero at a finite value for $\overline x$, and we simply solve for
$$\sum_i \frac{(x_i - \overline x)}{\sigma_i^2} = 0 \Leftrightarrow \sum_i \frac{x_i}{\sigma_i^2} = \overline x \sum_i \frac{1}{\sigma_i^2}$$
and therefore
$$\overline x = \frac{\sum_i {x_i}/{\sigma_i^2}}{\sum_i {1}/{\sigma_i^2}}.$$
Unfortunately, finding the variance of this distribution is not quite as straight-forward, analytically – which is why I haven’t done it. If you have an idea, I’d love to know! (But we know from above that it is supposed to be $\overline\sigma^2 = \sum_i 1/\sigma_i^2$.)
Monte Carlo
In the second approach, we can check if the presented math was at least empirically correct. For this, we’re using:
using LinearAlgebra
using Statistics
using StatsPlots
using Turing
with a few example measurements and uncertainties:
# True value: 4
meas, uncert = [4.1, 3.8, 4.3, 4.05], [.1, .2, .2, .3]
and a Turing model:
@model function weighted_model(meas::Vector{Tv}, unc::Vector{Ts}) where {Tv <: Real, Ts <: Real}
m ~ Uniform(0, 10)
meas ~ MvNormal(fill(m, length(vals)), diagm(unc.^2))
end
This warrants a little bit of explanation: This model takes in the values and uncertainties as
observations, which serve to condition the distribution. m
is the mean to be inferred,
corresponding to $\overline x$. The first line declares it to conform to a uniform prior between 0
and 10 (our true value is 4!), although the bounds don’t have any special meaning.
The second line is used for the likelihood calculation, which is essential to the inference process:
It declares that the measurements are expected to follow a multivariate Gaussian (Normal)
distribution with $N$ means m
(after all, every measurement is based on the same true value), and
associated uncertainties. diagm
constructs a covariance matrix from the measurement variances –
this is using the symmetry again, as the most intuitive view is that a measurement is surrounded by
its uncertainty-defined distribution, and not sampling from the true value’s distribution with the
measurement device’s uncertainty; the symmetry makes this interchangeable.
The inference process will sample many different values for m
and deliver an estimation of its
probability distribution – which is what this is all about.
For comparison, we also calculate the aggregate measurement the old way:
function weighted_mean(val::Vector, std::Vector)
stdsum = sum(1 ./ std.^2)
m = sum(v/(std[i]^2) for (i, v) in enumerate(val))/stdsum
s = sqrt(1/stdsum)
m, s
end
weighted_mean(meas, uncert)
# (4.0810344827586205, 0.07878385971583354)
Now for the fun part, the Monte Carlo sampling; this is fairly trivial:
# Sample model with NUTS algorithm, using 1000 samples.
chain = sample(weighted_model(meas, uncert), NUTS(), 3000)
which results in
Chains MCMC chain (3000×13×1 Array{Float64, 3}):
Iterations = 1001:1:4000
Number of chains = 1
Samples per chain = 3000
Wall duration = 0.24 seconds
Compute duration = 0.24 seconds
parameters = m
internals = lp, n_steps, is_accept, acceptance_rate, log_density, [...]
Summary Statistics
parameters mean std naive_se mcse ess rhat ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯
m 4.0828 0.0801 0.0015 0.0023 1142.9760 0.9998 ⋯
1 column omitted
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
m 3.9261 4.0282 4.0815 4.1367 4.2369
Check out the “Summary Statistics” table: the mean and standard deviation for m
, the “true value”,
agree quite well with the explicit calculation above! You have to believe me (or try yourself) when
I say that the agreement persists across different measurements.
Conclusion
I’m happy that I managed to translate this quite simple problem into the Bayesian domain, and that the results worked out well. The problem from classical statistics doesn’t translate very obviously into a Bayesian concept, but with a few assumptions this hasn’t been difficult at all. I’m always glad to use new tools (like Turing.ml) even for extremely simple applications like this one, where most of them are still very elegant and straight-forward to use. In this vein, I hope that you as reader also enjoy this kind of article, which doesn’t require any significant amount of prior knowledge.
Please let me know the different ways I have probably messed up in this article; I’m always looking forward to feedback.