Linear "Machine Learning"

Sat, Jun 12, 2021 tags: [ julia datascience ]

There is this meme floating around (you can google for it, and find many instances), that machine learning is just statistics (and sometimes even just a linear regression). While this is obviously simplifying how it is, the techniques to do either are similar: you can “learn” a linear regression, and a linear regression can look from the outside like a machine learning algorithm. (Provided a simple enough problem)

Personally, I’m more fascinated by fundamental techniques in machine learning. While deep learning is all fun and so on, going back to the roots is what really interests me. So I wanted to see how to easily implement a linear model using what can be described as “machine learning”.

What is a linear model?

A linear model is – in the easiest case of one independent and one dependent dimension – an equation like $y = m x + b$. This is often used for evaluating simple experiments; think of a car driving along a street, and us wanting to measure its speed.

A linear model can also depend on multiple independent variables, while still having one dependent variable. In that case, we can write the equation as $y = \vec x^\intercal \vec \beta + \varepsilon$. We have measurements of the independent variables $\vec x$, a parameter vector $\vec \beta$, and some random “disturbance” $\varepsilon$. As we can see, the prediction $y$ still depends on the components of $\vec x$ in a linear fashion – thus “linear model”.

This vector notation may be a bit weird (although it is nice and short); an equivalent way of stating this is $$y = x_1 \beta_1 + x_2 \beta_2 + … + x_N \beta_N + \varepsilon.$$

Note that this is the case for having exactly one measurement; if we have several measurements, we can apply this equation for each measurement, and then see that this is equivalent to a matrix equation $$\vec y = \mathbf X \vec \beta + \vec \varepsilon,$$ where the rows of $\mathbf X$ are the individual measurements, and we obtain not one $y$, but a whole vector of predictions. Note that here we assume that $y$ is a scalar; $y$ may also be a vector again, meaning that we obtain more than one “measurement” (or prediction) from a set of input parameters. In that case, $\beta$ is a matrix. Depending on what we are modelling, there may be a necessity to express a constant offset (or “bias”). We will show this below, and here it can be expressed as an additional vector $+\gamma \vec 1$, which adds the same offset $\gamma$ to each prediction.

In the code further down, we will determine $\gamma$, but ignore $\varepsilon$. The per-measurement bias is in common use in e.g. neural networks, but for naive linear regressions we prefer a constant offset.

So far so good, this is all in Wikipedia, and so on. The essential problem here is: with measurements $\mathbf X$, how do we find a good estimation for the parameter vector $\vec \beta$? The answer is simple; in many cases, for finding a minimum (or maximum) of a function we can use the derivative (or gradient, for a function of multiple variables), and demand it to be zero. Where the gradient is zero, the function has a “stationary” point, and with some auxiliary techniques, we can ensure that the stationary point is a minimum. In the case of the “sum of least squares”, which is used for linear models, we are looking for the minimum of the sum of squared differences $\chi^2 \sim \sum_i (Y_i - (\mathbf X \vec \beta + \vec \varepsilon)_i)^2$ for our measurements $Y_i$. The condition here is then $\vec\nabla_{\beta} \chi^2 = \vec 0$, and we use the popular technique named “gradient descent” in order to look for minima, not maxima.

Due to the quadratic structure of what I inaccurately called $\chi^2$, this is fairly easy to calculate, and we can find a precise parameter vector minimizing the least squares sum.

^.~*~Machine Learning~*^

Taking derivatives is boring, though, so let’s make the computer do the work (notwithstanding that the derivatives are quite easy to calculate, but anyway…). Also, there is this amazing machine learning library called Flux in Julia, so I really don’t see a reason why we should not use that instead.

But – I tricked you. “instead”? No – Flux does exactly that. As most machine learning algorithms, the core mechanism of Flux is the gradient descent method (gradient? look in the previous section again!). And for gradient descent, we need a gradient first! The coolest thing about Flux (in my opinion), is that it is mostly a library for automatic differentiation, i.e. calculating derivatives of functions efficiently. On top of that, it then builds all the abstractions common in ML, like layers and so on.

The following steps I’ve mostly taken to build something really simple. A linear model can be optimized in even shorter code using some more Flux facilities; for that, check out their Getting Started guide, for example.

As we’ve seen above, estimating the parameter vector is nothing but finding the minimum of our sum of squared differences. That sum of squared differences would usually be called “loss function” in machine learning circles: it tells us how “bad” our current estimation is, and is crucially the one function we aim to minimize.

But before waxing on more, let’s see how we can make Flux solve our linear model. First, it is useful to define again what we’re talking about: the linear model. It works with an arbitrary number of parameters; for our example, we’ll start using two parameters, because we can plot it in 3D.

function linear_model(x, p, e; eps=0.3)
  # Note: this is a matrix multiplication.
  r = x * p .+ e
  r + (rand(size(r)...).-1/2) .* eps

Here, our eps gives the magnitude of random perturbations. This is useful for generating “measurements”. e is the “bias”, which corresponds to the y-axis intersection in the single-variable regression; there is one bias per prediction dimension (usually, there is only one prediction anyway). Otherwise, nothing new: this is the formula shown earlier. Next, our “loss function”. Again, a one-to-one translation from the formula above:

function linloss(pred, actual)

Now, we can “measure” some “stuff”. In an actual problem, we’d have some independent variables xs and measurements depending on these in a linear fashion. Unfortunately, I’m living by the motto a solution looking for a problem, so I am not bothered by the real world here. We set eps > 0 to introduce some randomness to our “measurements”:

# Our "actual" parameters. Note that (as briefly mentioned above), we could also
# have a parameter matrix. Each column then corresponds to an element in the
# "measurement" (prediction) vector `y`.
> params = [2.3, -1.8]
# Our independent variables: for ten measurements, two variables per measurement.
> xs = rand(10, 2) * 10
10×2 Matrix{Float64}:
 4.32756   6.24764
 0.665557  2.0053
 7.09192   8.30752
 1.62726   1.88415
 1.8145    0.740606
 8.11547   1.40716
 4.58845   2.66221
 8.15711   2.00101
 6.17103   0.00396418
 9.20409   7.43563

# Our "measurements", with a random component +/- 1 and a constant offset of 5.
> measurements = linear_model(xs, params, 5, eps=1)
[-1.2855414261531637, -2.51499939089371, 1.7694635849375682, ...]

Now the task is: How do we go from our variables xs and the measurements back to the parameters defined in params? We are fortunate, because we defined params in the first place; but in an actual problem, we wouldn’t be so lucky.

The idea is now simple (and anyone who’s done machine learning before will already be far ahead, or have stopped reading long ago), and has been mentioned several times: minimize the result of linloss(). This is done more easily than I could explain it:

# Goal: with measurements and independent variables, find the parameters used
# originally.
# `iters` is the number of steps we're willing to take; `eta` is the "learning
# rate", `loss` and `model` specify the details of our model. We've explained
# those above :)
# `off` enables the constant offset explained above. If it is `false`, the model
# will intersect the origin (0).
function find_params(measurements, xs; iters=5000, eta=4e-4,
		loss=linloss, model=linear_model, off=true)
  # How many parameters do we have? In our case, 2.
  nparams = size(xs)[2]
  # Generate some random parameters. We will start with them.
  init_params = rand(nparams)
  # The constant offset. If β is a matrix, then this would have as many columns as
  # β.
  init_off = off ? rand() : 0
  for i = 1:iters
    # Now, in each steep, we take the gradient of the loss function with respect
    # to our parameter vector, `init_params`.
    # This gives us, in intuitive terms, for each parameter a number `g`. That
    # number tells us "if you vary this parameter by x, then the loss function will
    # vary by approximately g*x". We want to minimize the loss function, so we will
    # take this hint and change each parameter in the direction that the loss
    # function will be minimized. This is the famous "gradient descent" approach.
    grad = Flux.gradient(p -> loss(model(xs, p[1], p[2]), measurements), (init_params, init_off))

    # Here, we implement that second step: take the feedback from the gradient
    # and adjust our parameters by a little bit. The learning rate `eta` controls
    # how much we do this.
    init_params .-= grad[1][1] .* eta
    init_off -= off ? grad[1][2] * eta : 0
  # Hopefully, our parameters will have converged by now!
  (init_params, init_off)

That’s it! Let’s test it.

> inferred_parameters = find_params(measurement, xs, eta=1e-3)
([2.29674, -1.80167], 4.88093)

Yup! It worked! Let’s take a (literal) look at how our predictions do in comparison to the measurements:

Comparison of measurements (orange) and predictions(blue)

The deviations of $y$ originate in the random perturbations $\varepsilon$ we introduced when we generated the data. If you set eps to 0, you will see an almost perfect result, which is of course not very surprising at all. This can be generalized to arbitrarily many variables. For having taken just seven lines of code in the find_parameters() function, I’m still kind of amazed how well the mathematical idea translates to Julia code, and to actually useful results.

Why it may be desirable to have this constant offset (which, by the way, I haven’t generally found in literature about multi-variable linear models), can also be seen in the simplest case of a univariate linear regression. This is also not a problem for our code:

using Plots

# Arbitrary data. Reshape into matrices, because that's what find_params()
# desires.
linxs = reshape([1., 2, 3, 4, 5, 6], 6, 1)
linys = reshape([3., 4, 6, 9, 10, 11], 6, 1)

((m), b) = find_params(linys, linxs)
((m2), b2) = find_params(linys, linxs, off=false, eta=1e-3)

scatter(linxs, linys, label="data")
plot!(linxs, m .* linxs .+ b, xlim=(0,9), ylim=(0, 12), label="regression")
plot!(linxs, m2 .* linxs .+ b2, label="wrong! off = 0", color=:red)

This results in the following regression:

Linear regression with and without offset

If you have actual work to do, you should probably use a proper linear regression library, like Regression, which will provide you with more features that you can rely on for actual problems.