There are several sources of error that can affect the accuracy of machine learning models including bias and variance. A fundamental machine learning concept is what’s known as the bias-variance tradeoff. This article discusses what’s meant by bias and variance and how trading them off against one another can affect model accuracy.

Essentially predictive machine learning involves trying to learn a function $f$ that captures systematic information about a response variable $Y$ contained in a set of features $X$ and where $f$ is fit by minimising some cost function i.e. the goal is to express y in terms of a function of x plus an error term (which we want to minimise) $y= f(x) +\epsilon$. This typically involves splitting X into a training and test set, fitting functions, or models, on the training set and then determining accuracy on the test set.

Predictive accuracy is very important in machine learning, it’s the raison d’etre of supervised learning. If we consider that real world datasets are a combination of signal, which we want to capture in our function, and noise, which we want to ignore, then the necessity of determining accuracy on different data to that which we used to fit the function becomes clear. It’s also why functions that have  low training error may show much higher test error.

### Decomposition of Error

A commonly used cost function is mean squared error (MSE). It represents the mean of the squared differences between observed values and estimates i.e. $MSE = E[(y-\hat{f}(x))^2]$. Below is a simplified decomposition of MSE of the test set into its constituent parts.

$\mathbf{MSE=E[(y-\hat{f}(x))^2]}$

$\mathbf{=E[y^2-2y\hat{f}(x)+\hat{f}(x)^2]}$ since $(a-b)^2= a^2-2ab+b^2$

$\mathbf{=E[y^2]-E[2y\hat{f}(x)]+E[\hat{f}(x)^2]}$ because of linearity of expectation for random variables

$\mathbf{=Var[y]+E[y]^2-E[2y\hat{f}(x)]+Var[\hat{f}(x)]+E[\hat{f}(x)]^2}$ since $E[x^2]=Var[x]+E[x]^2$

$\mathbf{=Var[y]+E[y]^2+Var[\hat{f}(x)]+E[\hat{f}(x)]^2-2E[y]E[\hat{f}(x)]}$ because of expectation of product of random variables

$\mathbf{=Var[y]+f(x)^2+Var[\hat{f}(x)]+E[\hat{f}(x)]^2-2f(x)E[\hat{f}(x)]}$ since $E[y]= E[f(x)+\epsilon]$ where $f(x)$ is the true, or best, function that $\hat{f}(x)$ is estimating and $E[\epsilon]$ is zero.

$\mathbf{=Var[y]+Var[\hat{f}(x)]+f(x)^2-2f(x)E[\hat{f}(x)]+E[\hat{f}(x)]^2}$

$\mathbf{=Var[y]+\color{red}{Var[\hat{f}(x)]}\color{black}{+}\color{blue}{(f(x)-E[\hat{f}(x)])^2}}$

Thus we can see that MSE decomposes into irreducible error, variance and squared bias. $Var[y]$ above, or $\sigma^2$, corresponds to the irreducible error in the data, the part in red is the variance and the blue is the squared bias

##### Bias =$(f(X)-E[\hat{f}(X)])$

As can be seen from the above the bias of an estimator is the difference between its expected value and the true value of the parameter its estimating. Essentially bias is the error resulting from trying to approximate what may be a complex function using a simpler model.

##### Variance=$E[(\hat{f}(X)-E[(\hat{f}(X)])^2]$

Variance is the expectation of the squared difference between the estimated value and expected value. It represents the extent to which the model is affected by changes in the data or how dependent the model is on a particular realisation of the data set. It’s a measure of how sensitive the estimator function is to changes in the data set.

One way to think about the bias-variance tradeoff is as the fundamental tension that exists between avoiding underfitting the data while at the same time taking care not to overfit the data. Let’s take a simplified example to demonstrate the concepts.

Consider the data set above. If it looks familiar that’s because I took it from a previous article on clustering where it represents the first two principal components of the Acidosis dataset. Principal components analysis produces linearly independent components however in the case of this dataset it is apparent that there is some sort of curvilinear relationship between the components. In fact if you run a second order polynomial regression on the data you will get something like the fit indicated by the line in the diagram above.

Principal components analysis produces orthogonal components and so a simple linear regression on the data will produce a flat line (i.e. slope of zero)  as below.

Linear regression assumes a linear relationship between the variables and so is a relatively inflexible learning method. In this case it performs badly as there is no simple linear relationship.

This is an extreme example because the data set is the result of principal components analysis. Often, even though variables may not be linearly related, linear regression can still be a useful tool since it simplifies the problem of estimating a function down to estimating a set of parameters. Results of linear regression will also be more interpretable than those from highly non-linear methods, so if interpretability is important, some accuracy might be sacrificed.

Linear regression can often do reasonably well even when working with data where the true function is somewhat non-linear, since variance will usually be low, however if the true function is very non-linear (and remember we don’t know the true function) then using a linear model will result in high bias since we are assuming a functional form that does not fit the true one. Even with a large amount of training data, we will likely underfit the data.

So if using a simpler model like linear regression can result in high bias, maybe we should use a more flexible learning method, that way we can ensure we will be able to fit a function to the data? Well with a large enough training set and a non-parametric learning method it’s true that we can fit a function almost perfectly to the training set. The difficulty with such highly flexible methods however is that they may fit too closely to the training set  and therefore won’t perform well on new data. In other words they show high variance.

What is needed is a model that is accurate enough and generalisable enough. Essentially predictive machine learning is about balancing accuracy and generalisability by trading off the two sources of reducible error – bias and variance.

All the concepts discussed so far are illustrated in the diagram below which plots plots Model Complexity against MSE Error. It is evident that as model complexity increases training error decreases while test error decreases to a point and then starts to increase. Variance increases with model complexity while bias decreases. The point at which test error is lowest is the point at which the combined influence of bias and variance are minimised.

This site uses Akismet to reduce spam. Learn how your comment data is processed.