Why do we use Standard Deviation and is it Right?

It’s a fundamental question and it has knock on effects for all algorithms used within data science. But what is interesting is that there is a history. People haven’t always used variance and standard deviation as the defacto measure of spread. But first, what is it?

Standard Deviation

The Standard Deviation is used throughout statistics and data science as a measure of “spread” or “dispersion” of a feature. The standard deviation of a population is:

$$\sigma = \sqrt{ \frac{\sum_{i=0}^{i=N}{ (x_i - \mu )^2 }} {N} }$$

Where $\mu$ is the mean of the population and $N$ is the total number of observations in the population. Let’s run through an example.

Assume you have the following array of values:

[6 4 6 9 4 4 9 7 3 6]

The the mean is:

μ = 5.8

Then calculating each step in equation 1:

Deviations from the mean: [ 0.2 -1.8  0.2  3.2 -1.8 -1.8  3.2  1.2 -2.8  0.2]
Squared deviations from the mean: [  0.04   3.24   0.04  10.24   3.24   3.24  10.24   1.44   7.84   0.04]
Sum of squared deviations from the mean: 39.6
Mean of squared deviations from the mean: 3.96

Which results in:

σ = 1.98997487421

Why Squared Differences?

You might be asking, why do we square the differences of the samples, to only square-root the samples later on?

Intuitively, you can think of this as taking the extreme values into account. If we were to just sum the absolute values, rather than the squared values, then the measure of the “spread” will be dominated by the values that are most common. It virtually ignore outliers.

Let’s repeat that calculation again, but this time we won’t perform the square:

Deviations from the mean: [ 0.2 -1.8  0.2  3.2 -1.8 -1.8  3.2  1.2 -2.8  0.2]
Absolute deviations from the mean: [ 0.2  1.8  0.2  3.2  1.8  1.8  3.2  1.2  2.8  0.2]
Average of absolute deviations from the mean: 1.64

This technique has a name.

Mean Absolute Deviation

The Mean Absolute Deviation (MAD - a great acronym) measures the average spread of each observation from the mean. Its formula is:

$$MAD = \frac{\sum_{i=0}^{i=N}{ |x_i - \mu| }} {N}$$

Now recall that for our population provided at the start, σ = 2.0 and MAD = 1.6. That’s a significant difference considering that the population was created with a notional standard deviation of 3.

The reason is that the MAD is introducing a form of weighting. We’re treating the observations that are father away from the mean in the same way as we do those close to the mean. For the standard deviation, we’re squaring the difference, so those far from the mean have a much greater affect on the final value of σ.

Why Use Standard Deviation at All?

Here’s the thing. There are two very important properties of the variance (that’s just $\sigma^2$).

First, the squared term perfectly describes the spread in a Gaussian probability distribution. I won’t go into the mathematics at this point, but suffice to say that it is the best metric to describe the spread of a Normal distribution.

Secondly, the square is continuously differentiable. The absolute is a discontinuity, it is not differentiable. This is extremely important in all sorts of optimisation problems encountered during Data Science.

So When Shouldn’t you use Standard Deviation?

It comes back to the earlier point. If you have values far away from the mean that don’t truly represent your data, these are known as outliers.

If you include outliers in the standard deviation calculation they will over-exaggerate the standard deviation. The result will be far greater than the true standard deviation of the population.

So when you are choosing how to optimise your models, you’ll get the option of using the L1 or L2 Norm.

The L1 Norm is simply the MAD equation, but without the averaging division by $N$. It is also known as the taxicab Norm since the geometric interpretation is the distance that a car has to travel through a square-block city. Use this when you know you have outliers. The L1 Norm is less sensitive to outliers. Also use this if your data isn’t Normally distributed.

The L2 Norm, a.k.a. Euclidean Norm, a.k.a. Pythagoras’ Theorem, is the same as the Standard Deviation, except we leave out the averaging division by $N$ again. Use the L2 Norm when your data doesn’t have outliers and your data is Normally distributed.

Also note that the vast majority of algorithms used within Data Science use the Mean and the L2 Norm somewhere within their implementation. This implies that they assume that your data is always Normally Distributed; they are probably not!

Further Reading

[1]: Revisiting a 90-year-old debate: the advantages of the mean deviation