It’s a fundamental question and it has knock on effects for all algorithms used within data science. But what is interesting is that there is a history. People haven’t always used variance and standard deviation as the defacto measure of spread. But first, what is it?

The Standard Deviation is used throughout statistics and data science as a measure of “spread” or “dispersion” of a feature. The standard deviation of a population is:

$$\sigma = \sqrt{ \frac{\sum_{i=0}^{i=N}{ (x_i - \mu )^2 }} {N} }$$

Where $\mu$ is the mean of the population and $N$ is the total number of observations in the population. Let’s run through an example.

Assume you have the following array of values:

```
[6 4 6 9 4 4 9 7 3 6]
```

The the mean is:

```
μ = 5.8
```

Then calculating each step in equation 1:

```
Deviations from the mean: [ 0.2 -1.8 0.2 3.2 -1.8 -1.8 3.2 1.2 -2.8 0.2]
Squared deviations from the mean: [ 0.04 3.24 0.04 10.24 3.24 3.24 10.24 1.44 7.84 0.04]
Sum of squared deviations from the mean: 39.6
Mean of squared deviations from the mean: 3.96
```

Which results in:

```
σ = 1.98997487421
```

You might be asking, why do we square the differences of the samples, to only square-root the samples later on?

Intuitively, you can think of this as taking the extreme values into account. If we were to just sum the absolute values, rather than the squared values, then the measure of the “spread” will be dominated by the values that are most common. It virtually ignore outliers.

Let’s repeat that calculation again, but this time we won’t perform the square:

```
Deviations from the mean: [ 0.2 -1.8 0.2 3.2 -1.8 -1.8 3.2 1.2 -2.8 0.2]
Absolute deviations from the mean: [ 0.2 1.8 0.2 3.2 1.8 1.8 3.2 1.2 2.8 0.2]
Average of absolute deviations from the mean: 1.64
```

This technique has a name.

The Mean Absolute Deviation (MAD - a great acronym) measures the average spread of each observation from the mean. Its formula is:

$$MAD = \frac{\sum_{i=0}^{i=N}{ |x_i - \mu| }} {N}$$

Now recall that for our population provided at the start, σ = 2.0 and MAD = 1.6. That’s a significant difference considering that the population was created with a notional standard deviation of 3.

The reason is that the MAD is introducing a form of *weighting*. We’re treating the observations
that are father away from the mean in the same way as we do those close to the mean. For the
standard deviation, we’re squaring the difference, so those far from the mean have a much greater
affect on the final value of σ.

Here’s the thing. There are two very important properties of the variance (that’s just $\sigma^2$).

First, the squared term perfectly describes the spread in a Gaussian probability distribution. I
won’t go into the mathematics at this point, but suffice to say that it is *the* best metric to
describe the spread of a Normal distribution.

Secondly, the square is continuously differentiable. The absolute is a discontinuity, it is not
differentiable. This is extremely important in all sorts of *optimisation* problems encountered
during Data Science.

It comes back to the earlier point. If you have values far away from the mean that don’t truly
represent your data, these are known as *outliers*.

If you include outliers in the standard deviation calculation they will over-exaggerate the standard deviation. The result will be far greater than the true standard deviation of the population.

So when you are choosing how to optimise your models, you’ll get the option of using the *L1* or
*L2* Norm.

The L1 Norm is simply the MAD equation, but without the averaging division by $N$. It is also known as the taxicab Norm since the geometric interpretation is the distance that a car has to travel through a square-block city. Use this when you know you have outliers. The L1 Norm is less sensitive to outliers. Also use this if your data isn’t Normally distributed.

The L2 Norm, a.k.a. Euclidean Norm, a.k.a. Pythagoras’ Theorem, is the same as the Standard Deviation, except we leave out the averaging division by $N$ again. Use the L2 Norm when your data doesn’t have outliers and your data is Normally distributed.

Also note that the vast majority of algorithms used within Data Science use the Mean and the L2 Norm somewhere within their implementation. This implies that they assume that your data is always Normally Distributed; they are probably not!

[1]: Revisiting a 90-year-old debate: the advantages of the mean deviation