Welcome! This workshop is from TrainingDataScience.com. Sign up to receive more free workshops, training and videos.

This workshop is about two fundamental measures of data. I want to you start thinking about how you can best describe or summarise data. How can we best take a set of data and describe that data in as few variables as possible? These are called *summary statistics* because they summarise statistical data. In other words, this is your first model!

```
import numpy as np
```

The *mean*, also known as the average, is a measure of the tendency of the data. For example, if you were provided some data then you could say that, on average, is most likely best represented by the mean.

The mean is calculated as:

$$\mu = \frac{\sum_{i=0}^{N-1}{ x_i }} {N}$$

The sum of all observations divided by the number of observations.

```
x = [6, 4, 6, 9, 4, 4, 9, 7, 3, 6];
```

```
N = len(x)
x_sum = 0
for i in range(N):
x_sum = x_sum + x[i]
mu = x_sum / N
print("μ =", mu)
```

```
μ = 5.8
```

Of course, we should be using libraries to reduce the amount of code we have to write. For low level tasks such as this, the most common library is called Numpy.

We can rewrite the above as:

```
N = len(x)
x_sum = np.sum(x)
mu = x_sum / N
print("μ =", mu)
```

```
μ = 5.8
```

We can take this even further and just use Numpy’s implementation of the mean:

```
print("μ =", np.mean(x))
```

```
μ = 5.8
```

To describe our data, the mean alone doesn’t provide enough information. It tells us what value we should observe on average. But the values could be +/- 1 or +/- 100 of that value. (+/- is shorthand for “plus or minus”, i.e. “could be greater than or less than this value”).

To provide this information we need a measure of “spread” around the mean. The most common measure of “spread” is the *standard deviation*.

Read more about the standard deviation at: TrainingDataScience.com - Why do we use Standard Deviation and is it Right?.

The standard deviation of a population is:

$$\sigma = \sqrt{ \frac{\sum_{i=0}^{N-1}{ (x_i - \mu )^2 }} {N} }$$

```
x = [6, 4, 6, 9, 4, 4, 9, 7, 3, 6];
```

```
N = len(x)
mu = np.mean(x)
print("μ =", mu)
```

```
μ = 5.8
```

```
print("Deviations from the mean:", x - mu)
print("Squared deviations from the mean:", (x - mu)**2)
print("Sum of squared deviations from the mean:", ((x - mu)**2).sum() )
print("Mean of squared deviations from the mean:", ((x - mu)**2).sum() / N )
```

```
Deviations from the mean: [ 0.2 -1.8 0.2 3.2 -1.8 -1.8 3.2 1.2 -2.8 0.2]
Squared deviations from the mean: [ 0.04 3.24 0.04 10.24 3.24 3.24 10.24 1.44 7.84 0.04]
Sum of squared deviations from the mean: 39.6
Mean of squared deviations from the mean: 3.96
```

```
print("σ =", np.sqrt(((x - mu)**2).sum() / N ))
```

```
σ = 1.98997487421
```

Again, we don’t need to code this all up. The Numpy equivalent is:

```
print("σ =", np.std(x))
```

```
σ = 1.98997487421
```

You knew they’d be a catch, right? ;-)

I didn’t mention it at the start, but the two previous measures of the central tendency and the spread are specific to a very special combination of data.

If the observations are *distributed* in a special way, then these metrics perfectly *model* the underlying data. If not, then these metrics are invalid.

You probably said “huh?” to a few of those new words, so let’s go through them.