# Statistics

## Population

The entire group which we want to study. Usually, we do not have access to the data for the entirety of this group.

## Sample

A subset of the population. If the sample is *representative*, we can use it to make *inferences* about the population. A lot of classical statistics deals with this subject. There's various strategies to make a sample representative, the simplest of which is drawing it completely at random from the population. If the sample is not representative, we speak of sample *bias*. The rare case where the sample is actually the entire population is called a *census*.

## Mean

The *mean* is just another name for average value. So the *sample mean* is simply the average over all data points. The mean of quantity x is often denoted with a bar:

Here we have assumed that the sample consists of n data points.

## Variance

The *variance* is a measure or *spread* in the data. It is found by averaging over the square of the distance from the mean:

Note that the "averaging over" is not technically true, as there's n-1 in the denominator rather than n. The reason for this is essentially that we cannot estimate spread in data from one data point alone, so we need at least two points not to "get infinity" by dividing by zero.

## Standard deviation

Variance measures spread, but has the disadvantage of having the units of x squared. This makes interpretation harder. For instance, measuring the heights of a group of people and having the measure of spread be an area (length squared) seems unintuitive. So often, the *standard deviation* - which is just the square root of the variance - is used instead:

## Covariance

Often, we wish to get an idea of whether a change in one variable also means that another variable changes. Imagine we have a sample of n datasets where the variables x and y have been measured. The *covariance* between the two is then:

Note: This means the covariance of a variable with itself is just the variance of the variable.

## Correlation

Often covariance itself is not as interesting as the *correlation*, which can be seen as a normalized version of covariance:

This quantity is sometimes known as the *Pearson's correlation coefficient*. With this definition, the correlation is always between -1 and 1. Positive (negative) values means that increasing one variable generally increases (decreases) the other. The closer the absolute value is to one, the more pronounced the tendency is. The picture below graphs different datasets and associated correlation values.

A few words of caution are in order here!

*not* imply causation

Correlation does This is often stated, and is very important to remember. For instance, there might be a positive correlation between ice cream sales and the number of drowning accidents. This does not mean that people drown because the have eaten ice cream! There is a third factor - seasonality - which is the cause of both: People eat more ice cream in the summer. More people go swimming in the summer too, which means more drowning accidents too. Seasonality in this example is called a *confounder* or *hidden variable*. Causation is generally much harder to detect than correlation.

### Correlation only measures linear tendency

Look at the lower row in the figure above. All have zero correlation, but not all are flat lines. This is because correlation only detects linear variations between the variables. So symmetric variations in particular will have zero correlation. But the two variables are not *independent*. Independence has a probabilty theoretical definition, but for now, the important thing to note is, that while independent variables are uncorrelated, not all uncorrelated variables are independent.