The entire group which we want to study. Usually, we do not have access to the data for the entirety of this group.
A subset of the population. If the sample is representative, we can use it to make inferences about the population. A lot of classical statistics deals with this subject. There's various strategies to make a sample representative, the simplest of which is drawing it completely at random from the population. If the sample is not representative, we speak of sample bias. The rare case where the sample is actually the entire population is called a census.
The mean is just another name for average value. So the sample mean is simply the average over all data points. The mean of quantity x is often denoted with a bar:
Here we have assumed that the sample consists of n data points.
The variance is a measure or spread in the data. It is found by averaging over the square of the distance from the mean:
Note that the "averaging over" is not technically true, as there's n-1 in the denominator rather than n. The reason for this is essentially that we cannot estimate spread in data from one data point alone, so we need at least two points not to "get infinity" by dividing by zero.
Variance measures spread, but has the disadvantage of having the units of x squared. This makes interpretation harder. For instance, measuring the heights of a group of people and having the measure of spread be an area (length squared) seems unintuitive. So often, the standard deviation - which is just the square root of the variance - is used instead:
Often, we wish to get an idea of whether a change in one variable also means that another variable changes. Imagine we have a sample of n datasets where the variables x and y have been measured. The covariance between the two is then:
Note: This means the covariance of a variable with itself is just the variance of the variable.
Often covariance itself is not as interesting as the correlation, which can be seen as a normalized version of covariance:
This quantity is sometimes known as the Pearson's correlation coefficient. With this definition, the correlation is always between -1 and 1. Positive (negative) values means that increasing one variable generally increases (decreases) the other. The closer the absolute value is to one, the more pronounced the tendency is. The picture below graphs different datasets and associated correlation values.
A few words of caution are in order here!
Correlation does not imply causation
This is often stated, and is very important to remember. For instance, there might be a positive correlation between ice cream sales and the number of drowning accidents. This does not mean that people drown because the have eaten ice cream! There is a third factor - seasonality - which is the cause of both: People eat more ice cream in the summer. More people go swimming in the summer too, which means more drowning accidents too. Seasonality in this example is called a confounder or hidden variable. Causation is generally much harder to detect than correlation.
Correlation only measures linear tendency
Look at the lower row in the figure above. All have zero correlation, but not all are flat lines. This is because correlation only detects linear variations between the variables. So symmetric variations in particular will have zero correlation. But the two variables are not independent. Independence has a probabilty theoretical definition, but for now, the important thing to note is, that while independent variables are uncorrelated, not all uncorrelated variables are independent.