Intro to Statistics
Statistics is all about data and they are generally divided into two categories:
Descriptive statistics – Imagine we have a bunch of data, and if we want to tell something about that data. Without giving people all of the data, can we somehow describe it with a smaller set of numbers?
Inferential statistics – And then once we build our toolkit on the descriptive statistics, then we can start to make inferences about that data (eg. start to make conclusions / judgments.)
So we can tell, descriptive statistics focus on describing the visible characteristics of a dataset (a population or a sample). Meanwhile, inferential statistics focus on making predictions or generalizations about a larger dataset, based on a sample of those data.
Measuring center in Quantitative Data
What is the average?
The average is like our attempt to represent values in a dataset with one number. We call this ‘average’, ‘typical’, ‘middle’ or ‘center’ of those values. So it’s an attempt to find a measure of central tendency.
Some ways to get the average or measure of central tendency:
- Arithmetic mean – the average of all values
- Median – the middle value (after we sort the data points)
- Mode – the most common value
- Midrange – the arithmetic mean of the highest and lowest value
And, formal way to write the formula of population mean is:
[math]mean=\mu=\frac{\sum_{i=1}^{n}x_{i}}{n}[/math]
Which one to choose: mean or median?
Median is good when we have outliers. Imagine we have the following dataset, representing the scores of a team.
70, 72, 74, 76, 80, 114
Here median (75) best describes the scores of the team, because the mean (81) is higher than almost all of the scores in the data set.
We can also think of the mean as the balancing point. It is another way of saying that the total distance from the mean to each data point that is below the mean is equal to the total distance from the mean to each data point that is above the mean.
Measures of spread: Range, Variance & Standard Deviation
Apart from the central tendency, we also check the measures of spread in a dataset to understand it better.
Let’s check the two datasets below to underline the need of such measures.
Data set 1: -10, 0, 10, 20, 30
Data set 2: 8, 9, 10, 11, 12
Mean of set 1: 10
Mean of set 2: 10
However,
Range of set 1: 40
Range of set 2: 4
The range is a good measure of dispersion here, rather than just checking the mean of these two sets, which are in fact the same.
A question with range and median
Imagine 11 firefighters save cats and we have a record of each of them showing how many cats they saved. We know that the median of saved cats is 6 and the range of cats is 4. Is the following statement true, false or not known?
“At least one of the firefighters saved more than 10 cats?”
It is False, because if the median is 6, the minimum should be less than or equal to 6. Even if it would be 6, then the max could be 10, so we can’t say that any of the firefighters saved more than 10 cats.
Interquartile range (IQR)
The IQR describes the middle 50% of values when ordered from lowest to highest.
To find the interquartile range (IQR), we find the median (middle value) of the lower and the upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.
It’s a measure of spread and the formula would be as simple as below.
IQR = Q3 – Q1
But range is not always going to tell us the whole picture. We might have two datasets with the same range, while still having different distributions of their values. This leads us to the terms of variance and standard deviation.
Variance and Standard Deviation of a Population
A quick note: Here we’ll assume that we are looking at a population. In other words, we are not sampling, nor taking a subset.
Variance of a population
Population variance is a measure of how spread out a group of data points is. Specifically, it quantifies the average squared deviation from the mean. So, if all data points are very close to the mean, the variance will be small; if data points are spread out over a wide range, the variance will be larger.
[math]population\:variance=\sigma^{2}=\frac{\sum_{i=1}^{n}(x_{i}-\mu)^{2}}{N}[/math]
µ (mu) refers to the population mean
To get the variance, we take each of the data points and find the difference between them and the mean. Then we square them and take the average of those squares.
Population standard deviation
The population standard deviation is a measure of how much variation there is among individual data points in a population. It’s a way of quantifying how spread out the data is from its mean. A small standard deviation means that the data points are generally close to the mean, while a large standard deviation means that the data is more dispersed.
[math]SD=\sqrt{\frac{\Sigma|x-\mu|^{2}}{N}}[/math]
Σ (sigma) means sum of
x is a value in the dataset
μ is the population mean
N is the number of data points in the population
Note that we can’t have negative standard deviation.
Variance and Standard Deviation
Variance gives us a bit of an arbitrary number. For example, if we are dealing with units, like distances in meters, we get variance in terms of meters squared. It’s kind of an odd set of units (if we’re trying to visualize how dispersed we are from the mean).
So people like to talk in terms of standard deviation, which is just the square root of the variance.
[math]population\:standard\:deviation=\sqrt{variance}=\sqrt{\sigma^{2}}=\sigma[/math]
To get a better understanding, let’s compare the below datasets side by side.
Dataset A -10, 0, 10, 20, 30 | Dataset B 8, 9, 10, 11, 12 | |
10 | mean | 10 |
40 | range | 4 |
200 | variance | 2 |
102 | Standard deviation | 2 |
Using standard deviation makes a little more sense now: Dataset A has 10 times (more) the standard deviation than dataset B.
Let’s elaborate this a bit more:
Both has the mean of 10, but for the set B, 9 is 1 away from the 10 and for the set A, 0 is 10 away from the 10. Looks like the standard deviation is giving a much better sense of how far away on average we are from the mean.
Why do we take square roots?
We could have done other things, like taking the absolute value. The reason why we do it this way is it has neat statistical properties as we try to build on it. Hopefully, the above sample makes a little more sense.
Which measure to use for central tendency and the spread?
Imagine we have the following dataset, representing the salaries in thousands.
35, 50, 50, 50, 56, 60, 60, 75, 250
Descriptive stats would be as follows:
Mean: ~76.2
Std Dev: ~62.3
Median: 56
IQR: 17.5
Which measure of central tendency is a better measure?
If we pick the mean, then our measure of central tendency is higher than all of the data points except for one. Because our data is skewed significantly by the last one (250), hence an outlier.
In such cases, the median is much more robust and more indicative for a measure of central tendency.
What about the spread?
Well, the standard deviation is based on the mean. Our outlier will skew the standard deviation as well. That’s why the interquartile range is more robust.
We can look at this from this point as well: Even if the outlier 250 would be, say 250k, our median and IQR would remain the same. And, still would be better representatives of the central tendency and the spread.
That’s why when we work on data that are about home prices or salaries, we mostly use the median. Because in most cases those datasets are skewed.
Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.
I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Khan Academy’s Statistics and Probability series.