Probability Distributions in Python

In this study, I’ll assume that I am a data professional working for the Department of Education of a large nation.

SciPy is an open source software we can use for solving mathematical, scientific, engineering, and technical problems. It allows us to manipulate and visualize data with a wide range of Python commands. SciPy stats is a module designed specifically for statistics.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm

education_districtwise = pd.read_csv('education_districtwise.csv')
education_districtwise = education_districtwise.dropna()

dropna() was used to remove missing values in our data.

Plotting a histogram

The first step in trying to model our data with a probability distribution is to plot a histogram. This will help us visualize the shape of our data and determine if it resembles the shape of a specific distribution.

education_districtwise['OVERALL_LI'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7f5f7f3e4590>

Normal Distribution

The histogram shows that the distribution of the literacy rate data is bell-shaped and symmetric about the mean. The mean literacy rate, which is around 73%, is located in the center of the plot.

The normal distribution is a continuous probability distribution that is bell-shaped and symmetrical on both sides of the mean. The shape of the histogram suggests that the normal distribution might be a good modeling option for the data.

Empirical rule

Since the normal distribution seems like a good fit for the district literacy rate data, we can expect the empirical rule to apply relatively well. In other words, we can expect that about 68 percent of literacy rates will fall within one standard deviation of the mean, 95 percent will fall within two standard deviations, and 99.7 percent will fall within three standard deviations.

Mean and the standard deviation

mean_overall_li = education_districtwise['OVERALL_LI'].mean()
mean_overall_li

73.39518927444797

The mean district literacy rate is about 73.4%.

std_overall_li = education_districtwise['OVERALL_LI'].std() 
std_overall_li

10.098460413782469

The standard deviation is about 10%.

Now, let’s compute the actual percentage of district literacy rates that fall within +/- 1 SD from the mean.

lower_limit = mean_overall_li - 1 * std_overall_li
upper_limit = mean_overall_li + 1 * std_overall_li
((education_districtwise['OVERALL_LI'] >= lower_limit) & (education_districtwise['OVERALL_LI'] <= upper_limit)).mean()

0.6640378548895899

This is very close to the roughly 68 percent that the empirical rule suggests. Let’s use the same code structure to compute the actual percentage of district literacy rates that fall within +/- 2 SD and +/- 3 SD from the mean.

lower_limit = mean_overall_li - 2 * std_overall_li
upper_limit = mean_overall_li + 2 * std_overall_li
((education_districtwise['OVERALL_LI'] >= lower_limit) & (education_districtwise['OVERALL_LI'] <= upper_limit)).mean()

0.9542586750788643

lower_limit = mean_overall_li - 3 * std_overall_li
upper_limit = mean_overall_li + 3 * std_overall_li
((education_districtwise['OVERALL_LI'] >= lower_limit) & (education_districtwise['OVERALL_LI'] <= upper_limit)).mean()

0.9968454258675079

Our values of 66.4%, 95.4%, and 99.6% are very close to the values the empirical rule suggests: roughly 68%, 95%, and 99.7%.

At this point, it’s safe to say our data follows a normal distribution. Knowing that our data is normally distributed is useful for analysis because many statistical tests and machine learning models assume a normal distribution. Plus, when our data follows a normal distribution, we can use Z-scores to measure the relative position of our values and find outliers in our data.

Compute z-scores to find outliers

A z-score is a measure of how many standard deviations below or above the population mean a data point is. A z-score is useful because it tells us where a value lies in a distribution. Typically, data professionals consider observations with a z-score smaller than -3 or larger than +3 as outliers.

For example, if one tells us a literacy rate is 80 percent, this doesn’t give us much information about where the value lies in the distribution. However, if they also tell us the literacy rate has a Z-score of two, then we know that the value is two standard deviations above the mean.

We will compute the z-scores using the function scipy.stats.zscore().

education_districtwise['Z_SCORE'] = stats.zscore(education_districtwise['OVERALL_LI'])
education_districtwise

	DISTNAME	STATNAME	BLOCKS	VILLAGES	CLUSTERS	TOTPOPULAT	OVERALL_LI	Z_SCORE
0	DISTRICT32	STATE1	13	391	104	875564.0	66.92	-0.641712
1	DISTRICT649	STATE1	18	678	144	1015503.0	66.93	-0.640721
2	DISTRICT229	STATE1	8	94	65	1269751.0	71.21	-0.216559
3	DISTRICT259	STATE1	13	523	104	735753.0	57.98	-1.527694
4	DISTRICT486	STATE1	8	359	64	570060.0	65.00	-0.831990
…	…	…	…	…	…	…	…	…
675	DISTRICT522	STATE29	37	876	137	5296396.0	78.05	0.461307
676	DISTRICT498	STATE29	64	1458	230	4042191.0	56.06	-1.717972
677	DISTRICT343	STATE29	59	1117	216	3483648.0	65.05	-0.827035
678	DISTRICT130	STATE29	51	993	211	3522644.0	66.16	-0.717030
679	DISTRICT341	STATE29	41	783	185	2798214.0	65.46	-0.786403

634 rows × 8 columns

Now that we have computed z-scores for our dataset, we will write some code to identify outliers, or districts with z-scores that are more than +/- 3 SDs from the mean.

education_districtwise[(education_districtwise['Z_SCORE'] > 3) | (education_districtwise['Z_SCORE'] < -3)]

	DISTNAME	STATNAME	BLOCKS	VILLAGES	CLUSTERS	TOTPOPULAT	OVERALL_LI	Z_SCORE
434	DISTRICT461	STATE31	4	360	53	532791.0	42.67	-3.044964
494	DISTRICT429	STATE22	6	612	62	728677.0	37.22	-3.585076

Using z-scores, we can identify two outlying districts that have unusually low literacy rates: DISTRICT461 and DISTRICT429.

Our analysis gives us important information to share. The government may want to provide more funding and resources to these two districts in the hopes of significantly improving literacy.

Probability distributions are useful for modeling our data and help us determine which statistical test to use for an analysis. In addition to the normal distribution, Python can help us work with a wide range of probability distributions too.