Cleaning Data: Outliers, Sample 1

Identifying and Dealing with Outliers in Python

Objective

We will be examining lightning strike data collected by the National Oceanic and Atmospheric Association (NOAA) from 1987 through 2020. Because this would be many millions of rows to read into the notebook, Google’s team preprocessed the data so it contains just the year and the number of strikes.

We will examine the range of total lightning strike counts for each year and identify outliers. Then we will plot the yearly totals on a scatterplot.

Imports and loading the data

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

# Read in data
df = pd.read_csv('eda_outliers_dataset1.csv')
df.head(10)

	year	number_of_strikes
0	2020	15620068
1	2019	209166
2	2018	44600989
3	2017	35095195
4	2016	41582229
5	2015	37894191
6	2014	34919173
7	2013	27600898
8	2012	28807552
9	2011	31392058

Converting the numbers into strings and checking the means & medians

Let’s convert the number of strikes value to a more readable format on the graph.

 def readable_numbers(x):
   """takes a large number and formats it into K,M to make it more readable"""
    if x >= 1e6:
        s = '{:1.1f}M'.format(x*1e-6)
    else:
        s = '{:1.0f}K'.format(x*1e-3)
    return s

# Use the readable_numbers() function to create a new column 
df['number_of_strikes_readable']=df['number_of_strikes'].apply(readable_numbers)

df.head(10)

	year	number_of_strikes	number_of_strikes_readable
0	2020	15620068	15.6M
1	2019	209166	209K
2	2018	44600989	44.6M
3	2017	35095195	35.1M
4	2016	41582229	41.6M
5	2015	37894191	37.9M
6	2014	34919173	34.9M
7	2013	27600898	27.6M
8	2012	28807552	28.8M
9	2011	31392058	31.4M

print("Mean:" + readable_numbers(np.mean(df['number_of_strikes'])))
print("Median:" + readable_numbers(np.median(df['number_of_strikes'])))

Mean:26.8M
Median:28.3M

Visualizing the results

One effective way to visualize outliers is a boxplot.

# Create boxplot
box = sns.boxplot(x=df['number_of_strikes'])
g = plt.gca()
box.set_xticklabels(np.array([readable_numbers(x) for x in g.get_xticks()]))
plt.xlabel('Number of strikes')
plt.title('Yearly number of lightning strikes')

The points to the left of the left whisker are outliers. Any observations that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3 are considered outliers.

One important point however is not assuming an outlier as erroneous unless there is an explanation or reason to do so.

Defining the IQR limits

Next piece of code will help us define our IQR, upper and lower limit.

# Calculate 25th percentile of annual strikes
percentile25 = df['number_of_strikes'].quantile(0.25)

# Calculate 75th percentile of annual strikes
percentile75 = df['number_of_strikes'].quantile(0.75)

# Calculate interquartile range
iqr = percentile75 - percentile25

# Calculate upper and lower thresholds for outliers
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr

print('Lower limit is: '+ readable_numbers(lower_limit))

Lower limit is: 8.6M

Now we can use a Boolean mask to select only the rows of the dataframe where the number of strikes is less than the lower limit we calculated above. These rows are the outliers on the low end.

# Isolate outliers on low end
df[df['number_of_strikes'] < lower_limit]

	year	number_of_strikes	number_of_strikes_readable
1	2019	209166	209K
33	1987	7378836	7.4M

Visualization of the outliers

One great way of seeing these outliers in relation to the rest of the data points is a data visualization. For this plot, let’s do a scatter plot.

def addlabels(x,y):
    for i in range(len(x)):
        plt.text(x[i]-0.5, y[i]+500000, s=readable_numbers(y[i]))

colors = np.where(df['number_of_strikes'] < lower_limit, 'r', 'b')

fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(df['year'], df['number_of_strikes'],c=colors)
ax.set_xlabel('Year')
ax.set_ylabel('Number of strikes')
ax.set_title('Number of lightning strikes by year')
addlabels(df['year'], df['number_of_strikes'])
for tick in ax.get_xticklabels():
    tick.set_rotation(45)
plt.show()

The ax refers to axis, which tells the notebook to plot the data on an x versus y data graphic.

Investigating the outliers 2019 and 1987

Let’s examine the two outlier years a bit more closely. In the section above, we used a preprocessed dataset that didn’t include a lot of the information that we’re accustomed to having in this data. In order to further investigate the outlier years, we’ll need more information, so we’re going to import data from these years specifically.

Data for 2019

df_2019 = pd.read_csv('eda_outliers_dataset2.csv')
df_2019.head()

	date	number_of_strikes	center_point_geom
0	2019-12-01	1	POINT(-79.7 35.3)
1	2019-12-01	1	POINT(-84.7 39.3)
2	2019-12-01	1	POINT(-83.4 38.9)
3	2019-12-01	1	POINT(-71.5 35.2)
4	2019-12-01	1	POINT(-87.8 41.6)

First, lets convert the date column to datetime, as we did before with all the lightning datasets and then sort the data by month.

# Convert `date` column to datetime
df_2019['date']= pd.to_datetime(df_2019['date'])

# Create 2 new columns
df_2019['month'] = df_2019['date'].dt.month
df_2019['month_txt'] = df_2019['date'].dt.month_name().str.slice(stop=3)

# Group by `month` and `month_txt`, sum it, and sort. Assign result to new df
df_2019_by_month = df_2019.groupby(['month','month_txt']).sum().sort_values('month', ascending=True).head(12).reset_index()
df_2019_by_month

	month	month_txt	number_of_strikes
0	12	Dec	209166

2019 appears to have data only for the month of December. The likelihood of there not being any lightning from January to November 2019 is ~0. This appears to be a case of missing data. We should probably exclude 2019 from the analysis (for most use cases).

Data for 1987

Now let’s inspect the data from the other outlier year, 1987.

# Read in 1987 data
df_1987 = pd.read_csv('eda_outliers_dataset3.csv')

Now let’s do the same datetime conversions and groupings we did for the other datasets.

# Convert `date` column to datetime
df_1987['date'] = pd.to_datetime(df_1987['date'])

# Create 2 new columns
df_1987['month'] = df_1987['date'].dt.month
df_1987['month_txt'] = df_1987['date'].dt.month_name().str.slice(stop=3)

# Group by `month` and `month_txt`, sum it, and sort. Assign result to new df
df_1987_by_month = df_1987.groupby(['month','month_txt']).sum().sort_values('month', ascending=True).head(12).reset_index()
df_1987_by_month

	month	month_txt	number_of_strikes
0	1	Jan	23044
1	2	Feb	61020
2	3	Mar	117877
3	4	Apr	157890
4	5	May	700910
5	6	Jun	1064166
6	7	Jul	2077619
7	8	Aug	2001899
8	9	Sep	869833
9	10	Oct	105627
10	11	Nov	155290
11	12	Dec	43661

This difference that 2019 does not have data for all months, and that 1987 does have data for all months, helps us to know how to handle these two outliers. For 2019, it would make sense to exclude it from our analysis. As for 1987, we recognize that it is indeed an outlier but it should not be excluded from our analysis because it does have lightning strike totals included for each month of the year.

Rechecking the Means and the Medians

Finally, let’s re-run the mean and median after removing the outliers.

# Create new df that removes outliers
df_without_outliers = df[df['number_of_strikes'] >= lower_limit]

# Recalculate mean and median values on data without outliers
print("Mean:" + readable_numbers(np.mean(df_without_outliers['number_of_strikes'])))
print("Median:" + readable_numbers(np.median(df_without_outliers['number_of_strikes'])))

Mean:28.2M
Median:28.8M

Both the mean and the median changed, but the mean much more so. The outliers significantly affect the dataset’s mean, but do not significantly affect the median.