Cleaning Data: Outliers, Sample 1

Identifying and Dealing with Outliers in Python

Objective

We will be examining lightning strike data collected by the National Oceanic and Atmospheric Association (NOAA) from 1987 through 2020. Because this would be many millions of rows to read into the notebook, Google’s team preprocessed the data so it contains just the year and the number of strikes.

We will examine the range of total lightning strike counts for each year and identify outliers. Then we will plot the yearly totals on a scatterplot.

Imports and loading the data
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

# Read in data
df = pd.read_csv('eda_outliers_dataset1.csv')
df.head(10)
yearnumber_of_strikes
0202015620068
12019209166
2201844600989
3201735095195
4201641582229
5201537894191
6201434919173
7201327600898
8201228807552
9201131392058
Converting the numbers into strings and checking the means & medians

Let’s convert the number of strikes value to a more readable format on the graph.

 def readable_numbers(x):
   """takes a large number and formats it into K,M to make it more readable"""
    if x >= 1e6:
        s = '{:1.1f}M'.format(x*1e-6)
    else:
        s = '{:1.0f}K'.format(x*1e-3)
    return s

# Use the readable_numbers() function to create a new column 
df['number_of_strikes_readable']=df['number_of_strikes'].apply(readable_numbers)

df.head(10)
yearnumber_of_strikesnumber_of_strikes_readable
020201562006815.6M
12019209166209K
220184460098944.6M
320173509519535.1M
420164158222941.6M
520153789419137.9M
620143491917334.9M
720132760089827.6M
820122880755228.8M
920113139205831.4M
print("Mean:" + readable_numbers(np.mean(df['number_of_strikes'])))
print("Median:" + readable_numbers(np.median(df['number_of_strikes'])))
Mean:26.8M
Median:28.3M
Visualizing the results

One effective way to visualize outliers is a boxplot.

# Create boxplot
box = sns.boxplot(x=df['number_of_strikes'])
g = plt.gca()
box.set_xticklabels(np.array([readable_numbers(x) for x in g.get_xticks()]))
plt.xlabel('Number of strikes')
plt.title('Yearly number of lightning strikes')

The points to the left of the left whisker are outliers. Any observations that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3 are considered outliers.

One important point however is not assuming an outlier as erroneous unless there is an explanation or reason to do so.

Defining the IQR limits

Next piece of code will help us define our IQR, upper and lower limit.

# Calculate 25th percentile of annual strikes
percentile25 = df['number_of_strikes'].quantile(0.25)

# Calculate 75th percentile of annual strikes
percentile75 = df['number_of_strikes'].quantile(0.75)

# Calculate interquartile range
iqr = percentile75 - percentile25

# Calculate upper and lower thresholds for outliers
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr

print('Lower limit is: '+ readable_numbers(lower_limit))
Lower limit is: 8.6M

Now we can use a Boolean mask to select only the rows of the dataframe where the number of strikes is less than the lower limit we calculated above. These rows are the outliers on the low end.

# Isolate outliers on low end
df[df['number_of_strikes'] < lower_limit]
yearnumber_of_strikesnumber_of_strikes_readable
12019209166209K
33198773788367.4M
Visualization of the outliers

One great way of seeing these outliers in relation to the rest of the data points is a data visualization. For this plot, let’s do a scatter plot. 

def addlabels(x,y):
    for i in range(len(x)):
        plt.text(x[i]-0.5, y[i]+500000, s=readable_numbers(y[i]))

colors = np.where(df['number_of_strikes'] < lower_limit, 'r', 'b')

fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(df['year'], df['number_of_strikes'],c=colors)
ax.set_xlabel('Year')
ax.set_ylabel('Number of strikes')
ax.set_title('Number of lightning strikes by year')
addlabels(df['year'], df['number_of_strikes'])
for tick in ax.get_xticklabels():
    tick.set_rotation(45)
plt.show()

The ax refers to axis, which tells the notebook to plot the data on an x versus y data graphic.

Investigating the outliers 2019 and 1987

Let’s examine the two outlier years a bit more closely. In the section above, we used a preprocessed dataset that didn’t include a lot of the information that we’re accustomed to having in this data. In order to further investigate the outlier years, we’ll need more information, so we’re going to import data from these years specifically.

Data for 2019
df_2019 = pd.read_csv('eda_outliers_dataset2.csv')
df_2019.head()
datenumber_of_strikescenter_point_geom
02019-12-011POINT(-79.7 35.3)
12019-12-011POINT(-84.7 39.3)
22019-12-011POINT(-83.4 38.9)
32019-12-011POINT(-71.5 35.2)
42019-12-011POINT(-87.8 41.6)

First, lets convert the date column to datetime, as we did before with all the lightning datasets and then sort the data by month.

# Convert `date` column to datetime
df_2019['date']= pd.to_datetime(df_2019['date'])

# Create 2 new columns
df_2019['month'] = df_2019['date'].dt.month
df_2019['month_txt'] = df_2019['date'].dt.month_name().str.slice(stop=3)

# Group by `month` and `month_txt`, sum it, and sort. Assign result to new df
df_2019_by_month = df_2019.groupby(['month','month_txt']).sum().sort_values('month', ascending=True).head(12).reset_index()
df_2019_by_month
monthmonth_txtnumber_of_strikes
012Dec209166

2019 appears to have data only for the month of December. The likelihood of there not being any lightning from January to November 2019 is ~0. This appears to be a case of missing data. We should probably exclude 2019 from the analysis (for most use cases).

Data for 1987

Now let’s inspect the data from the other outlier year, 1987.

# Read in 1987 data
df_1987 = pd.read_csv('eda_outliers_dataset3.csv')

Now let’s do the same datetime conversions and groupings we did for the other datasets.

# Convert `date` column to datetime
df_1987['date'] = pd.to_datetime(df_1987['date'])

# Create 2 new columns
df_1987['month'] = df_1987['date'].dt.month
df_1987['month_txt'] = df_1987['date'].dt.month_name().str.slice(stop=3)

# Group by `month` and `month_txt`, sum it, and sort. Assign result to new df
df_1987_by_month = df_1987.groupby(['month','month_txt']).sum().sort_values('month', ascending=True).head(12).reset_index()
df_1987_by_month
monthmonth_txtnumber_of_strikes
01Jan23044
12Feb61020
23Mar117877
34Apr157890
45May700910
56Jun1064166
67Jul2077619
78Aug2001899
89Sep869833
910Oct105627
1011Nov155290
1112Dec43661

This difference that 2019 does not have data for all months, and that 1987 does have data for all months, helps us to know how to handle these two outliers. For 2019, it would make sense to exclude it from our analysis. As for 1987, we recognize that it is indeed an outlier but it should not be excluded from our analysis because it does have lightning strike totals included for each month of the year.

Rechecking the Means and the Medians

Finally, let’s re-run the mean and median after removing the outliers.

# Create new df that removes outliers
df_without_outliers = df[df['number_of_strikes'] >= lower_limit]

# Recalculate mean and median values on data without outliers
print("Mean:" + readable_numbers(np.mean(df_without_outliers['number_of_strikes'])))
print("Median:" + readable_numbers(np.median(df_without_outliers['number_of_strikes'])))
Mean:28.2M
Median:28.8M

Both the mean and the median changed, but the mean much more so. The outliers significantly affect the dataset’s mean, but do not significantly affect the median.


In

,