Cleaning Data: Handling Outliers

How to Handle Outliers

In my previous post I gave a sample work to demonstrate how to identify and deal with the outliers. Here I’ll collect all related notes together and bring more of them for a future reference.

When we deal with the outliers, we can start with these general guidelines:

  • Delete them: If we are sure the outliers are mistakes, typos, or errors and the dataset will be used for modeling or machine learning, then we are more likely to decide to delete outliers. Of the three choices, we’ll use this one the least. 
  • Reassign them: If the dataset is small and/or the data will be used for modeling or machine learning, we are more likely to choose a path of deriving new values to replace the outlier values. 
  • Leave them: For a dataset that we plan to do EDA/analysis on and nothing else, or for a dataset we are preparing for a model that is resistant to outliers, it is most likely that we are going to leave them in.

Now let’s check some techniques for deleting and reassigning outliers.

Deleting Outliers

In a previous post we checked a sample with the lightning dataset. Let’s get back to that dataframe.

box = sns.boxplot(x=df['number_of_strikes'])
g = plt.gca()
box.set_xticklabels(np.array([readable_numbers(x) for x in g.get_xticks()]))
plt.xlabel('Number of strikes')
plt.title('Yearly number of lightning strikes')

Then we found the lower limit.

# Calculate 25th percentile of annual strikes
percentile25 = df['number_of_strikes'].quantile(0.25)

# Calculate 75th percentile of annual strikes
percentile75 = df['number_of_strikes'].quantile(0.75)

# Calculate interquartile range
iqr = percentile75 - percentile25

# Calculate upper and lower thresholds for outliers
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr

print('Lower limit is: ', lower_limit)
Lower limit is:  8585016.625

Next, we used a Boolean mask to filter the dataframe.

print(df[df['number_of_strikes'] < lower_limit])
    number_of_strikes  year
1 209166 2019
33 7378836 1987

Once we know the cutoff points for outliers, if we want to delete them, we can use a Boolean mask to select all rows such that: lower limit ≤ values ≤ upper limit.

mask = (df['number_of_strikes'] >= lower_limit) & (df['number_of_strikes'] <= upper_limit)

df = df[mask].copy()
print(df)
    number_of_strikes  year
0 15620068 2020
2 44600989 2018
3 35095195 2017
4 41582229 2016
5 37894191 2015
6 34919173 2014
7 27600898 2013
8 28807552 2012
9 31392058 2011
10 29068965 2010
11 30100585 2009
12 29790934 2008
13 30529064 2007
14 33292382 2006
15 38168699 2005
16 40023951 2004
17 39092327 2003
18 29916767 2002
19 25470095 2001
20 26276135 2000
21 27758681 1999
22 28802221 1998
23 26986915 1997
24 26190094 1996
25 22763540 1995
26 25094010 1994
27 24206929 1993
28 16371876 1992
29 16900934 1991
30 15839052 1990
31 14245186 1989
32 9150440 1988

Reassigning Outliers

Instead of deleting outliers, we can always reassign them, that is, change the values to ones that fit within the general distribution of the dataset. There are two common ways to do this, but many different ways can be used:

1. Create a floor and ceiling at a quantile: 

We could place walls at the 90th and 10th percentile of the distribution of data values. Any value above the 90% mark or below the 10% mark are changed to fit within the walls we set.

# Calculate 10th percentile
tenth_percentile = np.percentile(df['number_of_strikes'], 10)

# Calculate 90th percentile
ninetieth_percentile = np.percentile(df['number_of_strikes'], 90)

# Apply lambda function to replace outliers with thresholds defined above
df['number_of_strikes'] = df['number_of_strikes'].apply(lambda x: (
    tenth_percentile if x < tenth_percentile 
    else ninetieth_percentile if x > ninetieth_percentile 
    else x))
0     15620068.0
1 14657650.6
2 38815238.6
3 35095195.0
4 38815238.6
5 37894191.0
6 34919173.0
7 27600898.0
8 28807552.0
9 31392058.0
10 29068965.0
11 30100585.0
12 29790934.0
13 30529064.0
14 33292382.0
15 38168699.0
16 38815238.6
17 38815238.6
18 29916767.0
19 25470095.0
20 26276135.0
21 27758681.0
22 28802221.0
23 26986915.0
24 26190094.0
25 22763540.0
26 25094010.0
27 24206929.0
28 16371876.0
29 16900934.0
30 15839052.0
31 14657650.6
32 14657650.6
33 14657650.6
Name: number_of_strikes, dtype: float64
2. Impute the average: 

In some cases, it might be best to reassign all outlier values to match the median or mean value. This will ensure that our median and distribution are based solely on the non-outlier values, leaving the original outliers excluded. 

The following code block calculates the median of the values greater than the lower limit. Then it imputes the median where values are lower than the lower limit.

# Calculate median of all NON-OUTLIER values
median = np.median(df['number_of_strikes'][df['number_of_strikes'] >= lower_limit])

# Impute the median for all values < lower_limit
df['number_of_strikes'] = np.where(df['number_of_strikes'] < lower_limit, median, df['number_of_strikes'] )
[ 15620068.   28804886.5  44600989.   35095195.   41582229.   37894191.
34919173. 27600898. 28807552. 31392058. 29068965. 30100585.
29790934. 30529064. 33292382. 38168699. 40023951. 39092327.
29916767. 25470095. 26276135. 27758681. 28802221. 26986915.
26190094. 22763540. 25094010. 24206929. 16371876. 16900934.
15839052. 14245186. 9150440. 28804886.5]

Of course, there are more complex variations on dealing with outliers. These above ways are some common ones.


In

,