Validating Data: Input Validation

There are many different ways to validate a set of data, but here we’ll work on Input validation. 

Input validation

The practice of thoroughly analyzing and double checking to make sure data is complete, error free, and high quality.

Objective

We will be examining monthly lightning strike data collected by the National Oceanic and Atmospheric Association (NOAA) for 2018. The dataset includes five columns: date, number_of_strikes, center_point_geom, longitude, and latitude.

The objective is to inspect the data and validate the quality of its contents. We will check for: 

  • Null values 
  • Missing dates 
  • A plausible range of daily lightning strikes in a location 
  • A geographical range that aligns with expectation
Imports, Read, and Basic Check
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px 
import seaborn as sns

df = pd.read_csv('eda_input_validation_joining_dataset1.csv')
df.head()
datenumber_of_strikescenter_point_geomlongitudelatitude
02018-01-03194POINT(-75 27)-75.027.0
12018-01-0341POINT(-78.4 29)-78.429.0
22018-01-0333POINT(-73.9 27)-73.927.0
32018-01-0338POINT(-73.8 27)-73.827.0
42018-01-0392POINT(-79 28)-79.028.0
# Display the data types of the columns
print(df.dtypes)
date                  object
number_of_strikes int64
center_point_geom object
longitude float64
latitude float64
dtype: object

As we did many times before, let’s change the date column into a datetime.

# Convert `date` column to datetime
df['date'] = pd.to_datetime(df['date'])

Let’s begin our data validation by counting the number of missing values in each column first.

df.isnull().sum()
date                 0
number_of_strikes 0
center_point_geom 0
longitude 0
latitude 0
dtype: int64

Now let’s check ranges for all variables.

df.describe(include = 'all')
datenumber_of_strikescenter_point_geomlongitudelatitude
count34010123.401012e+0634010123.401012e+063.401012e+06
unique357NaN170855NaNNaN
top2018-09-01 00:00:00NaNPOINT(-81.5 22.5)NaNNaN
freq31773NaN108NaNNaN
first2018-01-01 00:00:00NaNNaNNaNNaN
last2018-12-31 00:00:00NaNNaNNaNNaN
meanNaN1.311403e+01NaN-9.081778e+013.374688e+01
stdNaN3.212099e+01NaN1.296593e+017.838555e+00
minNaN1.000000e+00NaN-1.418000e+021.660000e+01
25%NaN2.000000e+00NaN-1.008000e+022.760000e+01
50%NaN4.000000e+00NaN-9.070000e+013.350000e+01
75%NaN1.200000e+01NaN-8.130000e+013.970000e+01
maxNaN2.211000e+03NaN-4.320000e+015.170000e+01

Notice that the number of unique dates in the date column is 357. This means that eight days of 2018 are missing from the data, because 2018 had 365 days.

Validate date column

Let’s confirm each day is listed by designing a calendar index, we’ll call it full date range. This is how we will determine which dates are missing.

# Create datetime index of every date in 2018
full_date_range = pd.date_range(start='2018-01-01', end='2018-12-31')

# Determine which values are in `full_date_range` but not in `df['date']`
full_date_range.difference(df['date'])
DatetimeIndex(['2018-06-19', '2018-06-20', '2018-06-21', '2018-06-22',
'2018-09-18', '2018-09-19', '2018-12-01', '2018-12-02'],
dtype='datetime64[ns]', freq=None)

There are four consecutive days missing in June of 2018, then two consecutive days in September, and two consecutive days in December. This finding would be something to investigate or to question the owner of the data, to find out the reason why.

Given that the number of missing days is relatively small, we can complete our analysis by making a note of the missing days in the presentation. This will ensure that anyone who analyzes our visualization or presentation will know that the data depicted doesn’t include those missing dates.

Validate number_of_strikes column

Let’s make a boxplot to better understand the range of values in the data.

sns.boxplot(y = df['number_of_strikes'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f2062f56ed0>

It looks like the distribution is very skewed. Most days of the year have less than five lightning strikes, but some days have more than 2,000. But we can remove the outliers from the boxplot visualization to help us understand where the majority of our data is distributed.

sns.boxplot(y = df['number_of_strikes'], showfliers=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f20542926d0>

The interquartile range is approximately 2–12 strikes. But we know from the previous boxplot that there are many outlier days that have hundreds or even thousands of strikes. This exercise just helped us make sure that most of the dates in our data had plausible values for the number of strikes.

If the highest distribution of strikes were all in the 2000s, we might be a little more skeptical.

Validate latitude and longitude columns 

Finally, let’s verify that all of the latitudes and longitudes included in this dataset are in the United States. We’ll plot the points on a map to make sure the points in the data are relevant and not in unexpected locations. 

Because this can be a computationally intensive process, we’ll prevent redundant computation by dropping rows that have the same values in their latitude and longitude columns. We can do this because the purpose here is to examine locations that had lightning strikes, but it doesn’t matter how many strikes they had or when.

# Create new df only of unique latitude and longitude combinations
df_points = df[['latitude', 'longitude']].drop_duplicates() 
df_points.head()
latitudelongitude
027.0-75.0
129.0-78.4
227.0-73.9
327.0-73.8
428.0-79.0

When we’re working with hundreds of thousands of data points plotted on a map, it takes a lot of computing power. To make it a shorter runtime, below we’ll use the Plotly express package. The package is designed to keep runtimes as low as possible.

p = px.scatter_geo(df_points, lat = 'latitude', lon = 'longitude')
p.show()

In

,