There are many different ways to validate a set of data, but here we’ll work on Input validation.
Input validation
The practice of thoroughly analyzing and double checking to make sure data is complete, error free, and high quality.
Objective
We will be examining monthly lightning strike data collected by the National Oceanic and Atmospheric Association (NOAA) for 2018. The dataset includes five columns: date, number_of_strikes, center_point_geom, longitude, and latitude.
The objective is to inspect the data and validate the quality of its contents. We will check for:
- Null values
- Missing dates
- A plausible range of daily lightning strikes in a location
- A geographical range that aligns with expectation
Imports, Read, and Basic Check
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import seaborn as sns
df = pd.read_csv('eda_input_validation_joining_dataset1.csv')
df.head()
date | number_of_strikes | center_point_geom | longitude | latitude | |
0 | 2018-01-03 | 194 | POINT(-75 27) | -75.0 | 27.0 |
1 | 2018-01-03 | 41 | POINT(-78.4 29) | -78.4 | 29.0 |
2 | 2018-01-03 | 33 | POINT(-73.9 27) | -73.9 | 27.0 |
3 | 2018-01-03 | 38 | POINT(-73.8 27) | -73.8 | 27.0 |
4 | 2018-01-03 | 92 | POINT(-79 28) | -79.0 | 28.0 |
# Display the data types of the columns
print(df.dtypes)
date object
number_of_strikes int64
center_point_geom object
longitude float64
latitude float64
dtype: object
As we did many times before, let’s change the date column into a datetime.
# Convert `date` column to datetime
df['date'] = pd.to_datetime(df['date'])
Let’s begin our data validation by counting the number of missing values in each column first.
df.isnull().sum()
date 0
number_of_strikes 0
center_point_geom 0
longitude 0
latitude 0
dtype: int64
Now let’s check ranges for all variables.
df.describe(include = 'all')
date | number_of_strikes | center_point_geom | longitude | latitude | |
count | 3401012 | 3.401012e+06 | 3401012 | 3.401012e+06 | 3.401012e+06 |
unique | 357 | NaN | 170855 | NaN | NaN |
top | 2018-09-01 00:00:00 | NaN | POINT(-81.5 22.5) | NaN | NaN |
freq | 31773 | NaN | 108 | NaN | NaN |
first | 2018-01-01 00:00:00 | NaN | NaN | NaN | NaN |
last | 2018-12-31 00:00:00 | NaN | NaN | NaN | NaN |
mean | NaN | 1.311403e+01 | NaN | -9.081778e+01 | 3.374688e+01 |
std | NaN | 3.212099e+01 | NaN | 1.296593e+01 | 7.838555e+00 |
min | NaN | 1.000000e+00 | NaN | -1.418000e+02 | 1.660000e+01 |
25% | NaN | 2.000000e+00 | NaN | -1.008000e+02 | 2.760000e+01 |
50% | NaN | 4.000000e+00 | NaN | -9.070000e+01 | 3.350000e+01 |
75% | NaN | 1.200000e+01 | NaN | -8.130000e+01 | 3.970000e+01 |
max | NaN | 2.211000e+03 | NaN | -4.320000e+01 | 5.170000e+01 |
Notice that the number of unique dates in the date column is 357. This means that eight days of 2018 are missing from the data, because 2018 had 365 days.
Validate date column
Let’s confirm each day is listed by designing a calendar index, we’ll call it full date range. This is how we will determine which dates are missing.
# Create datetime index of every date in 2018
full_date_range = pd.date_range(start='2018-01-01', end='2018-12-31')
# Determine which values are in `full_date_range` but not in `df['date']`
full_date_range.difference(df['date'])
DatetimeIndex(['2018-06-19', '2018-06-20', '2018-06-21', '2018-06-22',
'2018-09-18', '2018-09-19', '2018-12-01', '2018-12-02'],
dtype='datetime64[ns]', freq=None)
There are four consecutive days missing in June of 2018, then two consecutive days in September, and two consecutive days in December. This finding would be something to investigate or to question the owner of the data, to find out the reason why.
Given that the number of missing days is relatively small, we can complete our analysis by making a note of the missing days in the presentation. This will ensure that anyone who analyzes our visualization or presentation will know that the data depicted doesn’t include those missing dates.
Validate number_of_strikes column
Let’s make a boxplot to better understand the range of values in the data.
sns.boxplot(y = df['number_of_strikes'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f2062f56ed0>
It looks like the distribution is very skewed. Most days of the year have less than five lightning strikes, but some days have more than 2,000. But we can remove the outliers from the boxplot visualization to help us understand where the majority of our data is distributed.
sns.boxplot(y = df['number_of_strikes'], showfliers=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f20542926d0>
The interquartile range is approximately 2–12 strikes. But we know from the previous boxplot that there are many outlier days that have hundreds or even thousands of strikes. This exercise just helped us make sure that most of the dates in our data had plausible values for the number of strikes.
If the highest distribution of strikes were all in the 2000s, we might be a little more skeptical.
Validate latitude and longitude columns
Finally, let’s verify that all of the latitudes and longitudes included in this dataset are in the United States. We’ll plot the points on a map to make sure the points in the data are relevant and not in unexpected locations.
Because this can be a computationally intensive process, we’ll prevent redundant computation by dropping rows that have the same values in their latitude and longitude columns. We can do this because the purpose here is to examine locations that had lightning strikes, but it doesn’t matter how many strikes they had or when.
# Create new df only of unique latitude and longitude combinations
df_points = df[['latitude', 'longitude']].drop_duplicates()
df_points.head()
latitude | longitude | |
0 | 27.0 | -75.0 |
1 | 29.0 | -78.4 |
2 | 27.0 | -73.9 |
3 | 27.0 | -73.8 |
4 | 28.0 | -79.0 |
When we’re working with hundreds of thousands of data points plotted on a map, it takes a lot of computing power. To make it a shorter runtime, below we’ll use the Plotly express package. The package is designed to keep runtimes as low as possible.
p = px.scatter_geo(df_points, lat = 'latitude', lon = 'longitude')
p.show()