Cleaning Data: Missing data, Sample 1

Dealing with missing data in Python

Objective

We will be examining lightning strike data collected by the National Oceanic and Atmospheric Association (NOAA) for the month of August 2018. There are two datasets. The first includes five columns:

datecenter_point_geomlongitudelatitudenumber_of_strikes

The second dataset contains seven columns:

datezip_codecitystatestate_codecenter_point_geomnumber_of_strikes

We want to combine the two datasets into a single dataframe that has all of the information from both datasets. Ideally, both datasets will have the same number of entries for the same locations on the same dates. If they don’t, we’ll investigate which data is missing.

Imports and Loading data
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
from matplotlib import pyplot as plt

# Read in first dataset
df = pd.read_csv('eda_missing_data_dataset1.csv')
df.head()
datecenter_point_geomlongitudelatitudenumber_of_strikes
02018-08-01POINT(-81.6 22.6)-81.622.648
12018-08-01POINT(-81.1 22.6)-81.122.632
22018-08-01POINT(-80.9 22.6)-80.922.6118
32018-08-01POINT(-80.8 22.6)-80.822.669
42018-08-01POINT(-98.4 22.8)-98.422.844
df.shape
(717530, 5)
# Read in second dataset
df_zip = pd.read_csv('eda_missing_data_dataset2.csv')
df_zip.head()
datezip_codecitystatestate_codecenter_point_geomnumber_of_strikes
02018-08-083281WeareNew HampshireNHPOINT(-71.7 43.1)1
12018-08-146488Heritage Village CDPConnecticutCTPOINT(-73.2 41.5)3
22018-08-1697759Sisters city, Black Butte Ranch CDPOregonORPOINT(-121.4 44.3)3
32018-08-186776New Milford CDPConnecticutCTPOINT(-73.4 41.6)48
42018-08-081077SouthwickMassachusettsMAPOINT(-72.8 42)2
df_zip.shape
(323700, 7)
Joining data

This dataset has less than half the number of rows as the first one. Let’s find out which ones are, by using the merge method.

# Left-join the two datasets
df_joined = df.merge(df_zip, how='left', on=['date','center_point_geom'])
df_joined.head()
datecenter_point_geomlongitudelatitudenumber_of_strikes_xzip_codecitystatestate_codenumber_of_strikes_y
02018-08-01POINT(-81.6 22.6)-81.622.648NaNNaNNaNNaNNaN
12018-08-01POINT(-81.1 22.6)-81.122.632NaNNaNNaNNaNNaN
22018-08-01POINT(-80.9 22.6)-80.922.6118NaNNaNNaNNaNNaN
32018-08-01POINT(-80.8 22.6)-80.822.669NaNNaNNaNNaNNaN
42018-08-01POINT(-98.4 22.8)-98.422.844NaNNaNNaNNaNNaN

The unique columns of each original dataframe also appear in the merged dataframe. But both original dataframes had another column ‘number_of_strikes’ that had the same name in both dataframes and was not indicated as a key. Pandas handles this by adding both columns to the new dataframe.

# Get descriptive statistics of the joined dataframe
df_joined.describe()
longitudelatitudenumber_of_strikes_xzip_codenumber_of_strikes_y
count717530.000000717530.000000717530.000000323700.000000323700.000000
mean-90.87544533.32857221.63708157931.95899625.410587
std13.6484297.93883148.02952522277.32741157.421824
min-133.90000016.6000001.0000001002.0000001.000000
25%-102.80000026.9000003.00000038260.7500003.000000
50%-90.30000033.2000006.00000059212.5000008.000000
75%-80.90000039.40000021.00000078642.00000024.000000
max-43.80000051.7000002211.00000099402.0000002211.000000

The count information confirms that the new dataframe is missing some data.

Checking missing data

Now let’s check how many missing state locations we have by using isnull() to create a Boolean mask that we’ll apply to df_joined.

# Create a new df of just the rows that are missing data
df_null_geo = df_joined[pd.isnull(df_joined.state_code)]
df_null_geo.shape
(393830, 10)

(Note that using the state_code column to create this mask is an arbitrary decision. We could have selected zip_code, city, or state instead and gotten the same results.)

Or we could quickly use info() method to check non-null values.

# Get non-null counts on merged dataframe
df_joined.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 717530 entries, 0 to 717529
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 717530 non-null object
1 center_point_geom 717530 non-null object
2 longitude 717530 non-null float64
3 latitude 717530 non-null float64
4 number_of_strikes_x 717530 non-null int64
5 zip_code 323700 non-null float64
6 city 323700 non-null object
7 state 323700 non-null object
8 state_code 323700 non-null object
9 number_of_strikes_y 323700 non-null float64
dtypes: float64(4), int64(1), object(5)
memory usage: 60.2+ MB
df_null_geo.head()
datecenter_point_geomlongitudelatitudenumber_of_strikes_xzip_codecitystatestate_codenumber_of_strikes_y
02018-08-01POINT(-81.6 22.6)-81.622.648NaNNaNNaNNaNNaN
12018-08-01POINT(-81.1 22.6)-81.122.632NaNNaNNaNNaNNaN
22018-08-01POINT(-80.9 22.6)-80.922.6118NaNNaNNaNNaNNaN
32018-08-01POINT(-80.8 22.6)-80.822.669NaNNaNNaNNaNNaN
42018-08-01POINT(-98.4 22.8)-98.422.844NaNNaNNaNNaNNaN
Understanding what data is missing

Now that we’ve merged all of our data together and isolated the rows with missing data, we can better understand what data is missing by plotting the longitude and latitude of locations that are missing city, state, and zip code data.

# Create new df of just latitude, longitude, and number of strikes and group by latitude and longitude
top_missing = df_null_geo[['latitude','longitude','number_of_strikes_x']]
             .groupby(['latitude','longitude'])
             .sum().sort_values('number_of_strikes_x',ascending=False).reset_index()
top_missing.head(10)
latitudelongitudenumber_of_strikes_x
022.4-84.23841
122.9-82.93184
222.4-84.32999
322.9-83.02754
422.5-84.12746
522.5-84.22738
622.3-81.02680
722.9-82.42652
822.9-82.32618
922.3-84.32551
Plotting the missing data

Let’s import plotly to reduce the size of the dataframe as we create a geographic scatter plot. Express is a helpful package that speeds up coding by doing a lot of the back-end work for us. If we don’t use express for this particular data set which has hundreds of thousands of points to plot our run-cell times could be long or the code could even break. 

import plotly.express as px                         # Be sure to import express
# reduce size of db otherwise it could break
fig = px.scatter_geo(top_missing[top_missing.number_of_strikes_x>=300],  # Input Pandas DataFrame
                    lat="latitude",                 # DataFrame column with latitude
                    lon="longitude",                # DataFrame column with latitude
                    size="number_of_strikes_x")     # Set to plot size as number of strikes
fig.update_layout(
    title_text = 'Missing data',                    # Create a Title
)

fig.show()

(The image above is static, but when working in a Python notebook like Jupyter the above code brings a dynamic one.)

Let’s scale the map down to only the geographic area that we are interested in, the United States.

import plotly.express as px                         # Be sure to import express
fig = px.scatter_geo(top_missing[top_missing.number_of_strikes_x>=300],  # Input Pandas DataFrame
                    lat="latitude",                 # DataFrame column with latitude
                    lon="longitude",                # DataFrame column with latitude
                    size="number_of_strikes_x")     # Set to plot size as number of strikes
fig.update_layout(
    title_text = 'Missing data',                    # Create a Title
    geo_scope='usa',                                # Plot only the USA instead of globe
)

fig.show()

The resulting map shows a majority of these missing values cropping up along the borders or in spots over bodies of water like lakes, the ocean, or the Gulf of Mexico. Given that the missing data were in the state abbreviations and zip code columns, it does make sense why those data points were left blank.

But we can also see some other locations with missing data that are not over bodies of water. These types of missing data, latitude and longitude on land are the kind of missing data we would want to reach out to the NOA A about. 


In

,