Basic EDA: Sample 4 | Alper Kokcu

Sample 4: Lightning Strikes

Following my notes on Python’s basic tools for data works, here I’ll do some basic EDA projects. Disclaimer: The one below is based on Google’s Advanced Data Analysis Program. My only intention, by repeating their structure, is to practice what I’ve learned and keep these notes as future reference. Their content can be reached via Coursera. A free version is also available without claiming the certificate.

Introduction

As we did with sample 1 and sample 3, we will again work on lightning strike data collected by the National Oceanic and Atmospheric Association (NOAA) for the year of 2018. We will use our structuring tools to learn more about whether lightning strikes are more prevalent on some days than others.

We will follow these steps:

Find the locations with the greatest number of strikes within a single day
Examine the locations that had the greatest number of days with at least one lightning strike
Determine whether certain days of the week had more lightning strikes than others
Add data from 2016 and 2017 and, for each month, calculate the percentage of total lightning strikes for that year that occurred in that month
Plot this data on a bar graph

Imports

import pandas as pd
import numpy as np
import seaborn as sns
import datetime
from matplotlib import pyplot as plt

# Read in the 2018 data. 
df = pd.read_csv('eda_structuring_with_python_dataset1.csv')
df.head()

	date	number_of_strikes	center_point_geom
0	2018-01-03	194	POINT(-75 27)
1	2018-01-03	41	POINT(-78.4 29)
2	2018-01-03	33	POINT(-73.9 27)
3	2018-01-03	38	POINT(-73.8 27)
4	2018-01-03	92	POINT(-79 28)

As we did earlier with the same dataset, let’s first convert the date column to datetime.

# Convert the 'date' column to datetime.
df['date'] = pd.to_datetime(df['date'])

Let’s check the shape of the dataframe.

df.shape

(3401012, 3)

Now checking the duplicates.

df.drop_duplicates().shape

(3401012, 3)

With the code above the notebook returns the number of rows, and columns remaining after duplicates are removed. Since they are the same, it means that we don’t have any duplicates. (I usually don’t drop duplicated without understanding what are they first.)

Locations with most strikes in a single day

# Sort by number of strikes in descending order.
df.sort_values(by='number_of_strikes', ascending=False).head(10)

	date	number_of_strikes	center_point_geom
302758	2018-08-20	2211	POINT(-92.5 35.5)
278383	2018-08-16	2142	POINT(-96.1 36.1)
280830	2018-08-17	2061	POINT(-90.2 36.1)
280453	2018-08-17	2031	POINT(-89.9 35.9)
278382	2018-08-16	1902	POINT(-96.2 36.1)
11517	2018-02-10	1899	POINT(-95.5 28.1)
277506	2018-08-16	1878	POINT(-89.7 31.5)
24906	2018-02-25	1833	POINT(-98.7 28.9)
284320	2018-08-17	1767	POINT(-90.1 36)
24825	2018-02-25	1741	POINT(-98 29)

Locations with most days with at least one lightning strike

# Identify the locations that appear most in the dataset.
df.center_point_geom.value_counts()

POINT(-81.5 22.5)     108
POINT(-84.1 22.4)     108
POINT(-82.5 22.9)     107
POINT(-82.7 22.9)     107
POINT(-82.5 22.8)     106
                     ... 
POINT(-119.3 35.1)      1
POINT(-119.3 35)        1
POINT(-119.6 35.6)      1
POINT(-119.4 35.6)      1
POINT(-58.5 45.3)       1
Name: center_point_geom, Length: 170855, dtype: int64

Now let’s examine whether there is an even distribution of values, or whether 108 strikes is an unusually high number of days with lightning strikes.

# Identify the top 20 locations with most days of lightning.
df.center_point_geom.value_counts()[:20].rename_axis('unique_values').reset_index(name='counts').style.background_gradient()

	unique_values	counts
0	POINT(-81.5 22.5)	108
1	POINT(-84.1 22.4)	108
2	POINT(-82.5 22.9)	107
3	POINT(-82.7 22.9)	107
4	POINT(-82.5 22.8)	106
5	POINT(-84.2 22.3)	106
6	POINT(-76 20.5)	105
7	POINT(-75.9 20.4)	105
8	POINT(-82.2 22.9)	104
9	POINT(-78 18.2)	104
10	POINT(-83.9 22.5)	103
11	POINT(-84 22.4)	102
12	POINT(-82 22.8)	102
13	POINT(-82 22.4)	102
14	POINT(-82.3 22.9)	102
15	POINT(-78 18.3)	102
16	POINT(-84.1 22.5)	101
17	POINT(-75.5 20.6)	101
18	POINT(-84.2 22.4)	101
19	POINT(-76 20.4)	101

Lightning strikes by day of week

Let’s check whether any particular day of the week had fewer or more lightning strikes than others. First we’ll use two methods below.

(More about dt.isocalendar(): pandas.Series.dt.isocalendar documentation.
More about dt.day_name(): pandas.Series.dt.day_name documentation.)

# Create two new columns.
df['week'] = df.date.dt.isocalendar().week
df['weekday'] = df.date.dt.day_name()
df.head()

	date	number_of_strikes	center_point_geom	week	weekday
0	2018-01-03	194	POINT(-75 27)	1	Wednesday
1	2018-01-03	41	POINT(-78.4 29)	1	Wednesday
2	2018-01-03	33	POINT(-73.9 27)	1	Wednesday
3	2018-01-03	38	POINT(-73.8 27)	1	Wednesday
4	2018-01-03	92	POINT(-79 28)	1	Wednesday

Now, we can calculate the mean number of lightning strikes for each weekday of the year.

# Calculate the mean count of lightning strikes for each weekday.
df[['weekday','number_of_strikes']].groupby(['weekday']).mean()

	number_of_strikes
weekday
Friday	13.349972
Monday	13.152804
Saturday	12.732694
Sunday	12.324717
Thursday	13.240594
Tuesday	13.813599
Wednesday	13.224568

It seems that Saturday and Sunday have fewer lightning strikes on average than the other five weekdays. To understand better what this data is telling us, let’s plot a box plot chart.

A boxplot is a data visualization that depicts the locality, spread, and skew of groups of values within quartiles.

# Define order of days for the plot.
weekday_order = ['Monday','Tuesday', 'Wednesday', 'Thursday','Friday','Saturday','Sunday']

# Create boxplots of strike counts for each day of week.
g = sns.boxplot(data=df, 
            x='weekday',
            y='number_of_strikes', 
            order=weekday_order, 
            showfliers=False            #outliers are left off of the box plot
            )
g.set_title('Lightning distribution per weekday (2018)')

Notice that the median remains the same on all of the days of the week. As for Saturday and Sunday, however, the distributions are both lower than they are during the rest of the week. We also know that the mean numbers of strikes that occurred on Saturday and Sunday were lower than on the other weekdays. Why might this be?

Perhaps the aerosol particles emitted by factories and vehicles increase the likelihood of lightning strikes. In the U.S., Saturday and Sunday are days that many people don’t work, so there may be fewer factories operating and fewer cars on the road. This is only speculation, but it’s one possible path for further exploration. While we don’t know for sure, we have clear data suggesting the total quantity of weekend lightning strikes is lower than weekdays.

Monthly lightning strikes 2016–2018

To examine further let’s check our data across multiple years. We will calculate the percentage of total lightning strikes for each year that occurred in a given month. We will then plot this data on a bar graph.

# Import 2016–2017 data
df_2 = pd.read_csv('eda_structuring_with_python_dataset2.csv')
df_2.head()

	date	number_of_strikes	center_point_geom
0	2016-01-04	55	POINT(-83.2 21.1)
1	2016-01-04	33	POINT(-83.1 21.1)
2	2016-01-05	46	POINT(-77.5 22.1)
3	2016-01-05	28	POINT(-76.8 22.3)
4	2016-01-05	28	POINT(-77 22.1)

As we did earlier, we will convert the date column to datetime.

# Convert `date` column to datetime.
df_2['date'] = pd.to_datetime(df_2['date'])

To merge three years of data together, we need to make sure each dataset is formatted the same. The new datasets do not have the extra columns week and weekday that we created earlier. There’s an easy way to merge the three years of data and remove the extra columns at the same time.

# Create a new dataframe combining 2016–2017 data with 2018 data.
union_df = pd.concat([df.drop(['weekday','week'],axis=1), df_2], ignore_index=True)
union_df.head()

	date	number_of_strikes	center_point_geom
0	2018-01-03	194	POINT(-75 27)
1	2018-01-03	41	POINT(-78.4 29)
2	2018-01-03	33	POINT(-73.9 27)
3	2018-01-03	38	POINT(-73.8 27)
4	2018-01-03	92	POINT(-79 28)

Note that the above code doesn’t permanently modify df. The columns drop only for this operation.

Again as we did earlier, we will create three new columns that isolate the year, month number, and month name.

# Add 3 new columns.
union_df['year'] = union_df.date.dt.year
union_df['month'] = union_df.date.dt.month
union_df['month_txt'] = union_df.date.dt.month_name()
union_df.head()

	date	number_of_strikes	center_point_geom	year	month	month_txt
0	2018-01-03	194	POINT(-75 27)	2018	1	January
1	2018-01-03	41	POINT(-78.4 29)	2018	1	January
2	2018-01-03	33	POINT(-73.9 27)	2018	1	January
3	2018-01-03	38	POINT(-73.8 27)	2018	1	January
4	2018-01-03	92	POINT(-79 28)	2018	1	January

Let’s check the overall lightning strike count for each year.

# Calculate total number of strikes per year
union_df[['year','number_of_strikes']].groupby(['year']).sum()

	number_of_strikes
year
2016	41582229
2017	35095195
2018	44600989

Because the totals are different, it might be interesting as part of our analysis to see lightning strike percentages by month of each year.

# Calculate total lightning strikes for each month of each year.
lightning_by_month = union_df.groupby(['month_txt','year']).agg(
    number_of_strikes = pd.NamedAgg(column='number_of_strikes',aggfunc=sum)
    ).reset_index()

lightning_by_month.head()

	month_txt	year	number_of_strikes
0	April	2016	2636427
1	April	2017	3819075
2	April	2018	1524339
3	August	2016	7250442
4	August	2017	6021702

Similarly we can use the agg() function to calculate the same yearly totals we found before.

# Calculate total lightning strikes for each year.
lightning_by_year = union_df.groupby(['year']).agg(
  year_strikes = pd.NamedAgg(column='number_of_strikes',aggfunc=sum)
).reset_index()

lightning_by_year.head()

	year	year_strikes
0	2016	41582229
1	2017	35095195
2	2018	44600989

We created those two data frames, lightning by month and lightning by year in order to derive our percentages of lightning strikes by month and year.

# Combine 'lightning_by_month' and 'lightning_by_year' dataframes into single dataframe.
percentage_lightning = lightning_by_month.merge(lightning_by_year,on='year')
percentage_lightning.head()

	month_txt	year	number_of_strikes	year_strikes
0	April	2016	2636427	41582229
1	August	2016	7250442	41582229
2	December	2016	316450	41582229
3	February	2016	312676	41582229
4	January	2016	313595	41582229

Now we will create a new column in our new dataframe that represents the percentage of total lightning strikes that occurred during each month for each year.

# Create new `percentage_lightning_per_month` column.
percentage_lightning['percentage_lightning_per_month'] = (percentage_lightning.number_of_strikes/
                                                          percentage_lightning.year_strikes * 100.0)
percentage_lightning.head()

	month_txt	year	number_of_strikes	year_strikes	percentage_lightning_per_month
0	April	2016	2636427	41582229	6.340273
1	August	2016	7250442	41582229	17.436396
2	December	2016	316450	41582229	0.761022
3	February	2016	312676	41582229	0.751946
4	January	2016	313595	41582229	0.754156

Now we can plot the percentages by month in a bar graph.

plt.figure(figsize=(10,6))
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']

sns.barplot(
    data = percentage_lightning,
    x = 'month_txt',
    y = 'percentage_lightning_per_month',
    hue = 'year',
    order = month_order )
plt.xlabel("Month")
plt.ylabel("% of lightning strikes")
plt.title("% of lightning strikes each Month (2016-2018)")

For all three years, there is a clear pattern over the course of each year. One month stands out: August. More than one third of the lightning strikes in 2018 happened in August.

The next step for a data professional trying to understand these findings might be to research storm and hurricane data, to learn whether those factors contributed to a greater number of lightning strikes for this particular month.