Basic EDA: Sample 4

Sample 4: Lightning Strikes

Following my notes on Python’s basic tools for data works, here I’ll do some basic EDA projects. Disclaimer: The one below is based on Google’s Advanced Data Analysis Program. My only intention, by repeating their structure, is to practice what I’ve learned and keep these notes as future reference. Their content can be reached via Coursera. A free version is also available without claiming the certificate.

Introduction

As we did with sample 1 and sample 3, we will again work on lightning strike data collected by the National Oceanic and Atmospheric Association (NOAA) for the year of 2018. We will use our structuring tools to learn more about whether lightning strikes are more prevalent on some days than others.

We will follow these steps:

  • Find the locations with the greatest number of strikes within a single day 
  • Examine the locations that had the greatest number of days with at least one lightning strike 
  • Determine whether certain days of the week had more lightning strikes than others 
  • Add data from 2016 and 2017 and, for each month, calculate the percentage of total lightning strikes for that year that occurred in that month 
  • Plot this data on a bar graph
Imports
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
from matplotlib import pyplot as plt

# Read in the 2018 data. 
df = pd.read_csv('eda_structuring_with_python_dataset1.csv')
df.head()
datenumber_of_strikescenter_point_geom
02018-01-03194POINT(-75 27)
12018-01-0341POINT(-78.4 29)
22018-01-0333POINT(-73.9 27)
32018-01-0338POINT(-73.8 27)
42018-01-0392POINT(-79 28)

As we did earlier with the same dataset, let’s first convert the date column to datetime.

# Convert the 'date' column to datetime.
df['date'] = pd.to_datetime(df['date']) 

Let’s check the shape of the dataframe.

df.shape
(3401012, 3)

Now checking the duplicates.

df.drop_duplicates().shape
(3401012, 3)

With the code above the notebook returns the number of rows, and columns remaining after duplicates are removed. Since they are the same, it means that we don’t have any duplicates. (I usually don’t drop duplicated without understanding what are they first.)

Locations with most strikes in a single day
# Sort by number of strikes in descending order.
df.sort_values(by='number_of_strikes', ascending=False).head(10)
datenumber_of_strikescenter_point_geom
3027582018-08-202211POINT(-92.5 35.5)
2783832018-08-162142POINT(-96.1 36.1)
2808302018-08-172061POINT(-90.2 36.1)
2804532018-08-172031POINT(-89.9 35.9)
2783822018-08-161902POINT(-96.2 36.1)
115172018-02-101899POINT(-95.5 28.1)
2775062018-08-161878POINT(-89.7 31.5)
249062018-02-251833POINT(-98.7 28.9)
2843202018-08-171767POINT(-90.1 36)
248252018-02-251741POINT(-98 29)
Locations with most days with at least one lightning strike
# Identify the locations that appear most in the dataset.
df.center_point_geom.value_counts()
POINT(-81.5 22.5)     108
POINT(-84.1 22.4) 108
POINT(-82.5 22.9) 107
POINT(-82.7 22.9) 107
POINT(-82.5 22.8) 106
...
POINT(-119.3 35.1) 1
POINT(-119.3 35) 1
POINT(-119.6 35.6) 1
POINT(-119.4 35.6) 1
POINT(-58.5 45.3) 1
Name: center_point_geom, Length: 170855, dtype: int64

Now let’s examine whether there is an even distribution of values, or whether 108 strikes is an unusually high number of days with lightning strikes.

# Identify the top 20 locations with most days of lightning.
df.center_point_geom.value_counts()[:20].rename_axis('unique_values').reset_index(name='counts').style.background_gradient()
unique_valuescounts
0POINT(-81.5 22.5)108
1POINT(-84.1 22.4)108
2POINT(-82.5 22.9)107
3POINT(-82.7 22.9)107
4POINT(-82.5 22.8)106
5POINT(-84.2 22.3)106
6POINT(-76 20.5)105
7POINT(-75.9 20.4)105
8POINT(-82.2 22.9)104
9POINT(-78 18.2)104
10POINT(-83.9 22.5)103
11POINT(-84 22.4)102
12POINT(-82 22.8)102
13POINT(-82 22.4)102
14POINT(-82.3 22.9)102
15POINT(-78 18.3)102
16POINT(-84.1 22.5)101
17POINT(-75.5 20.6)101
18POINT(-84.2 22.4)101
19POINT(-76 20.4)101
Lightning strikes by day of week 

Let’s check whether any particular day of the week had fewer or more lightning strikes than others. First we’ll use two methods below.

(More about dt.isocalendar(): pandas.Series.dt.isocalendar documentation.
More about dt.day_name(): pandas.Series.dt.day_name documentation.)

# Create two new columns.
df['week'] = df.date.dt.isocalendar().week
df['weekday'] = df.date.dt.day_name()
df.head()
datenumber_of_strikescenter_point_geomweekweekday
02018-01-03194POINT(-75 27)1Wednesday
12018-01-0341POINT(-78.4 29)1Wednesday
22018-01-0333POINT(-73.9 27)1Wednesday
32018-01-0338POINT(-73.8 27)1Wednesday
42018-01-0392POINT(-79 28)1Wednesday

Now, we can calculate the mean number of lightning strikes for each weekday of the year.

# Calculate the mean count of lightning strikes for each weekday.
df[['weekday','number_of_strikes']].groupby(['weekday']).mean()
number_of_strikes
weekday
Friday13.349972
Monday13.152804
Saturday12.732694
Sunday12.324717
Thursday13.240594
Tuesday13.813599
Wednesday13.224568

It seems that Saturday and Sunday have fewer lightning strikes on average than the other five weekdays. To understand better what this data is telling us, let’s plot a box plot chart.

A boxplot is a data visualization that depicts the locality, spread, and skew of groups of values within quartiles.

# Define order of days for the plot.
weekday_order = ['Monday','Tuesday', 'Wednesday', 'Thursday','Friday','Saturday','Sunday']

# Create boxplots of strike counts for each day of week.
g = sns.boxplot(data=df, 
            x='weekday',
            y='number_of_strikes', 
            order=weekday_order, 
            showfliers=False            #outliers are left off of the box plot
            )
g.set_title('Lightning distribution per weekday (2018)')

Notice that the median remains the same on all of the days of the week. As for Saturday and Sunday, however, the distributions are both lower than they are during the rest of the week. We also know that the mean numbers of strikes that occurred on Saturday and Sunday were lower than on the other weekdays. Why might this be? 

Perhaps the aerosol particles emitted by factories and vehicles increase the likelihood of lightning strikes. In the U.S., Saturday and Sunday are days that many people don’t work, so there may be fewer factories operating and fewer cars on the road. This is only speculation, but it’s one possible path for further exploration. While we don’t know for sure, we have clear data suggesting the total quantity of weekend lightning strikes is lower than weekdays.

Monthly lightning strikes 2016–2018

To examine further let’s check our data across multiple years. We will calculate the percentage of total lightning strikes for each year that occurred in a given month. We will then plot this data on a bar graph.

# Import 2016–2017 data
df_2 = pd.read_csv('eda_structuring_with_python_dataset2.csv')
df_2.head()
datenumber_of_strikescenter_point_geom
02016-01-0455POINT(-83.2 21.1)
12016-01-0433POINT(-83.1 21.1)
22016-01-0546POINT(-77.5 22.1)
32016-01-0528POINT(-76.8 22.3)
42016-01-0528POINT(-77 22.1)

As we did earlier, we will convert the date column to datetime.

# Convert `date` column to datetime.
df_2['date'] = pd.to_datetime(df_2['date'])

To merge three years of data together, we need to make sure each dataset is formatted the same. The new datasets do not have the extra columns week and weekday that we created earlier. There’s an easy way to merge the three years of data and remove the extra columns at the same time.

# Create a new dataframe combining 2016–2017 data with 2018 data.
union_df = pd.concat([df.drop(['weekday','week'],axis=1), df_2], ignore_index=True)
union_df.head()
datenumber_of_strikescenter_point_geom
02018-01-03194POINT(-75 27)
12018-01-0341POINT(-78.4 29)
22018-01-0333POINT(-73.9 27)
32018-01-0338POINT(-73.8 27)
42018-01-0392POINT(-79 28)

Note that the above code doesn’t permanently modify df. The columns drop only for this operation.

Again as we did earlier, we will create three new columns that isolate the year, month number, and month name.

# Add 3 new columns.
union_df['year'] = union_df.date.dt.year
union_df['month'] = union_df.date.dt.month
union_df['month_txt'] = union_df.date.dt.month_name()
union_df.head()
datenumber_of_strikescenter_point_geomyearmonthmonth_txt
02018-01-03194POINT(-75 27)20181January
12018-01-0341POINT(-78.4 29)20181January
22018-01-0333POINT(-73.9 27)20181January
32018-01-0338POINT(-73.8 27)20181January
42018-01-0392POINT(-79 28)20181January

Let’s check the overall lightning strike count for each year.

# Calculate total number of strikes per year
union_df[['year','number_of_strikes']].groupby(['year']).sum()
number_of_strikes
year
201641582229
201735095195
201844600989

Because the totals are different, it might be interesting as part of our analysis to see lightning strike percentages by month of each year.

# Calculate total lightning strikes for each month of each year.
lightning_by_month = union_df.groupby(['month_txt','year']).agg(
    number_of_strikes = pd.NamedAgg(column='number_of_strikes',aggfunc=sum)
    ).reset_index()

lightning_by_month.head()
month_txtyearnumber_of_strikes
0April20162636427
1April20173819075
2April20181524339
3August20167250442
4August20176021702

Similarly we can use the agg() function to calculate the same yearly totals we found before.

# Calculate total lightning strikes for each year.
lightning_by_year = union_df.groupby(['year']).agg(
  year_strikes = pd.NamedAgg(column='number_of_strikes',aggfunc=sum)
).reset_index()

lightning_by_year.head()
yearyear_strikes
0201641582229
1201735095195
2201844600989

We created those two data frames, lightning by month and lightning by year in order to derive our percentages of lightning strikes by month and year.

# Combine 'lightning_by_month' and 'lightning_by_year' dataframes into single dataframe.
percentage_lightning = lightning_by_month.merge(lightning_by_year,on='year')
percentage_lightning.head()
month_txtyearnumber_of_strikesyear_strikes
0April2016263642741582229
1August2016725044241582229
2December201631645041582229
3February201631267641582229
4January201631359541582229

Now we will create a new column in our new dataframe that represents the percentage of total lightning strikes that occurred during each month for each year.

# Create new `percentage_lightning_per_month` column.
percentage_lightning['percentage_lightning_per_month'] = (percentage_lightning.number_of_strikes/
                                                          percentage_lightning.year_strikes * 100.0)
percentage_lightning.head()
month_txtyearnumber_of_strikesyear_strikespercentage_lightning_per_month
0April20162636427415822296.340273
1August201672504424158222917.436396
2December2016316450415822290.761022
3February2016312676415822290.751946
4January2016313595415822290.754156

Now we can plot the percentages by month in a bar graph.

plt.figure(figsize=(10,6))
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']

sns.barplot(
    data = percentage_lightning,
    x = 'month_txt',
    y = 'percentage_lightning_per_month',
    hue = 'year',
    order = month_order )
plt.xlabel("Month")
plt.ylabel("% of lightning strikes")
plt.title("% of lightning strikes each Month (2016-2018)")

For all three years, there is a clear pattern over the course of each year. One month stands out: August. More than one third of the lightning strikes in 2018 happened in August.

The next step for a data professional trying to understand these findings might be to research storm and hurricane data, to learn whether those factors contributed to a greater number of lightning strikes for this particular month.


In

,