Basic EDA: Sample 3 | Alper Kokcu

Sample 3: Lightning Strikes

Following my notes on Python’s basic tools for data works, here I’ll do some basic EDA projects. Disclaimer: The one below is based on Google’s Advanced Data Analysis Program. My only intention, by repeating their structure, is to practice what I’ve learned and keep these notes as future reference. Their content can be reached via Coursera. A free version is also available without claiming the certificate.

Introduction

As we did with sample 1, we will work with 2016–2018 lightning strike data from the National Oceanic and Atmospheric Association (NOAA). But this time we will calculate weekly sums of lightning strikes and plot them on a bar graph. Then, we will calculate quarterly lightning strike totals and plot them on bar graphs.

Imports

import pandas as pd 
import datetime as dt 
import matplotlib.pyplot as plt
Import seaborn as sns

# Read in the data. 
df = pd.read_csv('eda_manipulate_date_strings_with_python.csv')
df.head()

	date	number_of_strikes	center_point_geom
0	2016-08-05	16	POINT(-101.5 24.7)
1	2016-08-05	16	POINT(-85 34.3)
2	2016-08-05	16	POINT(-89 41.4)
3	2016-08-05	16	POINT(-89.8 30.7)
4	2016-08-05	16	POINT(-86.2 37.9)

Create new time columns with string format time

When a date column data type is a string object, it limits what we can do with it. That’s why we’ll convert the column to datetime (once again).

# Convert the 'date' column to datetime.
df['date'] = pd.to_datetime(df['date'])

Now we can create new columns to represent week, month, quarter, and year by using datetime.strftime() method. We will use strftime format codes: %Y for year, %V for week number, %q for quarter.

# Create four new columns.
df['week'] = df['date'].dt.strftime('%Y-W%V')
df['month'] = df['date'].dt.strftime('%Y-%m')
df['quarter'] = df['date'].dt.to_period('Q').dt.strftime('%Y-Q%q')
df['year'] = df['date'].dt.strftime('%Y')

df.head(10)

	date	number_of_strikes	center_point_geom	week	month	quarter	year
0	2016-08-05	16	POINT(-101.5 24.7)	2016-W31	2016-08	2016-Q3	2016
1	2016-08-05	16	POINT(-85 34.3)	2016-W31	2016-08	2016-Q3	2016
2	2016-08-05	16	POINT(-89 41.4)	2016-W31	2016-08	2016-Q3	2016
3	2016-08-05	16	POINT(-89.8 30.7)	2016-W31	2016-08	2016-Q3	2016
4	2016-08-05	16	POINT(-86.2 37.9)	2016-W31	2016-08	2016-Q3	2016
5	2016-08-05	16	POINT(-97.8 38.9)	2016-W31	2016-08	2016-Q3	2016
6	2016-08-05	16	POINT(-81.9 36)	2016-W31	2016-08	2016-Q3	2016
7	2016-08-05	16	POINT(-90.9 36.7)	2016-W31	2016-08	2016-Q3	2016
8	2016-08-05	16	POINT(-106.6 26.1)	2016-W31	2016-08	2016-Q3	2016
9	2016-08-05	16	POINT(-108 31.6)	2016-W31	2016-08	2016-Q3	2016

Plot the number of weekly lightning strikes in 2018

First we’ll start by filtering the dataset.

# Create a new dataframe view of just 2018 data, summed by week.
df_by_week_2018 = df[df['year'] == '2018'].groupby(['week']).sum().reset_index()
df_by_week_2018.head()

	week	number_of_strikes
0	2018-W01	34843
1	2018-W02	353425
2	2018-W03	37132
3	2018-W04	412772
4	2018-W05	34972

Note: In pandas v.2.X+ we must set ‘numeric_only=True’ in the sum() function or else it will throw an error.

Now we have the table to plot.

# Plot a bar graph of weekly strike totals in 2018.
plt.bar(x = df_by_week_2018['week'], height = df_by_week_2018['number_of_strikes'])
plt.plot()
plt.xlabel("Week number")
plt.ylabel("Number of lightning strikes")
plt.title("Number of lightning strikes per week (2018)")

Because the x-axis labels are too crowded, we need to adjust the graph.

plt.figure(figsize = (20, 5))               # Increase output size.
plt.bar(x = df_by_week_2018['week'], height = df_by_week_2018['number_of_strikes'])
plt.plot()
plt.xlabel("Week number")
plt.ylabel("Number of lightning strikes")
plt.title("Number of lightning strikes per week (2018)")
plt.xticks(rotation = 45, fontsize = 8)     # Rotate x-axis labels and decrease font size.
plt.show()

Plot the number of quarterly lightning strikes from 2016–2018

Now let’s work on the full date range of available data and use quarters as our smaller chunks to investigate.

# Group 2016-2018 data by quarter and sum.
df_by_quarter = df.groupby(['quarter']).sum().reset_index()

# Format as text, in millions.
df_by_quarter['number_of_strikes_formatted'] = df_by_quarter['number_of_strikes'].div(1000000).round(1).astype(str) + 'M'

df_by_quarter.head()

	quarter	number_of_strikes	number_of_strikes_formatted
0	2016-Q1	2683798	2.7M
1	2016-Q2	15084857	15.1M
2	2016-Q3	21843820	21.8M
3	2016-Q4	1969754	2.0M
4	2017-Q1	2444279	2.4M

Above, we created a new formatted column, divided its results by one million, rounded to one digit after the decimal point, and lastly converted it to a string to add “M” to the end, representing millions. This was for us to easily read the table and also use on the bar graph below as labels.

Add labels

To be able to use these new formatted column values as labels, first we need to define our own functions. The function uses plt.text(), which is a pyplot function whose positional arguments are x, y, and s.

x represents the x-axis coordinates, y represents the y-axis coordinates, and s represents the text that we want to appear at these coordinates. For more information: pyplot documentation.

def addlabels(x, y, labels):
    '''
    Iterates over data and plots text labels above each bar of bar graph.
    '''
    for i in range(len(x)):
        plt.text(i, y[i], labels[i], ha = 'center', va = 'bottom')

Now we can plot the bar.

plt.figure(figsize = (15, 5))
plt.bar(x = df_by_quarter['quarter'], height = df_by_quarter['number_of_strikes'])

addlabels(df_by_quarter['quarter'], df_by_quarter['number_of_strikes'], df_by_quarter['number_of_strikes_formatted'])

plt.plot()
plt.xlabel('Quarter')
plt.ylabel('Number of lightning strikes')
plt.title('Number of lightning strikes per quarter (2016-2018)')
plt.show()

Create a grouped bar chart

In most cases there will be some other ways to show the same data differently. For instance, now we can create a grouped bar chart to better compare year-over-year changes each quarter.

But again, before plotting we need to prepare our data for it. We’ll create two new columns that break out the quarter and year from the quarter column. We will use the quarter column and take the last two characters to get quarter_number, and take the first four characters to get the year.

# Create two new columns.
df_by_quarter['quarter_number'] = df_by_quarter['quarter'].str[-2:]
df_by_quarter['year'] = df_by_quarter['quarter'].str[:4]
df_by_quarter.head()

	quarter	number_of_strikes	number_of_strikes_formatted	quarter_number	year
0	2016-Q1	2683798	2.7M	Q1	2016
1	2016-Q2	15084857	15.1M	Q2	2016
2	2016-Q3	21843820	21.8M	Q3	2016
3	2016-Q4	1969754	2.0M	Q4	2016
4	2017-Q1	2444279	2.4M	Q1	2017

My Side Note: Before coming to this point (by this I mean, before creating df_by_quarter dataframe), I’d probably retrieve months, weeks, and quarters (or whatever I be in need of) from the date column by simply using dt.year, dt.quarter, etc.

Now we can create our bar chart with seaborn.

plt.figure(figsize = (15, 5))
p = sns.barplot(
    data = df_by_quarter,
    x = 'quarter_number',
    y = 'number_of_strikes',
    hue = 'year')
for b in p.patches:
    p.annotate(str(round(b.get_height()/1000000, 1))+'M', 
                   (b.get_x() + b.get_width() / 2., b.get_height() + 1.2e6), 
                   ha = 'center', va = 'bottom', 
                   xytext = (0, -12), 
                   textcoords = 'offset points')
plt.xlabel("Quarter")
plt.ylabel("Number of lightning strikes")
plt.title("Number of lightning strikes per quarter (2016-2018)")
plt.show()

In the code above, Google uses a different method (much complicated one) to add labels on each bar. I need more time to grasp those.