Basic EDA: Sample 1 | Alper Kokcu

Sample 1: Lightning Strikes

Following my notes on Python’s basic tools for data works, here I’ll do some basic EDA projects. Disclaimer: The one below is based on Google’s Advanced Data Analysis Program. My only intention, by repeating their structure, is to practice what I’ve learned and keep these notes as future reference. Their content can be reached via Coursera. A free version is also available without claiming the certificate.

Introduction

We will use pandas to examine 2018 lightning strike data collected by the National Oceanic and Atmospheric Administration (NOAA). Then, we will calculate the total number of strikes for each month and plot this information on a bar graph.

Imports

import pandas as pd 
import numpy as np 
import datetime as dt 
import matplotlib.pyplot as plt

# Read in the 2018 lightning strike dataset. 
df = pd.read_csv('eda_using_basic_data_functions_in_python_dataset1.csv')

First Inspection

# Inspect the first 10 rows.
df.head(10)

	date	number_of_strikes	center_point_geom
0	2018-01-03	194	POINT(-75 27)
1	2018-01-03	41	POINT(-78.4 29)
2	2018-01-03	33	POINT(-73.9 27)
3	2018-01-03	38	POINT(-73.8 27)
4	2018-01-03	92	POINT(-79 28)
5	2018-01-03	119	POINT(-78 28)
6	2018-01-03	35	POINT(-79.3 28)
7	2018-01-03	60	POINT(-79.1 28)
8	2018-01-03	41	POINT(-78.7 28)
9	2018-01-03	119	POINT(-78.6 28)

df.shape

(3401012, 3)

# Get more information about the data, including data types of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3401012 entries, 0 to 3401011
Data columns (total 3 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   date               object
 1   number_of_strikes  int64 
 2   center_point_geom  object
dtypes: int64(1), object(2)
memory usage: 77.8+ MB

Convert the date column to datetime

Converting string dates to datetime will enable us to work with them much more easily.

# Convert date column to datetime
df['date']= pd.to_datetime(df['date'])

Calculate the days with the most strikes

We’ll try to get an idea of the highest data points by using a sum calculation. Using sum() performs a sum calculation on all other summable columns. In this case, we are summing all the lightning strikes that happened on each day.

Notice that the center_point_geom column is not included in the output below. That’s because, as a string object, this column is not summable.

# Calculate days with most lightning strikes.
df.groupby(['date']).sum().sort_values('number_of_strikes', ascending=False).head(10)

	number_of_strikes
date
2018-08-29	1070457
2018-08-17	969774
2018-08-28	917199
2018-08-27	824589
2018-08-30	802170
2018-08-19	786225
2018-08-18	741180
2018-08-16	734475
2018-08-31	723624
2018-08-15	673455

If we would use count instead of sum, it would return the number of occurrences of each date in the dataset, which is not the desired result.

Extract the month data

# Create a new `month` column
df['month'] = df['date'].dt.month
df.head()

	date	number_of_strikes	center_point_geom	month
0	2018-01-03	194	POINT(-75 27)	1
1	2018-01-03	41	POINT(-78.4 29)	1
2	2018-01-03	33	POINT(-73.9 27)	1
3	2018-01-03	38	POINT(-73.8 27)	1
4	2018-01-03	92	POINT(-79 28)	1

Calculate the number of strikes per month

# Calculate total number of strikes per month
df.groupby(['month']).sum().sort_values('number_of_strikes', ascending=False).head(12)

	number_of_strikes
month
8	15525255
7	8320400
6	6445083
5	4166726
9	3018336
2	2071315
4	1524339
10	1093962
1	860045
3	854168
11	409263
12	312097

Convert the month number to text

To help read the data more easily, we can convert the month number to text using the datetime function dt.month_name() and add this as a new column in the dataframe. And str.slice will omit the text after the first three letters.

# Create a new `month_txt` column.
df['month_txt'] = df['date'].dt.month_name().str.slice(stop=3)

	date	number_of_strikes	center_point_geom	month	month_txt
0	2018-01-03	194	POINT(-75 27)	1	Jan
1	2018-01-03	41	POINT(-78.4 29)	1	Jan
2	2018-01-03	33	POINT(-73.9 27)	1	Jan
3	2018-01-03	38	POINT(-73.8 27)	1	Jan
4	2018-01-03	92	POINT(-79 28)	1	Jan

Create a new dataframe

Our objective here is to plot the total number of strikes per month as a bar graph. We will also sort them by month for us to read easily.

# Create a new helper dataframe for plotting.
df_by_month = df.groupby(['month','month_txt']).sum().sort_values('month', ascending=True).head(12).reset_index()

df_by_month

	month	month_txt	number_of_strikes
0	1	Jan	860045
1	2	Feb	2071315
2	3	Mar	854168
3	4	Apr	1524339
4	5	May	4166726
5	6	Jun	6445083
6	7	Jul	8320400
7	8	Aug	15525255
8	9	Sep	3018336
9	10	Oct	1093962
10	11	Nov	409263
11	12	Dec	312097

Make a bar chart

Pyplot’s plt.bar() function takes positional arguments of x and height, representing the data used for the x- and y- axes, respectively. Below, the x-axis will represent months, and the y-axis will represent strike count.

plt.bar(x=df_by_month['month_txt'],height= df_by_month['number_of_strikes'], label="Number of strikes")
plt.plot()
plt.xlabel("Months(2018)")
plt.ylabel("Number of lightning strikes")
plt.title("Number of lightning strikes in 2018 by months")
plt.legend()
plt.show()