Basic EDA: Sample 1

Sample 1: Lightning Strikes

Following my notes on Python’s basic tools for data works, here I’ll do some basic EDA projects. Disclaimer: The one below is based on Google’s Advanced Data Analysis Program. My only intention, by repeating their structure, is to practice what I’ve learned and keep these notes as future reference. Their content can be reached via Coursera. A free version is also available without claiming the certificate.

Introduction

We will use pandas to examine 2018 lightning strike data collected by the National Oceanic and Atmospheric Administration (NOAA). Then, we will calculate the total number of strikes for each month and plot this information on a bar graph.

Imports
import pandas as pd 
import numpy as np 
import datetime as dt 
import matplotlib.pyplot as plt

# Read in the 2018 lightning strike dataset. 
df = pd.read_csv('eda_using_basic_data_functions_in_python_dataset1.csv')
First Inspection
# Inspect the first 10 rows.
df.head(10)
datenumber_of_strikescenter_point_geom
02018-01-03194POINT(-75 27)
12018-01-0341POINT(-78.4 29)
22018-01-0333POINT(-73.9 27)
32018-01-0338POINT(-73.8 27)
42018-01-0392POINT(-79 28)
52018-01-03119POINT(-78 28)
62018-01-0335POINT(-79.3 28)
72018-01-0360POINT(-79.1 28)
82018-01-0341POINT(-78.7 28)
92018-01-03119POINT(-78.6 28)
df.shape
(3401012, 3)
# Get more information about the data, including data types of each column
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3401012 entries, 0 to 3401011
Data columns (total 3 columns):
# Column Dtype
--- ------ -----
0 date object
1 number_of_strikes int64
2 center_point_geom object
dtypes: int64(1), object(2)
memory usage: 77.8+ MB
Convert the date column to datetime

Converting string dates to datetime will enable us to work with them much more easily.

# Convert date column to datetime
df['date']= pd.to_datetime(df['date'])
Calculate the days with the most strikes 

We’ll try to get an idea of the highest data points by using a sum calculation. Using sum() performs a sum calculation on all other summable columns. In this case, we are summing all the lightning strikes that happened on each day.

Notice that the center_point_geom column is not included in the output below. That’s because, as a string object, this column is not summable.

# Calculate days with most lightning strikes.
df.groupby(['date']).sum().sort_values('number_of_strikes', ascending=False).head(10)
number_of_strikes
date
2018-08-291070457
2018-08-17969774
2018-08-28917199
2018-08-27824589
2018-08-30802170
2018-08-19786225
2018-08-18741180
2018-08-16734475
2018-08-31723624
2018-08-15673455

If we would use count instead of sum, it would return the number of occurrences of each date in the dataset, which is not the desired result.

Extract the month data
# Create a new `month` column
df['month'] = df['date'].dt.month
df.head()
datenumber_of_strikescenter_point_geommonth
02018-01-03194POINT(-75 27)1
12018-01-0341POINT(-78.4 29)1
22018-01-0333POINT(-73.9 27)1
32018-01-0338POINT(-73.8 27)1
42018-01-0392POINT(-79 28)1
Calculate the number of strikes per month
# Calculate total number of strikes per month
df.groupby(['month']).sum().sort_values('number_of_strikes', ascending=False).head(12)
number_of_strikes
month
815525255
78320400
66445083
54166726
93018336
22071315
41524339
101093962
1860045
3854168
11409263
12312097
Convert the month number to text

To help read the data more easily, we can convert the month number to text using the datetime function dt.month_name() and add this as a new column in the dataframe. And str.slice will omit the text after the first three letters.

# Create a new `month_txt` column.
df['month_txt'] = df['date'].dt.month_name().str.slice(stop=3)
datenumber_of_strikescenter_point_geommonthmonth_txt
02018-01-03194POINT(-75 27)1Jan
12018-01-0341POINT(-78.4 29)1Jan
22018-01-0333POINT(-73.9 27)1Jan
32018-01-0338POINT(-73.8 27)1Jan
42018-01-0392POINT(-79 28)1Jan
Create a new dataframe

Our objective here is to plot the total number of strikes per month as a bar graph. We will also sort them by month for us to read easily.

# Create a new helper dataframe for plotting.
df_by_month = df.groupby(['month','month_txt']).sum().sort_values('month', ascending=True).head(12).reset_index()

df_by_month
monthmonth_txtnumber_of_strikes
01Jan860045
12Feb2071315
23Mar854168
34Apr1524339
45May4166726
56Jun6445083
67Jul8320400
78Aug15525255
89Sep3018336
910Oct1093962
1011Nov409263
1112Dec312097
Make a bar chart

Pyplot’s plt.bar() function takes positional arguments of x and height, representing the data used for the x- and y- axes, respectively. Below, the x-axis will represent months, and the y-axis will represent strike count.

plt.bar(x=df_by_month['month_txt'],height= df_by_month['number_of_strikes'], label="Number of strikes")
plt.plot()
plt.xlabel("Months(2018)")
plt.ylabel("Number of lightning strikes")
plt.title("Number of lightning strikes in 2018 by months")
plt.legend()
plt.show()

In

,