Merging and Joining Data: more samples with pandas

Merging and Joining Data

Let’s check how to add new data to existing dataframes. We’ll work with two pandas functions here: concat() and merge().

concat()

The pandas concat function combines data either by adding it horizontally as new columns for existing rows, or vertically as new rows for existing columns.

The two axes of a dataframe are zero, which runs vertically over rows; and one, which runs horizontally across columns.

Imagine we have two dataframes as follows.

df1
planetradius_kmmoons
0Mercury24400
1Venus60520
2Earth63711
3Mars33902
df2
planetradius_kmmoons
0Jupiter6991180
1Saturn5823283
2Uranus2536227
3Neptune2462214
# The pd.concat() function can combine the two dataframes along axis 0,
# with the second dataframe being added as new rows to the first dataframe.
df3 = pd.concat([df1, df2], axis=0)
df3
planetradius_kmmoons
0Mercury24400
1Venus60520
2Earth63711
3Mars33902
0Jupiter6991180
1Saturn5823283
2Uranus2536227
3Neptune2462214

Notice that each row retains its index number from its original dataframe. Let’s reset that.

# Reset the row indices.
df3 = df3.reset_index(drop=True)
df3
planetradius_kmmoons
0Mercury24400
1Venus60520
2Earth63711
3Mars33902
4Jupiter6991180
5Saturn5823283
6Uranus2536227
7Neptune2462214

The concat function is great for when we have dataframes containing identically formatted data that simply needs to be combined vertically. If we want to add data horizontally, we consider the merge function. 

merge()

The merge function is a pandas function that joins two dataframes together. It only combines data by extending along axis one horizontally.

First, let’s conceptualize how data joins work. For two datasets to connect, they need to share a common point of reference. In other words, both datasets must have some aspect of them that is the same in each one. 

These are known as keys. Keys are the shared points of reference between different dataframes, what to match on.

Now let’s consider the different ways that we can join the data: 

  • Inner join: We can join it so only the keys that are in both dataframes get included in the merge. 
  • Outer join: Alternatively, we can join the data so all of the keys from both dataframes get included in the merge. 
  • Left join: We can also join the data so all of the keys in the left dataframe are included, even if they aren’t in the right dataframe. 
  • Right join: Finally, we can join the data so all the keys in the right dataframe are included, even if they aren’t in the left dataframe. 

Let’s work on the dataframe df3 from above and df4 below.

df4
planettyperingsmean_temp_cmagnetic_fieldlife
0Earthterrestrialno15.0yes1
1Marsterrestrialno-65.0no0
2Jupitergas giantyes-110.0yes0
3Saturngas giantyes-140.0yes0
4Uranusice giantyes-195.0yes0
5Neptuneice giantyes-200.0yes0
6Janssensuper earthnoNaNNone1
7Tadmorgas giantNoneNaNNone1
Inner join
# Use pd.merge() to combine dataframes.
# Inner merge retains only keys that appear in both dataframes.
inner = pd.merge(df3, df4, on='planet', how='inner')
inner
planetradius_kmmoonstyperingsmean_temp_cmagnetic_fieldlife
0Earth63711terrestrialno15.0yes1
1Mars33902terrestrialno-65.0no0
2Jupiter6991180gas giantyes-110.0yes0
3Saturn5823283gas giantyes-140.0yes0
4Uranus2536227ice giantyes-195.0yes0
5Neptune2462214ice giantyes-200.0yes0
Outer join
# Use pd.merge() to combine dataframes.
# Outer merge retains all keys from both dataframes.
outer = pd.merge(df3, df4, on='planet', how='outer')
outer
planetradius_kmmoonstyperingsmean_temp_cmagnetic_fieldlife
0Mercury2440.00.0NaNNaNNaNNaNNaN
1Venus6052.00.0NaNNaNNaNNaNNaN
2Earth6371.01.0terrestrialno15.0yes1.0
3Mars3390.02.0terrestrialno-65.0no0.0
4Jupiter69911.080.0gas giantyes-110.0yes0.0
5Saturn58232.083.0gas giantyes-140.0yes0.0
6Uranus25362.027.0ice giantyes-195.0yes0.0
7Neptune24622.014.0ice giantyes-200.0yes0.0
8JanssenNaNNaNsuper earthnoNaNNone1.0
9TadmorNaNNaNgas giantNoneNaNNone1.0
Left join
# Use pd.merge() to combine dataframes.
# Left merge retains only keys that appear in the left dataframe.
left = pd.merge(df3, df4, on='planet', how='left')
left
planetradius_kmmoonstyperingsmean_temp_cmagnetic_fieldlife
0Mercury24400NaNNaNNaNNaNNaN
1Venus60520NaNNaNNaNNaNNaN
2Earth63711terrestrialno15.0yes1.0
3Mars33902terrestrialno-65.0no0.0
4Jupiter6991180gas giantyes-110.0yes0.0
5Saturn5823283gas giantyes-140.0yes0.0
6Uranus2536227ice giantyes-195.0yes0.0
7Neptune2462214ice giantyes-200.0yes0.0
Right join
# Use pd.merge() to combine dataframes.
# Right merge retains only keys that appear in right dataframe.
right = pd.merge(df3, df4, on='planet', how='right')
right
planetradius_kmmoonstyperingsmean_temp_cmagnetic_fieldlife
0Earth6371.01.0terrestrialno15.0yes1
1Mars3390.02.0terrestrialno-65.0no0
2Jupiter69911.080.0gas giantyes-110.0yes0
3Saturn58232.083.0gas giantyes-140.0yes0
4Uranus25362.027.0ice giantyes-195.0yes0
5Neptune24622.014.0ice giantyes-200.0yes0
6JanssenNaNNaNsuper earthnoNaNNone1
7TadmorNaNNaNgas giantNoneNaNNone1

Dataframes with pandas

Let’s get back to our aqi table and practice more.

From csv file into a pandas dataframe

import numpy as np
import pandas as pd

top3 = pd.read_csv('epa_ca_tx_pa.csv')
top3.head()
state_codestate_namecounty_codecounty_nameaqi
06California1Alameda11.0
16California7Butte6.0
26California19Fresno11.0
36California29Kern7.0
46California29Kern3.0

Summary information

Now let’s get some high-level summary information about our data.

Metadata
top3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 546 entries, 0 to 545
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 state_code 546 non-null int64
1 state_name 546 non-null object
2 county_code 546 non-null int64
3 county_name 546 non-null object
4 aqi 546 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 21.5+ KB
Summary statistics
top3.describe()
state_codecounty_codeaqi
count546.000000546.000000546.000000
mean20.59340783.1794878.906593
std19.00148492.2408739.078479
min6.0000001.0000000.000000
25%6.00000029.0000003.000000
50%6.00000066.0000006.000000
75%42.00000098.50000011.000000
max48.000000479.00000093.000000

Explore the data

Let’s explore our data further.

Rows per state
top3['state_name'].value_counts()
California      342
Texas 104
Pennsylvania 100
Name: state_name, dtype: int64
Sort by AQI
top3_sorted = top3.sort_values(by='aqi', ascending=False)
​top3_sorted.head(10)
state_codestate_namecounty_codecounty_nameaqi
766California37Los Angeles93.0
1466California37Los Angeles59.0
416California83Santa Barbara47.0
1226California59Orange47.0
1846California59Orange47.0
5148Texas141El Paso47.0
806California65Riverside43.0
13648Texas141El Paso40.0
586California65Riverside40.0
9148Texas141El Paso40.0
Use iloc to select rows
top3_sorted.iloc[10:12]
state_codestate_namecounty_codecounty_nameaqi
1866California73San Diego39.0
746California37Los Angeles38.0

Examine California data

We noticed that the rows with the highest AQI represent data from California, so we want to examine the data for just the state of California.

Basic Boolean masking
mask = top3_sorted['state_name'] == 'California'
ca_df = top3_sorted[mask]
ca_df.head()
state_codestate_namecounty_codecounty_nameaqi
766California37Los Angeles93.0
1466California37Los Angeles59.0
416California83Santa Barbara47.0
1226California59Orange47.0
1846California59Orange47.0
Validate CA data

Does its row count match the number of California rows determined before?

ca_df.shape
(342, 5)
Rows per CA county

Let’s examine a list of the number of times each county is represented in the California data.

ca_df['county_name'].value_counts()
Los Angeles        55
Santa Barbara 26
San Bernardino 21
Orange 19
San Diego 19
Sacramento 17
Alameda 17
Fresno 16
Riverside 14
Contra Costa 13
Imperial 13
San Francisco 8
Monterey 8
Humboldt 8
Santa Clara 7
El Dorado 7
Placer 6
Butte 6
Kern 6
Mendocino 6
Solano 5
San Joaquin 5
Tulare 5
Ventura 5
Sutter 4
San Mateo 4
Marin 3
Sonoma 3
Stanislaus 3
San Luis Obispo 2
Napa 2
Santa Cruz 2
Calaveras 2
Shasta 1
Tuolumne 1
Inyo 1
Yolo 1
Mono 1
Name: county_name, dtype: int64
Calculate mean AQI for Los Angeles county

We noticed that Los Angeles county has more than twice the number of rows of the next-most-represented county in California, and now we want to learn more about it.

mask = ca_df['county_name'] == 'Los Angeles'
ca_df[mask]['aqi'].mean()
13.4

Groupby

Let’s group the original dataframe (top3) by state and calculate the mean AQI for each state.

top3.groupby('state_name').mean()[['aqi']]
aqi
state_name
California9.412281
Pennsylvania6.690000
Texas9.375000

Add more data

Let’s add more data from a second file.

Read in the second file
other_states = pd.read_csv('epa_others.csv')
other_states.head()
state_codestate_namecounty_codecounty_nameaqi
04Arizona13Maricopa18.0
14Arizona13Maricopa9.0
24Arizona19Pima20.0
38Colorado41El Paso9.0
412Florida31Duval15.0
Concatenate the data

The data from other_states is in the same format as the data from top3. It has the same columns in the same order.

combined_df = pd.concat([top3, other_states], axis=0)
len(combined_df) == len(top3) + len(other_states)
True

Complex Boolean masking

According to the EPA (the U.S. Environmental Protection Agency), AQI values of 51-100 are considered of “Moderate” concern. In that sense, let’s examine some data for the state of Washington.

mask = (combined_df['state_name'] == 'Washington') & (combined_df['aqi'] >= 51)
combined_df[mask]
state_codestate_namecounty_codecounty_nameaqi
4053Washington33King55.0
8253Washington61Snohomish76.0
12153Washington77Yakima58.0
12253Washington77Yakima57.0

As a summary:

  • Pandas comes with many built-in functions and tools specifically designed for use with tabular data to simplify common tasks such as:
    • Reading and writing data to/from files
    • Quickly computing summary statistics about your data
    • Manipulating, selecting, and filtering data
    • Grouping and aggregating data
    • Adding new data to existing data
  • It’s powered by NumPy, which uses the power of array operations to enhance performance.
  • Its interface makes working with tabular data easy because it allows us to visualize our data in rows and columns.

In