Merging and Joining Data: more samples with pandas

Merging and Joining Data

Let’s check how to add new data to existing dataframes. We’ll work with two pandas functions here: concat() and merge().

concat()

The pandas concat function combines data either by adding it horizontally as new columns for existing rows, or vertically as new rows for existing columns.

The two axes of a dataframe are zero, which runs vertically over rows; and one, which runs horizontally across columns.

Imagine we have two dataframes as follows.

df1

	planet	radius_km	moons
0	Mercury	2440	0
1	Venus	6052	0
2	Earth	6371	1
3	Mars	3390	2

df2

	planet	radius_km	moons
0	Jupiter	69911	80
1	Saturn	58232	83
2	Uranus	25362	27
3	Neptune	24622	14

# The pd.concat() function can combine the two dataframes along axis 0,
# with the second dataframe being added as new rows to the first dataframe.
df3 = pd.concat([df1, df2], axis=0)
df3

	planet	radius_km	moons
0	Mercury	2440	0
1	Venus	6052	0
2	Earth	6371	1
3	Mars	3390	2
0	Jupiter	69911	80
1	Saturn	58232	83
2	Uranus	25362	27
3	Neptune	24622	14

Notice that each row retains its index number from its original dataframe. Let’s reset that.

# Reset the row indices.
df3 = df3.reset_index(drop=True)
df3

	planet	radius_km	moons
0	Mercury	2440	0
1	Venus	6052	0
2	Earth	6371	1
3	Mars	3390	2
4	Jupiter	69911	80
5	Saturn	58232	83
6	Uranus	25362	27
7	Neptune	24622	14

The concat function is great for when we have dataframes containing identically formatted data that simply needs to be combined vertically. If we want to add data horizontally, we consider the merge function.

merge()

The merge function is a pandas function that joins two dataframes together. It only combines data by extending along axis one horizontally.

First, let’s conceptualize how data joins work. For two datasets to connect, they need to share a common point of reference. In other words, both datasets must have some aspect of them that is the same in each one.

These are known as keys. Keys are the shared points of reference between different dataframes, what to match on.

Now let’s consider the different ways that we can join the data:

Inner join: We can join it so only the keys that are in both dataframes get included in the merge.
Outer join: Alternatively, we can join the data so all of the keys from both dataframes get included in the merge.
Left join: We can also join the data so all of the keys in the left dataframe are included, even if they aren’t in the right dataframe.
Right join: Finally, we can join the data so all the keys in the right dataframe are included, even if they aren’t in the left dataframe.

Let’s work on the dataframe df3 from above and df4 below.

df4

	planet	type	rings	mean_temp_c	magnetic_field	life
0	Earth	terrestrial	no	15.0	yes	1
1	Mars	terrestrial	no	-65.0	no	0
2	Jupiter	gas giant	yes	-110.0	yes	0
3	Saturn	gas giant	yes	-140.0	yes	0
4	Uranus	ice giant	yes	-195.0	yes	0
5	Neptune	ice giant	yes	-200.0	yes	0
6	Janssen	super earth	no	NaN	None	1
7	Tadmor	gas giant	None	NaN	None	1

Inner join

# Use pd.merge() to combine dataframes.
# Inner merge retains only keys that appear in both dataframes.
inner = pd.merge(df3, df4, on='planet', how='inner')
inner

	planet	radius_km	moons	type	rings	mean_temp_c	magnetic_field	life
0	Earth	6371	1	terrestrial	no	15.0	yes	1
1	Mars	3390	2	terrestrial	no	-65.0	no	0
2	Jupiter	69911	80	gas giant	yes	-110.0	yes	0
3	Saturn	58232	83	gas giant	yes	-140.0	yes	0
4	Uranus	25362	27	ice giant	yes	-195.0	yes	0
5	Neptune	24622	14	ice giant	yes	-200.0	yes	0

Outer join

# Use pd.merge() to combine dataframes.
# Outer merge retains all keys from both dataframes.
outer = pd.merge(df3, df4, on='planet', how='outer')
outer

	planet	radius_km	moons	type	rings	mean_temp_c	magnetic_field	life
0	Mercury	2440.0	0.0	NaN	NaN	NaN	NaN	NaN
1	Venus	6052.0	0.0	NaN	NaN	NaN	NaN	NaN
2	Earth	6371.0	1.0	terrestrial	no	15.0	yes	1.0
3	Mars	3390.0	2.0	terrestrial	no	-65.0	no	0.0
4	Jupiter	69911.0	80.0	gas giant	yes	-110.0	yes	0.0
5	Saturn	58232.0	83.0	gas giant	yes	-140.0	yes	0.0
6	Uranus	25362.0	27.0	ice giant	yes	-195.0	yes	0.0
7	Neptune	24622.0	14.0	ice giant	yes	-200.0	yes	0.0
8	Janssen	NaN	NaN	super earth	no	NaN	None	1.0
9	Tadmor	NaN	NaN	gas giant	None	NaN	None	1.0

Left join

# Use pd.merge() to combine dataframes.
# Left merge retains only keys that appear in the left dataframe.
left = pd.merge(df3, df4, on='planet', how='left')
left

	planet	radius_km	moons	type	rings	mean_temp_c	magnetic_field	life
0	Mercury	2440	0	NaN	NaN	NaN	NaN	NaN
1	Venus	6052	0	NaN	NaN	NaN	NaN	NaN
2	Earth	6371	1	terrestrial	no	15.0	yes	1.0
3	Mars	3390	2	terrestrial	no	-65.0	no	0.0
4	Jupiter	69911	80	gas giant	yes	-110.0	yes	0.0
5	Saturn	58232	83	gas giant	yes	-140.0	yes	0.0
6	Uranus	25362	27	ice giant	yes	-195.0	yes	0.0
7	Neptune	24622	14	ice giant	yes	-200.0	yes	0.0

Right join

# Use pd.merge() to combine dataframes.
# Right merge retains only keys that appear in right dataframe.
right = pd.merge(df3, df4, on='planet', how='right')
right

	planet	radius_km	moons	type	rings	mean_temp_c	magnetic_field	life
0	Earth	6371.0	1.0	terrestrial	no	15.0	yes	1
1	Mars	3390.0	2.0	terrestrial	no	-65.0	no	0
2	Jupiter	69911.0	80.0	gas giant	yes	-110.0	yes	0
3	Saturn	58232.0	83.0	gas giant	yes	-140.0	yes	0
4	Uranus	25362.0	27.0	ice giant	yes	-195.0	yes	0
5	Neptune	24622.0	14.0	ice giant	yes	-200.0	yes	0
6	Janssen	NaN	NaN	super earth	no	NaN	None	1
7	Tadmor	NaN	NaN	gas giant	None	NaN	None	1

Dataframes with pandas

Let’s get back to our aqi table and practice more.

From csv file into a pandas dataframe

import numpy as np
import pandas as pd

top3 = pd.read_csv('epa_ca_tx_pa.csv')
top3.head()

	state_code	state_name	county_code	county_name	aqi
0	6	California	1	Alameda	11.0
1	6	California	7	Butte	6.0
2	6	California	19	Fresno	11.0
3	6	California	29	Kern	7.0
4	6	California	29	Kern	3.0

Summary information

Now let’s get some high-level summary information about our data.

Metadata

top3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 546 entries, 0 to 545
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   state_code   546 non-null    int64  
 1   state_name   546 non-null    object 
 2   county_code  546 non-null    int64  
 3   county_name  546 non-null    object 
 4   aqi          546 non-null    float64
dtypes: float64(1), int64(2), object(2)
memory usage: 21.5+ KB

Summary statistics

top3.describe()

	state_code	county_code	aqi
count	546.000000	546.000000	546.000000
mean	20.593407	83.179487	8.906593
std	19.001484	92.240873	9.078479
min	6.000000	1.000000	0.000000
25%	6.000000	29.000000	3.000000
50%	6.000000	66.000000	6.000000
75%	42.000000	98.500000	11.000000
max	48.000000	479.000000	93.000000

Explore the data

Let’s explore our data further.

Rows per state

top3['state_name'].value_counts()

California      342
Texas           104
Pennsylvania    100
Name: state_name, dtype: int64

Sort by AQI

top3_sorted = top3.sort_values(by='aqi', ascending=False)
top3_sorted.head(10)

	state_code	state_name	county_code	county_name	aqi
76	6	California	37	Los Angeles	93.0
146	6	California	37	Los Angeles	59.0
41	6	California	83	Santa Barbara	47.0
122	6	California	59	Orange	47.0
184	6	California	59	Orange	47.0
51	48	Texas	141	El Paso	47.0
80	6	California	65	Riverside	43.0
136	48	Texas	141	El Paso	40.0
58	6	California	65	Riverside	40.0
91	48	Texas	141	El Paso	40.0

Use iloc to select rows

top3_sorted.iloc[10:12]

	state_code	state_name	county_code	county_name	aqi
186	6	California	73	San Diego	39.0
74	6	California	37	Los Angeles	38.0

Examine California data

We noticed that the rows with the highest AQI represent data from California, so we want to examine the data for just the state of California.

Basic Boolean masking

mask = top3_sorted['state_name'] == 'California'
ca_df = top3_sorted[mask]
ca_df.head()

	state_code	state_name	county_code	county_name	aqi
76	6	California	37	Los Angeles	93.0
146	6	California	37	Los Angeles	59.0
41	6	California	83	Santa Barbara	47.0
122	6	California	59	Orange	47.0
184	6	California	59	Orange	47.0

Validate CA data

Does its row count match the number of California rows determined before?

ca_df.shape

(342, 5)

Rows per CA county

Let’s examine a list of the number of times each county is represented in the California data.

ca_df['county_name'].value_counts()

Los Angeles        55
Santa Barbara      26
San Bernardino     21
Orange             19
San Diego          19
Sacramento         17
Alameda            17
Fresno             16
Riverside          14
Contra Costa       13
Imperial           13
San Francisco       8
Monterey            8
Humboldt            8
Santa Clara         7
El Dorado           7
Placer              6
Butte               6
Kern                6
Mendocino           6
Solano              5
San Joaquin         5
Tulare              5
Ventura             5
Sutter              4
San Mateo           4
Marin               3
Sonoma              3
Stanislaus          3
San Luis Obispo     2
Napa                2
Santa Cruz          2
Calaveras           2
Shasta              1
Tuolumne            1
Inyo                1
Yolo                1
Mono                1
Name: county_name, dtype: int64

Calculate mean AQI for Los Angeles county

We noticed that Los Angeles county has more than twice the number of rows of the next-most-represented county in California, and now we want to learn more about it.

mask = ca_df['county_name'] == 'Los Angeles'
ca_df[mask]['aqi'].mean()

13.4

Groupby

Let’s group the original dataframe (top3) by state and calculate the mean AQI for each state.

top3.groupby('state_name').mean()[['aqi']]

	aqi
state_name
California	9.412281
Pennsylvania	6.690000
Texas	9.375000

Add more data

Let’s add more data from a second file.

Read in the second file

other_states = pd.read_csv('epa_others.csv')
other_states.head()

	state_code	state_name	county_code	county_name	aqi
0	4	Arizona	13	Maricopa	18.0
1	4	Arizona	13	Maricopa	9.0
2	4	Arizona	19	Pima	20.0
3	8	Colorado	41	El Paso	9.0
4	12	Florida	31	Duval	15.0

Concatenate the data

The data from other_states is in the same format as the data from top3. It has the same columns in the same order.

combined_df = pd.concat([top3, other_states], axis=0)
len(combined_df) == len(top3) + len(other_states)

True

Complex Boolean masking

According to the EPA (the U.S. Environmental Protection Agency), AQI values of 51-100 are considered of “Moderate” concern. In that sense, let’s examine some data for the state of Washington.

mask = (combined_df['state_name'] == 'Washington') & (combined_df['aqi'] >= 51)
combined_df[mask]

	state_code	state_name	county_code	county_name	aqi
40	53	Washington	33	King	55.0
82	53	Washington	61	Snohomish	76.0
121	53	Washington	77	Yakima	58.0
122	53	Washington	77	Yakima	57.0

As a summary:

Pandas comes with many built-in functions and tools specifically designed for use with tabular data to simplify common tasks such as:
- Reading and writing data to/from files
- Quickly computing summary statistics about your data
- Manipulating, selecting, and filtering data
- Grouping and aggregating data
- Adding new data to existing data
It’s powered by NumPy, which uses the power of array operations to enhance performance.
Its interface makes working with tabular data easy because it allows us to visualize our data in rows and columns.