Merging and Joining Data
Let’s check how to add new data to existing dataframes. We’ll work with two pandas functions here: concat() and merge().
concat()
The pandas concat function combines data either by adding it horizontally as new columns for existing rows, or vertically as new rows for existing columns.
The two axes of a dataframe are zero, which runs vertically over rows; and one, which runs horizontally across columns.
Imagine we have two dataframes as follows.
df1
planet | radius_km | moons | |
0 | Mercury | 2440 | 0 |
1 | Venus | 6052 | 0 |
2 | Earth | 6371 | 1 |
3 | Mars | 3390 | 2 |
df2
planet | radius_km | moons | |
0 | Jupiter | 69911 | 80 |
1 | Saturn | 58232 | 83 |
2 | Uranus | 25362 | 27 |
3 | Neptune | 24622 | 14 |
# The pd.concat() function can combine the two dataframes along axis 0,
# with the second dataframe being added as new rows to the first dataframe.
df3 = pd.concat([df1, df2], axis=0)
df3
planet | radius_km | moons | |
0 | Mercury | 2440 | 0 |
1 | Venus | 6052 | 0 |
2 | Earth | 6371 | 1 |
3 | Mars | 3390 | 2 |
0 | Jupiter | 69911 | 80 |
1 | Saturn | 58232 | 83 |
2 | Uranus | 25362 | 27 |
3 | Neptune | 24622 | 14 |
Notice that each row retains its index number from its original dataframe. Let’s reset that.
# Reset the row indices.
df3 = df3.reset_index(drop=True)
df3
planet | radius_km | moons | |
0 | Mercury | 2440 | 0 |
1 | Venus | 6052 | 0 |
2 | Earth | 6371 | 1 |
3 | Mars | 3390 | 2 |
4 | Jupiter | 69911 | 80 |
5 | Saturn | 58232 | 83 |
6 | Uranus | 25362 | 27 |
7 | Neptune | 24622 | 14 |
The concat function is great for when we have dataframes containing identically formatted data that simply needs to be combined vertically. If we want to add data horizontally, we consider the merge function.
merge()
The merge function is a pandas function that joins two dataframes together. It only combines data by extending along axis one horizontally.
First, let’s conceptualize how data joins work. For two datasets to connect, they need to share a common point of reference. In other words, both datasets must have some aspect of them that is the same in each one.
These are known as keys. Keys are the shared points of reference between different dataframes, what to match on.
Now let’s consider the different ways that we can join the data:
- Inner join: We can join it so only the keys that are in both dataframes get included in the merge.
- Outer join: Alternatively, we can join the data so all of the keys from both dataframes get included in the merge.
- Left join: We can also join the data so all of the keys in the left dataframe are included, even if they aren’t in the right dataframe.
- Right join: Finally, we can join the data so all the keys in the right dataframe are included, even if they aren’t in the left dataframe.
Let’s work on the dataframe df3 from above and df4 below.
df4
planet | type | rings | mean_temp_c | magnetic_field | life | |
0 | Earth | terrestrial | no | 15.0 | yes | 1 |
1 | Mars | terrestrial | no | -65.0 | no | 0 |
2 | Jupiter | gas giant | yes | -110.0 | yes | 0 |
3 | Saturn | gas giant | yes | -140.0 | yes | 0 |
4 | Uranus | ice giant | yes | -195.0 | yes | 0 |
5 | Neptune | ice giant | yes | -200.0 | yes | 0 |
6 | Janssen | super earth | no | NaN | None | 1 |
7 | Tadmor | gas giant | None | NaN | None | 1 |
Inner join
# Use pd.merge() to combine dataframes.
# Inner merge retains only keys that appear in both dataframes.
inner = pd.merge(df3, df4, on='planet', how='inner')
inner
planet | radius_km | moons | type | rings | mean_temp_c | magnetic_field | life | |
0 | Earth | 6371 | 1 | terrestrial | no | 15.0 | yes | 1 |
1 | Mars | 3390 | 2 | terrestrial | no | -65.0 | no | 0 |
2 | Jupiter | 69911 | 80 | gas giant | yes | -110.0 | yes | 0 |
3 | Saturn | 58232 | 83 | gas giant | yes | -140.0 | yes | 0 |
4 | Uranus | 25362 | 27 | ice giant | yes | -195.0 | yes | 0 |
5 | Neptune | 24622 | 14 | ice giant | yes | -200.0 | yes | 0 |
Outer join
# Use pd.merge() to combine dataframes.
# Outer merge retains all keys from both dataframes.
outer = pd.merge(df3, df4, on='planet', how='outer')
outer
planet | radius_km | moons | type | rings | mean_temp_c | magnetic_field | life | |
0 | Mercury | 2440.0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
1 | Venus | 6052.0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
2 | Earth | 6371.0 | 1.0 | terrestrial | no | 15.0 | yes | 1.0 |
3 | Mars | 3390.0 | 2.0 | terrestrial | no | -65.0 | no | 0.0 |
4 | Jupiter | 69911.0 | 80.0 | gas giant | yes | -110.0 | yes | 0.0 |
5 | Saturn | 58232.0 | 83.0 | gas giant | yes | -140.0 | yes | 0.0 |
6 | Uranus | 25362.0 | 27.0 | ice giant | yes | -195.0 | yes | 0.0 |
7 | Neptune | 24622.0 | 14.0 | ice giant | yes | -200.0 | yes | 0.0 |
8 | Janssen | NaN | NaN | super earth | no | NaN | None | 1.0 |
9 | Tadmor | NaN | NaN | gas giant | None | NaN | None | 1.0 |
Left join
# Use pd.merge() to combine dataframes.
# Left merge retains only keys that appear in the left dataframe.
left = pd.merge(df3, df4, on='planet', how='left')
left
planet | radius_km | moons | type | rings | mean_temp_c | magnetic_field | life | |
0 | Mercury | 2440 | 0 | NaN | NaN | NaN | NaN | NaN |
1 | Venus | 6052 | 0 | NaN | NaN | NaN | NaN | NaN |
2 | Earth | 6371 | 1 | terrestrial | no | 15.0 | yes | 1.0 |
3 | Mars | 3390 | 2 | terrestrial | no | -65.0 | no | 0.0 |
4 | Jupiter | 69911 | 80 | gas giant | yes | -110.0 | yes | 0.0 |
5 | Saturn | 58232 | 83 | gas giant | yes | -140.0 | yes | 0.0 |
6 | Uranus | 25362 | 27 | ice giant | yes | -195.0 | yes | 0.0 |
7 | Neptune | 24622 | 14 | ice giant | yes | -200.0 | yes | 0.0 |
Right join
# Use pd.merge() to combine dataframes.
# Right merge retains only keys that appear in right dataframe.
right = pd.merge(df3, df4, on='planet', how='right')
right
planet | radius_km | moons | type | rings | mean_temp_c | magnetic_field | life | |
0 | Earth | 6371.0 | 1.0 | terrestrial | no | 15.0 | yes | 1 |
1 | Mars | 3390.0 | 2.0 | terrestrial | no | -65.0 | no | 0 |
2 | Jupiter | 69911.0 | 80.0 | gas giant | yes | -110.0 | yes | 0 |
3 | Saturn | 58232.0 | 83.0 | gas giant | yes | -140.0 | yes | 0 |
4 | Uranus | 25362.0 | 27.0 | ice giant | yes | -195.0 | yes | 0 |
5 | Neptune | 24622.0 | 14.0 | ice giant | yes | -200.0 | yes | 0 |
6 | Janssen | NaN | NaN | super earth | no | NaN | None | 1 |
7 | Tadmor | NaN | NaN | gas giant | None | NaN | None | 1 |
Dataframes with pandas
Let’s get back to our aqi table and practice more.
From csv file into a pandas dataframe
import numpy as np
import pandas as pd
top3 = pd.read_csv('epa_ca_tx_pa.csv')
top3.head()
state_code | state_name | county_code | county_name | aqi | |
0 | 6 | California | 1 | Alameda | 11.0 |
1 | 6 | California | 7 | Butte | 6.0 |
2 | 6 | California | 19 | Fresno | 11.0 |
3 | 6 | California | 29 | Kern | 7.0 |
4 | 6 | California | 29 | Kern | 3.0 |
Summary information
Now let’s get some high-level summary information about our data.
Metadata
top3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 546 entries, 0 to 545
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 state_code 546 non-null int64
1 state_name 546 non-null object
2 county_code 546 non-null int64
3 county_name 546 non-null object
4 aqi 546 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 21.5+ KB
Summary statistics
top3.describe()
state_code | county_code | aqi | |
count | 546.000000 | 546.000000 | 546.000000 |
mean | 20.593407 | 83.179487 | 8.906593 |
std | 19.001484 | 92.240873 | 9.078479 |
min | 6.000000 | 1.000000 | 0.000000 |
25% | 6.000000 | 29.000000 | 3.000000 |
50% | 6.000000 | 66.000000 | 6.000000 |
75% | 42.000000 | 98.500000 | 11.000000 |
max | 48.000000 | 479.000000 | 93.000000 |
Explore the data
Let’s explore our data further.
Rows per state
top3['state_name'].value_counts()
California 342
Texas 104
Pennsylvania 100
Name: state_name, dtype: int64
Sort by AQI
top3_sorted = top3.sort_values(by='aqi', ascending=False)
top3_sorted.head(10)
state_code | state_name | county_code | county_name | aqi | |
76 | 6 | California | 37 | Los Angeles | 93.0 |
146 | 6 | California | 37 | Los Angeles | 59.0 |
41 | 6 | California | 83 | Santa Barbara | 47.0 |
122 | 6 | California | 59 | Orange | 47.0 |
184 | 6 | California | 59 | Orange | 47.0 |
51 | 48 | Texas | 141 | El Paso | 47.0 |
80 | 6 | California | 65 | Riverside | 43.0 |
136 | 48 | Texas | 141 | El Paso | 40.0 |
58 | 6 | California | 65 | Riverside | 40.0 |
91 | 48 | Texas | 141 | El Paso | 40.0 |
Use iloc to select rows
top3_sorted.iloc[10:12]
state_code | state_name | county_code | county_name | aqi | |
186 | 6 | California | 73 | San Diego | 39.0 |
74 | 6 | California | 37 | Los Angeles | 38.0 |
Examine California data
We noticed that the rows with the highest AQI represent data from California, so we want to examine the data for just the state of California.
Basic Boolean masking
mask = top3_sorted['state_name'] == 'California'
ca_df = top3_sorted[mask]
ca_df.head()
state_code | state_name | county_code | county_name | aqi | |
76 | 6 | California | 37 | Los Angeles | 93.0 |
146 | 6 | California | 37 | Los Angeles | 59.0 |
41 | 6 | California | 83 | Santa Barbara | 47.0 |
122 | 6 | California | 59 | Orange | 47.0 |
184 | 6 | California | 59 | Orange | 47.0 |
Validate CA data
Does its row count match the number of California rows determined before?
ca_df.shape
(342, 5)
Rows per CA county
Let’s examine a list of the number of times each county is represented in the California data.
ca_df['county_name'].value_counts()
Los Angeles 55
Santa Barbara 26
San Bernardino 21
Orange 19
San Diego 19
Sacramento 17
Alameda 17
Fresno 16
Riverside 14
Contra Costa 13
Imperial 13
San Francisco 8
Monterey 8
Humboldt 8
Santa Clara 7
El Dorado 7
Placer 6
Butte 6
Kern 6
Mendocino 6
Solano 5
San Joaquin 5
Tulare 5
Ventura 5
Sutter 4
San Mateo 4
Marin 3
Sonoma 3
Stanislaus 3
San Luis Obispo 2
Napa 2
Santa Cruz 2
Calaveras 2
Shasta 1
Tuolumne 1
Inyo 1
Yolo 1
Mono 1
Name: county_name, dtype: int64
Calculate mean AQI for Los Angeles county
We noticed that Los Angeles county has more than twice the number of rows of the next-most-represented county in California, and now we want to learn more about it.
mask = ca_df['county_name'] == 'Los Angeles'
ca_df[mask]['aqi'].mean()
13.4
Groupby
Let’s group the original dataframe (top3) by state and calculate the mean AQI for each state.
top3.groupby('state_name').mean()[['aqi']]
aqi | |
state_name | |
California | 9.412281 |
Pennsylvania | 6.690000 |
Texas | 9.375000 |
Add more data
Let’s add more data from a second file.
Read in the second file
other_states = pd.read_csv('epa_others.csv')
other_states.head()
state_code | state_name | county_code | county_name | aqi | |
0 | 4 | Arizona | 13 | Maricopa | 18.0 |
1 | 4 | Arizona | 13 | Maricopa | 9.0 |
2 | 4 | Arizona | 19 | Pima | 20.0 |
3 | 8 | Colorado | 41 | El Paso | 9.0 |
4 | 12 | Florida | 31 | Duval | 15.0 |
Concatenate the data
The data from other_states is in the same format as the data from top3. It has the same columns in the same order.
combined_df = pd.concat([top3, other_states], axis=0)
len(combined_df) == len(top3) + len(other_states)
True
Complex Boolean masking
According to the EPA (the U.S. Environmental Protection Agency), AQI values of 51-100 are considered of “Moderate” concern. In that sense, let’s examine some data for the state of Washington.
mask = (combined_df['state_name'] == 'Washington') & (combined_df['aqi'] >= 51)
combined_df[mask]
state_code | state_name | county_code | county_name | aqi | |
40 | 53 | Washington | 33 | King | 55.0 |
82 | 53 | Washington | 61 | Snohomish | 76.0 |
121 | 53 | Washington | 77 | Yakima | 58.0 |
122 | 53 | Washington | 77 | Yakima | 57.0 |
As a summary:
- Pandas comes with many built-in functions and tools specifically designed for use with tabular data to simplify common tasks such as:
- Reading and writing data to/from files
- Quickly computing summary statistics about your data
- Manipulating, selecting, and filtering data
- Grouping and aggregating data
- Adding new data to existing data
- It’s powered by NumPy, which uses the power of array operations to enhance performance.
- Its interface makes working with tabular data easy because it allows us to visualize our data in rows and columns.