In my previous post, we checked several different ways of selecting data in a dataframe using name-based and integer-based indexing. Here we’ll filter the data in the dataframe based on value-based conditions.
Boolean Masking
Boolean masking (also called Boolean indexing) is a filtering technique that overlays a Boolean grid onto a dataframe’s index in order to select only the values in the dataframe that align with the True values of the grid.
The result is that any rows in the dataframe that are indicated as True in the Boolean mask remain in the dataframe, and any rows that are indicated as False get filtered out.
# Instantiate a dictionary of planetary data.
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars',
'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
'radius_km': [2440, 6052, 6371, 3390, 69911, 58232,
25362, 24622],
'moons': [0, 0, 1, 2, 80, 83, 27, 14]
}
# Use pd.DataFrame() function to convert dictionary to dataframe.
planets = pd.DataFrame(data)
planets
planet | radius_km | moons | |
0 | Mercury | 2440 | 0 |
1 | Venus | 6052 | 0 |
2 | Earth | 6371 | 1 |
3 | Mars | 3390 | 2 |
4 | Jupiter | 69911 | 80 |
5 | Saturn | 58232 | 83 |
6 | Uranus | 25362 | 27 |
7 | Neptune | 24622 | 14 |
# Create a Boolean mask of planets with fewer than 20 moons.
mask = planets['moons'] < 20
mask
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 True
Name: moons, dtype: bool
Applying Mask to a DataFrame
The above one is the boolean mask. To apply this mask to the dataframe, we simply insert it into selector brackets and apply it to the dataframe.
# Apply the Boolean mask to the dataframe to filter it so it contains
# only the planets with fewer than 20 moons.
planets[mask]
planet | radius_km | moons | |
0 | Mercury | 2440 | 0 |
1 | Venus | 6052 | 0 |
2 | Earth | 6371 | 1 |
3 | Mars | 3390 | 2 |
7 | Neptune | 24622 | 14 |
Or we could define the Boolean mask and apply it in a single line, the result would be the same.
planets[planets['moons'] < 20]
# Boolean masks don't change the data. They're just views.
planets
planet | radius_km | moons | |
0 | Mercury | 2440 | 0 |
1 | Venus | 6052 | 0 |
2 | Earth | 6371 | 1 |
3 | Mars | 3390 | 2 |
4 | Jupiter | 69911 | 80 |
5 | Saturn | 58232 | 83 |
6 | Uranus | 25362 | 27 |
7 | Neptune | 24622 | 14 |
However, we can assign the result to a named variable. This may be useful if we’ll need to reference the list of planets with moons under 20 again later.
moons_under_20 = planets[mask]
moons_under_20
Filtering Data with multiple conditions
Sometimes we’ll need to filter data based on multiple conditions.
# Create a Boolean mask of planets with fewer than 10 moons OR more than 50 moons.
mask = (planets['moons'] < 10) | (planets['moons'] > 50)
# Apply the Boolean mask to filter the data.
planets[mask]
planet | radius_km | moons | |
0 | Mercury | 2440 | 0 |
1 | Venus | 6052 | 0 |
2 | Earth | 6371 | 1 |
3 | Mars | 3390 | 2 |
4 | Jupiter | 69911 | 80 |
5 | Saturn | 58232 | 83 |
# Create a Boolean mask of planets with more than 20 moons,
# excluding them if they have 80 moons
# or if their radius is less than 50,000 km.
mask = (planets['moons'] > 20) & ~(planets['moons'] == 80) & ~(planets['radius_km'] < 50000)
# Apply the mask
planets[mask]
planet | radius_km | moons | |
5 | Saturn | 58232 | 83 |
We could write the above code slightly different as below:
mask = (planets['moons'] > 20) & (planets['moons'] != 80) & (planets['radius_km'] >= 50000)
planets[mask]
And if we want to select just the planet column as a Series object (as we did similarly in previous post), we can use regular selection tools like loc[]:
mask = df['moons'] < 20
df.loc[mask, 'planet']
0 Mercury
1 Venus
2 Earth
3 Mars
7 Neptune
Name: planet, dtype: object
Grouping and Aggregation
groupby() Method
The groupby() method is a pandas DataFrame method that groups rows of the dataframe together based on their values at one or more columns, which allows further analysis of the groups.
import numpy as np
import pandas as pd
# Instantiate a dictionary of planetary data.
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars',
'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
'radius_km': [2440, 6052, 6371, 3390, 69911, 58232,
25362, 24622],
'moons': [0, 0, 1, 2, 80, 83, 27, 14],
'type': ['terrestrial', 'terrestrial', 'terrestrial', 'terrestrial',
'gas giant', 'gas giant', 'ice giant', 'ice giant'],
'rings': ['no', 'no', 'no', 'no', 'yes', 'yes', 'yes','yes'],
'mean_temp_c': [167, 464, 15, -65, -110, -140, -195, -200],
'magnetic_field': ['yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes']
}
# Use pd.DataFrame() function to convert dictionary to dataframe.
planets = pd.DataFrame(data)
planet | radius_km | moons | type | rings | mean_temp_c | magnetic_field | |
0 | Mercury | 2440 | 0 | terrestrial | no | 167 | yes |
1 | Venus | 6052 | 0 | terrestrial | no | 464 | no |
2 | Earth | 6371 | 1 | terrestrial | no | 15 | yes |
3 | Mars | 3390 | 2 | terrestrial | no | -65 | no |
4 | Jupiter | 69911 | 80 | gas giant | yes | -110 | yes |
5 | Saturn | 58232 | 83 | gas giant | yes | -140 | yes |
6 | Uranus | 25362 | 27 | ice giant | yes | -195 | yes |
7 | Neptune | 24622 | 14 | ice giant | yes | -200 | yes |
# The groupby() function returns a groupby object.
planets.groupby(['type'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7027b117b0d0>
When we call the groupby method on a dataframe, it creates a groupby object. If we do nothing else, the groupby object isn’t very helpful. We’ll basically get a statement saying, “Here’s your object. It’s stored at this address in the computer’s memory.“
# Apply the sum() function to the groupby object to get the sum
# of the values in each numerical column for each group.
planets.groupby(['type']).sum()
radius_km | moons | mean_temp_c | |
type | |||
gas giant | 128143 | 163 | -250 |
ice giant | 49984 | 41 | -395 |
terrestrial | 18253 | 3 | 581 |
Only the numerical columns are returned because the sum method only works on numerical data. The “type” column is an index of this dataframe. This information can be interpreted as the sum of all the values in each group at these respective columns.
# Apply the sum function to the groupby object
# and select only the 'moons' column.
planets.groupby(['type']).sum()[['moons']]
moons | |
type | |
gas giant | 163 |
ice giant | 41 |
terrestrial | 3 |
# Group by type and magnetic_field and get the mean of the values
# in the numeric columns for each group.
planets.groupby(['type', 'magnetic_field']).mean()
radius_km | moons | mean_temp_c | ||
type | magnetic_field | |||
gas giant | yes | 64071.5 | 81.5 | -125.0 |
ice giant | yes | 24992.0 | 20.5 | -197.5 |
terrestrial | no | 4721.0 | 1.0 | 199.5 |
yes | 4405.5 | 0.5 | 91.0 |
agg() Method
Agg (short for aggregate) method allows us to apply multiple calculations to groups of data.
# Group by type, then use the agg() function to get the mean and median
# of the values in the numeric columns for each group.
planets.groupby(['type']).agg(['mean', 'median'])
radius_km | moons | mean_temp_c | ||||
mean | median | mean | median | mean | median | |
type | ||||||
gas giant | 64071.5 | 64071.5 | 81.5 | 81.5 | -125 | -125 |
ice giant | 24992 | 24992 | 20.5 | 20.5 | -197.5 | -197.5 |
terrestrial | 4563.25 | 4721 | 0.75 | 0.5 | 145.25 | 91 |
# Group by type and magnetic_field, then use the agg() function to get the
# mean and max of the values in the numeric columns for each group.
planets.groupby(['type', 'magnetic_field']).agg(['mean', 'min', 'max'])
radius_km | moons | mean_temp_c | ||||||||
mean | min | max | mean | min | max | mean | min | max | ||
type | magnetic_field | |||||||||
gas giant | yes | 64071.5 | 58232 | 69911 | 81.5 | 80 | 83 | -125 | -140 | -110 |
ice giant | yes | 24992 | 24622 | 25362 | 20.5 | 14 | 27 | -197.5 | -200 | -195 |
terrestrial | no | 4721 | 3390 | 6052 | 1 | 0 | 2 | 199.5 | -65 | 464 |
yes | 4405.5 | 2440 | 6371 | 0.5 | 0 | 1 | 91 | 15 | 167 |
We can define our functions too. The one below will return the 90 percentile of an array.
# Define a function that returns the 90 percentile of an array.
def percentile_90(x):
return x.quantile(0.9)
Then we can call this custom function in our aggregation.
# Group by type and magnetic_field, then use the agg() function
# to apply the mean and the custom-defined `percentile_90()` function
# to the numeric columns for each group.
planets.groupby(['type', 'magnetic_field']).agg(['mean', percentile_90])
radius_km | moons | mean_temp_c | |||||
mean | percentile_90 | mean | percentile_90 | mean | percentile_90 | ||
type | magnetic_field | ||||||
gas giant | yes | 64071.5 | 68743.1 | 81.5 | 82.7 | -125 | -113 |
ice giant | yes | 24992 | 25288 | 20.5 | 25.7 | -197.5 | -195.5 |
terrestrial | no | 4721 | 5785.8 | 1 | 1.8 | 199.5 | 411.1 |
yes | 4405.5 | 5977.9 | 0.5 | 0.9 | 91 | 151.8 |
Notice that we can enter “mean” as a string because it’s an existing method of groupby objects, but we type the “percentile 90” function as an object because it’s custom-defined.
We can specify where to use agg and also differentiate it based on our needs. Imagine we have the following dataframe.
clothes
color mass_g price_usd type
0 red 125 20 pants
1 blue 440 35 shirt
2 green 680 50 shirt
3 blue 200 40 pants
4 green 395 100 shirt
5 red 485 75 pants
clothes.groupby('color').agg({'price_usd': ['mean', 'max'],
'mass_g': ['min', 'max']})
price_usd mass_g mean max min max color blue 37.5 40 200 440 green 75.0 100 395 680 red 47.5 75 125 485