Boolean Masking and Grouping

In my previous post, we checked several different ways of selecting data in a dataframe using name-based and integer-based indexing. Here we’ll filter the data in the dataframe based on value-based conditions.

Boolean Masking

Boolean masking (also called Boolean indexing) is a filtering technique that overlays a Boolean grid onto a dataframe’s index in order to select only the values in the dataframe that align with the True values of the grid.

The result is that any rows in the dataframe that are indicated as True in the Boolean mask remain in the dataframe, and any rows that are indicated as False get filtered out.

# Instantiate a dictionary of planetary data.
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars',
                   'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
       'radius_km': [2440, 6052, 6371, 3390, 69911, 58232,
                     25362, 24622],
       'moons': [0, 0, 1, 2, 80, 83, 27, 14]
        }
# Use pd.DataFrame() function to convert dictionary to dataframe.
planets = pd.DataFrame(data)
planets
planetradius_kmmoons
0Mercury24400
1Venus60520
2Earth63711
3Mars33902
4Jupiter6991180
5Saturn5823283
6Uranus2536227
7Neptune2462214
# Create a Boolean mask of planets with fewer than 20 moons.
mask = planets['moons'] < 20
mask
0     True
1 True
2 True
3 True
4 False
5 False
6 False
7 True
Name: moons, dtype: bool

Applying Mask to a DataFrame

The above one is the boolean mask. To apply this mask to the dataframe, we simply insert it into selector brackets and apply it to the dataframe.

# Apply the Boolean mask to the dataframe to filter it so it contains
# only the planets with fewer than 20 moons.
planets[mask]
planetradius_kmmoons
0Mercury24400
1Venus60520
2Earth63711
3Mars33902
7Neptune2462214

Or we could define the Boolean mask and apply it in a single line, the result would be the same.

planets[planets['moons'] < 20]
# Boolean masks don't change the data. They're just views.
planets
planetradius_kmmoons
0Mercury24400
1Venus60520
2Earth63711
3Mars33902
4Jupiter6991180
5Saturn5823283
6Uranus2536227
7Neptune2462214

However, we can assign the result to a named variable. This may be useful if we’ll need to reference the list of planets with moons under 20 again later.

moons_under_20 = planets[mask]
moons_under_20

Filtering Data with multiple conditions

Sometimes we’ll need to filter data based on multiple conditions.

# Create a Boolean mask of planets with fewer than 10 moons OR more than 50 moons.
mask = (planets['moons'] < 10) | (planets['moons'] > 50)
# Apply the Boolean mask to filter the data.
planets[mask]
planetradius_kmmoons
0Mercury24400
1Venus60520
2Earth63711
3Mars33902
4Jupiter6991180
5Saturn5823283
# Create a Boolean mask of planets with more than 20 moons, 
# excluding them if they have 80 moons 
# or if their radius is less than 50,000 km.
mask = (planets['moons'] > 20) & ~(planets['moons'] == 80) & ~(planets['radius_km'] < 50000)
# Apply the mask
planets[mask]
planetradius_kmmoons
5Saturn5823283

We could write the above code slightly different as below:

mask = (planets['moons'] > 20) & (planets['moons'] != 80) & (planets['radius_km'] >= 50000)
planets[mask]

And if we want to select just the planet column as a Series object (as we did similarly in previous post), we can use regular selection tools like loc[]:

mask = df['moons'] < 20
df.loc[mask, 'planet']
0    Mercury
1 Venus
2 Earth
3 Mars
7 Neptune
Name: planet, dtype: object

Grouping and Aggregation

groupby() Method

The groupby() method is a pandas DataFrame method that groups rows of the dataframe together based on their values at one or more columns, which allows further analysis of the groups.

import numpy as np
import pandas as pd

# Instantiate a dictionary of planetary data.
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars',
                   'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
        'radius_km': [2440, 6052, 6371, 3390, 69911, 58232,
                     25362, 24622],
        'moons': [0, 0, 1, 2, 80, 83, 27, 14],
        'type': ['terrestrial', 'terrestrial', 'terrestrial', 'terrestrial',
                 'gas giant', 'gas giant', 'ice giant', 'ice giant'],
        'rings': ['no', 'no', 'no', 'no', 'yes', 'yes', 'yes','yes'],
        'mean_temp_c': [167, 464, 15, -65, -110, -140, -195, -200],
        'magnetic_field': ['yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes']
        }

# Use pd.DataFrame() function to convert dictionary to dataframe.
planets = pd.DataFrame(data)
planetradius_kmmoonstyperingsmean_temp_cmagnetic_field
0Mercury24400terrestrialno167yes
1Venus60520terrestrialno464no
2Earth63711terrestrialno15yes
3Mars33902terrestrialno-65no
4Jupiter6991180gas giantyes-110yes
5Saturn5823283gas giantyes-140yes
6Uranus2536227ice giantyes-195yes
7Neptune2462214ice giantyes-200yes
# The groupby() function returns a groupby object.
planets.groupby(['type'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7027b117b0d0>

When we call the groupby method on a dataframe, it creates a groupby object. If we do nothing else, the groupby object isn’t very helpful. We’ll basically get a statement saying, “Here’s your object. It’s stored at this address in the computer’s memory.

# Apply the sum() function to the groupby object to get the sum
# of the values in each numerical column for each group.
planets.groupby(['type']).sum()
radius_kmmoonsmean_temp_c
type
gas giant128143163-250
ice giant4998441-395
terrestrial182533581

Only the numerical columns are returned because the sum method only works on numerical data. The “type” column is an index of this dataframe. This information can be interpreted as the sum of all the values in each group at these respective columns.

# Apply the sum function to the groupby object 
# and select only the 'moons' column.
planets.groupby(['type']).sum()[['moons']]
moons
type
gas giant163
ice giant41
terrestrial3
# Group by type and magnetic_field and get the mean of the values
# in the numeric columns for each group.
planets.groupby(['type', 'magnetic_field']).mean()
radius_kmmoonsmean_temp_c
typemagnetic_field
gas giantyes64071.581.5-125.0
ice giantyes24992.020.5-197.5
terrestrialno4721.01.0199.5
yes4405.50.591.0

agg() Method

Agg (short for aggregate) method allows us to apply multiple calculations to groups of data.

# Group by type, then use the agg() function to get the mean and median
# of the values in the numeric columns for each group.
planets.groupby(['type']).agg(['mean', 'median'])
radius_kmmoonsmean_temp_c
meanmedianmeanmedianmeanmedian
type
gas giant64071.564071.581.581.5-125-125
ice giant249922499220.520.5-197.5-197.5
terrestrial4563.2547210.750.5145.2591
# Group by type and magnetic_field, then use the agg() function to get the
# mean and max of the values in the numeric columns for each group.
planets.groupby(['type', 'magnetic_field']).agg(['mean', 'min', 'max'])
radius_kmmoonsmean_temp_c
meanminmaxmeanminmaxmeanminmax
typemagnetic_field
gas giantyes64071.5582326991181.58083-125-140-110
ice giantyes24992246222536220.51427-197.5-200-195
terrestrialno472133906052102199.5-65464
yes4405.5244063710.5019115167

We can define our functions too. The one below will return the 90 percentile of an array.

# Define a function that returns the 90 percentile of an array.
def percentile_90(x):
    return x.quantile(0.9)

Then we can call this custom function in our aggregation.

# Group by type and magnetic_field, then use the agg() function 
# to apply the mean and the custom-defined `percentile_90()` function 
# to the numeric columns for each group.
planets.groupby(['type', 'magnetic_field']).agg(['mean', percentile_90])
radius_kmmoonsmean_temp_c
meanpercentile_90meanpercentile_90meanpercentile_90
typemagnetic_field
gas giantyes64071.568743.181.582.7-125-113
ice giantyes249922528820.525.7-197.5-195.5
terrestrialno47215785.811.8199.5411.1
yes4405.55977.90.50.991151.8

Notice that we can enter “mean” as a string because it’s an existing method of groupby objects, but we type the “percentile 90” function as an object because it’s custom-defined.

We can specify where to use agg and also differentiate it based on our needs. Imagine we have the following dataframe.

clothes
   color  mass_g  price_usd   type
0 red 125 20 pants
1 blue 440 35 shirt
2 green 680 50 shirt
3 blue 200 40 pants
4 green 395 100 shirt
5 red 485 75 pants
clothes.groupby('color').agg({'price_usd': ['mean', 'max'],
                             'mass_g': ['min', 'max']})
           price_usd      mass_g     
           mean  max    min  max
color                           
blue       37.5   40    200  440
green      75.0  100    395  680
red        47.5   75    125  485

In