Boolean Masking and Grouping

In my previous post, we checked several different ways of selecting data in a dataframe using name-based and integer-based indexing. Here we’ll filter the data in the dataframe based on value-based conditions.

Boolean Masking

Boolean masking (also called Boolean indexing) is a filtering technique that overlays a Boolean grid onto a dataframe’s index in order to select only the values in the dataframe that align with the True values of the grid.

The result is that any rows in the dataframe that are indicated as True in the Boolean mask remain in the dataframe, and any rows that are indicated as False get filtered out.

# Instantiate a dictionary of planetary data.
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars',
                   'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
       'radius_km': [2440, 6052, 6371, 3390, 69911, 58232,
                     25362, 24622],
       'moons': [0, 0, 1, 2, 80, 83, 27, 14]
        }
# Use pd.DataFrame() function to convert dictionary to dataframe.
planets = pd.DataFrame(data)
planets

	planet	radius_km	moons
0	Mercury	2440	0
1	Venus	6052	0
2	Earth	6371	1
3	Mars	3390	2
4	Jupiter	69911	80
5	Saturn	58232	83
6	Uranus	25362	27
7	Neptune	24622	14

# Create a Boolean mask of planets with fewer than 20 moons.
mask = planets['moons'] < 20
mask

0     True
1     True
2     True
3     True
4    False
5    False
6    False
7     True
Name: moons, dtype: bool

Applying Mask to a DataFrame

The above one is the boolean mask. To apply this mask to the dataframe, we simply insert it into selector brackets and apply it to the dataframe.

# Apply the Boolean mask to the dataframe to filter it so it contains
# only the planets with fewer than 20 moons.
planets[mask]

	planet	radius_km	moons
0	Mercury	2440	0
1	Venus	6052	0
2	Earth	6371	1
3	Mars	3390	2
7	Neptune	24622	14

Or we could define the Boolean mask and apply it in a single line, the result would be the same.

planets[planets['moons'] < 20]

# Boolean masks don't change the data. They're just views.
planets

	planet	radius_km	moons
0	Mercury	2440	0
1	Venus	6052	0
2	Earth	6371	1
3	Mars	3390	2
4	Jupiter	69911	80
5	Saturn	58232	83
6	Uranus	25362	27
7	Neptune	24622	14

However, we can assign the result to a named variable. This may be useful if we’ll need to reference the list of planets with moons under 20 again later.

moons_under_20 = planets[mask]
moons_under_20

Filtering Data with multiple conditions

Sometimes we’ll need to filter data based on multiple conditions.

# Create a Boolean mask of planets with fewer than 10 moons OR more than 50 moons.
mask = (planets['moons'] < 10) | (planets['moons'] > 50)
# Apply the Boolean mask to filter the data.
planets[mask]

	planet	radius_km	moons
0	Mercury	2440	0
1	Venus	6052	0
2	Earth	6371	1
3	Mars	3390	2
4	Jupiter	69911	80
5	Saturn	58232	83

# Create a Boolean mask of planets with more than 20 moons, 
# excluding them if they have 80 moons 
# or if their radius is less than 50,000 km.
mask = (planets['moons'] > 20) & ~(planets['moons'] == 80) & ~(planets['radius_km'] < 50000)
# Apply the mask
planets[mask]

	planet	radius_km	moons
5	Saturn	58232	83

We could write the above code slightly different as below:

mask = (planets['moons'] > 20) & (planets['moons'] != 80) & (planets['radius_km'] >= 50000)
planets[mask]

And if we want to select just the planet column as a Series object (as we did similarly in previous post), we can use regular selection tools like loc[]:

mask = df['moons'] < 20
df.loc[mask, 'planet']

0    Mercury
1      Venus
2      Earth
3       Mars
7    Neptune
Name: planet, dtype: object

Grouping and Aggregation

groupby() Method

The groupby() method is a pandas DataFrame method that groups rows of the dataframe together based on their values at one or more columns, which allows further analysis of the groups.

import numpy as np
import pandas as pd

# Instantiate a dictionary of planetary data.
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars',
                   'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
        'radius_km': [2440, 6052, 6371, 3390, 69911, 58232,
                     25362, 24622],
        'moons': [0, 0, 1, 2, 80, 83, 27, 14],
        'type': ['terrestrial', 'terrestrial', 'terrestrial', 'terrestrial',
                 'gas giant', 'gas giant', 'ice giant', 'ice giant'],
        'rings': ['no', 'no', 'no', 'no', 'yes', 'yes', 'yes','yes'],
        'mean_temp_c': [167, 464, 15, -65, -110, -140, -195, -200],
        'magnetic_field': ['yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes']
        }

# Use pd.DataFrame() function to convert dictionary to dataframe.
planets = pd.DataFrame(data)

	planet	radius_km	moons	type	rings	mean_temp_c	magnetic_field
0	Mercury	2440	0	terrestrial	no	167	yes
1	Venus	6052	0	terrestrial	no	464	no
2	Earth	6371	1	terrestrial	no	15	yes
3	Mars	3390	2	terrestrial	no	-65	no
4	Jupiter	69911	80	gas giant	yes	-110	yes
5	Saturn	58232	83	gas giant	yes	-140	yes
6	Uranus	25362	27	ice giant	yes	-195	yes
7	Neptune	24622	14	ice giant	yes	-200	yes

# The groupby() function returns a groupby object.
planets.groupby(['type'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7027b117b0d0>

When we call the groupby method on a dataframe, it creates a groupby object. If we do nothing else, the groupby object isn’t very helpful. We’ll basically get a statement saying, “Here’s your object. It’s stored at this address in the computer’s memory.“

# Apply the sum() function to the groupby object to get the sum
# of the values in each numerical column for each group.
planets.groupby(['type']).sum()

	radius_km	moons	mean_temp_c
type
gas giant	128143	163	-250
ice giant	49984	41	-395
terrestrial	18253	3	581

Only the numerical columns are returned because the sum method only works on numerical data. The “type” column is an index of this dataframe. This information can be interpreted as the sum of all the values in each group at these respective columns.

# Apply the sum function to the groupby object 
# and select only the 'moons' column.
planets.groupby(['type']).sum()[['moons']]

	moons
type
gas giant	163
ice giant	41
terrestrial	3

# Group by type and magnetic_field and get the mean of the values
# in the numeric columns for each group.
planets.groupby(['type', 'magnetic_field']).mean()

		radius_km	moons	mean_temp_c
type	magnetic_field
gas giant	yes	64071.5	81.5	-125.0
ice giant	yes	24992.0	20.5	-197.5
terrestrial	no	4721.0	1.0	199.5
terrestrial	yes	4405.5	0.5	91.0

agg() Method

Agg (short for aggregate) method allows us to apply multiple calculations to groups of data.

# Group by type, then use the agg() function to get the mean and median
# of the values in the numeric columns for each group.
planets.groupby(['type']).agg(['mean', 'median'])

	radius_km		moons		mean_temp_c
	mean	median	mean	median	mean	median
type
gas giant	64071.5	64071.5	81.5	81.5	-125	-125
ice giant	24992	24992	20.5	20.5	-197.5	-197.5
terrestrial	4563.25	4721	0.75	0.5	145.25	91

# Group by type and magnetic_field, then use the agg() function to get the
# mean and max of the values in the numeric columns for each group.
planets.groupby(['type', 'magnetic_field']).agg(['mean', 'min', 'max'])

		radius_km			moons			mean_temp_c
		mean	min	max	mean	min	max	mean	min	max
type	magnetic_field
gas giant	yes	64071.5	58232	69911	81.5	80	83	-125	-140	-110
ice giant	yes	24992	24622	25362	20.5	14	27	-197.5	-200	-195
terrestrial	no	4721	3390	6052	1	0	2	199.5	-65	464
terrestrial	yes	4405.5	2440	6371	0.5	0	1	91	15	167

We can define our functions too. The one below will return the 90 percentile of an array.

# Define a function that returns the 90 percentile of an array.
def percentile_90(x):
    return x.quantile(0.9)

Then we can call this custom function in our aggregation.

# Group by type and magnetic_field, then use the agg() function 
# to apply the mean and the custom-defined `percentile_90()` function 
# to the numeric columns for each group.
planets.groupby(['type', 'magnetic_field']).agg(['mean', percentile_90])

		radius_km		moons		mean_temp_c
		mean	percentile_90	mean	percentile_90	mean	percentile_90
type	magnetic_field
gas giant	yes	64071.5	68743.1	81.5	82.7	-125	-113
ice giant	yes	24992	25288	20.5	25.7	-197.5	-195.5
terrestrial	no	4721	5785.8	1	1.8	199.5	411.1
terrestrial	yes	4405.5	5977.9	0.5	0.9	91	151.8

Notice that we can enter “mean” as a string because it’s an existing method of groupby objects, but we type the “percentile 90” function as an object because it’s custom-defined.

We can specify where to use agg and also differentiate it based on our needs. Imagine we have the following dataframe.

clothes

   color  mass_g  price_usd   type
0    red     125         20  pants
1   blue     440         35  shirt
2  green     680         50  shirt
3   blue     200         40  pants
4  green     395        100  shirt
5    red     485         75  pants

clothes.groupby('color').agg({'price_usd': ['mean', 'max'],
                             'mass_g': ['min', 'max']})

           price_usd      mass_g     
           mean  max    min  max
color                           
blue       37.5   40    200  440
green      75.0  100    395  680
red        47.5   75    125  485