Build a K-means model in Python

Introduction

In this activity, we are a consultant for a scientific organization that works to support and sustain penguin colonies. We are tasked with helping other staff members learn more about penguins in order to achieve this mission.

The data for this activity is in a spreadsheet that includes datapoints across a sample size of 345 penguins, such as species, island, and sex. We will use a K-means clustering model to group this data and identify patterns that provide important insights about penguins.

Step 1: Imports

# Import standard operational packages.
import numpy as np
import pandas as pd

# Important tools for modeling and evaluation.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

# Import visualization packages.
import matplotlib.pyplot as plt
import seaborn as sns

# Save the `pandas` DataFrame in variable `penguins` 
penguins = pd.read_csv("penguins.csv")

# Review the first 10 rows
penguins.head(n = 10)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	male
6	Adelie	Torgersen	38.9	17.8	181.0	3625.0	female
7	Adelie	Torgersen	39.2	19.6	195.0	4675.0	male
8	Adelie	Torgersen	34.1	18.1	193.0	3475.0	NaN
9	Adelie	Torgersen	42.0	20.2	190.0	4250.0	NaN

Step 2: Data exploration

After loading the dataset, the next step is to prepare the data to be suitable for clustering. This includes:

Exploring data
Checking for missing values
Encoding data
Dropping a column
Scaling the features using StandardScaler

Explore data

To cluster penguins of multiple different species, we need to determine how many different types of penguin species are in the dataset.

# Find out how many penguin types there are
penguins['species'].unique()

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

# Find the count of each species type
penguins['species'].value_counts(dropna = False)

Adelie       152
Gentoo       124
Chinstrap     68
Name: species, dtype: int64

There are three types of species. Note the Chinstrap species is less common than the other species. This has a chance to affect K-means clustering as K-means performs best with similar sized groupings.
For purposes of clustering, we’ll pretend we don’t know that there are three different types of species. Then, we can explore whether the algorithm can discover the different species. We might even find other relationships in the data.

Check for missing values

An assumption of K-means is that there are no missing values.

# Check for missing values
penguins.isnull().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

Now, we’ll drop the rows with missing values and save the resulting pandas DataFrame in a variable named penguins_subset.

# Drop rows with missing values
# Save DataFrame in variable `penguins_subset`
penguins_subset = penguins.dropna(axis=0).reset_index(drop = True)

# Check for missing values
penguins_subset.isna().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

# View first 10 rows
penguins_subset.head(10)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female
3	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female
4	Adelie	Torgersen	39.3	20.6	190.0	3650.0	male
5	Adelie	Torgersen	38.9	17.8	181.0	3625.0	female
6	Adelie	Torgersen	39.2	19.6	195.0	4675.0	male
7	Adelie	Torgersen	41.1	17.6	182.0	3200.0	female
8	Adelie	Torgersen	38.6	21.2	191.0	3800.0	male
9	Adelie	Torgersen	34.6	21.1	198.0	4400.0	male

Encode data

Some versions of the penguins dataset have values encoded in the sex column as ‘Male’ and ‘Female’ instead of ‘MALE’ and ‘FEMALE’. The code below will make sure all values are ALL CAPS.

penguins_subset['sex'] = penguins_subset['sex'].str.upper()

K-means needs numeric columns for clustering. We’ll convert the categorical column 'sex' into numeric. There is no need to convert the 'species' column because it isn’t being used as a feature in the clustering algorithm.

# Convert `sex` column from categorical to numeric
penguins_subset = pd.get_dummies(penguins_subset, drop_first = True, columns=['sex'])

Drop a column

Drop the categorical column island from the dataset. While it has value, this notebook is trying to confirm if penguins of the same species exhibit different physical characteristics based on sex. This doesn’t include location.

Note that the 'species' column is not numeric. We woon’t drop the 'species' column for now. It could potentially be used to help understand the clusters later.

# Drop the island column
penguins_subset = penguins_subset.drop(['island'], axis=1)

Scale the features

Because K-means uses distance between observations as its measure of similarity, it’s important to scale the data before modeling. StandardScaler scales each point xᵢ by subtracting the mean observed value for that feature and dividing by the standard deviation:

x-scaled = (xᵢ – mean(X)) / σ

This ensures that all variables have a mean of 0 and variance/standard deviation of 1.

Note: Because the species column isn’t a feature, it doesn’t need to be scaled.

First, we’ll copy all the features except the 'species' column to a DataFrame X.

# Exclude `species` variable from X
X = penguins_subset.drop(['species'], axis=1)

#Scale the features
#Assign the scaled data to variable `X_scaled`
X_scaled = StandardScaler().fit_transform(X)

Step 3: Data modeling

Now, we’ll fit K-means and evaluate inertia for different values of k. Because we may not know how many clusters exist in the data, we’ll start by fitting K-means and examining the inertia values for different values of k. To do this, we’ll write a function called kmeans_inertia that takes in num_clusters and x_vals (X_scaled) and returns a list of each k-value’s inertia.

# Fit K-means and evaluate inertia for different values of k
num_clusters = [i for i in range(2, 11)]

def kmeans_inertia(num_clusters, x_vals):
    """
    Accepts as arguments list of ints and data array. 
    Fits a KMeans model where k = each value in the list of ints. 
    Returns each k-value's inertia appended to a list.
    """
    inertia = []
    for num in num_clusters:
        kms = KMeans(n_clusters=num, random_state=42)
        kms.fit(x_vals)
        inertia.append(kms.inertia_)

    return inertia

# Return a list of inertia for k=2 to 10
inertia = kmeans_inertia(num_clusters, X_scaled)
inertia

[885.6224143652249,
 578.8284278107235,
 386.14534424773285,
 284.5464837898288,
 217.92858573807678,
 201.39287843423264,
 186.82270634899209,
 173.47283154242746,
 164.55854201979943]

Next, we’ll create a line plot that shows the relationship between num_clusters and inertia.

# Create a line plot
plot = sns.lineplot(x=num_clusters, y=inertia, marker = 'o')
plot.set_xlabel("Number of clusters")
plot.set_ylabel("Inertia")

The plot seems to depict an elbow at six clusters, but there isn’t a clear method for confirming that a six-cluster model is optimal. Therefore, the silhouette scores should be checked.

Step 4: Results and evaluation

Now, we’ll evaluate the silhouette score using the silhouette_score() function. Silhouette scores are used to study the distance between clusters.

Then, we’ll compare the silhouette score of each value of k, from 2 through 10. To do this, we’ll write a function called kmeans_sil that takes in num_clusters and x_vals (X_scaled) and returns a list of each k-value’s silhouette score.

# Evaluate silhouette score
# Write a function to return a list of each k-value's score
def kmeans_sil(num_clusters, x_vals):
    """
    Accepts as arguments list of ints and data array. 
    Fits a KMeans model where k = each value in the list of ints.
    Calculates a silhouette score for each k value. 
    Returns each k-value's silhouette score appended to a list.
    """
    sil_score = []
    for num in num_clusters:
        kms = KMeans(n_clusters=num, random_state=42)
        kms.fit(x_vals)
        sil_score.append(silhouette_score(x_vals, kms.labels_))

    return sil_score

sil_score = kmeans_sil(num_clusters, X_scaled)
sil_score

[0.44398088353055243,
 0.45101024097188364,
 0.5080140996630784,
 0.519998574860868,
 0.5263224884981607,
 0.47774022332151733,
 0.42680523270292947,
 0.35977478703657334,
 0.3589883410610364]

Next, we’ll create a line plot that shows the relationship between num_clusters and sil_score.

# Create a line plot
plot = sns.lineplot(x=num_clusters, y=sil_score, marker = 'o')
plot.set_xlabel("# of clusters")
plot.set_ylabel("Silhouette Score")

Silhouette scores near 1 indicate that samples are far away from neighboring clusters. Scores close to 0 indicate that samples are on or very close to the decision boundary between two neighboring clusters.

The plot indicates that the silhouette score is closest to 1 when the data is partitioned into six clusters, although five clusters also yield a relatively good silhouette score.

Optimal k-value

To decide on an optimal k-value, we’ll fit a six-cluster model to the dataset.

# Fit a 6-cluster model
kmeans6 = KMeans(n_clusters=6, random_state=42)
kmeans6.fit(X_scaled)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=6, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=42, tol=0.0001, verbose=0)

# Print unique labels
print('Unique labels:', np.unique(kmeans6.labels_))

Unique labels: [0 1 2 3 4 5]

Now, we’ll create a new column cluster that indicates cluster assignment in the DataFrame penguins_subset. It’s important to understand the meaning of each cluster’s labels, then we’ll decide whether the clustering makes sense.

Note: This task is done using penguins_subset because it is often easier to interpret unscaled data.

# Create a new column `cluster`
penguins_subset['cluster'] = kmeans6.labels_
penguins_subset.head()

	species	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex_MALE	cluster
0	Adelie	39.1	18.7	181.0	3750.0	1	0
1	Adelie	39.5	17.4	186.0	3800.0	0	2
2	Adelie	40.3	18.0	195.0	3250.0	0	2
3	Adelie	36.7	19.3	193.0	3450.0	0	2
4	Adelie	39.3	20.6	190.0	3650.0	1	0

Let’s use groupby to verify if any 'cluster' can be differentiated by 'species'.

# Verify if any `cluster` can be differentiated by `species`
penguins_subset.groupby(by=['cluster', 'species']).size()

cluster  species  
0        Adelie       71
1        Gentoo       58
2        Adelie       73
         Chinstrap     5
3        Gentoo       61
4        Adelie        2
         Chinstrap    34
5        Chinstrap    29
dtype: int64

Next, we’ll interpret the groupby outputs. Although the results of the groupby show that each 'cluster' can be differentiated by 'species', it is useful to visualize these results. The graph shows that each 'cluster' can be differentiated by 'species'.

penguins_subset.groupby(by=['cluster', 'species']).size().plot.bar(title='Clusters differentiated by species',
            figsize=(6, 5),
            ylabel='Size',
            xlabel='(Cluster, Species)')

Now let’s use groupby to verify if each 'cluster' can be differentiated by 'species' AND 'sex_MALE'.

# Verify if each `cluster` can be differentiated by `species` AND `sex_MALE`
penguins_subset.groupby(by=['cluster','species', 'sex_MALE']).size().sort_values(ascending = False)

cluster  species    sex_MALE
2        Adelie     0           73
0        Adelie     1           71
3        Gentoo     1           61
1        Gentoo     0           58
4        Chinstrap  1           34
5        Chinstrap  0           29
2        Chinstrap  0            5
4        Adelie     1            2
dtype: int64

Even though clusters 1 and 3 weren’t all one species or sex, the groupby indicates that the algorithm produced clusters mostly differentiated by species and sex.

Finally, let’s interpret the groupby outputs and visualize these results. The graph shows that each 'cluster' can be differentiated by 'species' and 'sex_MALE'. Furthermore, each cluster is mostly comprised of one sex and one species.

penguins_subset.groupby(by=['cluster','species','sex_MALE']).size().unstack(level = 'species', fill_value=0).plot.bar(title='Clusters differentiated by species and sex',                                                                                                                     figsize=(6, 5),                                                                                                                     ylabel='Size',                                                                                                                    xlabel='(Cluster, Sex)')
plt.legend(bbox_to_anchor=(1.3, 1.0))

Considerations

Here are some key takeaways:

Many machine learning workflows are about cleaning, encoding, and scaling data.
Inertia and silhouette score can be used to find the optimal value of clusters.
Clusters can find natural groupings in data.
The clusters in this lab are mostly differentiated by species and sex as shown by the groupby results and corresponding graphs.
The elbow plot and especially the silhouette scores suggests that 6 clusters are optimal for this data.
Having 6 clusters makes sense because the study suggests that there is sexual dimorphism (differences between the sexes) for each of the three species (2 sexes * 3 different species = 6 clusters).

What summary could we provide to stakeholders?

The K-means clustering enabled this data to be effectively grouped. It helped identify patterns that can educate team members about penguins.
The success of the cluster results suggests that the organization can apply clustering to other projects and continue augmenting employee education.

Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate program.