Clustering beyond K-means: DBSCAN and Agglomerative clustering

Here we will review the strengths of K-means and also demonstrate some of its weaknesses. Two additional clustering methodologies will be discussed as well, to explore as alternatives: DBSCAN and Agglomerative clustering.

Why use other algorithms? 

In the K-means demonstration code, we saw how the K-means algorithm clustered the pixels of a photograph of some tulips. Recall that the color of every pixel is represented by a vector with three values, each with a range of [0, 255]. These three values correspond to the amounts of red, green, and blue (RGB) that are combined to make the color of each pixel. The RGB values can be plotted in a three-dimensional space in order to visualize how the colors in the photograph are related to each other. 

In this way, we could go from this:

To this:

From this graph, it is clear that all of the pixels fall within a space that resembles a lopsided pyramid. In the same demonstration, the pixels were grouped into three clusters using K-means. When the RGB values for each pixel were replaced with those of its nearest centroid, the results were this:

Perhaps we can think that we would have clustered this data differently. After all, there do appear to be denser areas of points along the vertices of the pyramid:

The K-means clusters were so boxy, because K-means works by minimizing inter-cluster variance. In other words, it aims to minimize the distance between points and their centroids. This means that K-means works best when the clusters are round. 

If we aren’t satisfied with the way K-means is clustering our data, there are many other clustering methods available to choose from. 

DBSCAN 

DBSCAN stands for density-based spatial clustering of applications with noise. Instead of trying to minimize variance between points in each cluster, DBSCAN searches our data space for continuous regions of high density. Here’s how it works.

DBSCAN’s workflow:

1. Start at a random point.
2. Examine radius ε around the point.
3. If there are min_samples within radius ε of this instance (including itself), it is a core instance. All samples in this ε-neighborhood are part of the same cluster. In this case, min_samples = 3.
4. Repeat for each point in the cluster. If a point does not have min_samples in its neighborhood, it is density reachable, but it’s not a core instance.
5. Repeat.
6. Repeat.
7. Repeat.
8. If there are no more points in the cluster, move to another random point and begin a new cluster.
9. Repeat.
10. Points that don’t have any core instances in their ε-neighborhood are considered noise and are not assigned to a cluster.

Hyperparameters 

The most important hyperparameters for DBSCAN in scikit-learn are: 

  • eps: Epsilon (ε) – The radius of our search area from any given point. 
  • min_samples: The number of samples in an ε-neighborhood for a point to be considered a core point (including itself). 

Since DBSCAN is designed to find clusters based on density, the shape of the cluster isn’t as important as it is for K-means. Here’s what the tulips data looks like when clustered using DBSCAN compared to K-means:

DBSCAN

K-means

Notice that the clusters aren’t as block-like as they were for K-means, and they correspond with the vertices of the pyramid a lot more closely than K-means. Also, of course, different clustering arrangements result in different colors for each of the three centroids. 

Agglomerative clustering 

Another way to cluster data is by using a technique called agglomerative clustering. Agglomerative clustering works by first assigning every point to its own cluster, then progressively combining clusters based on intercluster distance. Here’s how it works.

Agglomerative clustering’s workflow:

1. Each point is in its own cluster.
2. Merge the closest pair into a new cluster.
3. Repeat
4. If one of the points in the next closest pair is already in a larger cluster, merge the unassigned point into the cluster.
5. Continue
6. Continue
7. Continue
8. Continue (At this point, there are 4 clusters.)
9. If the closest pairs are both part of larger clusters…
10. …merge the clusters. (Now there are 3 clusters.)
11. Continue
12. Continue (Now there are 2 clusters.)
13. Continue
14. One cluster (process cannot go further)

Agglomerative clustering requires that we specify a desired number of clusters or a distance threshold, which is the linkage distance above which clusters will not be merged. If we do not specify a desired number of clusters, then the distance threshold is an important parameter, because without it the model would converge into a single cluster every time. 

Linkage 

There are different ways to measure the distances that determine whether or not to merge the clusters. This is known as the linkage. Here are some of the most common.

Also note that these linkage strategies aren’t specific just to agglomerative clustering.

Single:
The minimum pairwise distance between clusters.

Complete:
The maximum pairwise distance between clusters.

Average:
The distance between each cluster’s centroid and other clusters’ centroids. 

Ward:
This is not a distance measurement. Instead, it merges the two clusters whose merging will result in the lowest inertia. 

When does it stop? 

The agglomerative clustering algorithm will stop when one of the following conditions is met: 

  • We reach a specified number of clusters. 
  • We reach an intercluster distance threshold = Clusters that are separated by more than this distance are too far from each other and will not be merged.
Hyperparameters 

There are numerous hyperparameters available for agglomerative clustering in scikit-learn. These are some of the most important ones: 

  • n_clusters: The number of clusters we want in our final model. 
  • linkage: The linkage method to use to determine which clusters to merge. 
  • affinity: The metric used to calculate the distance between clusters. Default = euclidean distance. 
  • distance_threshold: The distance above which clusters will not be merged.

Agglomerative clustering can be very effective. It scales reasonably well, and it can detect clusters of various shapes. Compare how agglomerative clustering performs on the tulip data compared to DBSCAN and K-means: 

Agglomerative clustering

DBSCAN

K-means

Agglomerative clustering gives even more definition to the data clustered along the vertices of the pyramid, which most closely represents the clusters that appear to the eye. 

Other clustering algorithms 

There are many other ways to cluster data than what is covered here. Scikit-learn’s documentation provides a helpful reference that illustrates some of the strengths and weaknesses of each methodology by running them on a series of toy datasets.

A question to practice

We want to group the pixels in an image of the red and white flag of Denmark into clusters using their RGB values. Which of the following statements are true? 

Running a k-means model with k=1  would result in pixels being assigned to a centroid that represents the color pink. 

Correct! Because the image has two colors (red and white), a one-cluster model would have a single centroid positioned somewhere between the locations of the red pixels and the white pixels. This location would be an RGB value that encodes the color pink.

Running a k-means model with k=2 would result in an overall inertia of 0. 

Correct! The image has two colors: red and white, so there are only two unique RGB values, and each pixel must be one of them. Therefore, a two-cluster model would result in two centroids, each at the exact location of all of its assigned points, resulting in an inertia of zero.

Running a k-means model with k=2 would yield a lower overall inertia than a k-means model with k=1. 

Correct! 


Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources. 

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate program.