K-means

Introduction to K-means

K-means is an unsupervised partitioning algorithm and it’s used to organize unlabeled data into groups or clusters. It does this by creating a logical scheme to make sense of the data. With K-means, each cluster is defined by a central point or a centroid. Its position represents the center of the cluster, determined by the mathematical mean of all the points in that cluster.

There are four steps to building a K-means model

Step 1: Choose the number of centroids and place them in the data space. K represents the number of centroids in our model, which is how many clusters we’ll have. This is a decision that we make.

Step 2: Assign each data point to its nearest centroid. The nearest centroid is the one that’s closest in space.

Step 3: Recalculate the centroid of each cluster. Again, the centroid location is calculated by taking the mean of all of the points in its cluster. Note that the centroid is moved to the midpoint of their clusters. This will happen each iteration, until the algorithm reaches convergence.

Step 4: Repeat steps 2 and 3 until the algorithm converges.

Note that the centroids may still be moving after the last iteration, but if so, all movement is below a convergence tolerance that can be set by the modeler. The default in scikit-learn is 0.0001. If all centroid locations move less than this distance, the model is declared converged.

If we had a lot more data, the centroids would get increasingly closer to their related clusters. We also might find that the cluster assignment of each data point changes as the centroid locations move within each iteration.

A note on K-means++

Something to be mindful of is that it’s important to run the model with different starting positions for the centroids. This helps avoid poor clustering caused by local minima. In other words, not having an appropriate distance between clusters. Check the sample below.

The above clustering isn’t wrong, it’s still a valid resolution of the model, it just doesn’t make much sense in this context. After all, the goal is to find a clustering scheme that makes sense of our data.

In scikit-learn, this implementation (running K-means multiple times with different initial positions of the centroids to help avoid using a model that gets stuck in local minima) is called K-means++. K-means++ still randomly initializes centroids in the data, but it does so based on a probability calibration. 

Basically, it randomly chooses one point within the data to be the first centroid, then it uses other data points as centroids, selecting them pseudo-randomly. The probability that a point will be selected as a centroid increases the farther it is from other centroids. This helps to ensure that centroids aren’t initially placed very close together, which is when convergence in local minima is most likely to occur. 

Note: K-means++ is the default implementation when we instantiate K-means in scikit-learn. If we don’t want to use this implementation and would prefer to start with truly random centroids, we can change this by setting the “init” parameter to “random”, but rarely would we want to do this.

Clustering vs Partitioning 

Finally, note that even though K-means is a partitioning algorithm, data professionals typically talk about it as a clustering algorithm. The difference is that outlying points in clustering algorithms can exist outside of the clusters. However, for partitioning algorithms, all points must be assigned to a cluster. In other words, K-means does not allow unassigned outliers.

When there are no correct answers 

Often we won’t have any class labels to determine how “good” our model is. For instance, we may want to perform a market segmentation analysis on data. In this scenario, there is no “correct” answer. We cannot verify our model’s clustering against a baseline truth. It’s possible that we could make a case to use two, three, four, or more clusters. We might not have any idea how many to choose. 

K-means can suffer from the “curse of dimensionality,” whereby having too many dimensions in our data increases the distance between observations to such a degree that they are all too far away from each other to assign them to meaningful clusters.


Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources. 

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate program.