This blog entry’s going to show you how to implement K-means clustering a bit more efficiently than the naive algorithm with the help of a little algebra and sparse vector manipulation. The current implementation of k-means clustering in LingPipe is woefully slow:

- LingPipe Javadoc:
`cluster.KMeansClusterer`

We’re working on a medium-size clustering problem for a customer with a few hundred thousand short-ish text messages being clustered. It was taking roughly *forever* (OK, several hours/epoch).

I knew it was slow when I was writing. You can tell by the apologetic javadoc implementation note:

Implementation Note:The current implementation is an inefficient brute-force approach that computes the distance between every element and every centroid on every iteration. These distances are computed between maps, which itself is not very efficient. In the future, we may substitute a more efficient version of k-means implemented to this same interface.More efficient implementations could use kd-trees to index the points, could cache scores of input feature vectors against clusters, and would only compute differential updates when elements change cluster. Threading could also be used to compute the distances.

All true, but there’s a simpler approach than kd-trees or multiple threads or caching that I just implemented and now the epochs are taking roughly no time at all (OK, about a minute).

K-means clustering is very simple. You start with a set of items and a feature extractor to convert them to feature vectors. These feature vectors are maps from strings to numbers. In the first step, items are assigned to clusters somehow, for instance randomly (though there are better ways to do this for some purposes). Then on for each iteration until a maximum number or convergence, the mean (aka centroid) of each cluster is computed by averaging all the vectors (centroids are calculated dimensionwise), and each item’s Euclidean distance to each centroid is computed. After all these computations, each item is reassigned to the closest centroid at the end of an epoch.

Typically, when I do heavy lifting with features, I extract the features once and then re-use the vectors. I did that with k-means, but didn’t consider the downstream computations closely enough. When you have a 100,000-feature centroid vector and a 100-feature item vector, it’s rather inefficient to compute Euclidean distance, which is defined as:

Of course, we don’t care about distance per se, so we can save the square root operation and work with squared distances

There are a couple of problems. Hash maps in Java involving numbers are wasteful in size and time to convert back and forth to primitives for arithmetic. They’re also slow to iterate (even linked hash maps). That’s an obvious bottleneck. But the real bottleneck is iterating at all. Remember, the fastest code is the code that doesn’t execute at all (as I noted in a comment about point 6 of John Langford’s excellent blog entry on machine learning code optimization).

### The Algebra

The key to optimization is noting that most of the `elt[i]`

values are zero. So what we do is compute the squared length of each centroid and store it at the beginning of each epoch:

The second part of this equation is key, as it shows how the length of the centroid is related to distances to dimensions with zero values. Here’s the formula:

Now note that when `elt[i]=0`

, the terms inside the sum cancel. So we keep going with:

This simple algebra means that after computing the length of each centroid, each distance computation only requires a number of operations proportional to the number of non-zero elements in the item’s feature vector.

### Sparse Vectors

Combining this with more reasonable sparse vector representations (parallel `int[]`

index and `double[]`

value arrays), and reasonable dense vector representations (i.e. `double[]`

), K-means is now hundreds of times faster for the kinds of problems we care about: large sets of sparse feature vectors. I know I can squeeze out at least another factor of 2 or 3 on my dual quad-core machine by multi-threading; k-means and other EM-like algorithms are trivial to parallelize because all the heavy computations in the inner loops are independent.

### Memory Locality

The next obvious speedup would be to improve memory locality (Langford’s tip number 3, and the heart of a lot of the speedups in Jon Bentley’s wonderful book *Programming Pearls*, the pearls in which were extracted from his *Dr. Dobbs* columns).

As it is, I iterate over the items, then for each item, I iterate over the cluster centroids computing distances, which involves iterating over the non-zero values in the item’s feature vector. It’d be much better for memory locality to do this by iterating over the non-zero values in the item’s feature vector and pulling back an array indexed by cluster. That way, there’d be a number of non-memory-continguous lookups equal to the number of non-zero features, and everything else would be local.

March 12, 2009 at 10:07 pm |

Since it’s hard to write formula in comment, I’m gonna write the idea in matlab language.

In each iteration, you want to find the samples that are closest to centers.

Assume you have two matrices C and S. Each colum of C is a center vector and Each colum of S is a sample vector. Then the distance matrix

D=sum(C.^2,1)’*ones(1,n)-2*C’*S+ones(k,1)*sum(S.^2,1); //(1)

D is a k x n matrix with each element is the distance between a sample and a center.

Since you also want to find the min disances, the first term of right hand side of (1) can be droped and the third term (length of all samples) can precomputed. Then you only need to compute C’*S in each iteration. This will be efficient even for dense matrix. For a naive implementation, each iteration will cost O(n*k*d), the matrix multiplication certainly can do better than that, not to mention if the vector is sparse.

March 13, 2009 at 12:15 am |

The third term of (1) can be dropped not the first one. Sorry for that error but the idea is the same

March 13, 2009 at 10:35 am |

Don’t forget the triangle inequality speed-up: if you pre-compute distances between centroids d(c,c’) before each iteration, then you don’t have to compute many distances between examples and centroids d(x,c) at all, because d(x,c) >= d(x,c’) – d(c,c’)

March 13, 2009 at 1:53 pm |

@sth4nth: Cool! I should’ve kept going with the algebra. It’s obvious the constants don’t matter, but I’m always surprised when you can replace distances with products. The key to the approach I outlined in the entry was using sparse multiplications, because the S vectors (in your notation) are sparse at around 1% or less of the C vectors in the applications we’re considering.

@dz: Thanks! I keep forgetting that Euclidean distance is a real metric. I’m so used to having to generalize to pseudo-metrics where the triangle ineuqality fails.

@myself: The other thing I really need to do now is cache distances vs. centroids when the centroids don’t change. After a few dozen epochs with K=1000, lots of the clusters don’t change at all between epochs, so that whole columns of sth4nth’s C’ * S matrix are constant.

December 9, 2009 at 1:17 pm |

Check out:

C. Elkan. Using the Triangle Inequality to Accelerate k-Means (pdf). In Proceedings of the Twentieth International Conference on Machine Learning (ICML’03), pp. 147-153.

December 9, 2009 at 2:57 pm |

Thanks for the link. I know that paper and anyone doing k-means should read it. Even if you don’t use the algorithm, it provides great insight into the geometry of the problem.

The main reason I didn’t use it was the extra memory overhead.

I would have sworn I’d discussed exactly Elkan’s algorithm on this blog entry, but can’t find any mention of it here or elsewhere.