Jan 2, 2012

BSP k-means Clustering Benchmark

Hey all,

in my last post I already wrote about the kmeans clustering with Apache Hama and BSP.
Now we have a detailed benchmark of my algorithm.

Have a look here for the current state taken from here: http://wiki.apache.org/hama/Benchmarks

Because it will change during the lifetime of Apache Hama, I made a screenshot from the very first benchmark. Maybe to document performance improvements ;)

Have a look here:

Is it faster than MapReduce?
Yes! I recently read in the new "Taming Text" by Grant S. Ingersoll that the same amount of workload takes the same time, but not in seconds, but in minutes.

However, I want to benchmark it against the same dataset and on the same machines to get a fully comparable result.

Future Work
Besides the benchmark against MapReduce and Mahout, I want to show the guys from Mahout that it is reasonable to use BSP as an alternative to MapReduce. I look forward that they use Apache Hama and BSP within the next year as an alternative to MapReduce implementations for various tasks.

Thanks to Edward who made this possible!


  1. great job! btw, do you have any plan to contribute your work to mahout or hama examples package?

  2. hi may I suggest something ? I see your code ,i think hama could be have some cache (cache could swap to disk) 。 I am worry JVM's memory would be burst when we compute large data :-)

  3. Hi,

    there is a distinction between heap caching (which is for really small datasets) and OS caching that takes place usually.

    Normally the algorithm for larger datasets will benefit from OS caching.