Jan 2, 2012

BSP k-means Clustering Benchmark

Hey all,

in my last post I already wrote about the kmeans clustering with Apache Hama and BSP.
Now we have a detailed benchmark of my algorithm.

Have a look here for the current state taken from here: http://wiki.apache.org/hama/Benchmarks

Because it will change during the lifetime of Apache Hama, I made a screenshot from the very first benchmark. Maybe to document performance improvements ;)

Have a look here:




Is it faster than MapReduce?
Yes! I recently read in the new "Taming Text" by Grant S. Ingersoll that the same amount of workload takes the same time, but not in seconds, but in minutes.

However, I want to benchmark it against the same dataset and on the same machines to get a fully comparable result.

Future Work
Besides the benchmark against MapReduce and Mahout, I want to show the guys from Mahout that it is reasonable to use BSP as an alternative to MapReduce. I look forward that they use Apache Hama and BSP within the next year as an alternative to MapReduce implementations for various tasks.

Thanks to Edward who made this possible!

4 comments:

  1. great job! btw, do you have any plan to contribute your work to mahout or hama examples package?

    ReplyDelete
  2. hi may I suggest something ? I see your code ,i think hama could be have some cache (cache could swap to disk) 。 I am worry JVM's memory would be burst when we compute large data :-)

    ReplyDelete
  3. Hi,

    there is a distinction between heap caching (which is for really small datasets) and OS caching that takes place usually.

    Normally the algorithm for larger datasets will benefit from OS caching.

    ReplyDelete