Thomas Jungblut's Blog: BSP k-means Clustering Benchmark

Jan 2, 2012

BSP k-means Clustering Benchmark

Hey all,

in my last post I already wrote about the kmeans clustering with Apache Hama and BSP.
Now we have a detailed benchmark of my algorithm.

Have a look here for the current state taken from here: http://wiki.apache.org/hama/Benchmarks

Because it will change during the lifetime of Apache Hama, I made a screenshot from the very first benchmark. Maybe to document performance improvements ;)

Have a look here:

Is it faster than MapReduce?
Yes! I recently read in the new "Taming Text" by Grant S. Ingersoll that the same amount of workload takes the same time, but not in seconds, but in minutes.

However, I want to benchmark it against the same dataset and on the same machines to get a fully comparable result.

Future Work
Besides the benchmark against MapReduce and Mahout, I want to show the guys from Mahout that it is reasonable to use BSP as an alternative to MapReduce. I look forward that they use Apache Hama and BSP within the next year as an alternative to MapReduce implementations for various tasks.

Thanks to Edward who made this possible!

4 comments:

Edward J. YoonJanuary 3, 2012 at 1:35 AM
great job! btw, do you have any plan to contribute your work to mahout or hama examples package?
ReplyDelete
Replies
Thomas JungblutJanuary 3, 2012 at 9:19 AM
Yes both.
ReplyDelete
Replies
UnknownSeptember 15, 2012 at 10:04 AM
hi may I suggest something ？ I see your code ，i think hama could be have some cache （cache could swap to disk）。 I am worry JVM's memory would be burst when we compute large data :-)
ReplyDelete
Replies
Thomas JungblutSeptember 15, 2012 at 10:08 AM
Hi,

there is a distinction between heap caching (which is for really small datasets) and OS caching that takes place usually.

Normally the algorithm for larger datasets will benefit from OS caching.
ReplyDelete
Replies

Add comment