it has been quite a long time since my last blog post. Thanks to my work, to keep me busy all day and don't let me research on cool new things. However, over the few holidays and weekends over the last weeks I came across a very interesting algorithm called DBSCAN.

It is abbreviated for "

*density-based spatial clustering of applications with noise*", it is a unsupervised clustering algorithm just like k-means, besides that it is much smarter in many aspects.

Another objective I'd like to solve is the parallelization of this algorithm. I've seen just some ancient papers and what buffels me is that I've seen no implementation in Mahout (for MapReduce) or other distributed frameworks.

As you may know, I'm working for Apache Hama. It is a framework for distributed computing with the BSP (bulk synchronous parallel) model. I always searching for new algorithms that could fit into the model of BSP computing, e.G. graph algorithms of all sorts, strongly iterative algorithms, real-time algorithms.

And I think that DBSCAN also fits into the BSP model, I tell you why a bit later in this post.

First off, just a little introduction of the DBSCAN algorithm itself...

**The algorithm**

The algorithm is very easy to understand. Actually you have a bunch of points (or vectors in higher dimensionalities) as input, then you have to parameters and some fancy output.

The two parameters are called

*"epsilon"*and

*"minpoints",*epsilon is the minimum distance between two vectors to connect two points

**strongly**and minpoints is the number of points that are at least needed to build a cluster out of strongly connected vectors.

Now you are going through the graph, point by point, marking visited vectors and adding points to a cluster while they are not violating the rules defined by epsilon and minpoints.

You can read on wikipedia about how the sequential version works in detail, however I am going to propose a much more easier to understand version of the algorithm.

**Distributed algorithm steps**

Instead of defining a big distributed algorithm that translates the sequential version into some distributed programming model, I have assembled three main steps to get the same result as the sequential version.

However each of these steps are strongly parallelizable in every major programming model (at least I know how it works in MapReduce, BSP and MPI).

Here are the three steps:

- compute a distance matrix between the vectors with a given distance measurement
- trivial step to parallelize, can also be merged with the next point
- extract adjacent points via the epsilon threshold and the minpoints restriction
- This step creates an adjacency list/matrix representing a graph
- Noise is filtered at this step of the algorithm
- run a connected component algorithm on the resulting graph of the previous step

These three simple steps will give you the same result as a DBSCAN. Normally you can merge step 1 with step two, you can simply extract the adjacents points while computing the distances.

In the end, you will receive n-connected components, every of them will represent a cluster.

The delta to the points of your original input would be the noise cluster.

**Note**that the initial step is O(n²) which is obviously pretty bad and not scalable. So think about techniques like Similarity Hashing to speed this step up.
Pretty easy right? I think it is even more easier than the pseudocode on wikipedia.

Of course I put up a sample version (although sequential) on my github:

There is a nice plot I received when running it:

To make the noise more easy to spot, I have made horrible yellow circles arround them with Paint, please forgive me ;)

**Stay tuned for an implementation with Apache Hama!**

**Update:**

So far I haven't found the time to implement this whole system with Apache Hama. However, if you want to practically use this here are some advices:

- For the distance matrix to compute, better use a heuristic to find close vectors
- Mahout has a MinHashing implementation of such a clustering
- Once you obtained "mini" clusters, you can compute more expensive distance measurements and extract your graph (step two in the above list)

Great work! Could you please post (if available) a matrix multiplication process code and/or a conjugate gradient method code, implemented @ Apache Hama?

ReplyDeleteYou have convinced me for the superiority of this FW over Hadoop, concerning iterative large-scale computations, but I would like to learn more through actual code.Thanks in advance!

Thanks. I have written a matrix/matrix multiplication, however I don't think it is scalable.

ReplyDeletehttps://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/bsp/MatrixMultiplicationBSP.java

It just uses two supersteps and too much memory, but I haven't too much time to take care about making it scalable.

In case of conjugate gradient, we are currently trying to develop one, (http://markmail.org/message/bdcpxo5wew7upafi) but no concrete plans were made yet.

If you have some interested in helping us with one of both drop by the dev mailing list and will can discuss and sort some details out.

This comment has been removed by the author.

ReplyDeleteI'm pretty sure these are the same results as plain DBSCAN, I checked against R's fpc package and it was the same result.

ReplyDeleteYou don't have to store the immediate distance matrix between step 1 and 2, you can directly extract the adjacency matrix.

This still needs you to compute O(n^2) distances, but will save you the O(n^2) data.

Do you have a paper link for "parallel DBSCAN" for me?

Thanks for your comment!

This comment has been removed by the author.

ReplyDeleteIsn't computing the distance matrix scalable in map-reduce paradigm? I mean without using LSH techniques, can't we compute the full matrix in scalable fashion using map-reduce?

ReplyDelete@Unknown no there is not a way to make this happen. You can try, but it will be certainly very slow.

ReplyDeleteI found this after a google search:

ReplyDeletehttp://stackoverflow.com/questions/3380135/mapreduce-distance-calculation-in-hadoop

Please let me know if above method should work...

In our project, we have a set of ~1 million xml documents.

We need to do unsupervised clustering as we don't know the number of clusters before hand...So looking for a parallel clustering method.

BTW, excellent coding style!! Helped me a lot in clearing things up...Keep up the good work...

Sure try it out. I would first bring the data into a binary sequence file and set this file's replication to a higher number to avoid long distance reads.

ReplyDelete