Thomas Jungblut's Blog: 2012

Oct 6, 2012

Java XOR Swap Performance

Hey,
just quick post, because I was working to polish my min heap and checked about different swap techniques.

From my experience I know there are two different versions to swap two integers in an array.

XOR swap algorithm
Swap using temporary variable

Since swapping is used pretty much everywhere, I decided to micro benchmark these against each other using Caliper.

I've seen many people using the XOR algorithm lately, I don't know if they know that this is inefficient. Many people seem to do not care about this, because bitshifting makes them kinda look cool probably?

However, here is my code so we can sort out the performance of both pretty easy:

public class SwapBenchmark extends SimpleBenchmark {

  @Param({ "10", "100", "1000", "10000", "100000", "1000000", "10000000", "100000000" })
  private int size;

  @Param
  SwapType type;

  public enum SwapType {
    XOR, TMP
  };

  int[] array;

  @Override
  protected void setUp() throws Exception {
    array = new int[size];
    Random r = new Random();
    for (int i = 0; i < size; i++) {
      array[i] = r.nextInt();
    }
  }

  public void timeSwap(int reps) {
    for (int rep = 0; rep < reps; rep++) {
      int sum = 0;
      for (int i = 0; i < size; i++) {
        final int x = i;
        final int y = size - i - 1;
        if (type == SwapType.XOR) {
          array[x] ^= array[y];
          array[x] ^= (array[y] ^= array[x]);
          sum += i;
        } else {
          int tmpIndex = array[x];
          array[x] = array[y];
          array[y] = tmpIndex;
          sum += i;
        }
      }
      System.out.println(sum);
    }

  }

  public static void main(String[] args) {
    Runner.main(SwapBenchmark.class, args);
  }
}

So basically I'm swapping n-times in an n-size array, where n ranges from ten to 100,000,000 which is really big.

The result isn't very exciting, I used my Nehalem i7 with 3,3GHZ and latest Java7u7:

      size type        us linear runtime
       10  XOR      2,90 =
       10  TMP      2,84 =
      100  XOR      3,06 =
      100  TMP      2,85 =
     1000  XOR      3,92 =
     1000  TMP      3,52 =
    10000  XOR     20,21 =
    10000  TMP     14,22 =
   100000  XOR    183,33 =
   100000  TMP    118,57 =
  1000000  XOR   1822,98 =
  1000000  TMP   1192,19 =
 10000000  XOR  19401,65 ==
 10000000  TMP  13266,78 ==
100000000  XOR 194173,73 ==============================
100000000  TMP 134364,67 ====================

The TMP swap is much more efficient in every case. Why is this?
Wikipedia states the following:

On modern CPU architectures, the XOR technique is considerably slower than using a temporary variable to do swapping. One reason is that modern CPUs strive to execute instructions in parallel via instruction pipelines. In the XOR technique, the inputs to each operation depend on the results of the previous operation, so they must be executed in strictly sequential order. If efficiency is of tremendous concern, it is advised to test the speeds of both the XOR technique and temporary variable swapping on the target architecture.

There is actually nothing more to add. So please don't do any pre-mature optimization!
Even if it looks cool to do a bit-shifting trick ;)

Sep 1, 2012

Particle Swarm Optimization

Hey there,

I recently started to attend a artificial intelligence lecture (a real course, not online) and we have gotten a small intro to NetLogo yesterday. Although I'm not a real fan of it (mainly because I was tortured with LOGO as a kid, you don't know how painful it is if you already know C++), I liked the idea of plotting agents and scripting all kinds of logic in a compressed way.

So I used to came across an example in their modules library called "Particle Swarm Optimization" (PSO), which was really amazing. I can give you a small picture of it:

Particle Swarm Optimization plot Source

You can see a lot of colored particles (or agents) and a terrain that is greyscaled according to its height. Your job is to find a pretty good minimum (whitened areas) in this terrain via swarm intelligence.

A short summary is, that if you have a function and want to minimize the cost by a given parameter set "Theta", you can use the PSO algorithm to find a good minimum of your function. You can compare this for example with Gradient Descent or Conjugate Gradient, but there is a subtle difference between them, for PSO no differentiable function is needed.

Intuition

Of course I will provide you with the basic intuition of this algorithm. So I told that there is a certain number of agents in this algorithm. They are initialized uniformly distributed in our search space, ideally you want to fill the map to get a good spread of your agents so your chance is higher to find a global minimum.

For a fixed number of iterations, you are going to change the position of each particle randomized with a weight on three objectives which are parameters of this meta-optimizer.

Alpha (in literature also called φ_p/ Phi_personal) is the weight of an agent following its personal memory (based on prior found minimas)
Beta (in literature also called φ_g/ Phi_global) is the weight of an agent following the global memory of the swarm (based on the currently best minimum from the swarm)
Phi (in literature also called ω / Omega) is the inertia/velocity of an agent.

From the parameters you can tell that each particle has its own memory (the best personal found solution), also there is a swarm memory, which basically is just a record of the currently best found solution from all particles.

Math

To understand the weighted formula of the movement of the particles, we have to have a look at the formula:

Formula to calculate the new position of a particle, based on the current position "v"

You can see, to calculate the movement of a particle, you have a weighted radomized linear combination of the above parameters. Non-mathematically interpreted, the agent is on the move and it is attracted to either its personal best solution or the global best solution. By setting the parameters you can fine tune the behaviour of each particle in this regard.

Algorithm

The algorithm is quite easy, because there is some really good pseudo code on Wikipedia.
However here is my small summary of it:

Initialize...

the particle positions to some random positions in our search space
the memorys of the agents and the swarm memory

In each iteration until we haven't hit our maximum iterations do...

loop over all particles
calculate the new position of a particle
check the cost of this position, if its smaller than the particles memory then update, do the same update with the global memory if its smaller.

Profit!

This is a nice and easy algorithm. To further visualize it, I have found a very good youtube video about it, implemented with MatLab: http://www.youtube.com/watch?v=bbbUwyMr1W8

There you can see the movement of the particles very well for a number of functions.

In Java

Of course I have done this in Java with my math library. So let's dive into my normal testcase of the function

Which is a good testcase, because it is convex so it has a global minimum which can be good tested.

A nice looking plot of it:

Convex function f(x,y) = x^2 + y^2

Let's sneak into the Java code:

    DoubleVector start = new DenseDoubleVector(new double[] { 2, 5 });

    // our function is f(x,y) = x^2+y^2
    CostFunction inlineFunction = new CostFunction() {
      @Override
      public Tuple<Double, DoubleVector> evaluateCost(DoubleVector input) {

        double cost = Math.pow(input.get(0), 2) + Math.pow(input.get(1), 2);
        return new Tuple<Double, DoubleVector>(cost, null);
      }
    };

    DoubleVector theta = ParticleSwarmOptimization.minimizeFunction(
        inlineFunction, start, 1000, 0.1, 0.2, 0.4, 100, false);
    // retrieve the optimal parameters from theta

We run 1000 particles for 100 iterations, with a weight of 0.1 to the personal best solution, 0.2 to the swarm solution and 0.4 for its own exploration. Basically this was the usage of the algorithm, you can see that we don't need a gradient in this computation (it is nulled). Theta in this case should contain a value close to zero.

Of course you can find the source code here:

https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/math/minimize/ParticleSwarmOptimization.java

And the testcase above here:

https://github.com/thomasjungblut/thomasjungblut-common/blob/master/test/de/jungblut/math/minimize/ParticleSwarmOptimizationTest.java

And BTW, I just trained a XOR neural network with a single hidden layer with it. This is a very very cool algorithm ;)
A small parameterization tip is to set alpha to 2.8 so it does not fall into local minimas too fast (beta was 0.2 and phi 0.4).
You can have a look how it works in my testcase of the MultiLayerPerceptron:

https://github.com/thomasjungblut/thomasjungblut-common/blob/master/test/de/jungblut/classification/nn/MultiLayerPerceptronTest.java#L40

Aug 4, 2012

Creating a semantic graph

This is a small follow-up post to the similarity aggregation (short SEISA, set expansion by iterative similarity aggregation) post before, it is just a small idea about creating a graph from data collected by this algorithm. So don't expect the master plan how to create a semantic graph from your data, it is just some brainstorming post.

I was very much into semantic relations in the last few months and I think the best way to express them is a graph.

Semantic graph arround Canon

You may be reminded of the Knowledge Graph by Google, yes it actually was inspired by it. However it is much more computationally heavy to use it (and even build it) than you might think of it first.

In any case, an algorithm that builds this can be seeded from the SEISA algorithm. Think of a generic starting point that starts in wikipedia and reads the facts about Carl Friedrich Gauss, keywords like "mathematician" will yield to other mathematicians very fast.

Google is (MAYBE) mainly using the semantic annotations to create a relation between the nodes, which is mainly depth first search (because you get a lot of information through Gauß). Whereas the SEISA information could yield to more breadth results like other mathematicians in his century like Riemann or Poincaré.

Creating this semantic graph for (at least) products, brands and other product related features will be my personal ambition for the next few years. Of course Google aims at the most generic way, they have the computational resources to do it for all the web.

Another problem they have is actually how to cut off information to be valuable to the user. Not everybody searching for "Gauß" is interested in his whole life story or other mathematicians, but maybe more into the algorithm for solving equations.

Also understanding queries and parsing the graph accordingly is another interesting new research field.
For example in the above graph arround Canon, what if a user wants to know about other brands LIKE Canon (SQL like for example: SELECT brands LIKE "Canon"). In this case, an engine would traverse the graph breadth first search starting by Canon to find nodes with the attribute "Brand". It would find their brand PIXMA and EOS. Distance is also a very important topic in this relation, how far are the relevant nodes away from their starting point?

Frameworks building such a graph is currently a very important aspect. There are a few helpful ones:

Graph databases, e.G. Titan https://github.com/thinkaurelius/titan
Graph only focused processing engines, e.G. Giraph http://giraph.apache.org/
Of course Hama, to use sophisticated machine learning or mathematical algorithms that can't be expressed as a graph http://hama.apache.org/
Good old Hadoop MapReduce is of course a help for various data crunchings as well.

Beeing part of that, is one of the things I most enjoy.

We will see how far Google integrates the graph into their search, it will be a driving factor if they manage to query a large scale graph successfully and within milliseconds.

Set Expansion by Iterative Similarity Aggregation

Hey all,

been a very long time since my last post. I was very busy with Apache Hama, we have graduated from the Apache incubator and call us now a top level project and of course made another release 0.5.0. I think this was just the beginning of a rising open source project and I am really excited to be a part of it.
However, responsibility comes with spending a lot of time. Time is always sparse, especially if you are young and have a lot of stuff to do.

So blog posts like this will be very sparse in the future, but today it should not matter and I will tell you about a semi-supervised learning algorithm with the very phonetic name called SEISA.
SEISA itself is a acronym for "Set Expansion by Iterative Similarity Aggregation", not an invention by me rather than a wrap-up and implementation of the paper I found here by Yeye He and Dong Xin.

What is Set Expansion?

Set expansion is used to find similar words to an a priori given list of items in a given "artificial environment". For example you want to find additional digitial camera brands to a given seed set like {Canon, Nikon} and the algorithm should provide you with additional brands like Olympus, Kodak, Sony and/or Leica.
"Artificial environment" is actually a generic term for a context that you have to provide. In the referenced paper it is majorly the internet (crawled pages) and search query logs.

In a real world scenario, set expansion might be used to create a semantic database of related items, improve relevance of search querys, leverage the performance of other named entity recognition algorithms, or maybe also for SEO things (Dear SEOs, I won't help you on that, even if you pay me a lot of $$$).

Data Model

The main idea behind the algorithm is the data modelling. How do we represent this "artificial environment" (will call it context from now on) I talked about above?
The idea is like in any sophisticated algorithm: as a graph! Actually as a so called bipartite graph.

Example of a bipartite graph from list data

On the left side of this bipartite graph are the so called "candidate terms", the seedset we provide should be a subset of them. The nodes on the right are called the context nodes, they are representing the link between the candidate terms. In the case of the paper, they have crawled the internet for list items and created an edge to a candidate term if it exists in a given list. The neigbours of a context nodes, are the actual list items in this case. In the case of the web, the picture should also contain words that are not digital camera brands, also think about this simple rule: "more data wins". As in every normal graph, you can also weight the edges, for example if you trust the list data from wikipedia more than of craigslist, then you should assign a higher weight the edges between the context nodes and their candidate term nodes.

I have implemented it by a String array and a DoubleMatrix, where the string array is representing the term nodes and the matrix (NxM, where n is the number of entity tokens and m is the number of context nodes) is representing the weight of the edges. In my experience this matrix is always very sparse, even on very small datasets.

Iterative Similarity Aggregation (the algorithm)

Time to describe the algorithm, now we know how we represented the data.
It is rather simply to explain and I don't know why they complicated it so much in their paper.

There are three main methods, first the relevance score computation, the method that filters out newly top ranked items and at last the ranking function.

The relevance score computation is working like this:

Loop over all the terms in your matrix while looping over all terms in your seedset
Get the column vector of your seed element and the column vector of the candidate term
Measure the similarity between both and sum this similarity over all seedset terms, in the paper it is normalized by the length of the seedset, so it is actually the average similarity

What you now get is a vector (length is the number of term nodes you have), and in each index is the average relevance score to the seedset tokens.

Of course you want to pick the highest relevance nodes of this vector, that's where the static thresholding comes in. You pass the algorithm a similarity threshold, this threshold affects how fast the algorithm converges to a certain solution and how much false positives it will have in the end. Of course a higher threshold will yield to less false positives and faster convergence, while the exact opposite will yield to the other extremum. The truth (as always) lies anywhere between, so it is up to you to choose this correctly.

Since the algorithm always converges quite fast, you can experiment with other thresholds and see how it affects your output. Sure we are going to filter by the threshold and get a number of most relevant new candidate nodes.

Now we can enter the main loop and compute the relevance score again for the newly found relevant candidates (this is now called the similarity score). Based on this we are going to rank these two scores with some simple vecor multiplications:

So where does this alpha come from? That is what the paper names as "weighting factor", as you can see it is used to weight the scores. It is mainly used for the convergence of the algorithm, they say, halfing the scores (alpha = 0.5) yields to the best result.

Once we have a ranked scores, we can now filter again by the similarity threshold and get our new candidate terms for the next iteration.

In each iteration we do:

Compute similarity score by newly found candidate terms (in the beginning this is the seed set)
Rank the calculated similarity score against the newly found candidate terms
Apply the static threshold and get new candidate terms

Until it converges: by means the relevant tokens does not change their order anymore- they are equal. Note that you can also equalize on their ranks, I haven't tried it yet but it should converge also quite fast.

At the end you can now optain your expanded set by getting the candidate terms from the last iteration.

Results

Since I'm working in the e-commerce industry, I wanted to try it out to find similar camera brands like in the paper. It works, but it has very noisy result, partly because the algorithm is just iterative and not parallelized thus I can't work on huge datasets and partly because I had no real clean dataset.

But it really worked, I seeded "Canon" and "Olympus" and got all the other camera brands, sadly a few lenses as well.

On colors there are some similar behaviours, I seeded "blue", "red" and "yellow". Funnyly the most relevant token was "X" and "XS", by means that especially t-shirt products have the color in their name. It became very noise afterwards, found Ronaldo jerseys and other fancy clothes, but a bunch of new colors also.

The paper mentions a really nice trick for it: using the internet!

They have collected millions of list items from websites and let the algorithm run, in my opinion in every machine learning algorithm more data wins. The noise on small dataset will be vanished by the sheer size of the data. So I can very much think that they got really good result.

Parallelization strategy

The easiest way to parallelize this is using BSP and Apache Hama (in my opinion).

Let me tell you why:

Algorithm is strongly iterative
Needs communication between a master and the slaves

Here is my strategy to implement this in BSP:

The bipartite graph is split among the tasks, partitioned by the term nodes
A slave only partially computes the similarity between the seed tokens and the candidate terms
A master will be used to put all the similarities back into a single vector, rank it and distribute the newly expanded set again as well as checking convergence.

This is pretty much work, so I guess the follow-up post will take a few weeks. But I promise, I will code this algorithm!

So thank you very much for reading, you can find my sequential code on github, as well as the testcase:

https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/ner/IterativeSimilarityAggregation.java

https://github.com/thomasjungblut/thomasjungblut-common/blob/master/test/de/jungblut/ner/IterativeSimilarityAggregationTest.java

May 6, 2012

Distributed DBSCAN (Intuition)

Hey all,

it has been quite a long time since my last blog post. Thanks to my work, to keep me busy all day and don't let me research on cool new things. However, over the few holidays and weekends over the last weeks I came across a very interesting algorithm called DBSCAN.
It is abbreviated for "density-based spatial clustering of applications with noise", it is a unsupervised clustering algorithm just like k-means, besides that it is much smarter in many aspects.
Another objective I'd like to solve is the parallelization of this algorithm. I've seen just some ancient papers and what buffels me is that I've seen no implementation in Mahout (for MapReduce) or other distributed frameworks.

As you may know, I'm working for Apache Hama. It is a framework for distributed computing with the BSP (bulk synchronous parallel) model. I always searching for new algorithms that could fit into the model of BSP computing, e.G. graph algorithms of all sorts, strongly iterative algorithms, real-time algorithms.
And I think that DBSCAN also fits into the BSP model, I tell you why a bit later in this post.
First off, just a little introduction of the DBSCAN algorithm itself...

The algorithm

The algorithm is very easy to understand. Actually you have a bunch of points (or vectors in higher dimensionalities) as input, then you have to parameters and some fancy output.
The two parameters are called "epsilon" and "minpoints", epsilon is the minimum distance between two vectors to connect two points strongly and minpoints is the number of points that are at least needed to build a cluster out of strongly connected vectors.
Now you are going through the graph, point by point, marking visited vectors and adding points to a cluster while they are not violating the rules defined by epsilon and minpoints.

You can read on wikipedia about how the sequential version works in detail, however I am going to propose a much more easier to understand version of the algorithm.

Distributed algorithm steps

Instead of defining a big distributed algorithm that translates the sequential version into some distributed programming model, I have assembled three main steps to get the same result as the sequential version.
However each of these steps are strongly parallelizable in every major programming model (at least I know how it works in MapReduce, BSP and MPI).

Here are the three steps:

compute a distance matrix between the vectors with a given distance measurement

trivial step to parallelize, can also be merged with the next point

extract adjacent points via the epsilon threshold and the minpoints restriction

This step creates an adjacency list/matrix representing a graph
Noise is filtered at this step of the algorithm

run a connected component algorithm on the resulting graph of the previous step

Already done that in MapReduce and BSP, the last BSP version will be updated shortly after Apache Hama 0.5.0 comes out.

These three simple steps will give you the same result as a DBSCAN. Normally you can merge step 1 with step two, you can simply extract the adjacents points while computing the distances.

In the end, you will receive n-connected components, every of them will represent a cluster.

The delta to the points of your original input would be the noise cluster.

Note that the initial step is O(n²) which is obviously pretty bad and not scalable. So think about techniques like Similarity Hashing to speed this step up.

Pretty easy right? I think it is even more easier than the pseudocode on wikipedia.

Of course I put up a sample version (although sequential) on my github:

https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/clustering/DBSCAN.java

There is a nice plot I received when running it:

To make the noise more easy to spot, I have made horrible yellow circles arround them with Paint, please forgive me ;)

Stay tuned for an implementation with Apache Hama!

Update:

So far I haven't found the time to implement this whole system with Apache Hama. However, if you want to practically use this here are some advices:

For the distance matrix to compute, better use a heuristic to find close vectors

Mahout has a MinHashing implementation of such a clustering

Once you obtained "mini" clusters, you can compute more expensive distance measurements and extract your graph (step two in the above list)

Feb 27, 2012

Nonlinear conjugate gradient method in Java

Hi all,

I have been a bit busy for the last few weeks with some projects. But fortunately I found some time on the last weekend to implement a recommandation engine for myself based on what we have done in ml-class.org lessons in octave and translate it to a Java version.

If you attended ml-class you might be familiar that you need to minimize a rather complex cost function based on what the user likes in terms of movies.
However I haven't found a simple and not ancient Java library containing a fully working conjugate gradient method. Shorthand I decided to translate it from Octave to Java. It took me 5-6 hours to build a Octave-like vector library arround it to translate it quite 1:1. But it was really worth it.

And here it is:
https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/math/minimize/Fmincg.java

Fmincg btw stands for Function minimize nonlinear conjugant gradient. It wasn't clear for me in the first place and I really started to understand the algorithm when I translated it.

It works quite like the version in octave, you pass an input vector (which is used as a starting point) and a costfunction along with a number of iterations to do.
Since I'm a hacker by heart, I want to give you a sample code to try it out for yourself.

Example

I have prepared a simple quadratic function for you
f(x) = (4-x)^2+10

Obviously since this is quadratic this has a global minimum which is easy to spot because I used the binomial version of the function, we will see if fmincg finds it.

For the algorithm we constantly need to calculate the gradients of the input in our cost function. Therefore we need the derivative which is for our function
f(x)' = 2x-8

If you're a math crack then you know that the f(x) hits (int our case) the minimum where the derivative cut's the x-axis or y=0.

Since this is quite hard to visualize, I have made a picture for you:

Function f(x) and its derivative

Not difficult to spot, you see the black function is our f(x) whereas the green line is our derivative. And it hits the x-axis right at the minimum of the function. Great!

How do we code this?

This is quite simple, I show you the code first:

int startPoint = -5;
    // start at x=-5
    DenseDoubleVector start = new DenseDoubleVector(new double[] { startPoint });
  
    CostFunction inlineFunction = new CostFunction() {
      @Override
      public Tuple<Double, DenseDoubleVector> evaluateCost(
          DenseDoubleVector input) {
        
        // our function is f(x) = (4-x)^2+10, so we can calculate the cost
        double cost = Math.pow(4-input.get(0),2)+10;
        // the derivative is f(x)' = 2x-8 which is our gradient value
        DenseDoubleVector gradient = new DenseDoubleVector(new double[] {2*input.get(0)-8});
        
        return new Tuple<Double, DenseDoubleVector>(cost, gradient);
      }
    };
    
    DenseDoubleVector minimizeFunction = Fmincg.minimizeFunction(inlineFunction, start, 100, true);
    // should return 4
    System.out.println("Found a minimum at: " + minimizeFunction);

As you can see we have to allocate the vector which is containing our start "point". You can set this arbitrary randomly, but you have know that it might not hit the global minimum but rather a local minimum. So different random starting points can yield to different results.

But not in our case. Now you can implement the cost function the algorithm needs the cost of your function at the given point of the input and the value of the derivative at this point.
So we return this after we put the input "x" in our two equations as a tuple.

The whole application prints:

Interation 1 | Cost: 10,000000
Interation 2 | Cost: 10,000000
[4.0]

Works! And it outputs the x-value of our minima and the cost is 10, which should be the y-value. Very cool!
You can find this example on Github as well.

Please use it if you need it, I have already implemented the collaborative filtering algorithm with it and I guess a backprop neural network will follow soon.

I really have to admit that I am not a math crack although I am studying computer sciences, but math really makes many things easier and if you take a deeper look behind what you've learned in school, it is really beautiful.

Thanks and bye!

Jan 24, 2012

German Stop Words

Hey all,

I'm doing some text mining in the last time, so I needed a reliable list of german stop words.
The only real advanced version I have found was the lucene "GermanAnalyzer". That is the seed of the following list I wanted to share with you.

I already formatted this as an array that is put into a HashSet, so you can easily use it within your Java code via HashSet#contains(token).

public final static HashSet<String> GERMAN_STOP_WORDS = new HashSet<String>(
     Arrays.asList(new String[] { "and", "the", "of", "to", "einer",
      "eine", "eines", "einem", "einen", "der", "die", "das",
      "dass", "daß", "du", "er", "sie", "es", "was", "wer",
      "wie", "wir", "und", "oder", "ohne", "mit", "am", "im",
      "in", "aus", "auf", "ist", "sein", "war", "wird", "ihr",
      "ihre", "ihres", "ihnen", "ihrer", "als", "für", "von",
      "mit", "dich", "dir", "mich", "mir", "mein", "sein",
      "kein", "durch", "wegen", "wird", "sich", "bei", "beim",
      "noch", "den", "dem", "zu", "zur", "zum", "auf", "ein",
      "auch", "werden", "an", "des", "sein", "sind", "vor",
      "nicht", "sehr", "um", "unsere", "ohne", "so", "da", "nur",
      "diese", "dieser", "diesem", "dieses", "nach", "über",
      "mehr", "hat", "bis", "uns", "unser", "unserer", "unserem",
      "unsers", "euch", "euers", "euer", "eurem", "ihr", "ihres",
      "ihrer", "ihrem", "alle", "vom" }));

Note that there are some english words as well, if you don't need them, they are just in the first section of the array. So you can easily remove them ;)

If you have a good stemmer, you can remove other words as well.

How did I extract them?

These words are the words that had the highest word frequency in a large set (> 10 Mio.) of text and html documents.

Have fun and good luck!

Jan 2, 2012

BSP k-means Clustering Benchmark

Hey all,

in my last post I already wrote about the kmeans clustering with Apache Hama and BSP.
Now we have a detailed benchmark of my algorithm.

Have a look here for the current state taken from here: http://wiki.apache.org/hama/Benchmarks

Because it will change during the lifetime of Apache Hama, I made a screenshot from the very first benchmark. Maybe to document performance improvements ;)

Have a look here:

Is it faster than MapReduce?
Yes! I recently read in the new "Taming Text" by Grant S. Ingersoll that the same amount of workload takes the same time, but not in seconds, but in minutes.

However, I want to benchmark it against the same dataset and on the same machines to get a fully comparable result.

Future Work
Besides the benchmark against MapReduce and Mahout, I want to show the guys from Mahout that it is reasonable to use BSP as an alternative to MapReduce. I look forward that they use Apache Hama and BSP within the next year as an alternative to MapReduce implementations for various tasks.

Thanks to Edward who made this possible!