Thomas Jungblut's Blog: January 2012

Jan 24, 2012

German Stop Words

Hey all,

I'm doing some text mining in the last time, so I needed a reliable list of german stop words.
The only real advanced version I have found was the lucene "GermanAnalyzer". That is the seed of the following list I wanted to share with you.

I already formatted this as an array that is put into a HashSet, so you can easily use it within your Java code via HashSet#contains(token).

public final static HashSet<String> GERMAN_STOP_WORDS = new HashSet<String>(
     Arrays.asList(new String[] { "and", "the", "of", "to", "einer",
      "eine", "eines", "einem", "einen", "der", "die", "das",
      "dass", "daß", "du", "er", "sie", "es", "was", "wer",
      "wie", "wir", "und", "oder", "ohne", "mit", "am", "im",
      "in", "aus", "auf", "ist", "sein", "war", "wird", "ihr",
      "ihre", "ihres", "ihnen", "ihrer", "als", "für", "von",
      "mit", "dich", "dir", "mich", "mir", "mein", "sein",
      "kein", "durch", "wegen", "wird", "sich", "bei", "beim",
      "noch", "den", "dem", "zu", "zur", "zum", "auf", "ein",
      "auch", "werden", "an", "des", "sein", "sind", "vor",
      "nicht", "sehr", "um", "unsere", "ohne", "so", "da", "nur",
      "diese", "dieser", "diesem", "dieses", "nach", "über",
      "mehr", "hat", "bis", "uns", "unser", "unserer", "unserem",
      "unsers", "euch", "euers", "euer", "eurem", "ihr", "ihres",
      "ihrer", "ihrem", "alle", "vom" }));

Note that there are some english words as well, if you don't need them, they are just in the first section of the array. So you can easily remove them ;)

If you have a good stemmer, you can remove other words as well.

How did I extract them?

These words are the words that had the highest word frequency in a large set (> 10 Mio.) of text and html documents.

Have fun and good luck!

Jan 2, 2012

BSP k-means Clustering Benchmark

Hey all,

in my last post I already wrote about the kmeans clustering with Apache Hama and BSP.
Now we have a detailed benchmark of my algorithm.

Have a look here for the current state taken from here: http://wiki.apache.org/hama/Benchmarks

Because it will change during the lifetime of Apache Hama, I made a screenshot from the very first benchmark. Maybe to document performance improvements ;)

Have a look here:

Is it faster than MapReduce?
Yes! I recently read in the new "Taming Text" by Grant S. Ingersoll that the same amount of workload takes the same time, but not in seconds, but in minutes.

However, I want to benchmark it against the same dataset and on the same machines to get a fully comparable result.

Future Work
Besides the benchmark against MapReduce and Mahout, I want to show the guys from Mahout that it is reasonable to use BSP as an alternative to MapReduce. I look forward that they use Apache Hama and BSP within the next year as an alternative to MapReduce implementations for various tasks.

Thanks to Edward who made this possible!