Jan 24, 2012

German Stop Words

Hey all,

I'm doing some text mining in the last time, so I needed a reliable list of german stop words.
The only real advanced version I have found was the lucene "GermanAnalyzer". That is the seed of the following list I wanted to share with you.

I already formatted this as an array that is put into a HashSet, so you can easily use it within your Java code via HashSet#contains(token).

public final static HashSet<String> GERMAN_STOP_WORDS = new HashSet<String>(
     Arrays.asList(new String[] { "and", "the", "of", "to", "einer",
      "eine", "eines", "einem", "einen", "der", "die", "das",
      "dass", "daß", "du", "er", "sie", "es", "was", "wer",
      "wie", "wir", "und", "oder", "ohne", "mit", "am", "im",
      "in", "aus", "auf", "ist", "sein", "war", "wird", "ihr",
      "ihre", "ihres", "ihnen", "ihrer", "als", "für", "von",
      "mit", "dich", "dir", "mich", "mir", "mein", "sein",
      "kein", "durch", "wegen", "wird", "sich", "bei", "beim",
      "noch", "den", "dem", "zu", "zur", "zum", "auf", "ein",
      "auch", "werden", "an", "des", "sein", "sind", "vor",
      "nicht", "sehr", "um", "unsere", "ohne", "so", "da", "nur",
      "diese", "dieser", "diesem", "dieses", "nach", "über",
      "mehr", "hat", "bis", "uns", "unser", "unserer", "unserem",
      "unsers", "euch", "euers", "euer", "eurem", "ihr", "ihres",
      "ihrer", "ihrem", "alle", "vom" }));

Note that there are some english words as well, if you don't need them, they are just in the first section of the array. So you can easily remove them ;)

If you have a good stemmer, you can remove other words as well.

How did I extract them?

These words are the words that had the highest word frequency in a large set (> 10 Mio.) of text and html documents.

Have fun and good luck!

2 comments:

  1. Hello Thomas,

    why don't you use the stop word list of solr, which is included inside of the download?

    ReplyDelete
  2. It is pretty much the same list, but SOLR features a translation and a shitty formatting.

    ReplyDelete