Thomas Jungblut's Blog

XGBoost bayesian hyperparameter tuning with bayes_opt in Python

2016-10-15T14:23:00.002+02:00

Hey guys,

I just wanted to quickly share how I was optimizing hyperparameters in XGBoost using bayes_opt.

It does a k-fold cross validation while optimizing for stable parameters.
Keep in mind that bayes_opt maximizes the objective function, so change all the required hardcoded values along those lines to fit your problem. It's pretty compact, so I thought I just leave it here for your convenience as a gist.

Cheers,
Thomas

XGBoost Validation and Early Stopping in R

2016-03-18T22:03:00.002+01:00

Hey people,

While using XGBoost in Rfor some Kaggle competitions I always come to a stage where I want to do early stopping of the training based on a held-out validation set.

There are very little code snippets out there to actually do it in R, so I wanted to share my quite generic code here on the blog.

Let's say you have a training set in some csv and you want to split that into a 9:1 training:validation set. Here's the naive (not stratified way) of doing it:

train <- read.csv("train.csv")
bound <- floor(nrow(train) * 0.9)         
train <- train[sample(nrow(train)), ]       
df.train <- train[1:bound, ]                
df.validation <- train[(bound+1):nrow(train), ]

Now before feeding it back into XGBoost, we need to create a xgb.DMatrix and remove the targets to not spoil the classifier. This can be done via this code:

train.y <- df.train$TARGET
validation.y <- df.validation$TARGET

dtrain <- xgb.DMatrix(data=df.train, label=train.y)
dvalidation <- xgb.DMatrix(data=df.validation, label=validation.y)

So now we can go and setup a watchlist and actually start the training. Here's some simple sample code to get you started:

watchlist <- list(validation=dvalidation, train=dtrain)

param <- list(
   objective = "binary:logistic",
   eta = 0.3,
   max_depth = 8                
)

clf <- xgb.train(   
   params = param, 
   data = dtrain, 
   nrounds = 500, 
   watchlist = watchlist,
   maximize = FALSE,
   early.stop.round = 20
)

Here we setup a watchlist with the validation set in the first dimension of the list and the trainingset in the latter. The reason that you need to put the validation set first is that the early stopping only works on one metric - where we should obviously choose the validation set.

The rest is straightforward setup of the xgb tree itself. Keep in mind that when you use early stopping, you also need to supply whether or not to maximize the chosen objective function- otherwise you might find yourself stopping very fast!

Here's the full snippet as a gist:

https://gist.github.com/thomasjungblut/e60217c5b7609e4dfef3

Thanks for reading,
Thomas

MinHashing for Beginners

2015-03-05T22:36:00.002+01:00

Due to popular demand of my MinHashing class in my common library, I decided to do a quick and dirty walkthrough of how to vectorize, MinHash and lookup your data!

What is MinHashing good for?

MinHashing in the first place is used to speed up nearest neighbour lookups. Think of having this huge terabyte sized dataset of movie reviews and you want to find similar reviews (either for deduplication or for recommendation).

Now MinHashing will give you a very small and dense representation of your data by hashing values of a vectorized representation of your data. You can read more about the theory and some details behind it on Wikipedia.

Example Data

To get some base-line data, let's grab a dataset from Kaggle. This example uses the 50k movie reviews from IMDB, namely the "unlabeledTrainData.tsv". Sign-in and get the data from there- alternatively any data will do if it has the same columnar TSV style formatting, here is an example:

id review
"9999_0" "Watching Time Chasers, it obvious that it was made by a bunch of friends. Maybe they were sitting around one day in film school and said, \"Hey, let's pool our money together and make a really bad movie!\" Or something like that. What ever they said, they still ended up making a really bad movie--dull story, bad script, lame acting, poor cinematography, bottom of the barrel stock music, etc. All corners were cut, except the one that would have prevented this film's release. Life's like that."
"45057_0" "I saw this film about 20 years ago and remember it as being particularly nasty. I believe it is based on a true incident: a young man breaks into a nurses' home and rapes, tortures and kills various women.<br /><br />It is in black and white but saves the colour for one shocking shot.<br /><br />At the end the film seems to be trying to make some political statement but it just comes across as confused and obscene.<br /><br />Avoid."

We need to do some massaging here, since there are some HTML parts in it, but for this example we can simply get away with lower casing and removing everything that isn't alphanumeric.

Tokenization and Vectorization

The main part here is to normalize and tokenize the data into bigrams that we can use our vectorizer to build TF-IDF / or bag-of-word-frequency vectors out of it. Using my common library and Java 8 streams this looks like this:

 
    Stream<String> lines = Files.lines(
        Paths.get("C:/Users/thomas.jungblut/Downloads/unlabeledTrainData.tsv"))
        .skip(1); // skip the header
    
    // normalization and bigram tokenization
    List<String[]> documents = lines
       .map((line) -> TokenizerUtils.normalizeString(line))
       .map((normalized) -> TokenizerUtils.whiteSpaceTokenizeNGrams(normalized, 2))
       .collect(Collectors.toList());
    
    // every token that occurs more often than 50% in the corpus and less than twice will be discarded
    String[] dict = VectorizerUtils.buildDictionary(documents.stream(), 0.5f, 2);
    List<DoubleVector> vectors = VectorizerUtils.wordFrequencyVectorize(documents.stream().parallel(), dict).collect(Collectors.toList());

MinHashing

Now that we have the vectors, we can MinHash them.

 
    MinHash hasher = MinHash.create(20, HashType.MURMUR128);
    
    List<int[]> minHashes = vectors.stream().map((v) -> hasher.minHashVector(v)).collect(Collectors.toList());

In this case, we go for 20 min hashes and a rather strong murmur 128 bit hash. This is usually the default I try out for various text tasks.

Finding Similar Documents

To find similar documents, we have two ways: brute force or using an inverted index. Let's start using brute force by finding the top five similar elements to the third (index = 2) movie review in our list.

 
    final int sourceDocIndex = 2;

    // document index
    LimitedPriorityQueue<Integer> queue = new LimitedPriorityQueue<>(5);

    int[] first = minHashes.get(sourceDocIndex);

    Stopwatch watch = Stopwatch.createStarted();
    for (int i = 0; i < minHashes.size(); i++) {
      if (i != sourceDocIndex) {
        int[] reference = minHashes.get(i);
        double similarity = hasher.measureSimilarity(first, reference);
        if (similarity > 0.1) {
          queue.add(i, similarity);
        }
      }
    }

    System.out.println("Done finding similar docs in "
        + watch.elapsed(TimeUnit.MILLISECONDS) + "ms!");

    System.out.println("document:");
    System.out.println(lines.get(sourceDocIndex));
    while (!queue.isEmpty()) {
      double similarity = queue.getMaximumPriority();
      int index = queue.poll();
      System.out.println(similarity + " -> " + lines.get(index));
    }

This will output a very interesting pattern:

Done finding similar docs in 146ms!
document:
"15561_0" "Minor Spoilers<br /><br />In New York, Joan Barnard (Elvire Audrey) is informed that her husband, the archeologist Arthur Barnard (John Saxon), was mysteriously murdered in Italy while searching an Etruscan tomb. Joan decides to travel to Italy, in the company of her colleague, who offers his support. Once in Italy, she starts having visions relative to an ancient people and maggots, many maggots. After shootings and weird events, Joan realizes that her father is an international drug dealer, there are drugs hidden in the tomb and her colleague is a detective of the narcotic department. The story ends back in New York, when Joan and her colleague decide to get married with each other, in a very romantic end. Yesterday I had the displeasure of wasting my time watching this crap. The story is so absurd, mixing thriller, crime, supernatural and horror (and even a romantic end) in a non-sense way. The acting is the worst possible, highlighting the horrible performance of the beautiful Elvire Audrey. John Saxon just gives his name to the credits and works less than five minutes, when his character is killed. The special effects are limited to maggots everywhere. The direction is ridiculous. I lost a couple of hours of my life watching 'Assassinio al Cimitero Etrusco'. If you have the desire or curiosity of seeing this trash, choose another movie, go to a pizzeria, watch TV, go sleep, navigate in Internet, go to the gym, but do not waste your time like I did. My vote is two.<br /><br />Title (Brazil): 'O Mistério Etrusco' ('The Etruscan Mystery')"
0.9047619047619048 -> "15556_0" "Minor Spoilers<br /><br />In New York, Joan Barnard (Elvire Audrey) is informed that her husband, the archaeologist Arthur Barnard (John Saxon), was mysteriously murdered in Italy while searching an Etruscan tomb. Joan decides to travel to Italy, in the company of her colleague, who offers his support. Once in Italy, she starts having visions relative to an ancient people and maggots, many maggots. After shootings and weird events, Joan realizes that her father is an international drug dealer, there are drugs hidden in the tomb and her colleague is a detective of the narcotic department. The story ends back in New York, when Joan and her colleague decide to get married with each other, in a very romantic end. Yesterday I had the displeasure of wasting my time watching this crap. The story is so absurd, mixing thriller, crime, supernatural and horror (and even a romantic end) in a non-sense way. The acting is the worst possible, highlighting the horrible and screaming performance of the beautiful Elvire Audrey. John Saxon just gives his name to the credits and works less than five minutes, when his character is killed. The special effects are limited to maggots everywhere. The direction is ridiculous. I lost a couple of hours of my life watching 'Assassinio al Cimitero Etrusco'. My suggestion is that if you have the desire or curiosity of seeing this trash, choose another movie, go to a pizzeria, watch TV, go sleep, navigate in Internet, go to the gym, but do not waste your time like I did. My vote is two.<br /><br />Title (Brazil): 'O Mistério Etrusco' ('The Etruscan Mystery')"
0.9047619047619048 -> "15555_0" "Minor Spoilers<br /><br />In New York, Joan Barnard (Elvire Audrey) is informed that her husband, the archeologist Arthur Barnard (John Saxon), was mysteriously murdered in Italy while searching an Etruscan tomb. Joan decides to travel to Italy, in the company of her colleague, who offers his support. Once in Italy, she starts having visions relative to an ancient people and maggots, many maggots. After shootings and weird events, Joan realizes that her father is an international drug dealer, there are drugs hidden in the tomb and her colleague is a detective of the narcotic department. The story ends back in New York, when Joan and her colleague decide to get married with each other, in a very romantic end. Yesterday I had the displeasure of wasting my time watching this crap. The story is so absurd, mixing thriller, crime, supernatural and horror (and even a romantic end) in a non-sense way. The acting is the worst possible, highlighting the horrible and screaming performance of the beautiful Elvire Audrey. John Saxon just gives his name to the credits and works less than five minutes, when his character is killed. The special effects are limited to maggots everywhere. The direction is ridiculous. I lost a couple of hours of my life watching 'Assassinio al Cimitero Etrusco'. If you have the desire or curiosity of seeing this trash, choose another movie, go to a pizzeria, watch TV, go sleep, navigate in Internet, go to the gym, but do not waste your time like I did. AVOID IT! My vote is two.<br /><br />Title (Brazil): 'O Mistério Etrusco' ('The Etruscan Mystery')"

It found two near duplicates to the source document we provided :-)

Please find the full code here:

https://gist.github.com/thomasjungblut/e4759797f5a52d78e06d

I will cover an edit later on with the inverted index version and how the performance actually improved.

Thanks for reading,
Thomas

Fixing Alienware's MX14r2 Audio on Ubuntu 14.04 LTS

2015-02-08T17:59:00.001+01:00

Today I was using my Alienware MX14r2 again since a long time and after the upgrade to Ubuntu 14.04 my speakers work, but my headphone jack didn't respond anymore.

So if you see the following symptoms, I think I got a fix for you.

Speakers work and headphones don't (I experienced it the other way around already last year)
When the headphones are plugged in, you hear sound from the speaker

After going through numerous threads and multiple config hacks and reinstalls of ALSA, I came to the surprisingly simply fix of changing some things on the alsamixer.

So go to a bash (CTRL+ALT+T) and type "alsamixer":

Make it fullscreen, so you can see every input / output codec:

The culprit really was that both HP/Speaker were set to "MM" while they should be enabled to "00":

So navigate with your arrow-keys to these and press "m" until it shows up as "00".

Surprisingly, enabling only one will only enable the speakers, the others only the headphones. Switching both to "00" will make them work like you're expecting.

Thanks for reading and I hope it helped,

Thomas

Stochastic Logistic Regression in F

2014-05-31T16:03:00.000+02:00

Hello!

I'm back with some exiting new hacking. At the beginning of the week, Jon Harrop came to the Microsoft London office and gave a two day training on F#. The training was nice, I enjoyed it very much and we all laughed a lot :-)

In the end, I want to share a small project that I came up with during the training. It is about Stochastic Logistic Regression, or Logistic Regression "learning" the weights using Stochastic Gradient Descent.
This is just a small write-up of what I've learned, explaining some of the language features.

Learning Rule

In (more or less) mathematical terms, we will execute the following algorithm:

 
Let n = number of iterations
Let a = learning rate
Let w = weights that need to be learned (including a bias term)
For i to N:
  Let f = get a random feature
  Let c = the class of feature (either 0 or 1)
  Let prediction = Sigmoid ( f dot w )
  Let loss = prediction - c
  
  w := w - a * f * loss

Pretty easy algorithm, can be rewritten to use a convergence criterion very easily since there is only a single global minimum. If you participated in the ML course of Andrew Ng you'll recognize some similarity, because the full-batch algorithm they used basically uses the whole feature matrix and then updates the gradient by the average over all losses.

Let's jump into F#.

Training in F#

Since we are basically executing a math equation over and over again, it is handy to have a math library at hand. I've chosen Math.Net Numerics, because it has some nice F# bindings. Let's start by looking at the training loop of the whole program:

 
       let Train(numIterations, dim, learningRate, rnd) = 
        // new vector including the bias
        let mutable start = DenseVector.CreateRandom(dim + 1, Normal())
        for i = 0 to numIterations do
            let feature = SampleNewFeature(dim, rnd)
            start <- OnlineLogisticRegression.GradientDescentStep(learningRate, start, FeatureVector(feature), FeatureClass(feature))

        Console.WriteLine("Learned Weights: {0}", System.String.Join(", ", start.ToArray()))
        start

As you can see, F# is pretty much like the (more or less) mathematical notation I used in the previous section. Notable in the first place is that F# is indentation sensitive, just like Python.

In the first line we define the function "Train", which defines some arguments and no return type. F# is quite smart about type inference, so it can detect what the types of the parameters are, so with the return type. In this method, we return the learned weights "start"- thus the compiler knows that this method returns a DenseVector. Like in most of the functional languages, there is always a return- in case there is not, there is a "unit" return (literally defined by () ), which always indicates an absense of a value.

In the raw algorithm, where I used a for loop to loop over all iteration, we sample a new feature and pass it to another function which I defined in the module "OnlineLogisticRegression" containing the update and calculation logic. At the end, we simply print (using the standard printing in C#) the vector and return it to the outside. You can seamlessly use F# and C# with each other in a program as they compile to the same IL.

Let's step into the gradient descent function for a while:

 
    /// Does a stochastic gradient descent step. Returns a new vector with the updated weights.
    let GradientDescentStep(learningRate, currentParameters, currentFeatureVector, outcome) =
        let biasedFeature = CreateBiasVector currentFeatureVector
        let prediction = Predict(biasedFeature, currentParameters)
        let loss = prediction - outcome
        // do a gradient descent step into the gradient direction
        DenseVector.OfVector(currentParameters - (biasedFeature * (loss * learningRate)))

Yet another example for the type inference, but in this case I want to put your attention to the operator overloading. Yes in F# you can overload operators: As you can see in the parameter update, "currentParameters" and "biasedFeature" is a DenseVector, while loss and learningRate are floats (floats in F# are 64bit doubles, float32 is the "normal" float). The compiler has a small problem, because you can't leave the brackets out, as it can't determine the precedence correctly?

In any case, the logic around that is pretty simple, very similar to a definition in maths.

Testing the classifier in F#

In case the classifier is trained, we want to assess its learned weights. Usually, we use a hold-out test set to measure some metrics like accuracy or precision/recall, in this case I settled with sampling some new features from the random distribution we created the features with in the training stage. So how does that look like in F#?

 
      let testSetSource = List.init testSetSize (fun (x) -> SampleNewFeature(dim, rnd))
      let testSet = 
            Seq.ofList testSetSource
            |> Seq.map (fun(feat) -> feat, OnlineLogisticRegression.Predict(FeatureVector(feat), weights))
            |> Seq.map (fun(feat, prediction) -> feat, prediction, abs(FeatureClass(feat) - prediction) < 0.5)

      let countCorrectPredictions = Seq.sumBy (fun(feat, prediction, correct) -> if correct then 1 else 0)
      let numCorrect = countCorrectPredictions testSet

As you can see, I have used the pipeline operator ( |> ) to chain operations together. First we create a new list containing the new and sampled test features, then we make it a sequence (which is a lazily evaluated structure that is similar to IEnumerable in C#).
Into that sequence, we map a three-tuple (the feature, the prediction and the learned weights) and then use another map stage to assess whether a classification was correct or not (threshold of the sigmoid here is 0.5). This in fact, is basically what's currying about: we chain multiple operators together forming a new function.
In the end, we sum the number of correct predictions by piping the sequences through the defined pipeline.
All of this is lazily evaluated, the whole computation was just executed within the last line.

Code

The code can be found on GitHub, Apache 2.0 licensed:

https://github.com/thomasjungblut/FSharpLogisticRegression

Result

Executing the code in the GitHub repository yields to the following plot.

Thanks for reading!

Next time (maybe), I'll write an online logistic regression service that can be trained on Microsoft Azure using a real-time data stream.

Greatest common divisor using F#

2014-04-07T22:22:00.000+02:00

Hey guys,

I played with F# on the weekend. Since I'm working at Microsoft now and mainly working with C#, I thought F# would be a nice counterpart to my Scala work in the past.

A few personal updates

The start at Microsoft was quite a lot of work until now and I find less time and motivation in coding much in the evening. You can mainly see this in the commit activity of my common library on Github which looks like this:

Flatline since christmas... Not very proud of it, due to my relocation to London I was two months without my development desktop PC, in addition, context-switching between C# and Java isn't easy. Personally I find it very hard to go back to Java, as I'm missing a ton of functionality from C#! But I promise that I will find my way back, especially when it comes to awesome machine learning algorithms.

Back to the topic

So like most of the time, I was lurking around at Stackoverflow and found a poor guy using Java8 lambdas to make the greatest common divisor working. Besides that his algorithm is quite strange and by the time you read this the question is likely to be deleted, I find a quite interesting idea to practice my (still poor) weekend's F# skills.

Just a few words about the greatest common divisor definition, given two integers, we want the largest number that divides both without a remainder.

I chose the simplest set theory approach, which is just calculating two sets of divisors and intersecting them. Then I choose the maximum from the intersection.

Here is what I came up in F# (even using the pipelining operators!):

 
        let gcd = fun (a, b) -> 
                    let set1 = Set.ofList [ 1 .. a ]
                               |> Set.filter (fun x -> a % x = 0)
                    let inter = Set.ofList [ 1 .. b ]
                               |> Set.filter (fun x -> b % x = 0)
                               |> Set.intersect set1
                    inter.MaximumElement

Rather easy binding of a function, not much to say here.
Small obligatory testcase:

 
     let a = 12
     let b = 8
     Console.WriteLine("GCD: {0}", gcd(a, b))
     // yields to result
     // GCD: 4

How is the language so far?

I found F# until now quite succinct, especially compared to C#. .NET integrates very nicely, too.
However, sometimes the type system makes me want to punch a wall.
But the most disappointing experience so far: the .fs file that contains your main method needs to be the last one in the project's solution. Yes you heard correctly, it must be on the lowest possible list entry in the project solution list.

How dumb is that?! Obviously, this was hiding behind a MSFT typical error message:

error FS0039: The namespace or module '_my_namespace' is not defined

That cost me a few hours to figure out. -.-

Complexity

Let's talk about the complexity a bit. So as far as I read here, a Set in F# is unordered, most likely to be a HashSet. Thus constructing and filtering both sets is O(A) respectively O(B), the intersection of two unordered sets is linear in time as well. So what we will end up is something like this:

2 * (O(A) + O(B)) + O(A+B) = O(n)

Still linear time, but very bad constants. Also the space complexity is rather bad, it is linear as well, as we are creating two quite large sets.

But hey! It is just a naive method to show off my new aquired F# knowledge ;-)

Thanks for reading, see you next time with a more interesting Big Data topic I hope,
Thomas

Wiring services together using the actor model

2013-06-08T18:42:00.000+02:00

Hey folks,

I had a lot of stuff to do in my last semester, thus I couldn't write up the post about news clustering completely yet. The stress will keep constant until my study is finally over, I guess until arround the beginning of october. Then I will have more time to explain this topic more thoroughly, of course with lots of pictures ;)

But today I have a bit of time and wanted to share a smart idea of joining two cool things together: Service Oriented Architectures (SOA) and the Actor Model. We will go through a small definition of both, why we should join these technologies and at the end have a small look on how to implement such an architecture with plain Java.

Why Sevice Oriented Architectures?

Service Oriented Architecture is a design pattern based on chunking a large software system into smaller and discrete modules called services. The goal is to design services solely based on their functionality and thus decouple them from other services. The result should be an ensemble of services that are only defined by their simplistic interfaces that offer functionality to the outside world.

A pretty intuitive example is a checkout process in ecommerce systems: it is a large and complicated process that can be chunked into simpler parts. At the beginning of your internet shopping trip you are likely to visit a few products. Retrieving products and their information is also a good candicate for a service, because it has a defined behaviour and its functionality can be reused very well for other purposes. The corresponding interface could be something like this:

 
    // could directly retrieve objects from a database 
    public Product getProduct(long productId); 
    // could be a proxy to another service
    public Opinions getUserOpinions(long productId);  
    // could be a filesystem call
    public Images getProductImages(long productId);

For many of you, this might look like a data access object (DAO) that is going to ask an underlying database implementation about the concrete values. This is not what the goal of the service itself should be: a service defines just the functionality and not how it transports the information (whether there is a RPC call or an Oracle database in the back shouldn't be part of your interface/service design).
Thus the user should never care about the underlying complexity or the implementation of the system. That is a similar statement like in Object Oriented Programming that naturally yields to polymorphism (multiple implementations of an interface).

But how do Services wire up together?

Imagine a computing cluster where a scheduler is a service that looks at the resources in the cluster and makes decisions on where to place your application best. How does it communicate the result of its computation to the next service- say the service that handles allocation of those resources?

There are basically few ways to handle this:

Just call the next service via its interface methods (synchronous call)
Send a message to the other service and continue (asynchronous call)
A superviser/manager that calls services after each other (supervised call)

I'm no async evangelist, but I will try to tell you about my experiences and why I think that asynchronous messaging is a much more viable way in highly concurrent scenarios.

Consider the following most simplistic scenario when dealing with services:

Super simple service chain

In this case the services form a simple chaining, so calls from Service A can only reach B, and B can only reach C. Clearly, there is no need for anything spectacular- if I were in that situation I would put those services in a list and call them after each other:

  List<Service> services = ... some list ...;
  Result lastResult = null;
  for(Service service : services){
     lastResult = service.call(lastResult);
  }
  // go further down the road with the final service result

This is what is called the Pipeline pattern, because you feed the result of the stage before to the next one and enhance/filter/modify the results. This is a supervised architecture, because you are controlling how the data flows by hardcoding the control flow in the above loop.

But what happens when we want to process requests through the services concurrently?

Now that is where the problems usually begins. The above architecture will work in a concurrent environment without any problems, as long as the code that is calling the services in sequence is thread-safe and all the services are designed thread-safe. That means, if your service has state (for example our scheduler that has information about the current cluster resources), it needs to lock all access to it while modifying it. This is not a big deal for someone who worked with threads already and is familiar with standard libraries like in Java (see for example a ReadWriteLock).

However, think of what complexity you are imposing to your software in that moment:

Every service needs to handle the locks for itself and must be thread-safe
Even with standard library support you clutter your code with try/finally unlock statements
The performance is likely to suffer in high concurrency/throughput environments

Overall, this is a complexity nightmare (have you ever traced down a race condition?) and exactly what we wanted to avoid when choosing a SOA.

It just begins to get worse:

Slightly more complex service chain

What do you want to do when Service B locks it's state for a long time (e.g. our scheduler just received a big update from a rack that just got back online)? Clearly other services will have to wait and throughput and responsiveness starts to suffer severely. You can spin this even a tick further: What if you're in a distributed environment and Service B just doesn't exist anymore (server goes down, netlink breaks)? Services A_[1-n] will have to wait until B comes back online and can't do anything else than wait. Always note that those are the easiest service architectures! In reality your call graph looks much more connected throughout all services.

All that is an issue if you're relying on synchronous communication between services. What we need is to define a decoupling between the services- not of their functionality, but this time of the communication between them.

The Actor Model

The most intuitive way to make asynchronous communication to happen is to send a message!

If I want Bob to work on issue X in our bug tracker, I write him an email that he should have a look at issue X soon. Now Bob can decide on his own when he looks into his mailbox (for example when he is finished with the current task) and also when he wants to start working on issue X. Transferred to computer science: you don't disturb the service in doing its job as you would with locking or interrupts.

The intuition is the same behind the actor model, here Bob would be the actor and emails would be some kind of messages that land in an actors' inbox. Normally we want to have many more actors that can interact with each other and provide functionality. That's where we come back to services: so actors and services both provide functionality / behaviour and messaging between actors helps us to solve the synchronous comunication problems.

While you can use a framework like Akka for the actor model, it is very easy to implement in Java using the standard API:

public class SimpleActor<MSG_TYPE> implements Runnable {

  public static interface Service<MSG_TYPE> {

    void onMessage(MSG_TYPE message);

  }

  private final LinkedBlockingQueue<MSG_TYPE> inbox = new LinkedBlockingQueue<>();
  private Service<MSG_TYPE> messageListener;

  public SimpleActor(Service<MSG_TYPE> listener) {
    this.messageListener = listener;
  }

  public void message(MSG_TYPE message) {
    inbox.add(message);
  }

  @Override
  public void run() {
    while (!Thread.currentThread().isInterrupted()) {
        // blocks until we have a new message
        MSG_TYPE take = inbox.take();
        messageListener.onMessage(take);
        // interrupted exception omitted
    }
  }

}

As you can see, it is super easy to setup a producer/consumer inbox within a thread and use the service as a callback listener. All concurrency and signalling is the problem of the underlying implementation of the inbox, here a LinkedBlockingQueue is used.

Now your Service can easily implement the callback, with the guarantee that every message that arrives will be processed sequentially (because your run method takes only one message at a time from the queue). So you will never have to worry about explicit locking in your code, you just have to react to events that happen.

A simplistic and fictitious variant of a scheduler that reacts can look like this:

    Service<SchedulingEvent> scheduler = new Service<SchedulingEvent>() {
      
      @Override
      public void onMessage(SchedulingEvent message) {
        if(message.isSchedulingMessage()){
          if(cluster.getFreeMemory() > message.memoryNeeded()){
            // tell the allocation actor to run that 
            message(Allocator.class, new Allocation(message));
          }
        } else if(message.isUpdateMessage()){
          cluster.update(message.getFreeMemory());
        }
        
      }
    };

You can see, the logic is very clean, no locking is needed and you can react on specific events- or ignore them if you don't care about them. In a real-life scenario I would add an ActorManager that helps messaging by a defined name or class, or you can design actors as singletons and directly access their messaging methods.

Let's get back to our problems we had with the synchronous and supervised calls and see if we solved them:

Locking complexity

Best case: no locking anymore in the service code itself

Code clutter

Everything looks very clean and tied to the servers functionality

Performance

Every service can work at its own speed/pace, no polling is involved

What if the inbox fills up faster than the messages can be consumed?
Is it really faster?

Availability

When a service goes down, it is up to the messaging implementation to buffer those messages in a secondary storage so they can be retrieved after a crash. But certainly this is now easier to implement and maintain.

Seems we have a few open questions that definitely must be addressed by the engineer. To make a good decision you will need architecture knowledge on how the services interact with each other, but in the end it looks like a very nice model for the communication between services.

But what are the clear disadvantages of this actor model?

Of course there is no silver bullet in such technology. The actor model also has drawbacks, here are a few that I have observed when working with event driven actor architectures:

You have no explicit returns, e.g. if an exception happens you will be notified long time afterwards via message that comes back
Debugging is the hell if you don't optimize readability for it

The first bullet point is problematic, yet another example: what if you want to get a return value for a query that is part of our service functionality? It sounds like a huge detour to send messages when all you could do is to call a function. Always keep your goal in mind:
Do you want to create a service for functionality? Or do you want to create services that interact with each other? Both are (by definition) service oriented architectures and both can be used in conjunction with each other - choose wisely which one you need to use.

The second bullet point is something that will drive developers nuts in their daily lifes. When writing an actor model, be sure that your actors are named accordingly to their usecase. Nobody wants to send a message not knowing whose inbox to reach. So make it clear to which destination you're sending a message to.
Something that I have employed to neglect this was to use classnames as the address and make all services singletons. This helps to write code like this:

            
    // class name based routing
    message(Allocator.class, new Allocation(message));
    // singleton based routing
    message(Allocator.getInstance(), new Allocation(message));
    // singleton based direct messaging, NOTE getInstance() is a
    // convention, not a defined interface!
    Allocator.getInstance().message(new Allocation(message));

People working with that will immediately know, that they can click on the class entry in their IDE and get to the implementation fast and will always know where the message will end up.
Still the amount of scrolling to be done is too damn high! I hope that the IDEs will soon catch up on those paradigms (especially when lambdas and function pointers come with Java8) and make it easy to navigate to callback/listener methods.

So thank you very much for reading, you've definitely won a cookie for reading the whole article.

Named Entity Recognition in News Articles

2013-01-27T13:54:00.000+01:00

Hello,

been a while since the last post. Was into a lot of work- couldn't really get up on the weekend to write about named entity recognition in news articles. But today we can finally talk about it.

This post is about a few things:

What is named entity recognition?
How do we model it as a machine learning problem?
What features to extract for our learner?

So let's dive in, if you have taken the Natural Language Processing (NLP) class on Coursera, you will be familiar with the topic already and should start with the features we use in the third paragraph.

What is named entity recognition (NER)?

The most easiest explanation is to find word level concepts like a location or person in an unstructured text file. Let's say we have the following snippet of text, shamelessly stolen from Wikipedia:

Jim bought 300 shares of Acme Corp. in 2006.

The idea is to tag parts of this sentence with tuples of concepts and their value, such that we get this:

<PERSON, "Jim"> bought 300 shares of <CORP,"Acme Corp"> in 2006.

So we detected Jim as a person and the Acme Corp as a corporation in this sentence.
But how do we need this for our news aggregation engine? A very simple assumption: News are about people, what they did and where they did it. A simple example would be:

"David Cameron talks in Davos about the EU"

The topic is clearly about the person David Cameron- the action of talking and a location namely Davos in Switzerland. This is needed for our news aggregation engine to cluster topics together, even if their content is slightly different. We will talk about this in one of the following blog posts.

How do we model this as a machine learning problem?

Basically, it is nothing else than a (multiclass) classification problem. Namely classifying if a token belongs to the class PERSON, LOCATION- or O which is none of the ones before.

The main difference to other NLP tasks is that we need the context of a token, because the meaning of the current token is depending on the previous or the following token/class. So let's have a look at another example:

It did not, and most of Mr. Cameron’s European partners do not wish to revisit that fundamental question.

How do we recognize Cameron in this case here? There are two cases that are representative for english. First the "Mr." is a strong indicator that there is following a name, second the "'s" is a strong indicator that the previous token was a name. Also prepositions like "of", "to", "in" or "for" are likely indicators for names or locations. The trick that I learned from the NLP class on coursera was the encoding of the text as a sequence of unigrams. The previous text would look like this:

most O
of O
Mr. O
Cameron PERSON
's O

So what we do is predicting the label of the current unigram, by looking at the previous and following unigram and maybe also at the previous label. The reason we need to look at the previous label is that names could be composed of name and surname like David Cameron. So if the last class was a person, in many cases the current class might also be a person.

So what kind of classifier do we use? I used a self written version of the Maximum Entropy Markov Model supplied in week four of the NLP class exercise optimizable with normal Gradient Descent or Conjugate Gradient (or even with the given quasi newton minimizer supplied in the class). Also I written some utilities to extract sparse feature vectors as conveniently as in NLP class.
You can browse some code in my common repository's NER package and see how it looks like to use with the data supplied from NLP class in my testcase.

What features to extract for our learner?

Features are pretty important, they must cover structural as well as dictionary features. Here is my feature set for dictionary features:

current word
last word
next word

And for structural features:

previous class label
current word upper case start character
last word upper case start character
current word length
current word containing only alphabetic characters (1=true or 0=false)
next word containing only alphabetic characters (1=true or 0=false)
was the last word a preposition
was the last word a special character like dot, comma or question mark
last word ending with "ing"
previous/current/next words POS tag

POS tags are pretty important as nouns are more likely to be a person or location. Also other POS tags might lead to one of those classes by the previous word or next word, e.G. verbs are pretty likely to follow a name. All these features are pretty sparse, thus we are building a dictionary of all possible features we observed during training time and encode a dictionary of features we have seen.
For example the feature prevPosTag could have value "prevPosTag=NN". This is a rare occation as we have a unigram encoded as a feature vector, so it totally makes sense to encode them with a sparse vector.

Now that we have our text encoded as a list of vectors (a sparse vector for each unigram we observe) we can optimize the vectors and their outcome by minimizing a conditional likelyhood costfunction to be used in the Markov Model. This will learn conditional probabilities between features and the outcomes, describing when we are observing a feature- how likely is that the PERSON class occurs, for math lovers this can be described as P( class | features ). I optimized my model for 1k iterations using conjugate gradient and obtained a very low training error of arround 5%. To obtain the class for a feature vector we are doing a viterbi decoding on the learned weights. The trick here is that you need to encode the feature vector for all the possible classes, only that way the viterbi can decode the probabilities correctly.

So yeah, that is basically what it's all that named entity recognition is about. The next blog post will most probably be about how to cluster the news together using the information we gathered through this post.

Bye!

Extracting articles from crawled HTML sites

2013-01-01T17:27:00.001+01:00

So welcome to our second part of Building a news aggregation engine!
If you don't know how crawlers work, have a look into the last part of the series: Build a basic crawler.

This time we will talk about how to get the actual raw content of a site. Something that we humans see on a newspage isn't visible for an algorithm, because it just has to look at the code- not the actual rendered page. Therefore I will introduce you with a technique called Boilerplate removal.

What is this "Boilerplate"?

Boilerplate is everything, but the content of the page you are seeking for. Some people call it: "design" or "nagivation" and in their first attempts they try to parse websites with XPath expressions to get their content. But that is not necessary and an unsual pain if the design really shifts.

So the lesson learned is (like in the last blog post as well): IGNORE. Yes you have to ignore the parts you don't want to have. That doesn't mean that we don't take the parts to make decisions- in fact we need those to decide whether a block of html code belongs to the IGNORE- or the content-part.
Therefore I want to introduce to you the boilerpipe framework. Boilerpipe is written in Java by Christian Kohlschütter and it is licensed with Apache 2.0.

It uses some really simple machine learning algorithm (a small decision tree) to classify whether a given block of html is content or not. You can read the details in his research paper "Boilerplate Detection using Shallow Text Features " which is a really good read.
In short, it analyses the amount of tokens and links in the previous, current and next block of HTML. It is the same idea we will use later in sequence-learning when we deal with Named Entity Recognition in order to get people and events from the article.

In Java, you can use it like this:

final BoilerpipeExtractor extractor = ArticleExtractor.getInstance();
String text = extractor.getText(html);

It is sooo simple and you have the content of a webpage in a string. Of course, this only applies for news articles, since they are shaped really consistent over many news sites therefore the name ArticleExtractor.

But in real world, if we crawl the web, most of the sites we examine with our crawlers won't be news- so the content in the resulting text string might not be a news article. Maybe it is just the imprint or privacy statement that looks like an article, but isn't.

Classifying News sites

Since I faced this issue while crawling, I had to train some machine learning algorithm on the output of boilerpipe to detect news accurately.

Here is what I did:

Let the crawler run on the top 40 news sites in germany (hand seeded list) for 100k sites
Write a small python application to loop over all files and ask me if it was news or not
After 1k hand-classified items (was just 1h of work!), train a classifier.

I found out that training a Multilayer Perceptron with a single hidden layer gives arround 93% accuracy with very small number of features, since this is enough for my purposes I stopped there. But I believe that you can get alot more (99% should be really do-able) with ensembling and better features.

But many people don't tell what kind of features they used, so here are mine:

Length of the extracted text
URL ends with '/'
Length of the extracted title
Number of '\n' in the extracted text
Text mention of "impressum","haftung","agb","datenschutz","nutzungsbedingungen" (imprint, authors etc.)
Title mention of "impressum","haftung","agb","datenschutz","nutzungsbedingungen" (imprint, authors etc.)
Number of upper case letters in the text

It is a mixture of text level features that can be expanded and meta features like lengths.
I trained a neural net (7-35-1) with sigmoid activations (and 1.0 as regularization) for 10k epochs with Conjugate Gradient. Here are the average results from a 10 fold crossvalidation:

Accuracy: 0.9259259259259259
Precision: 0.9173553719008265
Recall: 0.9823008849557522
F1 Score: 0.9487179487179487

That is pretty good for such simple methods! And I didn't even used HTML features like boilerpipe does ;-)

If you want to have some data of my crawl, I have ~1500 classified news articles and 16.5k unclassified- so if you need a bit of german news data for whatever research: let me know via email!

Congratz! We can now crawl the world wide web and classify news sites very accurately. Our next step will be to develop a named entity recognition engine that allows us to extract the keywords we need from our text in order to group them efficiently.

Build a basic crawler

2013-01-01T16:37:00.001+01:00

So welcome to our first part of Building a news aggregation engine!

This time we talk about how we build a really simple crawler, that crawls us some sites.

The (more or less) formal definition of a crawler is stated in Wikipedia:

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

The basic workflow looks like that:

Seed some URLs and queue them up
Keep a set about what URLs were visited
While our queue is not empty (or we reached some maximum amounts of sites)

Get the first URL from the queue, put it into the visited set
Query the URL, obtain some HTML
Extract new URLs from the HTML and queue them up if they are not in the visited set yet

A small Java program could look like this:

String[] seedUrl = new String[]{"http://www.wikipedia.com/"};
final Deque<String> linksToCrawl = new ArrayDeque<>();
final HashSet<String> visited = new HashSet<>();

linksToCrawl.addAll(Arrays.asList(seedUrl));
visited.addAll(Arrays.asList(seedUrl));

while (fetches < 100 && !linksToCrawl.isEmpty()) {
      String urlToCrawl = linksToCrawl.poll();
      // open a connection and parse HTML
      ...
      // loop over all links we found on that page
      for (String outlink : extractedResult.outlinks) {
         if (visited.add(outlink))
            linksToCrawl.add(outlink);
      }
      fetches++;
}

It looks really simple, but tell you what? It is more difficult than it looks.
Once you started with it, you wish you never started with that- the web is ugly. I'm working in the backend team at work and I'm surrounded by a lot of garbage from various data sources, but the web is a whole new level. Just a small excerpt of what you need to deal with:

Encoding issues (we will fix them later on in this post)
Link expansions (relative vs. absolute URLs vs. JavaScript URLs like void(0); )
Not parsable stuff like videos or images

So for example, how do you deal with data you can't handle (which don't contain HTML)? You IGNORE it. For this kind of purpose I've clamped together a bunch of suffixes that can happen within links and they guard against running into not parsable binary data:

Pattern IGNORE_SUFFIX_PATTERN = Pattern
      .compile(".*(\\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf|iso|rm|smil|wmv|swf|wma|zip|rar|gz))$");

So as you can see, I'm guarding against anything here. Of course they are completely useless if someone does not supply a suffix of the filetype. In the latter case, you will need to get on the stream and look for a html or body tag to verify it is really a website (which is the worst case, because you're wasting bandwidth and time the crawler could use to do something else).

Something that poked me for quite a while were encoding issues. As a german, umlauts like öäüß are completely garbled if you read them with the wrong encoding. So most of the time, germany news look really bad and you can directly throw them into the next trash bin.

I ran across a project of the Mozilla foundation called universalchardet (abbrev. for universal charset detector) and its Java descendent called juniversalchardet. It detects encodings with really high accuracy and helps you to get the content of your crawl correct like you would browse the site.

In Java you have to obtain the site via streams, so let me show you a small example of juniversalchardet and how to read a stream into a string of HTML with NIO.

    String someURLAsString = "http://www.facebook.com";
    URL url = new URL(someURLAsString);
    InputStream stream = url.openStream();
    String html = consumeStream(stream);
    
  // the helper methods
    
  public static String consumeStream(InputStream stream) throws IOException {
    try {
      // setup the universal detector for charsets
      UniversalDetector detector = new UniversalDetector(null);
      ReadableByteChannel bc = Channels.newChannel(stream);
      // allocate a byte buffer of BUFFER_SIZE size 
      // 1mb is enough for every usual webpage
      ByteBuffer buffer = ByteBuffer.allocate(BUFFER_SIZE);
      int read = 0;
      while ((read = bc.read(buffer)) != -1) {
        // let the detector work on the downloaded chunk
        detector.handleData(buffer.array(), buffer.position() - read, read);
        // check if we found a larger site, then resize the buffer
        buffer = resizeBuffer(buffer);
      }
      // finish the sequence
      detector.dataEnd();
      // copy the result back to a byte array
      String encoding = detector.getDetectedCharset();
      // obtain the encoding, if null fall back to UTF-8
      return new String(buffer.array(), 0, buffer.position(),
          encoding == null ? "UTF-8" : encoding);
    } finally {
      if (stream != null) {
        stream.close();
      }
    }
  }
  // basic resize operation when 90% of the buffer is occupied

  // simply double the correct size and copy the buffer
  private static ByteBuffer resizeBuffer(ByteBuffer buffer) {
    ByteBuffer result = buffer;
    // double the size if we have only 10% capacity left
    if (buffer.remaining() < (int) (buffer.capacity() * 0.1f)) {
      result = ByteBuffer.allocate(buffer.capacity() * 2);
      buffer.flip();
      result.put(buffer);
    }
    return result;
  }

That is actually everything to know about getting HTML from a raw URL.

But, how do you extract the outlinks from a HTML page?

Many of you will now go ahead and say: let's compile some RegEx. You will FAIL.
As a computer scientist it is enough if you tell that HTML is a context free grammar (chomsky type 2) and RegEx needs a regular language (type 3) to operate properly. Type 2 languages are way more complex and can't be parsed with a regular expression. So please have a look at the funny rage answer at stackoverflow or read up the other very informative answers at the bottom to know why you shouldn't do this. Don't get me wrong: You will find URLs that you can parse with RegEx, but I don't think it is worth the stress. I always use the htmlparser on sourceforge, it is clean, well tested and pretty fast.

To end this post, I tell you how to extract some outlinks from a html page as string:

static final NodeFilter LINK_FILTER = new NodeClassFilter(
      LinkTag.class);

    Parser parser = new Parser(html);
    NodeList matches = parser.extractAllNodesThatMatch(LINK_FILTER);
    SimpleNodeIterator it = matches.elements();
    while (it.hasMoreNodes()) {
      LinkTag node = (LinkTag) it.nextNode();
      String link = node.getLink().trim();
      // now expand for relative urls and store somewhere
    }

It is simple as that. How expanding of URLs can be done is another part- but I leave that up to you ;-) Java's URI may help you with that.

So thanks for attending, my next post is about how to extract actual text content (news) from pure HTML code.

Building a news aggregation engine

2013-01-01T15:34:00.000+01:00

Hey all,

first off- happy new year!
It has been a few months now since I posted my last blog post. It wasn't just the stress (open source projects, work, study) that prevented me from writing, but more that I've found no really good topic to blog about.

Personally, what I always found to be interesting are these news aggregation sites. You know them: Google News, Yahoo News etc. They are crawling the web, getting fresh articles about what happens in the world, group them together and present them. Also, they require a lot of machine learning and natural language processing- topics that most of my blogposts are already about.

Since my last posts are more about distributing such algorithms, I want to focus alot more on the application of several algorithms and intentions in order to build up such a news aggregation engine. I know this is a topic where you can write several books about, but I think we can build a working application within three to five blog posts. My personal goal would be to derive a small Java application, that you just start with a seed of a few URLs and it is grouping the incoming pages in realtime- so you can watch the result in an embedded jetty web application by refreshing the site constantly. I currently have some simple parts of it clamped together, but they don't act as a single application so I will rewrite this for the public ;-)

Here is a rough topic grind what we need to cover in order to plug the system together:

Crawling and extraction of news articles

Detect text encodings while crawling (otherwise you'll get low quality results)
Article Classification (you have to detect that a site contains an article)
Boilerplate Removal (you don't care about the design, so remove it!)
Page Deduplication (Archieves or mirrors host the same article, we want to sort out those fast)

Topic modelling of news articles

Named Entity Recognition (News are about people and events, so we have to detect them)

Grouping News

Hierarchical clustering with realtime extension

We will use frameworks for some parts of the task, some coursework from online courses, but the most things are written by myself.

So stay tuned, I will start with basic crawling in the next post and how to detect encodings efficiently.

Java XOR Swap Performance

2012-10-06T23:12:00.001+02:00

Hey,
just quick post, because I was working to polish my min heap and checked about different swap techniques.

From my experience I know there are two different versions to swap two integers in an array.

XOR swap algorithm
Swap using temporary variable

Since swapping is used pretty much everywhere, I decided to micro benchmark these against each other using Caliper.

I've seen many people using the XOR algorithm lately, I don't know if they know that this is inefficient. Many people seem to do not care about this, because bitshifting makes them kinda look cool probably?

However, here is my code so we can sort out the performance of both pretty easy:

public class SwapBenchmark extends SimpleBenchmark {

  @Param({ "10", "100", "1000", "10000", "100000", "1000000", "10000000", "100000000" })
  private int size;

  @Param
  SwapType type;

  public enum SwapType {
    XOR, TMP
  };

  int[] array;

  @Override
  protected void setUp() throws Exception {
    array = new int[size];
    Random r = new Random();
    for (int i = 0; i < size; i++) {
      array[i] = r.nextInt();
    }
  }

  public void timeSwap(int reps) {
    for (int rep = 0; rep < reps; rep++) {
      int sum = 0;
      for (int i = 0; i < size; i++) {
        final int x = i;
        final int y = size - i - 1;
        if (type == SwapType.XOR) {
          array[x] ^= array[y];
          array[x] ^= (array[y] ^= array[x]);
          sum += i;
        } else {
          int tmpIndex = array[x];
          array[x] = array[y];
          array[y] = tmpIndex;
          sum += i;
        }
      }
      System.out.println(sum);
    }

  }

  public static void main(String[] args) {
    Runner.main(SwapBenchmark.class, args);
  }
}

So basically I'm swapping n-times in an n-size array, where n ranges from ten to 100,000,000 which is really big.

The result isn't very exciting, I used my Nehalem i7 with 3,3GHZ and latest Java7u7:

      size type        us linear runtime
       10  XOR      2,90 =
       10  TMP      2,84 =
      100  XOR      3,06 =
      100  TMP      2,85 =
     1000  XOR      3,92 =
     1000  TMP      3,52 =
    10000  XOR     20,21 =
    10000  TMP     14,22 =
   100000  XOR    183,33 =
   100000  TMP    118,57 =
  1000000  XOR   1822,98 =
  1000000  TMP   1192,19 =
 10000000  XOR  19401,65 ==
 10000000  TMP  13266,78 ==
100000000  XOR 194173,73 ==============================
100000000  TMP 134364,67 ====================

The TMP swap is much more efficient in every case. Why is this?
Wikipedia states the following:

On modern CPU architectures, the XOR technique is considerably slower than using a temporary variable to do swapping. One reason is that modern CPUs strive to execute instructions in parallel via instruction pipelines. In the XOR technique, the inputs to each operation depend on the results of the previous operation, so they must be executed in strictly sequential order. If efficiency is of tremendous concern, it is advised to test the speeds of both the XOR technique and temporary variable swapping on the target architecture.

There is actually nothing more to add. So please don't do any pre-mature optimization!
Even if it looks cool to do a bit-shifting trick ;)

Particle Swarm Optimization

2012-09-01T21:13:00.000+02:00

Hey there,

I recently started to attend a artificial intelligence lecture (a real course, not online) and we have gotten a small intro to NetLogo yesterday. Although I'm not a real fan of it (mainly because I was tortured with LOGO as a kid, you don't know how painful it is if you already know C++), I liked the idea of plotting agents and scripting all kinds of logic in a compressed way.

So I used to came across an example in their modules library called "Particle Swarm Optimization" (PSO), which was really amazing. I can give you a small picture of it:

Particle Swarm Optimization plot Source

You can see a lot of colored particles (or agents) and a terrain that is greyscaled according to its height. Your job is to find a pretty good minimum (whitened areas) in this terrain via swarm intelligence.

A short summary is, that if you have a function and want to minimize the cost by a given parameter set "Theta", you can use the PSO algorithm to find a good minimum of your function. You can compare this for example with Gradient Descent or Conjugate Gradient, but there is a subtle difference between them, for PSO no differentiable function is needed.

Intuition

Of course I will provide you with the basic intuition of this algorithm. So I told that there is a certain number of agents in this algorithm. They are initialized uniformly distributed in our search space, ideally you want to fill the map to get a good spread of your agents so your chance is higher to find a global minimum.

For a fixed number of iterations, you are going to change the position of each particle randomized with a weight on three objectives which are parameters of this meta-optimizer.

Alpha (in literature also called φ_p/ Phi_personal) is the weight of an agent following its personal memory (based on prior found minimas)
Beta (in literature also called φ_g/ Phi_global) is the weight of an agent following the global memory of the swarm (based on the currently best minimum from the swarm)
Phi (in literature also called ω / Omega) is the inertia/velocity of an agent.

From the parameters you can tell that each particle has its own memory (the best personal found solution), also there is a swarm memory, which basically is just a record of the currently best found solution from all particles.

Math

To understand the weighted formula of the movement of the particles, we have to have a look at the formula:

Formula to calculate the new position of a particle, based on the current position "v"

You can see, to calculate the movement of a particle, you have a weighted radomized linear combination of the above parameters. Non-mathematically interpreted, the agent is on the move and it is attracted to either its personal best solution or the global best solution. By setting the parameters you can fine tune the behaviour of each particle in this regard.

Algorithm

The algorithm is quite easy, because there is some really good pseudo code on Wikipedia.
However here is my small summary of it:

Initialize...

the particle positions to some random positions in our search space
the memorys of the agents and the swarm memory

In each iteration until we haven't hit our maximum iterations do...

loop over all particles
calculate the new position of a particle
check the cost of this position, if its smaller than the particles memory then update, do the same update with the global memory if its smaller.

Profit!

This is a nice and easy algorithm. To further visualize it, I have found a very good youtube video about it, implemented with MatLab: http://www.youtube.com/watch?v=bbbUwyMr1W8

There you can see the movement of the particles very well for a number of functions.

In Java

Of course I have done this in Java with my math library. So let's dive into my normal testcase of the function

Which is a good testcase, because it is convex so it has a global minimum which can be good tested.

A nice looking plot of it:

Convex function f(x,y) = x^2 + y^2

Let's sneak into the Java code:

    DoubleVector start = new DenseDoubleVector(new double[] { 2, 5 });

    // our function is f(x,y) = x^2+y^2
    CostFunction inlineFunction = new CostFunction() {
      @Override
      public Tuple<Double, DoubleVector> evaluateCost(DoubleVector input) {

        double cost = Math.pow(input.get(0), 2) + Math.pow(input.get(1), 2);
        return new Tuple<Double, DoubleVector>(cost, null);
      }
    };

    DoubleVector theta = ParticleSwarmOptimization.minimizeFunction(
        inlineFunction, start, 1000, 0.1, 0.2, 0.4, 100, false);
    // retrieve the optimal parameters from theta

We run 1000 particles for 100 iterations, with a weight of 0.1 to the personal best solution, 0.2 to the swarm solution and 0.4 for its own exploration. Basically this was the usage of the algorithm, you can see that we don't need a gradient in this computation (it is nulled). Theta in this case should contain a value close to zero.

Of course you can find the source code here:

https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/math/minimize/ParticleSwarmOptimization.java

And the testcase above here:

https://github.com/thomasjungblut/thomasjungblut-common/blob/master/test/de/jungblut/math/minimize/ParticleSwarmOptimizationTest.java

And BTW, I just trained a XOR neural network with a single hidden layer with it. This is a very very cool algorithm ;)
A small parameterization tip is to set alpha to 2.8 so it does not fall into local minimas too fast (beta was 0.2 and phi 0.4).
You can have a look how it works in my testcase of the MultiLayerPerceptron:

https://github.com/thomasjungblut/thomasjungblut-common/blob/master/test/de/jungblut/classification/nn/MultiLayerPerceptronTest.java#L40

Creating a semantic graph

2012-08-04T12:16:00.000+02:00

This is a small follow-up post to the similarity aggregation (short SEISA, set expansion by iterative similarity aggregation) post before, it is just a small idea about creating a graph from data collected by this algorithm. So don't expect the master plan how to create a semantic graph from your data, it is just some brainstorming post.

I was very much into semantic relations in the last few months and I think the best way to express them is a graph.

Semantic graph arround Canon

You may be reminded of the Knowledge Graph by Google, yes it actually was inspired by it. However it is much more computationally heavy to use it (and even build it) than you might think of it first.

In any case, an algorithm that builds this can be seeded from the SEISA algorithm. Think of a generic starting point that starts in wikipedia and reads the facts about Carl Friedrich Gauss, keywords like "mathematician" will yield to other mathematicians very fast.

Google is (MAYBE) mainly using the semantic annotations to create a relation between the nodes, which is mainly depth first search (because you get a lot of information through Gauß). Whereas the SEISA information could yield to more breadth results like other mathematicians in his century like Riemann or Poincaré.

Creating this semantic graph for (at least) products, brands and other product related features will be my personal ambition for the next few years. Of course Google aims at the most generic way, they have the computational resources to do it for all the web.

Another problem they have is actually how to cut off information to be valuable to the user. Not everybody searching for "Gauß" is interested in his whole life story or other mathematicians, but maybe more into the algorithm for solving equations.

Also understanding queries and parsing the graph accordingly is another interesting new research field.
For example in the above graph arround Canon, what if a user wants to know about other brands LIKE Canon (SQL like for example: SELECT brands LIKE "Canon"). In this case, an engine would traverse the graph breadth first search starting by Canon to find nodes with the attribute "Brand". It would find their brand PIXMA and EOS. Distance is also a very important topic in this relation, how far are the relevant nodes away from their starting point?

Frameworks building such a graph is currently a very important aspect. There are a few helpful ones:

Graph databases, e.G. Titan https://github.com/thinkaurelius/titan
Graph only focused processing engines, e.G. Giraph http://giraph.apache.org/
Of course Hama, to use sophisticated machine learning or mathematical algorithms that can't be expressed as a graph http://hama.apache.org/
Good old Hadoop MapReduce is of course a help for various data crunchings as well.

Beeing part of that, is one of the things I most enjoy.

We will see how far Google integrates the graph into their search, it will be a driving factor if they manage to query a large scale graph successfully and within milliseconds.

Set Expansion by Iterative Similarity Aggregation

2012-08-04T11:47:00.004+02:00

Hey all,

been a very long time since my last post. I was very busy with Apache Hama, we have graduated from the Apache incubator and call us now a top level project and of course made another release 0.5.0. I think this was just the beginning of a rising open source project and I am really excited to be a part of it.
However, responsibility comes with spending a lot of time. Time is always sparse, especially if you are young and have a lot of stuff to do.

So blog posts like this will be very sparse in the future, but today it should not matter and I will tell you about a semi-supervised learning algorithm with the very phonetic name called SEISA.
SEISA itself is a acronym for "Set Expansion by Iterative Similarity Aggregation", not an invention by me rather than a wrap-up and implementation of the paper I found here by Yeye He and Dong Xin.

What is Set Expansion?

Set expansion is used to find similar words to an a priori given list of items in a given "artificial environment". For example you want to find additional digitial camera brands to a given seed set like {Canon, Nikon} and the algorithm should provide you with additional brands like Olympus, Kodak, Sony and/or Leica.
"Artificial environment" is actually a generic term for a context that you have to provide. In the referenced paper it is majorly the internet (crawled pages) and search query logs.

In a real world scenario, set expansion might be used to create a semantic database of related items, improve relevance of search querys, leverage the performance of other named entity recognition algorithms, or maybe also for SEO things (Dear SEOs, I won't help you on that, even if you pay me a lot of $$$).

Data Model

The main idea behind the algorithm is the data modelling. How do we represent this "artificial environment" (will call it context from now on) I talked about above?
The idea is like in any sophisticated algorithm: as a graph! Actually as a so called bipartite graph.

Example of a bipartite graph from list data

On the left side of this bipartite graph are the so called "candidate terms", the seedset we provide should be a subset of them. The nodes on the right are called the context nodes, they are representing the link between the candidate terms. In the case of the paper, they have crawled the internet for list items and created an edge to a candidate term if it exists in a given list. The neigbours of a context nodes, are the actual list items in this case. In the case of the web, the picture should also contain words that are not digital camera brands, also think about this simple rule: "more data wins". As in every normal graph, you can also weight the edges, for example if you trust the list data from wikipedia more than of craigslist, then you should assign a higher weight the edges between the context nodes and their candidate term nodes.

I have implemented it by a String array and a DoubleMatrix, where the string array is representing the term nodes and the matrix (NxM, where n is the number of entity tokens and m is the number of context nodes) is representing the weight of the edges. In my experience this matrix is always very sparse, even on very small datasets.

Iterative Similarity Aggregation (the algorithm)

Time to describe the algorithm, now we know how we represented the data.
It is rather simply to explain and I don't know why they complicated it so much in their paper.

There are three main methods, first the relevance score computation, the method that filters out newly top ranked items and at last the ranking function.

The relevance score computation is working like this:

Loop over all the terms in your matrix while looping over all terms in your seedset
Get the column vector of your seed element and the column vector of the candidate term
Measure the similarity between both and sum this similarity over all seedset terms, in the paper it is normalized by the length of the seedset, so it is actually the average similarity

What you now get is a vector (length is the number of term nodes you have), and in each index is the average relevance score to the seedset tokens.

Of course you want to pick the highest relevance nodes of this vector, that's where the static thresholding comes in. You pass the algorithm a similarity threshold, this threshold affects how fast the algorithm converges to a certain solution and how much false positives it will have in the end. Of course a higher threshold will yield to less false positives and faster convergence, while the exact opposite will yield to the other extremum. The truth (as always) lies anywhere between, so it is up to you to choose this correctly.

Since the algorithm always converges quite fast, you can experiment with other thresholds and see how it affects your output. Sure we are going to filter by the threshold and get a number of most relevant new candidate nodes.

Now we can enter the main loop and compute the relevance score again for the newly found relevant candidates (this is now called the similarity score). Based on this we are going to rank these two scores with some simple vecor multiplications:

So where does this alpha come from? That is what the paper names as "weighting factor", as you can see it is used to weight the scores. It is mainly used for the convergence of the algorithm, they say, halfing the scores (alpha = 0.5) yields to the best result.

Once we have a ranked scores, we can now filter again by the similarity threshold and get our new candidate terms for the next iteration.

In each iteration we do:

Compute similarity score by newly found candidate terms (in the beginning this is the seed set)
Rank the calculated similarity score against the newly found candidate terms
Apply the static threshold and get new candidate terms

Until it converges: by means the relevant tokens does not change their order anymore- they are equal. Note that you can also equalize on their ranks, I haven't tried it yet but it should converge also quite fast.

At the end you can now optain your expanded set by getting the candidate terms from the last iteration.

Results

Since I'm working in the e-commerce industry, I wanted to try it out to find similar camera brands like in the paper. It works, but it has very noisy result, partly because the algorithm is just iterative and not parallelized thus I can't work on huge datasets and partly because I had no real clean dataset.

But it really worked, I seeded "Canon" and "Olympus" and got all the other camera brands, sadly a few lenses as well.

On colors there are some similar behaviours, I seeded "blue", "red" and "yellow". Funnyly the most relevant token was "X" and "XS", by means that especially t-shirt products have the color in their name. It became very noise afterwards, found Ronaldo jerseys and other fancy clothes, but a bunch of new colors also.

The paper mentions a really nice trick for it: using the internet!

They have collected millions of list items from websites and let the algorithm run, in my opinion in every machine learning algorithm more data wins. The noise on small dataset will be vanished by the sheer size of the data. So I can very much think that they got really good result.

Parallelization strategy

The easiest way to parallelize this is using BSP and Apache Hama (in my opinion).

Let me tell you why:

Algorithm is strongly iterative
Needs communication between a master and the slaves

Here is my strategy to implement this in BSP:

The bipartite graph is split among the tasks, partitioned by the term nodes
A slave only partially computes the similarity between the seed tokens and the candidate terms
A master will be used to put all the similarities back into a single vector, rank it and distribute the newly expanded set again as well as checking convergence.

This is pretty much work, so I guess the follow-up post will take a few weeks. But I promise, I will code this algorithm!

So thank you very much for reading, you can find my sequential code on github, as well as the testcase:

https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/ner/IterativeSimilarityAggregation.java

https://github.com/thomasjungblut/thomasjungblut-common/blob/master/test/de/jungblut/ner/IterativeSimilarityAggregationTest.java

Distributed DBSCAN (Intuition)

2012-05-06T18:49:00.001+02:00

Hey all,

it has been quite a long time since my last blog post. Thanks to my work, to keep me busy all day and don't let me research on cool new things. However, over the few holidays and weekends over the last weeks I came across a very interesting algorithm called DBSCAN.
It is abbreviated for "density-based spatial clustering of applications with noise", it is a unsupervised clustering algorithm just like k-means, besides that it is much smarter in many aspects.
Another objective I'd like to solve is the parallelization of this algorithm. I've seen just some ancient papers and what buffels me is that I've seen no implementation in Mahout (for MapReduce) or other distributed frameworks.

As you may know, I'm working for Apache Hama. It is a framework for distributed computing with the BSP (bulk synchronous parallel) model. I always searching for new algorithms that could fit into the model of BSP computing, e.G. graph algorithms of all sorts, strongly iterative algorithms, real-time algorithms.
And I think that DBSCAN also fits into the BSP model, I tell you why a bit later in this post.
First off, just a little introduction of the DBSCAN algorithm itself...

The algorithm

The algorithm is very easy to understand. Actually you have a bunch of points (or vectors in higher dimensionalities) as input, then you have to parameters and some fancy output.
The two parameters are called "epsilon" and "minpoints", epsilon is the minimum distance between two vectors to connect two points strongly and minpoints is the number of points that are at least needed to build a cluster out of strongly connected vectors.
Now you are going through the graph, point by point, marking visited vectors and adding points to a cluster while they are not violating the rules defined by epsilon and minpoints.

You can read on wikipedia about how the sequential version works in detail, however I am going to propose a much more easier to understand version of the algorithm.

Distributed algorithm steps

Instead of defining a big distributed algorithm that translates the sequential version into some distributed programming model, I have assembled three main steps to get the same result as the sequential version.
However each of these steps are strongly parallelizable in every major programming model (at least I know how it works in MapReduce, BSP and MPI).

Here are the three steps:

compute a distance matrix between the vectors with a given distance measurement

trivial step to parallelize, can also be merged with the next point

extract adjacent points via the epsilon threshold and the minpoints restriction

This step creates an adjacency list/matrix representing a graph
Noise is filtered at this step of the algorithm

run a connected component algorithm on the resulting graph of the previous step

Already done that in MapReduce and BSP, the last BSP version will be updated shortly after Apache Hama 0.5.0 comes out.

These three simple steps will give you the same result as a DBSCAN. Normally you can merge step 1 with step two, you can simply extract the adjacents points while computing the distances.

In the end, you will receive n-connected components, every of them will represent a cluster.

The delta to the points of your original input would be the noise cluster.

Note that the initial step is O(n²) which is obviously pretty bad and not scalable. So think about techniques like Similarity Hashing to speed this step up.

Pretty easy right? I think it is even more easier than the pseudocode on wikipedia.

Of course I put up a sample version (although sequential) on my github:

https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/clustering/DBSCAN.java

There is a nice plot I received when running it:

To make the noise more easy to spot, I have made horrible yellow circles arround them with Paint, please forgive me ;)

Stay tuned for an implementation with Apache Hama!

Update:

So far I haven't found the time to implement this whole system with Apache Hama. However, if you want to practically use this here are some advices:

For the distance matrix to compute, better use a heuristic to find close vectors

Mahout has a MinHashing implementation of such a clustering

Once you obtained "mini" clusters, you can compute more expensive distance measurements and extract your graph (step two in the above list)

Nonlinear conjugate gradient method in Java

2012-02-27T23:30:00.005+01:00

Hi all,

I have been a bit busy for the last few weeks with some projects. But fortunately I found some time on the last weekend to implement a recommandation engine for myself based on what we have done in ml-class.org lessons in octave and translate it to a Java version.

If you attended ml-class you might be familiar that you need to minimize a rather complex cost function based on what the user likes in terms of movies.
However I haven't found a simple and not ancient Java library containing a fully working conjugate gradient method. Shorthand I decided to translate it from Octave to Java. It took me 5-6 hours to build a Octave-like vector library arround it to translate it quite 1:1. But it was really worth it.

And here it is:
https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/math/minimize/Fmincg.java

Fmincg btw stands for Function minimize nonlinear conjugant gradient. It wasn't clear for me in the first place and I really started to understand the algorithm when I translated it.

It works quite like the version in octave, you pass an input vector (which is used as a starting point) and a costfunction along with a number of iterations to do.
Since I'm a hacker by heart, I want to give you a sample code to try it out for yourself.

Example

I have prepared a simple quadratic function for you
f(x) = (4-x)^2+10

Obviously since this is quadratic this has a global minimum which is easy to spot because I used the binomial version of the function, we will see if fmincg finds it.

For the algorithm we constantly need to calculate the gradients of the input in our cost function. Therefore we need the derivative which is for our function
f(x)' = 2x-8

If you're a math crack then you know that the f(x) hits (int our case) the minimum where the derivative cut's the x-axis or y=0.

Since this is quite hard to visualize, I have made a picture for you:

Function f(x) and its derivative

Not difficult to spot, you see the black function is our f(x) whereas the green line is our derivative. And it hits the x-axis right at the minimum of the function. Great!

How do we code this?

This is quite simple, I show you the code first:

int startPoint = -5;
    // start at x=-5
    DenseDoubleVector start = new DenseDoubleVector(new double[] { startPoint });
  
    CostFunction inlineFunction = new CostFunction() {
      @Override
      public Tuple<Double, DenseDoubleVector> evaluateCost(
          DenseDoubleVector input) {
        
        // our function is f(x) = (4-x)^2+10, so we can calculate the cost
        double cost = Math.pow(4-input.get(0),2)+10;
        // the derivative is f(x)' = 2x-8 which is our gradient value
        DenseDoubleVector gradient = new DenseDoubleVector(new double[] {2*input.get(0)-8});
        
        return new Tuple<Double, DenseDoubleVector>(cost, gradient);
      }
    };
    
    DenseDoubleVector minimizeFunction = Fmincg.minimizeFunction(inlineFunction, start, 100, true);
    // should return 4
    System.out.println("Found a minimum at: " + minimizeFunction);

As you can see we have to allocate the vector which is containing our start "point". You can set this arbitrary randomly, but you have know that it might not hit the global minimum but rather a local minimum. So different random starting points can yield to different results.

But not in our case. Now you can implement the cost function the algorithm needs the cost of your function at the given point of the input and the value of the derivative at this point.
So we return this after we put the input "x" in our two equations as a tuple.

The whole application prints:

Interation 1 | Cost: 10,000000
Interation 2 | Cost: 10,000000
[4.0]

Works! And it outputs the x-value of our minima and the cost is 10, which should be the y-value. Very cool!
You can find this example on Github as well.

Please use it if you need it, I have already implemented the collaborative filtering algorithm with it and I guess a backprop neural network will follow soon.

I really have to admit that I am not a math crack although I am studying computer sciences, but math really makes many things easier and if you take a deeper look behind what you've learned in school, it is really beautiful.

Thanks and bye!

German Stop Words

2012-01-24T12:56:00.003+01:00

Hey all,

I'm doing some text mining in the last time, so I needed a reliable list of german stop words.
The only real advanced version I have found was the lucene "GermanAnalyzer". That is the seed of the following list I wanted to share with you.

I already formatted this as an array that is put into a HashSet, so you can easily use it within your Java code via HashSet#contains(token).

public final static HashSet<String> GERMAN_STOP_WORDS = new HashSet<String>(
     Arrays.asList(new String[] { "and", "the", "of", "to", "einer",
      "eine", "eines", "einem", "einen", "der", "die", "das",
      "dass", "daß", "du", "er", "sie", "es", "was", "wer",
      "wie", "wir", "und", "oder", "ohne", "mit", "am", "im",
      "in", "aus", "auf", "ist", "sein", "war", "wird", "ihr",
      "ihre", "ihres", "ihnen", "ihrer", "als", "für", "von",
      "mit", "dich", "dir", "mich", "mir", "mein", "sein",
      "kein", "durch", "wegen", "wird", "sich", "bei", "beim",
      "noch", "den", "dem", "zu", "zur", "zum", "auf", "ein",
      "auch", "werden", "an", "des", "sein", "sind", "vor",
      "nicht", "sehr", "um", "unsere", "ohne", "so", "da", "nur",
      "diese", "dieser", "diesem", "dieses", "nach", "über",
      "mehr", "hat", "bis", "uns", "unser", "unserer", "unserem",
      "unsers", "euch", "euers", "euer", "eurem", "ihr", "ihres",
      "ihrer", "ihrem", "alle", "vom" }));

Note that there are some english words as well, if you don't need them, they are just in the first section of the array. So you can easily remove them ;)

If you have a good stemmer, you can remove other words as well.

How did I extract them?

These words are the words that had the highest word frequency in a large set (> 10 Mio.) of text and html documents.

Have fun and good luck!

BSP k-means Clustering Benchmark

2012-01-02T19:23:00.001+01:00

Hey all,

in my last post I already wrote about the kmeans clustering with Apache Hama and BSP.
Now we have a detailed benchmark of my algorithm.

Have a look here for the current state taken from here: http://wiki.apache.org/hama/Benchmarks

Because it will change during the lifetime of Apache Hama, I made a screenshot from the very first benchmark. Maybe to document performance improvements ;)

Have a look here:

Is it faster than MapReduce?
Yes! I recently read in the new "Taming Text" by Grant S. Ingersoll that the same amount of workload takes the same time, but not in seconds, but in minutes.

However, I want to benchmark it against the same dataset and on the same machines to get a fully comparable result.

Future Work
Besides the benchmark against MapReduce and Mahout, I want to show the guys from Mahout that it is reasonable to use BSP as an alternative to MapReduce. I look forward that they use Apache Hama and BSP within the next year as an alternative to MapReduce implementations for various tasks.

Thanks to Edward who made this possible!

k-Means Clustering with BSP (Intuition)

2011-12-10T10:05:00.001+01:00

Hey all,

I had a bit time to get my k-means clustering running with BSP and Apache Hama. I wanted to share this with you.
Actually this is a sequel to a series that I have announced long ago: Series: K-Means Clustering (MapReduce | BSP).
You may remember that I have already made a post about K-Means and MapReduce, now I want you to show that k-means clustering with BSP is much faster than the MapReduce one.
Don't expect a benchmark in this post, but I will give you a real comparision a bit later once Apache Hama 0.4.0 rolls out.
So this post is mainly about the algorithm itself and what my ideas were to make it running and scalable.
I'm going to make a next post which is containing more code and explain more clearly how it works. But code is still changing, so this will take a bit more time.

Quick note right at the beginning: As I mentioned it, I am currently developing with a Apache Hama 0.4.0 SNAPSHOT, since it hasn't been released yet.

Let's step into it!

Model Classes

We need some kind of Vector and something that is going to represent our Center.

These are very basic, actually it is just a wrapper object with an array of doubles and some convenience methods. A Center is just the same like a Vector, but it also adds some more methods to detect if it has converged or averaging two centers.

You can have a look on Github at them if you're interested.

Vector.java

ClusterCenter.java

Idea behind k-means with Apache Hama's BSP

Before really showing you the algorithm in the source code, I'd like to tell you what my plan and intention was.

In a typical clustering scenario you will have much much more points than centers/means. So n >> k (">>" for much larger than), where n is our number of vectors and k is our number of centers.

Since Apache Hama 0.4.0 will provide us with a new I/O system it makes it really easy to iterate over a chunk of the input on disk over and over again, we can use this fact and put all our centers into RAM.

The trick in this case is that unlike in graph algorithms, we do not split the centers over the tasks, but every task holds all k-centers in memory.

So each task gets a part of the big input file and every task has all centers.

Now we can easily do our assignment step, we just iterate over all input vectors and measure the distance against every center.

While iterating we find the nearest center for each of the n vectors. To save memory we are going to average our new center "on-the-fly". Follow the Average in data streams article on Wikipedia, if you're not familiar with it.

At the end of the assignment step, we have in each task the "on-the-fly" computed average new centers.

Now we are going to broadcast each of this computed averages to the other tasks.

Then we are going to sync so all messages can be delivered.

Afterwards we are iterating in each task through the messages and averaging all incoming centers if they belong to the same "old" mean.

I know this is difficult to explain, but please consult this picture, it is about the exchange of the locally computed mean for two tasks.

Message exchange with mean calculation

As you can see on this picture, we have two tasks which have calculated "their version" of a new mean. Since this isn't the "global mean" we have to calculate a new mean that will be consistent across all the tasks and is still the correct mathematical mean.
We apply the "Average on data streams" strategy. Each task is going to get the local computed averages from each other task and is reconstructing the global mean.

Since this calculation is the same on every task, the means are still globally consistent across the tasks just with the cost of one global synchronization. Fine!
Actually this is the whole intuition behind it. As the algorithm moves forward, this whole computation is running all over again until the centers converged = don't change anymore.
This is much faster than MapReduce, because you don't have to submit a new job for a new computation step.
In BSP the superstep is less costly than running a MapReduce job, therefore it is faster for this kind of tasks.

When you plot the result, you come up with something that looks like this:

K-Means BSP Clustering with K=3

In one of the upcoming posts, I'm going to explain you the code in detail.

If you are interested in the code and can't wait, you can have a look at it in my Github here:
K-Means BSP Code

The code will randomly create some vectos and assigns k initial centers to the first k-created records. You can run it via a main-method in your IDE or to your local-mode Hama cluster.

See you!

Apache Hama realtime processing

2011-10-24T19:56:00.007+02:00

Hi there,

today is about realtime processing with Apache Hama.
One day, Edward told me about a guy, who told him, that he uses Hama for realtime processing.

At first this is quite a strange/new thing, because inherently BSP is used (just like MapReduce) for batch processing. It has several advantages over MapReduce, especially in graph and mathematical use cases.

I think this new "feature" is the greatest advantage over MapReduce.
Let me clarify a bit how it works.

At first you will have some tasks which are going to be our so called event collectors. In my example this will be a single master task.
Anyways, the trick is now that the event collectors are waiting for new events to come, or even poll for new events that happened, they do it in a while loop. Something which is possible in MapReduce though.

Now you can built a producer/consumer pattern on top of this. Which just says, your event collectors are messaging computation tasks to do some computation on the data we have just sent. This will allow you to do more complex stream analysis in near-realtime.
We will see this in an example a bit later.

Why is this better than a MapReduce job?
If you run a MapReduce job, you can straight poll for data available inside a while loop, too. But without a messaging system between the tasks you are forced to write your data into HDFS to make it available for a broader amount of tasks to parallelize your workload.
Since Hadoop has lots of job scheduling and setup overhead, this is not realtime anymore. That is now degenerating to batch processing.
For those of you who are familiar with Giraph, it is similar to that MapReduce Job, where tasks messaging with other MapReduce tasks. Sadly they just focused on graph computing and are strongly dependend on input from filesytem.

Example: Realtime Twitter Message Processing
Yes, we can analyse Twitter data streams in BSP in realtime!
What do we need?

Twitter Library, in this case Twitter4J
Latest Hama, in this case this is a 0.4.0 snapshot. You can use 3.0 as well, with minor code changes.

Let's dive directly into it and look how to setup the job.

HamaConfiguration conf = new HamaConfiguration();
    // set the user we want to analyse
    conf.set("twitter.user.name", "tjungblut");
    // I'm always testing in localmode so I use 2 tasks.
    conf.set("bsp.local.tasks.maximum", "2");
    BSPJob bsp = new BSPJob(conf);
    bsp.setJobName("Twitter stream processing");
    bsp.setBspClass(DataStreamProcessing.class);
    bsp.waitForCompletion(true);

I think this is pretty standard, the trick is here to set the desired username of the guy who you want to analyse.
In my case this is my twitter nick "tjungblut".

I omit the setup method and the fields now, if you have questions on what I've done there, feel free to comment on this post.

The real (time) processing
Let's step directly to the processing and the mimic of the producer/consumer pattern.
The idea is simple: A master task is polling for new "Tweets" and is sending this directly to our computation tasks (fieldname is otherPeers, which contains all tasks but the master task).
This happens while our computation tasks are waiting for new "food" to arrive.
Once our computation tasks get a message, they can directly start with their computation.

Let's see how the master tasks is doing the job:

@Override
  public void bsp(BSPPeer bspPeer) throws IOException, KeeperException,
      InterruptedException {

    if (isMaster) {
      while (true) {
          // this should get us the least 20 tweets of this user
          List<Status> statuses = twitter.getUserTimeline(userName);
          for (Status s : statuses) {
            // deduplicate
            if (alreadyProcessedStatusses.add(s)) {
              System.out.println("Got new status from: "
                  + s.getUser().getName() + " with message " + s.getText());
              // we distribute messages to the other peers for
              // processing via user id partitioning
              // so a task gets all messages for a user
              bspPeer.send(
                  otherPeers[(int) (s.getUser().getId() % otherPeers.length)],
                  new LongMessage(s.getUser().getId(), s.getText()));
            }
          }
          // sync before we get new statusses again.
          bspPeer.sync();
 ... // computation task stuff

Note: I've ommitted a lot of details (try/catchs) and pre-mature optimizations which can be found in the code.

As you can see the event collector (aka master task) is polling the twitter API to get the newest tweets of a given user.
Now the master is sending the new messages to our computation task.
Note that there is a simple trick to distribute the work equally to the tasks. In our case we have just a single user we are listening on, and two tasks. So this won't do anything but sending this directly to another task.
You can change this behaviour by either listening to the public timeline or changing the distribution of the message by using the message id instead of the user id. I hope you get the gist ;)

In short: We are listening to a specific user and therefore every message goes from the collector directly to the computation task. In our case we have only 2 tasks, so increasing the tasks will just cause one task to be idle the whole time.

Let's have a look at the slave task (aka computation task).

This is very simple:

// we are not the master task... so lets do this:
} else {
      while (true) {
        // wait for some work...
        bspPeer.sync();
        LongMessage message = null;
        while ((message = (LongMessage) bspPeer.getCurrentMessage()) != null) {
          System.out.println("Got work in form of text: " + message.getData()
              + " for the userid: " + message.getTag().longValue());
        }
      }
    }

As you can see, this is a pretty simple consumer.
You could now add some logic to it. For example to track the communication between a person and others: How often, how much and what content.

In my case, this looks like this:

Console output of the realtime processing job

Note that it directly came up after it has been send.
Now, this is a real cool thing!

Imagine:
If you would have unlimited access to the public timeline (sadly this is capped by 150 requests/h) and you have enough computational power in your cluster, you can do your own trending topics!

Twitters worldwide trending topics

Of course you can do everything else you want to.

I hope this has been quite "illuminating" for you and shows you how to enable realtime processing if you have Hama.

Of course you can checkout my sourcecode my github. The class we just talked about is available here:

https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/bsp/DataStreamProcessing.java

Have fun and good luck!

Apache Hama upcoming features

2011-10-24T18:59:00.001+02:00

Hi all,

for me it is a pleasure to bring you a couple new things and announcements in this blog post.

Apache Hama 4.0 is on its way, and I want to introduce several pieces of fancyness before we dive into the realtime processing (will be the follow up blog post).

Revised BSP execution flow
Multiple Tasks per groom
YARN integration

Revised BSP execution flow

The first point is a very good improvement. Writing BSP is totally convenient now.

Let's take a look at the implementation of a BSP in Hama 3.0:

class OldBSP extends BSP{

  @Override
  public void bsp(BSPPeerProtocol arg0) throws IOException, KeeperException,
      InterruptedException {
    // TODO Auto-generated method stub
  }
  @Override
  public void setConf(Configuration conf) {
    // TODO Auto-generated method stub
  }
  @Override
  public Configuration getConf() {
    // TODO Auto-generated method stub
    return null;
  }
}

You see it in my eclipse generated subclass. You have to override a plenty of methods.
Two of them (if you are not familiar with Hadoop) seem to be very strange. What is that configuration? And why do I need to set this in my code?

Well, this is now history. We have revised the design and now shipping with default implementations of every method in the BSP class.

Additionally we have added a setup and a cleanup method. Setup is now called before the computation starts, cleanup after your computation has been done.

Let's see:

public class NewBSP extends BSP{

  @Override
  public void setup(BSPPeer peer) throws IOException, KeeperException,
      InterruptedException {
    super.setup(peer);
  }
  
  @Override
  public void bsp(BSPPeer peer) throws IOException, KeeperException,
      InterruptedException {
    super.bsp(peer);
  }

  @Override
  public void cleanup(BSPPeer peer) {
    super.cleanup(peer);
  }

}

It is a lot more intuitive isn't it? Now YOU can control the methods you need to override. And it is fully transparent when the methods are called.

And the best side-effect is that you can send messages and trigger a barrier sync while in setup!
This enables you now to send initial messages to other tasks and distributed information which hasn't been set in the configuration.
BTW: Your jobs configuration can now be obtained via peer.getConfiguration().

Multiple Tasks per groom

Yeah, we made the step to multitasking. In Hama 3.0 we only had a single task inside the groom.
This didn't really utilize the machines, because while executing a single BSP, other cores might be unused.
Like in Hadoop this is now configurable per host. So you can set the number of tasks which should be executed on a machine.

YARN integration

If you don't know what YARN actually is, let me clarify a bit. YARN stands for Yet Another Resource Negotiator.
This is Hadoops new architecture. If you want to learn more, have a look at Aruns slides here.

If you now know what the new Hadoop is, I'm proud to tell you that Apache Hama will be a part of it.
We implemented our own application to fit with the new YARN module and bring Hama to your Hadoop 23.0 cluster.
No more setup and configuration of additional daemons!

We managed to get a first BSP (Serialize Printing) running on a YARN cluster.

Serialize Printing on YARN

That is soo great!

We are still in development, so please follow HAMA-431 if you are interested.

Thanks for your attention, and please follow us on the mailing list! We are happy to answer your questions if you have one.

Ant Colony Optimization for TSP Problems

2011-08-30T19:40:00.004+02:00

Code can be found under (https://github.com/thomasjungblut/antcolonyopt)

Hi there,

I recently focused on playing around with ant colony AI (artificial intelligence) methods, as part of my second semester undergraduation project in Software Engineering.

Wikipedia's definition for Ant Colony Optimization is the following:

In computer science and operations research, the ant colony optimization algorithm (ACO) is a probabilistic technique for solving computational problems which can be reduced to finding good paths through graphs.

Unlike previous posts, it is a multithreaded solution which runs best on a single machine and not in a cluster and this post is about TSP (Traveling Salesman Problem) and not about finding shortest paths in a graph.

The Input

At first, we have to decide which input we take. In my case, this task was a university assignment so I have to use the given one. It is the Berlin52 problem which should be solved with the less costly route along the 52 best known / famous places in Berlin.
You can download an euclidian 2D coordinate version of it here. Just look for the "berlin52.tsp.gz" file. There is also an already calculated optimal solution here.

Plotting the file is going to show a cool roundtrip:

GnuPlot of the Berlin 52 Tour

The file content looks quite like this:

1 565.0 575.0
2 25.0 185.0
3 345.0 750.0
4 945.0 685.0

On the leftmost side is the ID of the vertex, in the middle is the x-coordinate and on the rightmost side is the y-coordinate in the euclidian plane.

The Algorithm

This algorithm is a mimicry of the real-life behaviour of ants. As you might know, Ants are quite smart in finding their way between their colony and their food source. A lot of workers are walking through the proximate environment and if they find some food, they leave a pheromone trail.
Some of the other ants are still searching for other paths, but the most of them are following the pheromone trail- making this path even more attractive. But over time the pherome trail is starting to decay / evaporate, so it is losing its attractiveness. Due to the time component, a long way has a very low density of the pheromone, because the pherome trail along the longer path is evaporating faster. Thus a very short path has a higher density of pheromones and is more attractive to the workers converging onto an approximately optimal path for our TSP problem.

For the problem, we are dropping ants on random vertices in our graph. Each ant is going to evaluate the next best destination vertex based on this following formula:

Wikipedia.org

It is about the probabiliy "p" for an ant called "k" to move to the vertex described by "x" and "y".
The variable "Tau" is the amount of pheromone deposited on the edge between "xy". It gets raised by "alpha" which is a heuristic parameter describing how greedy the algorithm is in finding its path across the graph. This is going to be multiplied by our apriori knowledge of how "good" the edge is. In our case this is the inverted distance (1 / distance) between x and y. This gets raised by "beta" which is also a heuristic parameter, which describes how fast the ants are going to converge to a steady solution. To get a transition probability to a vertex, each gets divided by the sum of the numerator over all possible left destination vertices.

The next equation is about adjusting the pheromone matrix, which is described by the following formula:

Wikipedia.org

"Tau" is the amount of absolute pheromone which gets deposited for worker "k" on the "xy" edge. "Rho" is a factor between 0-1 which represents the decay of the pheromone. This gets multiplied by the current amount of pheromone deposited and we just add updated new pheromone to it (which is the delta "Tau"). Delta "Tau" is an equatation too:

Wikipedia.org

It is just "Q" by the accumulated distance of the path so far.

Finish! That is all we need to start the implementation!

The whole thing works like this:

Initialize the best distance to infinity
Choose a random vertex in the graph where to plant your ant
Let the ant work their best paths using the formulas from above

Let the ant update the pheromones on our graph

If the worker returned with a new best distance update our currently best distance
Start from 2. until we found our best path or we have reached our maximum amount of workers.
Output the best distance

The single worker will now return with a path which he has taken and with a corresponding distance. Now you can decide if you are confident with the result, or going to let the worker calculate more of them.

So let's see how we can multithread this.

How to Multithread

In this case, multithreading is very easy. Each worker unit (in my repository called "Agent") is a single thread. In Java we have a cool thread pool construct called ExecutorService and a completion service which tells us when workers finished.

We are submitting a few workers for to the pool, they work on the same resources and once completed we get a reponse of the completion service.

Woot? Shared resources? Yes we need some sort of synchronization. In our case when writing to the pheromone matrix. In my latest implementation I used lock-free updates using Guava's AtomicDouble.

The whole algorithm is not going to change, we are just working in parallel.

The Result

After a lot of parameter testing you can find a very good solution:

Console output of my algorithm, finding a very close solution to the optimal distance for Berlin52

Parameter Testing

Because there are a hell lot of parameters (5) in there, we have to write an testing utility which calculates the best parameters for our problem (aka GridSearch).

We only care about the distance to the optimal solution:
In the grid search we only want to keep the lowest possible mean distance to the optimal solution (measured over multiple repetitions) and a very low variance.

For Berlin52, the best parameters using 200k agents I found are:

Alpha = 0.1
Beta = 9
Q = 1e-4
P = 0.8

So feel free to check out the code in the repository and let it run.
And please feel free to use this as a hint for your own implementation, or even improve it via a pull request.

Thanks for your interest and have fun :-)
Bye!

Repository https://github.com/thomasjungblut/antcolonyopt

Apache Hama Partitioning Improved

2011-08-20T12:36:00.001+02:00

Hey guys,

my work phase is over and I'm back to university next week so I'm having a bit more time to write here.

The first free time yesterday focussed on generating random data for Apache Hama / Apache Whirr and for my pagerank algorithm. HAMA-420 is the related issue on Jira.

After a lot of stress with the broken DMOZ file, I came along that partitioning takes far to long for some (1-3) million vertices. I already knew this issue, because it always took nearly 10 minutes to partition the sequencefile (1,26GB) of my SSSP(Single Source Shortest Paths) algorithm.

Where were the problems?

Actually, there were two major problems I saw.

Structure of the in- and output File
Zlib Compression

So first off the structure, for SSSP it looked like:

K           /                V 
Vertex[Text] / AdjacentVertex : Weight [Text]

Every line contained one(!) vertex and it's adjacent. Someone of you might notice that this is quite verbose: yes it is!
We used this structure for both, the input and the partitioned files.

The second thing is the standard compression that comes with SequenceFiles, it is the zLib codec which you have to explicitly turn off. Otherwise compression takes nearly 80% of the writing time without any real effort in file sizes.

How to deal with it?

These two problems were quite easy to solve. The first thing is to change structure and put adjacent vertices along with their origin vertex into one line of input textfile.

To be quite general for inputs I sticked with textfiles (this is very common and readable in every editor) and directly write into a sequencefile which is key/value composal of a vertex and an array of vertices.

Processing

The new architecture works on the input textfile and reads each line and passes it into an abstract method.
So everyone can implement their own input-textfile-parser and put it into the vertex the BSP needs. Which then gets directly partitioned via hash and gets written into the sequencefile of the groom.

Later on we have to extend this with various stuff of compression, inputformat, recordreader and other partitioning ways.

The groom will then just read the sequencefile objects and will put them into ram. So no more text deserialization has to be done: CHEERS!

Comparision new vs old

The result is overwhelming! See it in my Calc Sheet:

Improvement of Apache Hamas Example Partitioning

This is just a totally sick improvement. It get's comitted as HAMA-423.

Thanks and Bye!

Dealing with "OutOfMemoryError" in Hadoop

2011-07-09T12:51:00.000+02:00

Hey all,

some of you may have seen some kind of OutOfMemoryError in your Hadoop jobs, which looks like this:

java.lang.OutOfMemoryError: 
          unable to create new native thread
    at java.lang.Thread.start0(Native Method)
    at java.lang.Thread.start(Thread.java:597)

This is mainly what is the theme of today's blog post. I'll be writing about what causes this to happen and how to solve it.

As it is currently Juli 2011, I refer to Hadoop 0.20.2 since this is the latest stable release.

The Cause

I've seen many OutOfMemoryErrors in my short life. But this error isn't a "my jvm run out of heap" error. We'll usually see this in deeper in the stacktrace, in my case this was caused by FileSystem.create(). So this error is causing a OOM(will refer to it for OutOfMemory), but not in your JVM. The process that gets started by Hadoop which will execute your task, can't allocate more memory on your host-system.

What does FileSystem.create() do to let your host system get out of memory?
Well, I laughed at it first, too. It was setting the permissions with CHMOD.

The first time I've seen this error in the logs, I googled it. I came across a blogpost which rants about several concerns with Hadoop I come across every day (e.G. Zombie Tasks, XCievers), but it tells the truth.

Let's see what they wrote on this kinds of error:

Terrible with memory usage

We used to have problems with Hadoop running out of memory in various contexts. Sometimes our tasks would randomly die with out of memory exceptions, and sometimes reducers would error during the shuffle phase with "Cannot allocate memory" errors. These errors didn't make a lot of sense to us, since the memory we allocated for the TaskTrackers, DataNodes, and tasks was well under the total memory for each machine.

We traced the problem to a sloppy implementation detail of Hadoop. It turns out that Hadoop sometimes shells out to do various things with the local filesystem.
[...]

Source: http://tech.backtype.com/the-dark-side-of-hadoop

What I've seen is that the RawLocalFileSystem is going to create a file on the disk and is setting the permission on it with a given FSPermission object which will represent an equal call to CHMOD "XXX" on your system's shell.
In fact, Hadoop is going to launch a process using the ProcessBuilder, to just CHMOD the file it just created.

I guess you are now asking yourself, why this is going to cause an OOM error. If you not already followed the link I've provided above, I recommend to do so.
But I'm going to clarify this a bit more.

The ProcessBuilder in Java under linux will fork a child process. This process is allocating the same amount of memory as its parent process did. So for example you've provided your Hadoop Job to allocate 1G of Heap per Task, you'll end up temporary using 2G of your hostsystem when calling FileSystem.create().

Let me explain a bit more about the fork() in linux.
When calling fork(), linux is going to setup a new task-structure which is going to be a full copy of the parent process. The process is going to get a new process-id and is using the same memory as its parent process. Everything seems fine, but if one of them is going to modify or write something in the memory, the memory will be duplicated. This is called copy on write.

If you're able to read german, I recommend you the book "C und Linux" of "Martin Gräfe". It is very well explained there (although it is in a C context).

I had a job which downloads lots (millions) of files and creates a new file for it inside a mapper. This is going to be parallized so we have multiple task per host machine. The funny thing is, that each task is going to shell-out and CHMOD'ing the file it just created, if other jobs are going to run then, they are simply failing, because they cannot allocate enough memory for their JVM. So did the tasks itself.

The worst thing is to confuse this with JVM OOM's, they are dependend on what you're task is doing. So if you're having a 2GB HashMap to compute things faster in RAM, but your JVM has only 1GB Heap, this is clearly a JVM OOM since no more Heap can be allocated INSIDE.
A usual fix for this is to increase the heapsize of the JVM with -Xmx2G or -Xmx2048M. Don't do this in this case of Host-OOMs! This will even worse the problem. Especially in this specific problem, a child process will then allocate even more RAM, which probably yields in faster failure.

Now let's see how to solve this.

Solve the problem

"tech.backtype.com" is saying, that adding a lot of swap space is going to solve the problem. I am not a fan of lot's of swap space on Hadoop nodes, mainly because there is a high chance that tasks are getting swapped (due to misconfiguration, other processes etc). And JVM's are not constructed to work correctly when swapped. So you can think of really degrading performance once a task gets swapped. I was suprised that they told that everything got far more stable afterwards. Hell they must have the worst problems on their cluster.

I've backtracked the problem to FileSyste.create(). Actually it was create(FileSystem fs, Path file, FsPermission permission).

Actually, this will cause CHMOD to be forked twice. So the first fix was to use the non-static methods.
Using the non-setting FSPermission methods like create(Path path) will cause a process to be forked anyways.
So you have to call the abstract method in FileSystem directly. It has this signature:

create(Path f,
FsPermission permission,
boolean overwrite,
int bufferSize,
short replication,
long blockSize,
Progressable progress)

The way I prevent Hadoop from forking the process is going to set the permission to null.
This will cause a NullPointerException in the NameNode while setting the permission in the distributed FileSystem (the permissions you can see in the webfrontend). This is because the DFSClient is going to RPC the NameNode to register the new created file in the NameNode.
So you have to patch your NameNode by adding a null-check in the setPermission method.
Like this:

public void setPermission(String src, FsPermission permissions
      ) throws IOException {
    if(permissions == null){
       permissions = FsPermission.getDefault();
    }
    namesystem.setPermission(src, permissions);
  }

Note that this is going to not work if you're running Hadoop with lots of different users. I have just one user, which is also the user on the host-system. So this won't cause any problems in my case.

And be aware that adding this nifty "hack" to your code will cause your job to not run on the LocalFileSystem, since this isn't null-safe, too.

Now each task is not going to fork any process at all.