Jun 8, 2013

Wiring services together using the actor model

Hey folks,

I had a lot of stuff to do in my last semester, thus I couldn't write up the post about news clustering completely yet. The stress will keep constant until my study is finally over, I guess until arround the beginning of october. Then I will have more time to explain this topic more thoroughly, of course with lots of pictures ;)

But today I have a bit of time and wanted to share a smart idea of joining two cool things together: Service Oriented Architectures (SOA) and the Actor Model. We will go through a small definition of both, why we should join these technologies and at the end have a small look on how to implement such an architecture with plain Java.

Why Sevice Oriented Architectures?


Service Oriented Architecture is a design pattern based on chunking a large software system into smaller and discrete modules called services. The goal is to design services solely based on their functionality and thus decouple them from other services. The result should be an ensemble of services that are only defined by their simplistic interfaces that offer functionality to the outside world. 

A pretty intuitive example is a checkout process in ecommerce systems: it is a large and complicated process that can be chunked into simpler parts. At the beginning of your internet shopping trip you are likely to visit a few products. Retrieving products and their information is also a good candicate for a service, because it has a defined behaviour and its functionality can be reused very well for other purposes. The corresponding interface could be something like this:
 
    // could directly retrieve objects from a database 
    public Product getProduct(long productId); 
    // could be a proxy to another service
    public Opinions getUserOpinions(long productId);  
    // could be a filesystem call
    public Images getProductImages(long productId); 

For many of you, this might look like a data access object (DAO) that is going to ask an underlying database implementation about the concrete values. This is not what the goal of the service itself should be: a service defines just the functionality and not how it transports the information (whether there is a RPC call or an Oracle database in the back shouldn't be part of your interface/service design).
Thus the user should never care about the underlying complexity or the implementation of the system. That is a similar statement like in Object Oriented Programming that naturally yields to polymorphism (multiple implementations of an interface).

But how do Services wire up together? 

Imagine a computing cluster where a scheduler is a service that looks at the resources in the cluster and makes decisions on where to place your application best. How does it communicate the result of its computation to the next service- say the service that handles allocation of those resources?

There are basically few ways to handle this:
  1. Just call the next service via its interface methods (synchronous call)
  2. Send a message to the other service and continue (asynchronous call)
  3. A superviser/manager that calls services after each other (supervised call)
I'm no async evangelist, but I will try to tell you about my experiences and why I think that asynchronous messaging is a much more viable way in highly concurrent scenarios.

Consider the following most simplistic scenario when dealing with services:

Super simple service chain
In this case the services form a simple chaining, so calls from Service A can only reach B, and B can only reach C. Clearly, there is no need for anything spectacular- if I were in that situation I would put those services in a list and call them after each other:

  List<Service> services = ... some list ...;
  Result lastResult = null;
  for(Service service : services){
     lastResult = service.call(lastResult);
  }
  // go further down the road with the final service result

This is what is called the Pipeline pattern, because you feed the result of the stage before to the next one and enhance/filter/modify the results. This is a supervised architecture, because you are controlling how the data flows by hardcoding the control flow in the above loop.

But what happens when we want to process requests through the services concurrently?

Now that is where the problems usually begins. The above architecture will work in a concurrent environment without any problems, as long as the code that is calling the services in sequence is thread-safe and all the services are designed thread-safe. That means, if your service has state (for example our scheduler that has information about the current cluster resources), it needs to lock all access to it while modifying it. This is not a big deal for someone who worked with threads already and is familiar with standard libraries like in Java (see for example a ReadWriteLock).

However, think of what complexity you are imposing to your software in that moment:
  • Every service needs to handle the locks for itself and must be thread-safe
  • Even with standard library support you clutter your code with try/finally unlock statements
  • The performance is likely to suffer in high concurrency/throughput environments
Overall, this is a complexity nightmare (have you ever traced down a race condition?) and exactly what we wanted to avoid when choosing a SOA.

It just begins to get worse:

Slightly more complex service chain

What do you want to do when Service B locks it's state for a long time (e.g. our scheduler just received a big update from a rack that just got back online)? Clearly other services will have to wait and throughput and responsiveness starts to suffer severely. You can spin this even a tick further: What if you're in a distributed environment and Service B just doesn't exist anymore (server goes down, netlink breaks)? Services A_[1-n] will have to wait until B comes back online and can't do anything else than wait. Always note that those are the easiest service architectures! In reality your call graph looks much more connected throughout all services.

All that is an issue if you're relying on synchronous communication between services. What we need is to define a decoupling between the services- not of their functionality, but this time of the communication between them.

The Actor Model


The most intuitive way to make asynchronous communication to happen is to send a message! 
If I want Bob to work on issue X in our bug tracker, I write him an email that he should have a look at issue X soon. Now Bob can decide on his own when he looks into his mailbox (for example when he is finished with the current task) and also when he wants to start working on issue X. Transferred to computer science: you don't disturb the service in doing its job as you would with locking or interrupts.

The intuition is the same behind the actor model, here Bob would be the actor and emails would be some kind of messages that land in an actors' inbox. Normally we want to have many more actors that can interact with each other and provide functionality. That's where we come back to services: so actors and services both provide functionality / behaviour and messaging between actors helps us to solve the synchronous comunication problems. 

While you can use a framework like Akka for the actor model, it is very easy to implement in Java using the standard API:

public class SimpleActor<MSG_TYPE> implements Runnable {

  public static interface Service<MSG_TYPE> {

    void onMessage(MSG_TYPE message);

  }

  private final LinkedBlockingQueue<MSG_TYPE> inbox = new LinkedBlockingQueue<>();
  private Service<MSG_TYPE> messageListener;

  public SimpleActor(Service<MSG_TYPE> listener) {
    this.messageListener = listener;
  }

  public void message(MSG_TYPE message) {
    inbox.add(message);
  }

  @Override
  public void run() {
    while (!Thread.currentThread().isInterrupted()) {
        // blocks until we have a new message
        MSG_TYPE take = inbox.take();
        messageListener.onMessage(take);
        // interrupted exception omitted
    }
  }

}

As you can see, it is super easy to setup a producer/consumer inbox within a thread and use the service as a callback listener. All concurrency and signalling is the problem of the underlying implementation of the inbox, here a LinkedBlockingQueue is used.

Now your Service can easily implement the callback, with the guarantee that every message that arrives will be processed sequentially (because your run method takes only one message at a time from the queue). So you will never have to worry about explicit locking in your code, you just have to react to events that happen.

A simplistic and fictitious variant of a scheduler that reacts can look like this:

    Service<SchedulingEvent> scheduler = new Service<SchedulingEvent>() {
      
      @Override
      public void onMessage(SchedulingEvent message) {
        if(message.isSchedulingMessage()){
          if(cluster.getFreeMemory() > message.memoryNeeded()){
            // tell the allocation actor to run that 
            message(Allocator.class, new Allocation(message));
          }
        } else if(message.isUpdateMessage()){
          cluster.update(message.getFreeMemory());
        }
        
      }
    };

You can see, the logic is very clean, no locking is needed and you can react on specific events- or ignore them if you don't care about them. In a real-life scenario I would add an ActorManager that helps messaging by a defined name or class, or you can design actors as singletons and directly access their messaging methods.

Let's get back to our problems we had with the synchronous and supervised calls and see if we solved them:

  • Locking complexity
    • Best case: no locking anymore in the service code itself
  • Code clutter
    • Everything looks very clean and tied to the servers functionality
  • Performance
    • Every service can work at its own speed/pace, no polling is involved
      • What if the inbox fills up faster than the messages can be consumed?
      • Is it really faster?
  • Availability
    • When a service goes down, it is up to the messaging implementation to buffer those messages in a secondary storage so they can be retrieved after a crash. But certainly this is now easier to implement and maintain.
Seems we have a few open questions that definitely must be addressed by the engineer. To make a good decision you will need architecture knowledge on how the services interact with each other, but in the end it looks like a very nice model for the communication between services.

But what are the clear disadvantages of this actor model?

Of course there is no silver bullet in such technology. The actor model also has drawbacks, here are a few that I have observed when working with event driven actor architectures:
  • You have no explicit returns, e.g. if an exception happens you will be notified long time afterwards via message that comes back
  • Debugging is the hell if you don't optimize readability for it
The first bullet point is problematic, yet another example: what if you want to get a return value for a query that is part of our service functionality? It sounds like a huge detour to send messages when all you could do is to call a function. Always keep your goal in mind: 
Do you want to create a service for functionality? Or do you want to create services that interact with each other? Both are (by definition) service oriented architectures and both can be used in conjunction with each other - choose wisely which one you need to use.

The second bullet point is something that will drive developers nuts in their daily lifes. When writing an actor model, be sure that your actors are named accordingly to their usecase. Nobody wants to send a message not knowing whose inbox to reach. So make it clear to which destination you're sending a message to.
Something that I have employed to neglect this was to use classnames as the address and make all services singletons. This helps to write code like this:
            
    // class name based routing
    message(Allocator.class, new Allocation(message));
    // singleton based routing
    message(Allocator.getInstance(), new Allocation(message));
    // singleton based direct messaging, NOTE getInstance() is a
    // convention, not a defined interface!
    Allocator.getInstance().message(new Allocation(message));

People working with that will immediately know, that they can click on the class entry in their IDE and get to the implementation fast and will always know where the message will end up.
Still the amount of scrolling to be done is too damn high! I hope that the IDEs will soon catch up on those paradigms (especially when lambdas and function pointers come with Java8) and make it easy to navigate to callback/listener methods.


So thank you very much for reading, you've definitely won a cookie for reading the whole article.

Jan 27, 2013

Named Entity Recognition in News Articles

Hello,

been a while since the last post. Was into a lot of work- couldn't really get up on the weekend to write about   named entity recognition in news articles. But today we can finally talk about it.

This post is about a few things:
  1. What is named entity recognition?
  2. How do we model it as a machine learning problem?
  3. What features to extract for our learner?
So let's dive in, if you have taken the Natural Language Processing (NLP) class on Coursera, you will be familiar with the topic already and should start with the features we use in the third paragraph.

What is named entity recognition (NER)? 

The most easiest explanation is to find word level concepts like a location or person in an unstructured text file. Let's say we have the following snippet of text, shamelessly stolen from Wikipedia:
Jim bought 300 shares of Acme Corp. in 2006.
The idea is to tag parts of this sentence with tuples of concepts and their value, such that we get this:
<PERSON, "Jim"> bought 300 shares of <CORP,"Acme Corp"> in 2006.
So we detected Jim as a person and the Acme Corp as a corporation in this sentence.
But how do we need this for our news aggregation engine? A very simple assumption: News are about people, what they did and where they did it. A simple example would be:
"David Cameron talks in Davos about the EU"
The topic is clearly about the person David Cameron- the action of talking and a location namely Davos in Switzerland. This is needed for our news aggregation engine to cluster topics together, even if their content is slightly different. We will talk about this in one of the following blog posts.

How do we model this as a machine learning problem?

Basically, it is nothing else than a (multiclass) classification problem. Namely classifying if a token belongs to the class PERSON, LOCATION- or O which is none of the ones before.

The main difference to other NLP tasks is that we need the context of a token, because the meaning of the current token is depending on the previous or the following token/class. So let's have a look at another example:
It did not, and most of Mr. Cameron’s European partners do not wish to revisit that fundamental question.
How do we recognize Cameron in this case here? There are two cases that are representative for english. First the "Mr." is a strong indicator that there is following a name, second the "'s" is a strong indicator that the previous token was a name. Also prepositions like "of", "to", "in" or "for" are likely indicators for names or locations. The trick that I learned from the NLP class on coursera was the encoding of the text as a sequence of unigrams. The previous text would look like this:
most O
of O
Mr. O
Cameron PERSON
's O
So what we do is predicting the label of the current unigram, by looking at the previous and following unigram and maybe also at the previous label. The reason we need to look at the previous label is that names could be composed of name and surname like David Cameron. So if the last class was a person, in many cases the current class might also be a person.

So what kind of classifier do we use? I used a self written version of the Maximum Entropy Markov Model supplied in week four of the NLP class exercise optimizable with normal Gradient Descent or Conjugate Gradient (or even with the given quasi newton minimizer supplied in the class). Also I written some utilities to extract sparse feature vectors as conveniently as in NLP class.
You can browse some code in my common repository's NER package and see how it looks like to use with the data supplied from NLP class in my testcase.

What features to extract for our learner?

Features are pretty important, they must cover structural as well as dictionary features. Here is my feature set for dictionary features:

  • current word
  • last word
  • next word
And for structural features:
  • previous class label
  • current word upper case start character
  • last word upper case start character
  • current word length
  • current word containing only alphabetic characters  (1=true or 0=false)
  • next word containing only alphabetic characters  (1=true or 0=false)
  • was the last word a preposition
  • was the last word a special character like dot, comma or question mark
  • last word ending with "ing"
  • previous/current/next words POS tag
POS tags are pretty important as nouns are more likely to be a person or location. Also other POS tags might lead to one of those classes by the previous word or next word, e.G. verbs are pretty likely to follow a name. All these features are pretty sparse, thus we are building a dictionary of all possible features we observed during training time and encode a dictionary of features we have seen.
For example the feature prevPosTag could have value "prevPosTag=NN". This is a rare occation as we have a unigram encoded as a feature vector, so it totally makes sense to encode them with a sparse vector.

Now that we have our text encoded as a list of vectors (a sparse vector for each unigram we observe) we can optimize the vectors and their outcome by minimizing a conditional likelyhood costfunction to be used in the Markov Model. This will learn conditional probabilities between features and the outcomes, describing when we are observing a feature- how likely is that the PERSON class occurs, for math lovers this can be described as P( class | features ). I optimized my model for 1k iterations using conjugate gradient and obtained a very low training error of arround 5%. To obtain the class for a feature vector we are doing a viterbi decoding on the learned weights. The trick here is that you need to encode the feature vector for all the possible classes, only that way the viterbi can decode the probabilities correctly.

So yeah, that is basically what it's all that named entity recognition is about. The next blog post will most probably be about how to cluster the news together using the information we gathered through this post.

Bye!

Jan 1, 2013

Extracting articles from crawled HTML sites

So welcome to our second part of Building a news aggregation engine!
If you don't know how crawlers work, have a look into the last part of the series: Build a basic crawler.

This time we will talk about how to get the actual raw content of a site. Something that we humans see on a newspage isn't visible for an algorithm, because it just has to look at the code- not the actual rendered page. Therefore I will introduce you with a technique called Boilerplate removal.

What is this "Boilerplate"?

Boilerplate is everything, but the content of the page you are seeking for. Some people call it: "design" or "nagivation" and in their first attempts they try to parse websites with XPath expressions to get their content. But that is not necessary and an unsual pain if the design really shifts.

So the lesson learned is (like in the last blog post as well): IGNORE. Yes you have to ignore the parts you don't want to have. That doesn't mean that we don't take the parts to make decisions- in fact we need those to decide whether a block of html code belongs to the IGNORE- or the content-part.
Therefore I want to introduce to you the boilerpipe framework. Boilerpipe is written in Java by Christian Kohlschütter and it is licensed with Apache 2.0.

It uses some really simple machine learning algorithm (a small decision tree) to classify whether a given block of html is content or not. You can read the details in his research paper "Boilerplate Detection using Shallow Text Features" which is a really good read.
In short, it analyses the amount of tokens and links in the previous, current and next block of HTML. It is the same idea we will use later in sequence-learning when we deal with Named Entity Recognition in order to get people and events from the article.

In Java, you can use it like this:

final BoilerpipeExtractor extractor = ArticleExtractor.getInstance();
String text = extractor.getText(html);

It is sooo simple and you have the content of a webpage in a string. Of course, this only applies for news articles, since they are shaped really consistent over many news sites therefore the name ArticleExtractor.

But in real world, if we crawl the web, most of the sites we examine with our crawlers won't be news- so the content in the resulting text string might not be a news article. Maybe it is just the imprint or privacy statement that looks like an article, but isn't.

Classifying News sites

Since I faced this issue while crawling, I had to train some machine learning algorithm on the output of boilerpipe to detect news accurately.

Here is what I did:
  • Let the crawler run on the top 40 news sites in germany (hand seeded list) for 100k sites
  • Write a small python application to loop over all files and ask me if it was news or not
  • After 1k hand-classified items (was just 1h of work!), train a classifier.
I found out that training a Multilayer Perceptron with a single hidden layer gives arround 93% accuracy with very small number of features, since this is enough for my purposes I stopped there. But I believe that you can get alot more (99% should be really do-able) with ensembling and better features.

But many people don't tell what kind of features they used, so here are mine:
  • Length of the extracted text
  • URL ends with '/'
  • Length of the extracted title
  • Number of '\n' in the extracted text
  • Text mention of "impressum","haftung","agb","datenschutz","nutzungsbedingungen" (imprint, authors etc.)
  • Title mention of "impressum","haftung","agb","datenschutz","nutzungsbedingungen" (imprint, authors etc.)
  • Number of upper case letters in the text
It is a mixture of text level features that can be expanded and meta features like lengths.
I trained a neural net (7-35-1) with sigmoid activations (and 1.0 as regularization) for 10k epochs with Conjugate Gradient. Here are the average results from a 10 fold crossvalidation:
Accuracy: 0.9259259259259259
Precision: 0.9173553719008265
Recall: 0.9823008849557522
F1 Score: 0.9487179487179487
That is pretty good for such simple methods! And I didn't even used HTML features like boilerpipe does ;-)

If you want to have some data of my crawl, I have ~1500 classified news articles and 16.5k unclassified- so if you need a bit of german news data for whatever research: let me know via email!

Congratz! We can now crawl the world wide web and classify news sites very accurately. Our next step will be to develop a named entity recognition engine that allows us to extract the keywords we need from our text in order to group them efficiently.

Build a basic crawler

So welcome to our first part of Building a news aggregation engine!

This time we talk about how we build a really simple crawler, that crawls us some sites. 
Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.
The basic workflow looks like that:
  • Seed some URLs and queue them up
  • Keep a set about what URLs were visited
  • While our queue is not empty (or we reached some maximum amounts of sites)
    • Get the first URL from the queue, put it into the visited set
    • Query the URL, obtain some HTML
    • Extract new URLs from the HTML and queue them up if they are not in the visited set yet
A small Java program could look like this:

String[] seedUrl = new String[]{"http://www.wikipedia.com/"};
final Deque<String> linksToCrawl = new ArrayDeque<>();
final HashSet<String> visited = new HashSet<>();

linksToCrawl.addAll(Arrays.asList(seedUrl));
visited.addAll(Arrays.asList(seedUrl));

while (fetches < 100 && !linksToCrawl.isEmpty()) {
      String urlToCrawl = linksToCrawl.poll();
      // open a connection and parse HTML
      ...
      // loop over all links we found on that page
      for (String outlink : extractedResult.outlinks) {
         if (visited.add(outlink))
            linksToCrawl.add(outlink);
      }
      fetches++;
}

It looks really simple, but tell you what? It is more difficult than it looks.
Once you started with it, you wish you never started with that- the web is ugly. I'm working in the backend team at work and I'm surrounded by a lot of garbage from various data sources, but the web is a whole new level. Just a small excerpt of what you need to deal with:
  • Encoding issues (we will fix them later on in this post)
  • Link expansions (relative vs. absolute URLs vs. JavaScript URLs like void(0); ) 
  • Not parsable stuff like videos or images
So for example, how do you deal with data you can't handle (which don't contain HTML)? You IGNORE it.  For this kind of purpose I've clamped together a bunch of suffixes that can happen within links and they guard against running into not parsable binary data:

Pattern IGNORE_SUFFIX_PATTERN = Pattern
      .compile(".*(\\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf|iso|rm|smil|wmv|swf|wma|zip|rar|gz))$");

So as you can see, I'm guarding against anything here. Of course they are completely useless if someone does not supply a suffix of the filetype. In the latter case, you will need to get on the stream and look for a html or body tag to verify it is really a website (which is the worst case, because you're wasting bandwidth and time the crawler could use to do something else).

Something that poked me for quite a while were encoding issues. As a german, umlauts like öäüß are completely garbled if you read them with the wrong encoding. So most of the time, germany news look really bad and you can directly throw them into the next trash bin.

I ran across a project of the Mozilla foundation called universalchardet (abbrev. for universal charset detector) and its Java descendent called juniversalchardet. It detects encodings with really high accuracy and helps you to get the content of your crawl correct like you would browse the site.

In Java you have to obtain the site via streams, so let me show you a small example of juniversalchardet and how to read a stream into a string of HTML with NIO.

    String someURLAsString = "http://www.facebook.com";
    URL url = new URL(someURLAsString);
    InputStream stream = url.openStream();
    String html = consumeStream(stream);
    
  // the helper methods
    
  public static String consumeStream(InputStream stream) throws IOException {
    try {
      // setup the universal detector for charsets
      UniversalDetector detector = new UniversalDetector(null);
      ReadableByteChannel bc = Channels.newChannel(stream);
      // allocate a byte buffer of BUFFER_SIZE size 
      // 1mb is enough for every usual webpage
      ByteBuffer buffer = ByteBuffer.allocate(BUFFER_SIZE);
      int read = 0;
      while ((read = bc.read(buffer)) != -1) {
        // let the detector work on the downloaded chunk
        detector.handleData(buffer.array(), buffer.position() - read, read);
        // check if we found a larger site, then resize the buffer
        buffer = resizeBuffer(buffer);
      }
      // finish the sequence
      detector.dataEnd();
      // copy the result back to a byte array
      String encoding = detector.getDetectedCharset();
      // obtain the encoding, if null fall back to UTF-8
      return new String(buffer.array(), 0, buffer.position(),
          encoding == null ? "UTF-8" : encoding);
    } finally {
      if (stream != null) {
        stream.close();
      }
    }
  }
  // basic resize operation when 90% of the buffer is occupied
  // simply double the correct size and copy the buffer
  private static ByteBuffer resizeBuffer(ByteBuffer buffer) {
    ByteBuffer result = buffer;
    // double the size if we have only 10% capacity left
    if (buffer.remaining() < (int) (buffer.capacity() * 0.1f)) {
      result = ByteBuffer.allocate(buffer.capacity() * 2);
      buffer.flip();
      result.put(buffer);
    }
    return result;
  }

That is actually everything to know about getting HTML from a raw URL.

But, how do you extract the outlinks from a HTML page?

Many of you will now go ahead and say: let's compile some RegEx. You will FAIL.
As a computer scientist it is enough if you tell that HTML is a context free grammar (chomsky type 2) and RegEx needs a regular language (type 3) to operate properly. Type 2 languages are way more complex and can't be parsed with a regular expression. So please have a look at the funny rage answer at stackoverflow or read up the other very informative answers at the bottom to know why you shouldn't do this. Don't get me wrong: You will find URLs that you can parse with RegEx, but I don't think it is worth the stress. I always use the htmlparser on sourceforge, it is clean, well tested and pretty fast.

To end this post, I tell you how to extract some outlinks from a html page as string:

static final NodeFilter LINK_FILTER = new NodeClassFilter(
      LinkTag.class);

    Parser parser = new Parser(html);
    NodeList matches = parser.extractAllNodesThatMatch(LINK_FILTER);
    SimpleNodeIterator it = matches.elements();
    while (it.hasMoreNodes()) {
      LinkTag node = (LinkTag) it.nextNode();
      String link = node.getLink().trim();
      // now expand for relative urls and store somewhere
    }

It is simple as that. How expanding of URLs can be done is another part- but I leave that up to you ;-) Java's URI may help you with that.

So thanks for attending, my next post is about how to extract actual text content (news) from pure HTML code.

Building a news aggregation engine

Hey all,

first off- happy new year!
It has been a few months now since I posted my last blog post. It wasn't just the stress (open source projects, work, study) that prevented me from writing, but more that I've found no really good topic to blog about.

Personally, what I always found to be interesting are these news aggregation sites. You know them: Google NewsYahoo News etc. They are crawling the web, getting fresh articles about what happens in the world, group them together and present them. Also, they require a lot of machine learning and natural language processing- topics that most of my blogposts are already about.

Since my last posts are more about distributing such algorithms, I want to focus alot more on the application of several algorithms and intentions in order to build up such a news aggregation engine. I know this is a topic where you can write several books about, but I think we can build a working application within three to five blog posts. My personal goal would be to derive a small Java application, that you just start with a seed of a few URLs and it is grouping the incoming pages in realtime- so you can watch the result in an embedded jetty web application by refreshing the site constantly. I currently have some simple parts of it clamped together, but they don't act as a single application so I will rewrite this for the public ;-)

Here is a rough topic grind what we need to cover in order to plug the system together:

  1. Crawling and extraction of news articles
    1. Detect text encodings while crawling (otherwise you'll get low quality results)
    2. Article Classification (you have to detect that a site contains an article)
    3. Boilerplate Removal (you don't care about the design, so remove it!)
    4. Page Deduplication (Archieves or mirrors host the same article, we want to sort out those fast)
  2. Topic modelling of news articles
    1. Named Entity Recognition (News are about people and events, so we have to detect them)
  3. Grouping News
    1. Hierarchical clustering with realtime extension

We will use frameworks for some parts of the task, some coursework from online courses, but the most things are written by myself. 

So stay tuned, I will start with basic crawling in the next post and how to detect encodings efficiently.