Jan 27, 2013

Named Entity Recognition in News Articles


been a while since the last post. Was into a lot of work- couldn't really get up on the weekend to write about   named entity recognition in news articles. But today we can finally talk about it.

This post is about a few things:
  1. What is named entity recognition?
  2. How do we model it as a machine learning problem?
  3. What features to extract for our learner?
So let's dive in, if you have taken the Natural Language Processing (NLP) class on Coursera, you will be familiar with the topic already and should start with the features we use in the third paragraph.

What is named entity recognition (NER)? 

The most easiest explanation is to find word level concepts like a location or person in an unstructured text file. Let's say we have the following snippet of text, shamelessly stolen from Wikipedia:
Jim bought 300 shares of Acme Corp. in 2006.
The idea is to tag parts of this sentence with tuples of concepts and their value, such that we get this:
<PERSON, "Jim"> bought 300 shares of <CORP,"Acme Corp"> in 2006.
So we detected Jim as a person and the Acme Corp as a corporation in this sentence.
But how do we need this for our news aggregation engine? A very simple assumption: News are about people, what they did and where they did it. A simple example would be:
"David Cameron talks in Davos about the EU"
The topic is clearly about the person David Cameron- the action of talking and a location namely Davos in Switzerland. This is needed for our news aggregation engine to cluster topics together, even if their content is slightly different. We will talk about this in one of the following blog posts.

How do we model this as a machine learning problem?

Basically, it is nothing else than a (multiclass) classification problem. Namely classifying if a token belongs to the class PERSON, LOCATION- or O which is none of the ones before.

The main difference to other NLP tasks is that we need the context of a token, because the meaning of the current token is depending on the previous or the following token/class. So let's have a look at another example:
It did not, and most of Mr. Cameron’s European partners do not wish to revisit that fundamental question.
How do we recognize Cameron in this case here? There are two cases that are representative for english. First the "Mr." is a strong indicator that there is following a name, second the "'s" is a strong indicator that the previous token was a name. Also prepositions like "of", "to", "in" or "for" are likely indicators for names or locations. The trick that I learned from the NLP class on coursera was the encoding of the text as a sequence of unigrams. The previous text would look like this:
most O
of O
Mr. O
Cameron PERSON
's O
So what we do is predicting the label of the current unigram, by looking at the previous and following unigram and maybe also at the previous label. The reason we need to look at the previous label is that names could be composed of name and surname like David Cameron. So if the last class was a person, in many cases the current class might also be a person.

So what kind of classifier do we use? I used a self written version of the Maximum Entropy Markov Model supplied in week four of the NLP class exercise optimizable with normal Gradient Descent or Conjugate Gradient (or even with the given quasi newton minimizer supplied in the class). Also I written some utilities to extract sparse feature vectors as conveniently as in NLP class.
You can browse some code in my common repository's NER package and see how it looks like to use with the data supplied from NLP class in my testcase.

What features to extract for our learner?

Features are pretty important, they must cover structural as well as dictionary features. Here is my feature set for dictionary features:

  • current word
  • last word
  • next word
And for structural features:
  • previous class label
  • current word upper case start character
  • last word upper case start character
  • current word length
  • current word containing only alphabetic characters  (1=true or 0=false)
  • next word containing only alphabetic characters  (1=true or 0=false)
  • was the last word a preposition
  • was the last word a special character like dot, comma or question mark
  • last word ending with "ing"
  • previous/current/next words POS tag
POS tags are pretty important as nouns are more likely to be a person or location. Also other POS tags might lead to one of those classes by the previous word or next word, e.G. verbs are pretty likely to follow a name. All these features are pretty sparse, thus we are building a dictionary of all possible features we observed during training time and encode a dictionary of features we have seen.
For example the feature prevPosTag could have value "prevPosTag=NN". This is a rare occation as we have a unigram encoded as a feature vector, so it totally makes sense to encode them with a sparse vector.

Now that we have our text encoded as a list of vectors (a sparse vector for each unigram we observe) we can optimize the vectors and their outcome by minimizing a conditional likelyhood costfunction to be used in the Markov Model. This will learn conditional probabilities between features and the outcomes, describing when we are observing a feature- how likely is that the PERSON class occurs, for math lovers this can be described as P( class | features ). I optimized my model for 1k iterations using conjugate gradient and obtained a very low training error of arround 5%. To obtain the class for a feature vector we are doing a viterbi decoding on the learned weights. The trick here is that you need to encode the feature vector for all the possible classes, only that way the viterbi can decode the probabilities correctly.

So yeah, that is basically what it's all that named entity recognition is about. The next blog post will most probably be about how to cluster the news together using the information we gathered through this post.


Jan 1, 2013

Extracting articles from crawled HTML sites

So welcome to our second part of Building a news aggregation engine!
If you don't know how crawlers work, have a look into the last part of the series: Build a basic crawler.

This time we will talk about how to get the actual raw content of a site. Something that we humans see on a newspage isn't visible for an algorithm, because it just has to look at the code- not the actual rendered page. Therefore I will introduce you with a technique called Boilerplate removal.

What is this "Boilerplate"?

Boilerplate is everything, but the content of the page you are seeking for. Some people call it: "design" or "nagivation" and in their first attempts they try to parse websites with XPath expressions to get their content. But that is not necessary and an unsual pain if the design really shifts.

So the lesson learned is (like in the last blog post as well): IGNORE. Yes you have to ignore the parts you don't want to have. That doesn't mean that we don't take the parts to make decisions- in fact we need those to decide whether a block of html code belongs to the IGNORE- or the content-part.
Therefore I want to introduce to you the boilerpipe framework. Boilerpipe is written in Java by Christian Kohlschütter and it is licensed with Apache 2.0.

It uses some really simple machine learning algorithm (a small decision tree) to classify whether a given block of html is content or not. You can read the details in his research paper "Boilerplate Detection using Shallow Text Features" which is a really good read.
In short, it analyses the amount of tokens and links in the previous, current and next block of HTML. It is the same idea we will use later in sequence-learning when we deal with Named Entity Recognition in order to get people and events from the article.

In Java, you can use it like this:

final BoilerpipeExtractor extractor = ArticleExtractor.getInstance();
String text = extractor.getText(html);

It is sooo simple and you have the content of a webpage in a string. Of course, this only applies for news articles, since they are shaped really consistent over many news sites therefore the name ArticleExtractor.

But in real world, if we crawl the web, most of the sites we examine with our crawlers won't be news- so the content in the resulting text string might not be a news article. Maybe it is just the imprint or privacy statement that looks like an article, but isn't.

Classifying News sites

Since I faced this issue while crawling, I had to train some machine learning algorithm on the output of boilerpipe to detect news accurately.

Here is what I did:
  • Let the crawler run on the top 40 news sites in germany (hand seeded list) for 100k sites
  • Write a small python application to loop over all files and ask me if it was news or not
  • After 1k hand-classified items (was just 1h of work!), train a classifier.
I found out that training a Multilayer Perceptron with a single hidden layer gives arround 93% accuracy with very small number of features, since this is enough for my purposes I stopped there. But I believe that you can get alot more (99% should be really do-able) with ensembling and better features.

But many people don't tell what kind of features they used, so here are mine:
  • Length of the extracted text
  • URL ends with '/'
  • Length of the extracted title
  • Number of '\n' in the extracted text
  • Text mention of "impressum","haftung","agb","datenschutz","nutzungsbedingungen" (imprint, authors etc.)
  • Title mention of "impressum","haftung","agb","datenschutz","nutzungsbedingungen" (imprint, authors etc.)
  • Number of upper case letters in the text
It is a mixture of text level features that can be expanded and meta features like lengths.
I trained a neural net (7-35-1) with sigmoid activations (and 1.0 as regularization) for 10k epochs with Conjugate Gradient. Here are the average results from a 10 fold crossvalidation:
Accuracy: 0.9259259259259259
Precision: 0.9173553719008265
Recall: 0.9823008849557522
F1 Score: 0.9487179487179487
That is pretty good for such simple methods! And I didn't even used HTML features like boilerpipe does ;-)

If you want to have some data of my crawl, I have ~1500 classified news articles and 16.5k unclassified- so if you need a bit of german news data for whatever research: let me know via email!

Congratz! We can now crawl the world wide web and classify news sites very accurately. Our next step will be to develop a named entity recognition engine that allows us to extract the keywords we need from our text in order to group them efficiently.

Build a basic crawler

So welcome to our first part of Building a news aggregation engine!

This time we talk about how we build a really simple crawler, that crawls us some sites. 
Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.
The basic workflow looks like that:
  • Seed some URLs and queue them up
  • Keep a set about what URLs were visited
  • While our queue is not empty (or we reached some maximum amounts of sites)
    • Get the first URL from the queue, put it into the visited set
    • Query the URL, obtain some HTML
    • Extract new URLs from the HTML and queue them up if they are not in the visited set yet
A small Java program could look like this:

String[] seedUrl = new String[]{"http://www.wikipedia.com/"};
final Deque<String> linksToCrawl = new ArrayDeque<>();
final HashSet<String> visited = new HashSet<>();


while (fetches < 100 && !linksToCrawl.isEmpty()) {
      String urlToCrawl = linksToCrawl.poll();
      // open a connection and parse HTML
      // loop over all links we found on that page
      for (String outlink : extractedResult.outlinks) {
         if (visited.add(outlink))

It looks really simple, but tell you what? It is more difficult than it looks.
Once you started with it, you wish you never started with that- the web is ugly. I'm working in the backend team at work and I'm surrounded by a lot of garbage from various data sources, but the web is a whole new level. Just a small excerpt of what you need to deal with:
  • Encoding issues (we will fix them later on in this post)
  • Link expansions (relative vs. absolute URLs vs. JavaScript URLs like void(0); ) 
  • Not parsable stuff like videos or images
So for example, how do you deal with data you can't handle (which don't contain HTML)? You IGNORE it.  For this kind of purpose I've clamped together a bunch of suffixes that can happen within links and they guard against running into not parsable binary data:


So as you can see, I'm guarding against anything here. Of course they are completely useless if someone does not supply a suffix of the filetype. In the latter case, you will need to get on the stream and look for a html or body tag to verify it is really a website (which is the worst case, because you're wasting bandwidth and time the crawler could use to do something else).

Something that poked me for quite a while were encoding issues. As a german, umlauts like öäüß are completely garbled if you read them with the wrong encoding. So most of the time, germany news look really bad and you can directly throw them into the next trash bin.

I ran across a project of the Mozilla foundation called universalchardet (abbrev. for universal charset detector) and its Java descendent called juniversalchardet. It detects encodings with really high accuracy and helps you to get the content of your crawl correct like you would browse the site.

In Java you have to obtain the site via streams, so let me show you a small example of juniversalchardet and how to read a stream into a string of HTML with NIO.

    String someURLAsString = "http://www.facebook.com";
    URL url = new URL(someURLAsString);
    InputStream stream = url.openStream();
    String html = consumeStream(stream);
  // the helper methods
  public static String consumeStream(InputStream stream) throws IOException {
    try {
      // setup the universal detector for charsets
      UniversalDetector detector = new UniversalDetector(null);
      ReadableByteChannel bc = Channels.newChannel(stream);
      // allocate a byte buffer of BUFFER_SIZE size 
      // 1mb is enough for every usual webpage
      ByteBuffer buffer = ByteBuffer.allocate(BUFFER_SIZE);
      int read = 0;
      while ((read = bc.read(buffer)) != -1) {
        // let the detector work on the downloaded chunk
        detector.handleData(buffer.array(), buffer.position() - read, read);
        // check if we found a larger site, then resize the buffer
        buffer = resizeBuffer(buffer);
      // finish the sequence
      // copy the result back to a byte array
      String encoding = detector.getDetectedCharset();
      // obtain the encoding, if null fall back to UTF-8
      return new String(buffer.array(), 0, buffer.position(),
          encoding == null ? "UTF-8" : encoding);
    } finally {
      if (stream != null) {
  // basic resize operation when 90% of the buffer is occupied
  // simply double the correct size and copy the buffer
  private static ByteBuffer resizeBuffer(ByteBuffer buffer) {
    ByteBuffer result = buffer;
    // double the size if we have only 10% capacity left
    if (buffer.remaining() < (int) (buffer.capacity() * 0.1f)) {
      result = ByteBuffer.allocate(buffer.capacity() * 2);
    return result;

That is actually everything to know about getting HTML from a raw URL.

But, how do you extract the outlinks from a HTML page?

Many of you will now go ahead and say: let's compile some RegEx. You will FAIL.
As a computer scientist it is enough if you tell that HTML is a context free grammar (chomsky type 2) and RegEx needs a regular language (type 3) to operate properly. Type 2 languages are way more complex and can't be parsed with a regular expression. So please have a look at the funny rage answer at stackoverflow or read up the other very informative answers at the bottom to know why you shouldn't do this. Don't get me wrong: You will find URLs that you can parse with RegEx, but I don't think it is worth the stress. I always use the htmlparser on sourceforge, it is clean, well tested and pretty fast.

To end this post, I tell you how to extract some outlinks from a html page as string:

static final NodeFilter LINK_FILTER = new NodeClassFilter(

    Parser parser = new Parser(html);
    NodeList matches = parser.extractAllNodesThatMatch(LINK_FILTER);
    SimpleNodeIterator it = matches.elements();
    while (it.hasMoreNodes()) {
      LinkTag node = (LinkTag) it.nextNode();
      String link = node.getLink().trim();
      // now expand for relative urls and store somewhere

It is simple as that. How expanding of URLs can be done is another part- but I leave that up to you ;-) Java's URI may help you with that.

So thanks for attending, my next post is about how to extract actual text content (news) from pure HTML code.

Building a news aggregation engine

Hey all,

first off- happy new year!
It has been a few months now since I posted my last blog post. It wasn't just the stress (open source projects, work, study) that prevented me from writing, but more that I've found no really good topic to blog about.

Personally, what I always found to be interesting are these news aggregation sites. You know them: Google NewsYahoo News etc. They are crawling the web, getting fresh articles about what happens in the world, group them together and present them. Also, they require a lot of machine learning and natural language processing- topics that most of my blogposts are already about.

Since my last posts are more about distributing such algorithms, I want to focus alot more on the application of several algorithms and intentions in order to build up such a news aggregation engine. I know this is a topic where you can write several books about, but I think we can build a working application within three to five blog posts. My personal goal would be to derive a small Java application, that you just start with a seed of a few URLs and it is grouping the incoming pages in realtime- so you can watch the result in an embedded jetty web application by refreshing the site constantly. I currently have some simple parts of it clamped together, but they don't act as a single application so I will rewrite this for the public ;-)

Here is a rough topic grind what we need to cover in order to plug the system together:

  1. Crawling and extraction of news articles
    1. Detect text encodings while crawling (otherwise you'll get low quality results)
    2. Article Classification (you have to detect that a site contains an article)
    3. Boilerplate Removal (you don't care about the design, so remove it!)
    4. Page Deduplication (Archieves or mirrors host the same article, we want to sort out those fast)
  2. Topic modelling of news articles
    1. Named Entity Recognition (News are about people and events, so we have to detect them)
  3. Grouping News
    1. Hierarchical clustering with realtime extension

We will use frameworks for some parts of the task, some coursework from online courses, but the most things are written by myself. 

So stay tuned, I will start with basic crawling in the next post and how to detect encodings efficiently.