Jan 27, 2013

Named Entity Recognition in News Articles

Hello,

been a while since the last post. Was into a lot of work- couldn't really get up on the weekend to write about   named entity recognition in news articles. But today we can finally talk about it.

This post is about a few things:
  1. What is named entity recognition?
  2. How do we model it as a machine learning problem?
  3. What features to extract for our learner?
So let's dive in, if you have taken the Natural Language Processing (NLP) class on Coursera, you will be familiar with the topic already and should start with the features we use in the third paragraph.

What is named entity recognition (NER)? 

The most easiest explanation is to find word level concepts like a location or person in an unstructured text file. Let's say we have the following snippet of text, shamelessly stolen from Wikipedia:
Jim bought 300 shares of Acme Corp. in 2006.
The idea is to tag parts of this sentence with tuples of concepts and their value, such that we get this:
<PERSON, "Jim"> bought 300 shares of <CORP,"Acme Corp"> in 2006.
So we detected Jim as a person and the Acme Corp as a corporation in this sentence.
But how do we need this for our news aggregation engine? A very simple assumption: News are about people, what they did and where they did it. A simple example would be:
"David Cameron talks in Davos about the EU"
The topic is clearly about the person David Cameron- the action of talking and a location namely Davos in Switzerland. This is needed for our news aggregation engine to cluster topics together, even if their content is slightly different. We will talk about this in one of the following blog posts.

How do we model this as a machine learning problem?

Basically, it is nothing else than a (multiclass) classification problem. Namely classifying if a token belongs to the class PERSON, LOCATION- or O which is none of the ones before.

The main difference to other NLP tasks is that we need the context of a token, because the meaning of the current token is depending on the previous or the following token/class. So let's have a look at another example:
It did not, and most of Mr. Cameron’s European partners do not wish to revisit that fundamental question.
How do we recognize Cameron in this case here? There are two cases that are representative for english. First the "Mr." is a strong indicator that there is following a name, second the "'s" is a strong indicator that the previous token was a name. Also prepositions like "of", "to", "in" or "for" are likely indicators for names or locations. The trick that I learned from the NLP class on coursera was the encoding of the text as a sequence of unigrams. The previous text would look like this:
most O
of O
Mr. O
Cameron PERSON
's O
So what we do is predicting the label of the current unigram, by looking at the previous and following unigram and maybe also at the previous label. The reason we need to look at the previous label is that names could be composed of name and surname like David Cameron. So if the last class was a person, in many cases the current class might also be a person.

So what kind of classifier do we use? I used a self written version of the Maximum Entropy Markov Model supplied in week four of the NLP class exercise optimizable with normal Gradient Descent or Conjugate Gradient (or even with the given quasi newton minimizer supplied in the class). Also I written some utilities to extract sparse feature vectors as conveniently as in NLP class.
You can browse some code in my common repository's NER package and see how it looks like to use with the data supplied from NLP class in my testcase.

What features to extract for our learner?

Features are pretty important, they must cover structural as well as dictionary features. Here is my feature set for dictionary features:

  • current word
  • last word
  • next word
And for structural features:
  • previous class label
  • current word upper case start character
  • last word upper case start character
  • current word length
  • current word containing only alphabetic characters  (1=true or 0=false)
  • next word containing only alphabetic characters  (1=true or 0=false)
  • was the last word a preposition
  • was the last word a special character like dot, comma or question mark
  • last word ending with "ing"
  • previous/current/next words POS tag
POS tags are pretty important as nouns are more likely to be a person or location. Also other POS tags might lead to one of those classes by the previous word or next word, e.G. verbs are pretty likely to follow a name. All these features are pretty sparse, thus we are building a dictionary of all possible features we observed during training time and encode a dictionary of features we have seen.
For example the feature prevPosTag could have value "prevPosTag=NN". This is a rare occation as we have a unigram encoded as a feature vector, so it totally makes sense to encode them with a sparse vector.

Now that we have our text encoded as a list of vectors (a sparse vector for each unigram we observe) we can optimize the vectors and their outcome by minimizing a conditional likelyhood costfunction to be used in the Markov Model. This will learn conditional probabilities between features and the outcomes, describing when we are observing a feature- how likely is that the PERSON class occurs, for math lovers this can be described as P( class | features ). I optimized my model for 1k iterations using conjugate gradient and obtained a very low training error of arround 5%. To obtain the class for a feature vector we are doing a viterbi decoding on the learned weights. The trick here is that you need to encode the feature vector for all the possible classes, only that way the viterbi can decode the probabilities correctly.

So yeah, that is basically what it's all that named entity recognition is about. The next blog post will most probably be about how to cluster the news together using the information we gathered through this post.

Bye!

Jan 1, 2013

Extracting articles from crawled HTML sites

So welcome to our second part of Building a news aggregation engine!
If you don't know how crawlers work, have a look into the last part of the series: Build a basic crawler.

This time we will talk about how to get the actual raw content of a site. Something that we humans see on a newspage isn't visible for an algorithm, because it just has to look at the code- not the actual rendered page. Therefore I will introduce you with a technique called Boilerplate removal.

What is this "Boilerplate"?

Boilerplate is everything, but the content of the page you are seeking for. Some people call it: "design" or "nagivation" and in their first attempts they try to parse websites with XPath expressions to get their content. But that is not necessary and an unsual pain if the design really shifts.

So the lesson learned is (like in the last blog post as well): IGNORE. Yes you have to ignore the parts you don't want to have. That doesn't mean that we don't take the parts to make decisions- in fact we need those to decide whether a block of html code belongs to the IGNORE- or the content-part.
Therefore I want to introduce to you the boilerpipe framework. Boilerpipe is written in Java by Christian Kohlschütter and it is licensed with Apache 2.0.

It uses some really simple machine learning algorithm (a small decision tree) to classify whether a given block of html is content or not. You can read the details in his research paper "Boilerplate Detection using Shallow Text Features" which is a really good read.
In short, it analyses the amount of tokens and links in the previous, current and next block of HTML. It is the same idea we will use later in sequence-learning when we deal with Named Entity Recognition in order to get people and events from the article.

In Java, you can use it like this:

final BoilerpipeExtractor extractor = ArticleExtractor.getInstance();
String text = extractor.getText(html);

It is sooo simple and you have the content of a webpage in a string. Of course, this only applies for news articles, since they are shaped really consistent over many news sites therefore the name ArticleExtractor.

But in real world, if we crawl the web, most of the sites we examine with our crawlers won't be news- so the content in the resulting text string might not be a news article. Maybe it is just the imprint or privacy statement that looks like an article, but isn't.

Classifying News sites

Since I faced this issue while crawling, I had to train some machine learning algorithm on the output of boilerpipe to detect news accurately.

Here is what I did:
  • Let the crawler run on the top 40 news sites in germany (hand seeded list) for 100k sites
  • Write a small python application to loop over all files and ask me if it was news or not
  • After 1k hand-classified items (was just 1h of work!), train a classifier.
I found out that training a Multilayer Perceptron with a single hidden layer gives arround 93% accuracy with very small number of features, since this is enough for my purposes I stopped there. But I believe that you can get alot more (99% should be really do-able) with ensembling and better features.

But many people don't tell what kind of features they used, so here are mine:
  • Length of the extracted text
  • URL ends with '/'
  • Length of the extracted title
  • Number of '\n' in the extracted text
  • Text mention of "impressum","haftung","agb","datenschutz","nutzungsbedingungen" (imprint, authors etc.)
  • Title mention of "impressum","haftung","agb","datenschutz","nutzungsbedingungen" (imprint, authors etc.)
  • Number of upper case letters in the text
It is a mixture of text level features that can be expanded and meta features like lengths.
I trained a neural net (7-35-1) with sigmoid activations (and 1.0 as regularization) for 10k epochs with Conjugate Gradient. Here are the average results from a 10 fold crossvalidation:
Accuracy: 0.9259259259259259
Precision: 0.9173553719008265
Recall: 0.9823008849557522
F1 Score: 0.9487179487179487
That is pretty good for such simple methods! And I didn't even used HTML features like boilerpipe does ;-)

If you want to have some data of my crawl, I have ~1500 classified news articles and 16.5k unclassified- so if you need a bit of german news data for whatever research: let me know via email!

Congratz! We can now crawl the world wide web and classify news sites very accurately. Our next step will be to develop a named entity recognition engine that allows us to extract the keywords we need from our text in order to group them efficiently.