Thomas Jungblut's Blog: named entity recognition

Showing posts with label named entity recognition. Show all posts

Jan 27, 2013

Named Entity Recognition in News Articles

Hello,

been a while since the last post. Was into a lot of work- couldn't really get up on the weekend to write about named entity recognition in news articles. But today we can finally talk about it.

This post is about a few things:

What is named entity recognition?
How do we model it as a machine learning problem?
What features to extract for our learner?

So let's dive in, if you have taken the Natural Language Processing (NLP) class on Coursera, you will be familiar with the topic already and should start with the features we use in the third paragraph.

What is named entity recognition (NER)?

The most easiest explanation is to find word level concepts like a location or person in an unstructured text file. Let's say we have the following snippet of text, shamelessly stolen from Wikipedia:

Jim bought 300 shares of Acme Corp. in 2006.

The idea is to tag parts of this sentence with tuples of concepts and their value, such that we get this:

<PERSON, "Jim"> bought 300 shares of <CORP,"Acme Corp"> in 2006.

So we detected Jim as a person and the Acme Corp as a corporation in this sentence.
But how do we need this for our news aggregation engine? A very simple assumption: News are about people, what they did and where they did it. A simple example would be:

"David Cameron talks in Davos about the EU"

The topic is clearly about the person David Cameron- the action of talking and a location namely Davos in Switzerland. This is needed for our news aggregation engine to cluster topics together, even if their content is slightly different. We will talk about this in one of the following blog posts.

How do we model this as a machine learning problem?

Basically, it is nothing else than a (multiclass) classification problem. Namely classifying if a token belongs to the class PERSON, LOCATION- or O which is none of the ones before.

The main difference to other NLP tasks is that we need the context of a token, because the meaning of the current token is depending on the previous or the following token/class. So let's have a look at another example:

It did not, and most of Mr. Cameron’s European partners do not wish to revisit that fundamental question.

How do we recognize Cameron in this case here? There are two cases that are representative for english. First the "Mr." is a strong indicator that there is following a name, second the "'s" is a strong indicator that the previous token was a name. Also prepositions like "of", "to", "in" or "for" are likely indicators for names or locations. The trick that I learned from the NLP class on coursera was the encoding of the text as a sequence of unigrams. The previous text would look like this:

most O
of O
Mr. O
Cameron PERSON
's O

So what we do is predicting the label of the current unigram, by looking at the previous and following unigram and maybe also at the previous label. The reason we need to look at the previous label is that names could be composed of name and surname like David Cameron. So if the last class was a person, in many cases the current class might also be a person.

So what kind of classifier do we use? I used a self written version of the Maximum Entropy Markov Model supplied in week four of the NLP class exercise optimizable with normal Gradient Descent or Conjugate Gradient (or even with the given quasi newton minimizer supplied in the class). Also I written some utilities to extract sparse feature vectors as conveniently as in NLP class.
You can browse some code in my common repository's NER package and see how it looks like to use with the data supplied from NLP class in my testcase.

What features to extract for our learner?

Features are pretty important, they must cover structural as well as dictionary features. Here is my feature set for dictionary features:

current word
last word
next word

And for structural features:

previous class label
current word upper case start character
last word upper case start character
current word length
current word containing only alphabetic characters (1=true or 0=false)
next word containing only alphabetic characters (1=true or 0=false)
was the last word a preposition
was the last word a special character like dot, comma or question mark
last word ending with "ing"
previous/current/next words POS tag

POS tags are pretty important as nouns are more likely to be a person or location. Also other POS tags might lead to one of those classes by the previous word or next word, e.G. verbs are pretty likely to follow a name. All these features are pretty sparse, thus we are building a dictionary of all possible features we observed during training time and encode a dictionary of features we have seen.
For example the feature prevPosTag could have value "prevPosTag=NN". This is a rare occation as we have a unigram encoded as a feature vector, so it totally makes sense to encode them with a sparse vector.

Now that we have our text encoded as a list of vectors (a sparse vector for each unigram we observe) we can optimize the vectors and their outcome by minimizing a conditional likelyhood costfunction to be used in the Markov Model. This will learn conditional probabilities between features and the outcomes, describing when we are observing a feature- how likely is that the PERSON class occurs, for math lovers this can be described as P( class | features ). I optimized my model for 1k iterations using conjugate gradient and obtained a very low training error of arround 5%. To obtain the class for a feature vector we are doing a viterbi decoding on the learned weights. The trick here is that you need to encode the feature vector for all the possible classes, only that way the viterbi can decode the probabilities correctly.

So yeah, that is basically what it's all that named entity recognition is about. The next blog post will most probably be about how to cluster the news together using the information we gathered through this post.

Bye!

Jan 1, 2013

Building a news aggregation engine

Hey all,

first off- happy new year!
It has been a few months now since I posted my last blog post. It wasn't just the stress (open source projects, work, study) that prevented me from writing, but more that I've found no really good topic to blog about.

Personally, what I always found to be interesting are these news aggregation sites. You know them: Google News, Yahoo News etc. They are crawling the web, getting fresh articles about what happens in the world, group them together and present them. Also, they require a lot of machine learning and natural language processing- topics that most of my blogposts are already about.

Since my last posts are more about distributing such algorithms, I want to focus alot more on the application of several algorithms and intentions in order to build up such a news aggregation engine. I know this is a topic where you can write several books about, but I think we can build a working application within three to five blog posts. My personal goal would be to derive a small Java application, that you just start with a seed of a few URLs and it is grouping the incoming pages in realtime- so you can watch the result in an embedded jetty web application by refreshing the site constantly. I currently have some simple parts of it clamped together, but they don't act as a single application so I will rewrite this for the public ;-)

Here is a rough topic grind what we need to cover in order to plug the system together:

Crawling and extraction of news articles

Detect text encodings while crawling (otherwise you'll get low quality results)
Article Classification (you have to detect that a site contains an article)
Boilerplate Removal (you don't care about the design, so remove it!)
Page Deduplication (Archieves or mirrors host the same article, we want to sort out those fast)

Topic modelling of news articles

Named Entity Recognition (News are about people and events, so we have to detect them)

Grouping News

Hierarchical clustering with realtime extension

We will use frameworks for some parts of the task, some coursework from online courses, but the most things are written by myself.

So stay tuned, I will start with basic crawling in the next post and how to detect encodings efficiently.

Aug 4, 2012

Set Expansion by Iterative Similarity Aggregation

Hey all,

been a very long time since my last post. I was very busy with Apache Hama, we have graduated from the Apache incubator and call us now a top level project and of course made another release 0.5.0. I think this was just the beginning of a rising open source project and I am really excited to be a part of it.
However, responsibility comes with spending a lot of time. Time is always sparse, especially if you are young and have a lot of stuff to do.

So blog posts like this will be very sparse in the future, but today it should not matter and I will tell you about a semi-supervised learning algorithm with the very phonetic name called SEISA.
SEISA itself is a acronym for "Set Expansion by Iterative Similarity Aggregation", not an invention by me rather than a wrap-up and implementation of the paper I found here by Yeye He and Dong Xin.

What is Set Expansion?

Set expansion is used to find similar words to an a priori given list of items in a given "artificial environment". For example you want to find additional digitial camera brands to a given seed set like {Canon, Nikon} and the algorithm should provide you with additional brands like Olympus, Kodak, Sony and/or Leica.
"Artificial environment" is actually a generic term for a context that you have to provide. In the referenced paper it is majorly the internet (crawled pages) and search query logs.

In a real world scenario, set expansion might be used to create a semantic database of related items, improve relevance of search querys, leverage the performance of other named entity recognition algorithms, or maybe also for SEO things (Dear SEOs, I won't help you on that, even if you pay me a lot of $$$).

Data Model

The main idea behind the algorithm is the data modelling. How do we represent this "artificial environment" (will call it context from now on) I talked about above?
The idea is like in any sophisticated algorithm: as a graph! Actually as a so called bipartite graph.

Example of a bipartite graph from list data

On the left side of this bipartite graph are the so called "candidate terms", the seedset we provide should be a subset of them. The nodes on the right are called the context nodes, they are representing the link between the candidate terms. In the case of the paper, they have crawled the internet for list items and created an edge to a candidate term if it exists in a given list. The neigbours of a context nodes, are the actual list items in this case. In the case of the web, the picture should also contain words that are not digital camera brands, also think about this simple rule: "more data wins". As in every normal graph, you can also weight the edges, for example if you trust the list data from wikipedia more than of craigslist, then you should assign a higher weight the edges between the context nodes and their candidate term nodes.

I have implemented it by a String array and a DoubleMatrix, where the string array is representing the term nodes and the matrix (NxM, where n is the number of entity tokens and m is the number of context nodes) is representing the weight of the edges. In my experience this matrix is always very sparse, even on very small datasets.

Iterative Similarity Aggregation (the algorithm)

Time to describe the algorithm, now we know how we represented the data.
It is rather simply to explain and I don't know why they complicated it so much in their paper.

There are three main methods, first the relevance score computation, the method that filters out newly top ranked items and at last the ranking function.

The relevance score computation is working like this:

Loop over all the terms in your matrix while looping over all terms in your seedset
Get the column vector of your seed element and the column vector of the candidate term
Measure the similarity between both and sum this similarity over all seedset terms, in the paper it is normalized by the length of the seedset, so it is actually the average similarity

What you now get is a vector (length is the number of term nodes you have), and in each index is the average relevance score to the seedset tokens.

Of course you want to pick the highest relevance nodes of this vector, that's where the static thresholding comes in. You pass the algorithm a similarity threshold, this threshold affects how fast the algorithm converges to a certain solution and how much false positives it will have in the end. Of course a higher threshold will yield to less false positives and faster convergence, while the exact opposite will yield to the other extremum. The truth (as always) lies anywhere between, so it is up to you to choose this correctly.

Since the algorithm always converges quite fast, you can experiment with other thresholds and see how it affects your output. Sure we are going to filter by the threshold and get a number of most relevant new candidate nodes.

Now we can enter the main loop and compute the relevance score again for the newly found relevant candidates (this is now called the similarity score). Based on this we are going to rank these two scores with some simple vecor multiplications:

So where does this alpha come from? That is what the paper names as "weighting factor", as you can see it is used to weight the scores. It is mainly used for the convergence of the algorithm, they say, halfing the scores (alpha = 0.5) yields to the best result.

Once we have a ranked scores, we can now filter again by the similarity threshold and get our new candidate terms for the next iteration.

In each iteration we do:

Compute similarity score by newly found candidate terms (in the beginning this is the seed set)
Rank the calculated similarity score against the newly found candidate terms
Apply the static threshold and get new candidate terms

Until it converges: by means the relevant tokens does not change their order anymore- they are equal. Note that you can also equalize on their ranks, I haven't tried it yet but it should converge also quite fast.

At the end you can now optain your expanded set by getting the candidate terms from the last iteration.

Results

Since I'm working in the e-commerce industry, I wanted to try it out to find similar camera brands like in the paper. It works, but it has very noisy result, partly because the algorithm is just iterative and not parallelized thus I can't work on huge datasets and partly because I had no real clean dataset.

But it really worked, I seeded "Canon" and "Olympus" and got all the other camera brands, sadly a few lenses as well.

On colors there are some similar behaviours, I seeded "blue", "red" and "yellow". Funnyly the most relevant token was "X" and "XS", by means that especially t-shirt products have the color in their name. It became very noise afterwards, found Ronaldo jerseys and other fancy clothes, but a bunch of new colors also.

The paper mentions a really nice trick for it: using the internet!

They have collected millions of list items from websites and let the algorithm run, in my opinion in every machine learning algorithm more data wins. The noise on small dataset will be vanished by the sheer size of the data. So I can very much think that they got really good result.

Parallelization strategy

The easiest way to parallelize this is using BSP and Apache Hama (in my opinion).

Let me tell you why:

Algorithm is strongly iterative
Needs communication between a master and the slaves

Here is my strategy to implement this in BSP:

The bipartite graph is split among the tasks, partitioned by the term nodes
A slave only partially computes the similarity between the seed tokens and the candidate terms
A master will be used to put all the similarities back into a single vector, rank it and distribute the newly expanded set again as well as checking convergence.

This is pretty much work, so I guess the follow-up post will take a few weeks. But I promise, I will code this algorithm!

So thank you very much for reading, you can find my sequential code on github, as well as the testcase:

https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/ner/IterativeSimilarityAggregation.java

https://github.com/thomasjungblut/thomasjungblut-common/blob/master/test/de/jungblut/ner/IterativeSimilarityAggregationTest.java