As part of a project I’m working on (more on that later), I wanted to be able to take some text (probably in the form of a web page) and get a list of the important entities/keywords/phrases.
It turns out that there are actually quite a few companies that offer a service like this available freely (at least somewhat) through an API, so I set out to try them all out and assess their quality and suitability for my project. Most of these APIs are provided by companies that do various things in the NLP (natural language processing) realm and/or work with large semantic datasets. Many of the APIs provide a variety of information, only some of which is the set of entities that I’m looking for, so they may have good features that are excluded from my narrow comparison.
Using the APIs
To evaluate the APIs I wrote a script to make use of each one (scroll to the bottom to see it in action). They were fairly similar but the code to handle each one is slightly different. In many cases they offered multiple response formats but I opted for XML for each of them which made things simple enough, and I got used to using SimpleXML in PHP. The main difference between them all is simply the XPath expression needed to pick out the entities. For each API I grab the entity, any available synonyms (minus some de-duplication), and the self-reported relevance score for that entity, if available. If not already sorted by that relevance, I sort them.
An additional issue was that although most of the APIs accepted a URL as input, some required the actual content, in either HTML or plain text. When accepting content from a web page, the service needs to be smart about ignoring web navigation, ads, etc. when determining what is important, and they vary in ability to do that. Alchemy (one of the APIs tested here) also has a web page cleaning API which can be accessed on its own. Results from the Yahoo API were of such low quality that I actually ran the input web page through the web page cleaning API before sending it to Yahoo, and it is those results which are evaluated here.
Most of the analysis here is based off a sample of eight web pages including Wikipedia articles, news articles, and other pages with a lot of text content from a variety of subjects. I have not yet done any analysis of how the quality of the response for each API is affected by the length of the input document.
The APIs I tested were, roughly in order of increasing quality,
- Yahoo: Term Extraction
- OpenCalais: API
- BeliefNetworks: Recommend Concepts
- OpenAmplify: API
- AlchemyAPI: Named Entity Extraction
- Evri: REST API: Get entity network about some text
Comment on this post to let me know if I’m missing any.
Most APIs today have limits on both how much they can be called, and what you can use them for. Here they are ranked roughly by “most usable terms and limits” to least:
- Currently no API limit; essentially no requirements
- 30,000 calls per day (although more may be available); commercial use is definitely okay
- 50,000 calls per day, 4 calls per second; one must display their logo as-is; if you are syndicating the data it needs to preserve their GUIDs; see details
- 2,000 calls per day, 1 call per second; essentially no requirements
- 5,000 calls per IP per day; non-commercial use only
- 1,000 “transactions” per day; note that one call is 1-4 transactions depending on the input type and whether you want all or a subset of the output; commercial use is definitely okay
Please note the standard I-am-not-a-lawyer and that this is just a summary. Please read the terms of service yourself.
Although my project is English-only for now, ideally there would be support for other languages. All the samples I used were in English so that is what is being used to evaluate the quality, but here is the full list of what languages each API claims to handle:
- English, French, Spanish
- A white paper on their website claims that their technology can support multiple languages, but most likely only English is currently supported
- English, French, German, Italian, Portuguese, Russian, Spanish, Swedish, however some important features are only available for English
I have not done any research into similar APIs which are not available for English at all, but if you find one, please let me know and I’ll make a note of it here.
The response from OpenAmplify and AlchemyAPI will also include the language of the input document. For AlchemyAPI this includes 97 languages, not just the ones that the API can handle. If you’re just looking for good language identification, there are other resources for that, some open source. My ancient Language Identification page still has some useful links there.
Number of Entities and Relevance Scores
For the purposes of my project, I see the APIs which return more entities from the document to be more useful, all else being equal. BeliefNetworks allows you to specify the number of entities returned (supposedly up to 75 but it actually returns 76), and as such always returns that number, which is almost always more than any other API. Yahoo returns up to 20 entities (which isn’t documented), which is often the least of any API. Here I list the APIs sorted by number of entities returned from most to least:
There are a couple of important caveats here, however. This is based off of a very small sample, so other than BeliefNetworks returning the most, the list could be off. Beyond that, OpenCalais has a fairly small limit on text length (100,000 characters, presumably including the HTML tags) and if the input is too long, it returns no entities at all, just an error message. The ranking above excludes those examples. OpenAmplify has a limit of 2.5K, however they just truncate the document instead of failing (although this counts as an additional “transaction”). Oddly, Evri returned an error of “rejected by content filter” for this news article and returned no entities. Evri’s ranking in the list above is unchanged with or without the inclusion of that example.
All of the APIs, with the exception of Yahoo’s, include some metric with each entity rating its relevance to the input document. This is important as every user of any of these APIs would most likely want to establish a minimum relevance threshold for actually making use of the entities. The number of entities comparison above is based on no threshold at all; obviously changing the threshold would affect the comparison. AlchemyAPI and OpenCalais use scores from zero-to-one, however Evri, OpenAmplify, and BeliefNetworks have their own scale. I haven’t yet done any work to normalize all these scores and I think that most likely the best practice would be to independently determine your own threshold on a per-api basis depending on your own needs.
By semantic links I simply mean that the entities returned have some sort of links or references to additional information about those entities. Although not necessarily required for my project, this may be very useful. Two of the APIs, Evri and AlchemyAPI include this information when they successfully map a found entity to an entity in their own database. Evri provides a reference to the entity in Evri’s own system, whereas AlchemyAPI links to a variety of other sources: the website for the entity, Freebase, MusicBrainz, and others.
In addition to or instead of these semantic links, Evri, AlchemyAPI, and OpenCalais have their own systems of classification and label entities with things like “Person” and “Religion”. See Evri’s most popular ‘facets’, AlchemyAPI Entity Types, and OpenCalais: Metadata Element Simple Output Value for specifics of each. OpenAmplify is even more basic but provides broad categories such as “Locations” and “Proper Nouns”, and entities may be listed in more than one of these broad categories. Yahoo and BeliefNetworks provide no additional context.
Some of these APIs provide a wealth of information that I disregard entirely but could be useful to others. For example Amplify returns a lot of information about the sentiment being expressed about each entity, information about the person who authored the document (e.g. gender, education), the style of the document (e.g. slang usage), and actions expressed within the document. OpenCalais also extracts events and facts from a document, as well as other details per entity such as the ticker symbol for entities which are public companies. AlchemyAPI can extract quotations from the document. Note that this is a summary and not a complete list of all the data that these APIs return.
Synonyms vs. Duplicates
The better APIs here, at least as far as I’ll be using them, succeed at recognizing that “Smith” referred to throughout an article is the same as “John Smith” mentioned in the first sentence. I want duplicates minimized, and for each entity to have as many valid names/synonyms as possible. The APIs differ here significantly.
Evri is definitely the best, followed by AlchemyAPI. Unfortunately AlchemyAPI sometimes misdisambiguates (ooh, no results on Google or Bing for that word yet) which results in incorrect synonyms, however that isn’t a huge problem for me. An example is the article I referred to earlier where AlchemyAPI confuses a Canadian military unit for the British monarch it was named after. Yahoo and OpenCalais fall into the middle. OpenAmplify and BeliefNetworks have a fair number of unmerged duplicate entities. For my purposes, I don’t care if the synonyms come from the input document or an external database, which is what Evri and AlchemyAPI probably use.
Taking a look at each API
This was the only API that I was aware of until recently, and I’ve blogged about it before. The input format is plain text, so since I’m using URLs as input, I have to first extract the text, strip the HTML, and send that. As I mentioned above, the quality was so poor when using web pages as input that the text must first be scrubbed of web page navigation, etc., and I used AlchemyAPI to do that. Even then, the quality was still poor and the API returned things that I would describe more as long phrases than as entities. Given that, not to mention the maximum of 20 entities, and the non-commercial restriction, I don’t see myself making use of this API.
This API also accepted content rather than a URL. The content format must be specified in the API call. I simply retrieved the URL, and passed all of its content (with HTML) on to OpenCalais. They suggest making sure to remove web page navigation, but without me doing this, that didn’t present a problem. What was a problem was the short maximum document length. To actually use OpenCalais you should make sure to truncate documents before making the request, which is work that I haven’t yet implemented myself. Even when results were returned, the overall quality was mediocre.
The default output format is RDF, which is very verbose and includes a lot more information than I needed. I opted for the Text/Simple format which is actually XML.
For free API users sometimes the response comes back as “server busy” and I experienced this myself sometimes while trying it out.
This API has a simple output, and is somewhat unusual in its functionality. Unlike all the other APIs, where the entities are extracted from the input document, with BeliefNetworks it seems they find entities which are related to the document but not necessarily actually in it. This produces some interesting results that are sometimes good but overall less related than I’d expect, and in one of my examples, completely unrelated and bizarre. Given that, and the frequency of duplicate entities, as mentioned, I would describe the overall quality as mediocre, although usually better than OpenCalais.
The most notable feature of OpenAmplify is all the additional information they provide, as described above.
They take input either as a URL or the content itself, and “charge” an additional transaction if you call them using a URL. I used URL input but also tried submitting the HTML, submitting the text (HTML stripped), and submitting the text after putting through the AlchemyAPI web page cleaning, and in all cases the results were about the same or worse.
OpenAmplify notes that they may not be able to follow all URL redirects (although I didn’t test this with any of the APIs), but this issue can be avoided by following the redirects yourself before making the request. As mentioned earlier, they only look at the first 2.5K of input. They also accept RSS/Atom as input, which is a nice feature.
Although I’ve set up my script to remove duplicates it currently misses removing some duplicate entities from OpenAmplify as the entity may be listed several times in the response but with different relevance scores.
One problem I found was that the entities returned usually consisted of a single token (one word) which just made them less useful. Overall, the quality was okay, generally better than BeliefNetworks.
Other than the occasional misdisambiguation, AlchemyAPI is quite good.
Evri’s API is also quite good, with the biggest flaw being that it doesn’t return very many entities.
Overall Quality Summary
Overall, Evri and AlchemyAPI were definitely the best and most suited for my purposes. The quality of Evri’s was the best across the small sample, although not in all instances, and it didn’t return as many entities as AlchemyAPI. Interestingly these two APIs are also the two which include semantic links and have the least restrictions and high API limits.
OpenAmplify and BeliefNetworks are the runners up. OpenCalais fared poorly in my evaluation, but I suspect it would do better when looking at all the rest that their API. Yahoo’s API unfortunately just wasn’t good enough to use when any of the other APIs are available.
I’m convinced that trying to build a similar service myself is not worth it at all. One thing that I haven’t tried yet is combining these APIs together in some way, although that could potentially improve the results quite a bit.
You can see the script in action (until I take it down) at http://faganm.com/test/get_entities.php?u=[any URL].
Not exactly named entity extraction, but we do have an API for extracting semantic classifications for a URL or a piece of text
Ping me if you want to try out the API.
I have used Yahoo’s Term extraction API before; it was quite good. Why didnt you include uClassify.com in this list? I have heard about Alchemy API but havent had the time to use it in project. The others I havent heard about! Thanks for the collection! Bookmarked for later reference!
I really appreciate the nuts and bolts approach of this post.
Very detailed analysis. Thanks.
We tried OpenCalias. The costs are pretty steep when you exceed the 50k limit even though they have some good deals for startups. It is worth checking the pricing models of all offerings, if you are planning to include them in a product.
One thing that surprised me is that they don’t seem to build and use a reference database (of entities). Do you have any idea why?
very interesting comparison, and I have to say this is the first one that is a bit wider that I see. Finally people are doing proper comparisons of the contextual/semantic APIs!
Is there any special reason how come you left out the Zemanta API?
Andraz Tori, CTO at Zemanta
Thanks for the feedback and new info. I wasn’t aware of either uClassify.com or Zemanta, they certainly weren’t excluded intentionally. I will take a look and add them into this analysis if that makes sense.
Dorai, as to why OpenCalais does or doesn’t do anything, you’ll have to bug them ;-). And thanks for the comments on pricing; given the usage I expect for myself in the near future it’s not a huge deal, but in the long run definitely important.
1. EVRI API output can only be stored/kept for 14 days, after which time it must be deleted from your application (section 2.4)
2. The EVRI API may limit/rate limit you without advance notice (section 2.3)
3. No commercial use, resale, etc (section B.5)
4. No storing EVRI API results in a database where others can see the results (section B.4)
I was aware of #1,2, and 4 (frankly, I expect that any API offering free services may take them away at any time). #1 and #4 don’t matter to me at all so I disregarded it, but that was probably a bad idea since it will affect some others. As to #3, I completely missed that. I will update this today/tomorrow with the info.
Tom Tague from OpenCalais here.
Wow – you covered a lot of territory here and I could probably spend at least as much space responding to some of the points you’ve raised – but I’ll try to stay focused on just a few key items.
First of course is the use case – which you don’t reveal. And of course it’s difficult to evaluate tools without the intended use well understood. Are you “tagging” news? Analyzing a large corpus of documents for network effects?
For example – if your use case is simply for entity extraction then the volume of entities is rarely the goal – but rather a mixture of recall and precision. You can have perfect recall and low precision – or the reverse. The goal is to find the appropriate balance. If you’re use case requires named (e.g. typed) entities – then tools such as Yahoo should not be in the mix – they are term and not entity extraction engines.
Facts & Events: I was also a little surprised that you stopped at entity/term extraction. Most real world use cases want to understand what’s happening in the text and are heavily dependent on facts and events – and area in which OpenCalais shines. Facts and events reveal the relationships between entities, and make up the core elements of Ã¢â‚¬Å“aboutness,Ã¢â‚¬Â which are key values / benefits that many use cases for semantic technology seek to derive.Ã¢â‚¬Â¨
Semantic Links: It appears you missed our connection to the Linked Data Cloud on your Ã¢â‚¬Å“Semantic LinksÃ¢â‚¬Â section.Ã‚Â For a growing number of entities that we return Ã¢â‚¬â€œ we also return links to a rapidly growing set of Calais-provided, Thomson Reuters information assets that follow the semantic Linked Data standard.Ã‚Â These dynamically generated pages also provide relevant Ã¢â‚¬ËœsameAsÃ¢â‚¬â„¢ links to key resources in the Linking Open Data cloud. You can see these by entering the text of a news article into our demonstration tool at http://viewer.opencalais.com, copy and paste in a news story that features a number of company names, hit submit, and view the extracted entities, facts and events in the left hand rail then expand on the companies, and click on one to find the Calais asset. (For instance, see the Bank of America asset here: http://d.opencalais.com/er/company/ralg-tr1r/e80e12df-622c-3c3e-86dc-a3ffdcc39e25.html (Traditionally, we have also included sameAs links to DBPedia and Freebase, and those will be back.Ã‚Â Right now we are adapting to a new format.)
RDF: While the general developer population may lack familiarity with RDF at this time, as you note you do, developers that work with large textual content sets are moving to learn it now.Ã‚Â While a variety of alternate representation ideas are available – RDF is the W3C standard and provides the right transport layer for rich knowledge representation. The text/simple format you chose is designed to support simple tagging / entity extraction use cases and leaves much of the richness extracted behind.
Length: OpenCalais supports entry of text documents up to 100K in length. We’ve found this supports the vast majority of our users well while conserving systems resources.
Usage: We welcome the use of OpenCalais for commercial or non-commercial purposes and allows users to submit up to 50,000 documents per day at no charge. After that we need to discuss some sort of value exchange.
We could probably name several other tools – but you absolutely should include Zemanta in any entity extraction test case.
Again – thanks for putting in the time to compare tools. I’d encourage you to come back and revisit the subject in the future.
excellent post! Great to see an overview of existing approaches, and as others suggest, adding other tools into the survey would be even better.
Your main purpose is “take some text (probably in the form of a web page) and get a list of the important entities/keywords/phrases.” However, I noticed that each tool has slightly different goals. For example, Evry and AlchemyAPI extract Named Entities rather than important keywords. Which of the APIs suits your particular task the best, if you’d assume they were all equaly accurate?
Another question, to you and other readers of this post: Do you think the API is a suitable model for distributing such tools?
Pingback: Most Tweeted Articles by Agile Development Experts: MrTweet
Really useful post. 🙂 I looked at something similar for use with Twitter term extraction, but at a lower level. Someone mentioned Zemanta above and I think it’s worth looking at, just based on the initial quality of the results it gives. Even though the following 2 don’t have an API they may also be of use to you too
A very thorough experiment indeed. Could it be possible for us to see the data you used and the results each of these APIs returned?
Responding to Tom:
* Regarding my use case, I’ll explain that more in a few weeks or so. I fully realize the semantic difference between term/phrase and entity extraction, but for my purposes the latter is closer but both may work out.
* facts and events: I tried to make clear that OpenCalais has plenty of good features that I was neither making use of nor comparing here. In the future it’s possible I will end up using some of those.
* semantic links: I will take a look at that
* rdf: definitely a good thing that you have RDF, it’s just that I was able to get what I needed more easily using your simple format; sounds like I may need to change that to get the semantic links though
* usage: so I think you’re saying that I’ve got the details right 🙂
* max length: hey, I’m just reporting the numbers ;-)… given that I’m using the service for free, I can’t really complain about it
Responding to Alyona:
In terms of deciding which API to use, I don’t think it’s possible for me to separate out separate out the quality of the APIs from the exact goals of the API. I think what I want is closer to entities than to keywords, and (thus far) I’m more interested in entities that are *in* the document rather than *related* to the document as in BeliefNetworks. What I will probably use is either Evri or AlchemyAPI, using the other as backup when the first one returns a small number of entities.
thanks for the links, I will check them out
Thanks for the detailed comparison.
Two major factors in surveying any named entity tagger are of course (1) the accuracy of
the results (best measured in terms of Precision and Recall) and (2) the speed (e.g. in MB/s
on a reference machine).
Do you have any plans to quantify the quality of the services you tested, in addition to the
technical/legal/conceptual comparison? Obviously results will vary a lot depending on text
type (e.g. Web page versus newspaper prose).
PS: The Wikipedia page on Named Entity Recognition
( http://en.wikipedia.org/wiki/Named_entity_recognition ) has quite a large collection of
evaluation resources as well as free and commercial taggers, which goes beyond listing just those
available as Web services (but they can easily be wrapped by SOAP).
Thanks for the detailed comparison.
Apart from important criteria like licensing and API quality, there’s also
two other very important criteria, namely (1) the linguistics quality (in
terms of error rate) of the service and (2) the speed (throughput in MB/s on
a reference machine). Do you have plans to evaluate these as well?
Error rate was generally taken into account in my overall opinion of each API, although I wasn’t explicitly talking about linguistics quality. Some of the duplicate entities I describe above include having plural and singular versions of the same entity.
As to speed, it wasn’t an issue for me, but if anyone does such an analysis I’d be happy to refer to it from here.
Thank you Gary
Thank you Tom too. OpenCalais does brilliant work with entities, facts and events.
As an European living and working within a rich world of languages I would want to bring one more aspect to this discussion. I have personally been looking for a set of tools / APIs where you can do more or less the same as OpenCalais or OpenAmplify can do but only after you have first translated the text brutally from any source language to a target language with an automated translation (like Google translation API). I have tested for example web-scraping –> Google translator –> OpenCalais and sometimes I get good results sometimes not. I expect the results will be dramatically improving within 2-3 years because of some methodological improvements that have already been public and in use some months.
One of the discusssants here – Alyona Medelyan – introduced in her PhD a new method for automatic topic indexing (at least for me it was a new method). There you use wikipedia articles to support entity extraction. I must again congratulate ms. Alyona for her PhD. It gives an great tool for us who want to develop services for smaller languages.
@mfagan just a comment: since you say you are more interested in entities, keywords and concepts mentioned inside the text (in contrast to related keywords), be sure to look at “markup” part of the Zemanta response, not “keywords” (keywords are mix of found & related).
Also if you are looking to get more entities back, just use “markup_limit” parameter to rise the number of entities returned until you hit your target signal to noise ratio.
You might want to calculate markup_limit depending on length or ‘entity density’ of your documents.
Looking forward for your results and how they compare with other solutions!
Andraz Tori, Zemanta
Maybe you’ll be interested in looking at Apache UIMA incubator project.
Would love for you to know our API for your next comparison test. We are happy to have our technology matched up with any of those mentioned here. Expert System has 20 years and 200 person-years of investment in our semantic processing API. You can see a simple sample of this here –> http://expert1.expertsystem.us.com/essex
Thanks for an excellent overview. If someone is ever going to do an update, it would also be nice to know if any other languages then English are supported.
You can give a look at Complexity Intelligence web platform (http://www.complexityintelligence.com), they offer an artificial intelligence web services platform, including Part of Speech Tagging and Named Entity Recognition through API with Free subscription. A Java and PHP quick start are also included so you can start coding and work, for free, in less than 1 minute.
Excellent work in comparing all the API’s. It was about time someone did and I think that your work will lead to more detailed comparisons, which is great. I look forward to read the update. I believe that if you keep up with this, buy running and refining the test sets, you could have a reference page that would be tremendously beneficial. Goo luck with you project.
This was a great article — one of the few comparisons of its kind and still relevant a year later.
A Quora post which I have linked to this blogpost is:
I’d also like to add Extractiv to the list, which hit the scene in late 2010. Extractiv’s novelty is that it combines NLP with web crawling to provide an integrated solution for transforming the open web into structured semantic metadata. It also provides a REST-based On-Demand API for processing your own documents, like the services mentioned above. Metadata output includes over 150 entity types, synonyms (coreference chains), semantic links (linked entities), and relations (facts and events), and output formats such as JSON and RDF. More including a live demo is at http://www.extractiv.com.