The Race for Timbuktu: In Search of Africa’s City of Gold by Frank T. Kryza | LibraryThing

I mentioned a little over a year ago that I’m now keeping track of what I read (using LibraryThing). I enter the book, the date I finished it, and a rating out of five stars, but I’ve been thinking for a while that perhaps I should do some reviews as well, especially for the good ones.

I recently read The Race for Timbuktu: In Search of Africa’s City of Gold by Frank T. Kryza, about some British expeditions in the early 1800s. Overall definitely interesting, and I was surprised to realized just how recently Europe had zero clue about what lay within most of Africa, Given how shortly after these expeditions most of Africa was colonized.

The main flaws of the book were jumping back and forth between different expeditions in different years, especially in the first half of the book. There are some excellent maps included, but most of them are far too small to actually read much, so I’ve located some of them online:

  1. Africae nova descriptio by Willem Janszoon Blaeu, from page xix
  2. Africa by Sidney Hall, from page xx

The third map from the book I would have also liked to have found; I contacted the British Library for help (mentioned in the book as the source) but they were not able to find it. I’m sure I could get it eventually, but that was the limit of how much work I really wanted to put into this.

Aside from the maps being too small, I would have appreciated having more maps that showed the kingdoms and cultural groups with their boundaries, as it becomes difficult to follow this well given the number of times and places described.

Comparing NLP APIs for Entity Extraction

Update: a number have people have pointed out some small errors and some additional APIs that I should look at; until I get this post updated, please check out some of the great user comments at the bottom

As part of a project I’m working on (more on that later), I wanted to be able to take some text (probably in the form of a web page) and get a list of the important entities/keywords/phrases.

It turns out that there are actually quite a few companies that offer a service like this available freely (at least somewhat) through an API, so I set out to try them all out and assess their quality and suitability for my project. Most of these APIs are provided by companies that do various things in the NLP (natural language processing) realm and/or work with large semantic datasets. Many of the APIs provide a variety of information, only some of which is the set of entities that I’m looking for, so they may have good features that are excluded from my narrow comparison.

Using the APIs

To evaluate the APIs I wrote a script to make use of each one (scroll to the bottom to see it in action). They were fairly similar but the code to handle each one is slightly different. In many cases they offered multiple response formats but I opted for XML for each of them which made things simple enough, and I got used to using SimpleXML in PHP. The main difference between them all is simply the XPath expression needed to pick out the entities. For each API I grab the entity, any available synonyms (minus some de-duplication), and the self-reported relevance score for that entity, if available. If not already sorted by that relevance, I sort them.

An additional issue was that although most of the APIs accepted a URL as input, some required the actual content, in either HTML or plain text. When accepting content from a web page, the service needs to be smart about ignoring web navigation, ads, etc. when determining what is important, and they vary in ability to do that. Alchemy (one of the APIs tested here) also has a web page cleaning API which can be accessed on its own. Results from the Yahoo API were of such low quality that I actually ran the input web page through the web page cleaning API before sending it to Yahoo, and it is those results which are evaluated here.

Most of the analysis here is based off a sample of eight web pages including Wikipedia articles, news articles, and other pages with a lot of text content from a variety of subjects. I have not yet done any analysis of how the quality of the response for each API is affected by the length of the input document.

The APIs

The APIs I tested were, roughly in order of increasing quality,

Comment on this post to let me know if I’m missing any.

API terms

Most APIs today have limits on both how much they can be called, and what you can use them for. Here they are ranked roughly by “most usable terms and limits” to least:

Evri
Currently no API limit; essentially no requirements
AlchemyAPI
30,000 calls per day (although more may be available); commercial use is definitely okay
OpenCalais
50,000 calls per day, 4 calls per second; one must display their logo as-is; if you are syndicating the data it needs to preserve their GUIDs; see details
BeliefNetworks
2,000 calls per day, 1 call per second; essentially no requirements
Yahoo
5,000 calls per IP per day; non-commercial use only
OpenAmplify
1,000 “transactions” per day; note that one call is 1-4 transactions depending on the input type and whether you want all or a subset of the output; commercial use is definitely okay

Please note the standard I-am-not-a-lawyer and that this is just a summary. Please read the terms of service yourself.

Languages

Although my project is English-only for now, ideally there would be support for other languages. All the samples I used were in English so that is what is being used to evaluate the quality, but here is the full list of what languages each API claims to handle:

Yahoo
English
OpenCalais
English, French, Spanish
BeliefNetworks
A white paper on their website claims that their technology can support multiple languages, but most likely only English is currently supported
OpenAmplify
English
AlchemyAPI
English, French, German, Italian, Portuguese, Russian, Spanish, Swedish, however some important features are only available for English
Evri
English

I have not done any research into similar APIs which are not available for English at all, but if you find one, please let me know and I’ll make a note of it here.

The response from OpenAmplify and AlchemyAPI will also include the language of the input document. For AlchemyAPI this includes 97 languages, not just the ones that the API can handle. If you’re just looking for good language identification, there are other resources for that, some open source. My ancient Language Identification page still has some useful links there.

Number of Entities and Relevance Scores

For the purposes of my project, I see the APIs which return more entities from the document to be more useful, all else being equal. BeliefNetworks allows you to specify the number of entities returned (supposedly up to 75 but it actually returns 76), and as such always returns that number, which is almost always more than any other API. Yahoo returns up to 20 entities (which isn’t documented), which is often the least of any API. Here I list the APIs sorted by number of entities returned from most to least:

  1. BeliefNetworks
  2. AlchemyAPI
  3. OpenAmplify
  4. OpenCalais
  5. Yahoo
  6. Evri

There are a couple of important caveats here, however. This is based off of a very small sample, so other than BeliefNetworks returning the most, the list could be off. Beyond that, OpenCalais has a fairly small limit on text length (100,000 characters, presumably including the HTML tags) and if the input is too long, it returns no entities at all, just an error message. The ranking above excludes those examples. OpenAmplify has a limit of 2.5K, however they just truncate the document instead of failing (although this counts as an additional “transaction”). Oddly, Evri returned an error of “rejected by content filter” for this news article and returned no entities. Evri’s ranking in the list above is unchanged with or without the inclusion of that example.

Relevance Scores

All of the APIs, with the exception of Yahoo’s, include some metric with each entity rating its relevance to the input document. This is important as every user of any of these APIs would most likely want to establish a minimum relevance threshold for actually making use of the entities. The number of entities comparison above is based on no threshold at all; obviously changing the threshold would affect the comparison. AlchemyAPI and OpenCalais use scores from zero-to-one, however Evri, OpenAmplify, and BeliefNetworks have their own scale. I haven’t yet done any work to normalize all these scores and I think that most likely the best practice would be to independently determine your own threshold on a per-api basis depending on your own needs.

Semantic Links

By semantic links I simply mean that the entities returned have some sort of links or references to additional information about those entities. Although not necessarily required for my project, this may be very useful. Two of the APIs, Evri and AlchemyAPI include this information when they successfully map a found entity to an entity in their own database. Evri provides a reference to the entity in Evri’s own system, whereas AlchemyAPI links to a variety of other sources: the website for the entity, Freebase, MusicBrainz, and others.

In addition to or instead of these semantic links, Evri, AlchemyAPI, and OpenCalais have their own systems of classification and label entities with things like “Person” and “Religion”. See Evri’s most popular ‘facets’, AlchemyAPI Entity Types, and OpenCalais: Metadata Element Simple Output Value for specifics of each. OpenAmplify is even more basic but provides broad categories such as “Locations” and “Proper Nouns”, and entities may be listed in more than one of these broad categories. Yahoo and BeliefNetworks provide no additional context.

Additional Information

Some of these APIs provide a wealth of information that I disregard entirely but could be useful to others. For example Amplify returns a lot of information about the sentiment being expressed about each entity, information about the person who authored the document (e.g. gender, education), the style of the document (e.g. slang usage), and actions expressed within the document. OpenCalais also extracts events and facts from a document, as well as other details per entity such as the ticker symbol for entities which are public companies. AlchemyAPI can extract quotations from the document. Note that this is a summary and not a complete list of all the data that these APIs return.

Synonyms vs. Duplicates

The better APIs here, at least as far as I’ll be using them, succeed at recognizing that “Smith” referred to throughout an article is the same as “John Smith” mentioned in the first sentence. I want duplicates minimized, and for each entity to have as many valid names/synonyms as possible. The APIs differ here significantly.

Evri is definitely the best, followed by AlchemyAPI. Unfortunately AlchemyAPI sometimes misdisambiguates (ooh, no results on Google or Bing for that word yet) which results in incorrect synonyms, however that isn’t a huge problem for me. An example is the article I referred to earlier where AlchemyAPI confuses a Canadian military unit for the British monarch it was named after. Yahoo and OpenCalais fall into the middle. OpenAmplify and BeliefNetworks have a fair number of unmerged duplicate entities. For my purposes, I don’t care if the synonyms come from the input document or an external database, which is what Evri and AlchemyAPI probably use.

Taking a look at each API

Yahoo

This was the only API that I was aware of until recently, and I’ve blogged about it before. The input format is plain text, so since I’m using URLs as input, I have to first extract the text, strip the HTML, and send that. As I mentioned above, the quality was so poor when using web pages as input that the text must first be scrubbed of web page navigation, etc., and I used AlchemyAPI to do that. Even then, the quality was still poor and the API returned things that I would describe more as long phrases than as entities. Given that, not to mention the maximum of 20 entities, and the non-commercial restriction, I don’t see myself making use of this API.

OpenCalais

This API also accepted content rather than a URL. The content format must be specified in the API call. I simply retrieved the URL, and passed all of its content (with HTML) on to OpenCalais. They suggest making sure to remove web page navigation, but without me doing this, that didn’t present a problem. What was a problem was the short maximum document length. To actually use OpenCalais you should make sure to truncate documents before making the request, which is work that I haven’t yet implemented myself. Even when results were returned, the overall quality was mediocre.

The default output format is RDF, which is very verbose and includes a lot more information than I needed. I opted for the Text/Simple format which is actually XML.

For free API users sometimes the response comes back as “server busy” and I experienced this myself sometimes while trying it out.

BeliefNetworks

This API has a simple output, and is somewhat unusual in its functionality. Unlike all the other APIs, where the entities are extracted from the input document, with BeliefNetworks it seems they find entities which are related to the document but not necessarily actually in it. This produces some interesting results that are sometimes good but overall less related than I’d expect, and in one of my examples, completely unrelated and bizarre. Given that, and the frequency of duplicate entities, as mentioned, I would describe the overall quality as mediocre, although usually better than OpenCalais.

OpenAmplify

The most notable feature of OpenAmplify is all the additional information they provide, as described above.

They take input either as a URL or the content itself, and “charge” an additional transaction if you call them using a URL. I used URL input but also tried submitting the HTML, submitting the text (HTML stripped), and submitting the text after putting through the AlchemyAPI web page cleaning, and in all cases the results were about the same or worse.

OpenAmplify notes that they may not be able to follow all URL redirects (although I didn’t test this with any of the APIs), but this issue can be avoided by following the redirects yourself before making the request. As mentioned earlier, they only look at the first 2.5K of input. They also accept RSS/Atom as input, which is a nice feature.

Although I’ve set up my script to remove duplicates it currently misses removing some duplicate entities from OpenAmplify as the entity may be listed several times in the response but with different relevance scores.

One problem I found was that the entities returned usually consisted of a single token (one word) which just made them less useful. Overall, the quality was okay, generally better than BeliefNetworks.

AlchemyAPI

Other than the occasional misdisambiguation, AlchemyAPI is quite good.

Evri

Evri’s API is also quite good, with the biggest flaw being that it doesn’t return very many entities.

Overall Quality Summary

Overall, Evri and AlchemyAPI were definitely the best and most suited for my purposes. The quality of Evri’s was the best across the small sample, although not in all instances, and it didn’t return as many entities as AlchemyAPI. Interestingly these two APIs are also the two which include semantic links and have the least restrictions and high API limits.

OpenAmplify and BeliefNetworks are the runners up. OpenCalais fared poorly in my evaluation, but I suspect it would do better when looking at all the rest that their API. Yahoo’s API unfortunately just wasn’t good enough to use when any of the other APIs are available.

I’m convinced that trying to build a similar service myself is not worth it at all. One thing that I haven’t tried yet is combining these APIs together in some way, although that could potentially improve the results quite a bit.

You can see the script in action (until I take it down) at http://faganm.com/test/get_entities.php?u=[any URL].

geoupdater

geoupdater - does something very similar to what I was working on.

I’ve got my location set in FireEagle, which I update via the site and Dopplr. I also found an app that turns FireEagle into geoRSS which allows me to include the updates in friendfeed and plot my location easily on a map, which I will shortly be adding to my personal homepage.

This tool (via Ogle Earth) will read from FireEagle and post updates to services like Facebook, and via an RSS feed that includes past locations, allows you to pull in updates to friendfeed, etc. Not bad.

more from the Evening Standard: The DIY lunch break

Page 17 features an article about Cook, Eat, and Run. Basically, during your lunch break, you visit their kitchen, learn how to cook a meal, and eat it, all in around an hour. I believe we can design solutions that are parsimonous, win-win-win, etc., and I think this is a great one. It accomplishes:

  • people get out and have a real break for lunch
  • people are eating fresh food
  • people learn how to cook new things. nobody knows how to cook anything any more, and this is a big problem.
  • the hosting company makes money
  • you get to meet new people, and shared participation is the best way to really meet people properly, in my opinion.

In other words, brilliant. I believe there is room for a whole host of other similar ideas, such as perhaps workplaces hiring a chef once a week to organize everyone into making lunch in the company’s kitchen, for instance.

And people call me crazy when they see me slicing meat and vegetables at my desk. But nobody questions how great my sandwiches are ;-)

Gin, Television, and Social Surplus - Here Comes Everybody

Gin, Television, and Social Surplus - Here Comes Everybody - couple of things here I agree with, couple that I disagree with.

To start with, I thought “journalists” calling non-fad web things fads was going to die years ago. The internet is not a fad. Blogging is not a fad. Sharing stuff on the web, clearly not a fad. Lolcats are a fad. Sharing pictures of your cat is not a fad. This is hardly an opinion. Were people still calling radio a fad when it had as much penetration as blogs do today?

Once upon a time in the ancient 1940s, people wondered what humans would do with all the free time afforded by machines doing most of their work, like cleaning, cooking, etc. Try to find someone today who says they have enough free time.

So what happened? Why didn’t we get our free time? Well we did. But the available options with which to occupy our time has increased much faster than our free time has. People no longer need to invent things to do with their free time, they have to spend extra time deciding which things to try to do in their free time, as they can only do a small subset of what they want to do. The difficulty today is to look at the millions of options out there, and eliminate virtually all of them from your life. Realizing that removing things from your life will make it more full; that is tricky.

Saying that you do not have time for something is always a lie. You have time for whatever it is you decide to have time for.

Oh, and to tie that back in, it is nice that an increasing amout of what people are spending their time on is contributing to the public good, such as Wikipedia, as discussed in the article.

Casulo - Your Apartment in a Box « Gaya, Ruang dan Kepelbagaian

Casulo - Your Apartment in a Box « Gaya, Ruang dan Kepelbagaian - this is great. although more furniture than I use ;-)

mySociety » Blog Archive » Please Donate to help us expand TheyWorkForYou

mySociety » Blog Archive » Please Donate to help us expand TheyWorkForYou - mySociety does fantastic work. Unfortunately for me, it is mostly UK- and Eurocentric, but hopefully they’ll get some volunteers from elsewhere over time.

where are you reading this?

so I think I have semi-solved my question of where to post things. now everything I tag with “forfb” on this blog will also show up as a note on Facebook. That way I can post things on one, the other, or both, without having to write them twice. Of course, there will still be comment splitting…

You Are What You Grow

You Are What You Grow - this is a very true and depressing story of the massive implications of the US’s farm bill, which everyone should read. Via del.icio.us.

Kids, the Internet, and the End of Privacy: The Greatest Generation Gap Since Rock and Roll — New York Magazine

Kids, the Internet, and the End of Privacy: The Greatest Generation Gap Since Rock and Roll — New York Magazine - a much-needed article, although I’m still reading it. One the one side, privacy is now in crisis mode, with everything publicly available; on the other hand, the younger generation is paradigm shifting, and embrases the privacylessness wholeheartedly.

If a tree falls, and it is doesn’t have a permalink, did it really happen? How would we know if it did, there’s not even a video of it…

Via Jeremy Zawodny’s linkblog.

Puzzlepieces down again?

argh. I noticed today that my blog appeared to be completely down, in the sense that all the pages were blank.. hmn… the Wordpress admin interface still worked…

Took me about a half our to track down the problem, which was solved by disabling a plugin that I don’t use anyway. I have no idea why, but can hardly be bothered to figure that out now.

Puzzlepieces

Puzzlepieces - so I think I’ve successfully moved this blog over to it’s new home… everything should look the same. I believe the post IDs have not changed, which is what should happen… except that it’s not because I’ve done it on purpose, it’s because for some reason I seem to have previously hardcoded them, before I learned the best way to make unique IDs for items in Atom feeds. ah well

Puzzlepieces lives!

So for all of you (~2 people) who complained that Puzzlepieces was dead, I’m happy to say it is back and doing well.

During this between-semesters holiday I’ve finally gotten myself a new host, and have (almost) successfully moved all of my websites over. My University of Waterloo search engine (UWhub) doesn’t work, but that will be fixed once I change all my file request function calls to use CURL instead. There is also a minor problem with Fagan Finder’s older .shtml pages, but I’ll worry about that later.

So, congratulations to me. Also, I’ve finally gotten myself a website, faganm.com, and so I intend to move this blog over to somewhere there instead of hijacking my old website’s domain.

IEBlog : Search in IE7 RC1

IEBlog : Search in IE7 RC1 - they support the newish “referrer” extension - yay. let’s hope everybody else starts supporting this too

OpenSearch Update

I’ve been fairly busy at Microsoft, working, and hanging out with other interns and so I’m way behind on blogging about OpenSearch.

Internet Explorer 7 beta 2 and Firefox Bon Echo are out, both with some degree of OpenSearch support. Both support autodiscovery of Description files. IE7 (not sure about Firefox) supports search results in OpenSearch Response (RSS/Atom) as well as HTML. IE7 (and I suspect Firefox) do not support extended search parameters (those beyond searchTerms, startPage, etc.), but that’s to be expected at this stage.

Firefox support is a little odd, in that they also support some odd pseudoOpenSearch format. So please, developers, use real OpenSearch, it’ll work equally well in all readers, not just Firefox.

Firefox’s beta also has support for “search suggestions” when using Google or Yahoo. DeWitt has shown how (see draft document) these suggestions can be implemented in a way that is completely compatible with OpenSearch, without changing the existing format (JSON) at all. And it also opens the door to allowing suggestions themselves in OpenSearch; the Query element is ideal for this purpose.

From a webmaster perspective, the OpenSearch referrer extension (draft) is really great, allowing search sites to see where their searches are coming from. I’ve wanted this for a while, and it’s great to see it happening.

Perhaps more interesting than any of this is moving forward on adding structured data into OpenSearch, and DeWitt’s draft OpenSearch and Microformats is a great step in that direction. Personally I like data to be in XML more directly (rather than embedding it within atom:content, for example), but hopefully that approach can work in tandem, still using microformats. I’ll be looking into it, as I unofficially advise my university on how to create an API for their people search. Others have been looking at this too.

These are just some of the major happenings in OpenSearch. There are a variety of new software libraries, such as in Java and Ruby. An increasing number of organizations are basing their APIs and other things on OpenSearch. A9.com’s listing of OpenSearch providers is now well over 300. It’s hard to believe how far OpenSearch has come and how far it looks like it may go.

Puzzlepieces � tree-climbing vines (April 17, 2005)

Puzzlepieces � tree-climbing vines (April 17, 2005) - yes, I’m linking to a blog post of mine from last year. Most of my posts generate about zero comments, so this one, at 5, is a lot, the last one just added today.

All from people I don’t know, it seems that people are searching for “tree climbing vine” and variations, and I’m ranking quite well for those. So if you have vines like those and want to get rid of them, check that post for comments ;-)

iSpecies.org

iSpecies.org is a mashup for a specific discipline, which is where it’s at ;-) . Not too much to speak of yet, but it’s a good start. How about adding Wikipedia data? Via Tara.

IEBlog : Hello from LA!

IEBlog : Hello from LA! - placeholder until tomorrow when I comment on this

quadspot � waterloo

quadspot � waterloo - looks like there’s now a craigslist clone for my university. finally, this is basically what I’ve been wanting and hoping someone would do.

the website seems to have a ton of school sites, but it all appears to be new. no about page, and I don’t immediately see other websites mentioning it. interesting. via uw.forsale, presumably posted by someone associated with it

all categories have rss feeds, but not search results

Update June 13: SLC comments on this too. And wonderfully, the creator comments on this post and adds feeds for search results :-)

Intro to Twin Peaks

I did see a number of butterflies (not to mention birds) at Twin Peaks, but I didn’t read the description, so I’m not sure if they were ‘mission blue’ butterflies. Considering how incredibly windy it was, I’m surprised that butterflies could stay up there. If I was a little lighter I might have just been blown far away.

0240
9-Apr-05 15:19:36