Michael Fagan's blog

egosurfing on delicious.com

Posted on February 19, 2010 by mfagan

Via Alf, I learn that apparently delicious has long supported lookup by domain and path, not just absolute URL.

I took a look through all the bookmarks of my old website, Fagan Finder, and it turned up a few interesting things. The most popular page, with some commentary:

URLinfo (688). This makes sense because it is (well, was) a useful tool for web developers, bloggers, etc., the typical audience of delicious, however it is definitely not one of the most popular pages by the site’s own statistics.
All About RSS (568), with bookmarks spread over 34 different URLs. Aside from the bookmarks to specific sections of the page, this shows the results of moving the URL by changing the file extension, although I did use a 301 redirect. This page is no longer one of the popular pages in terms of traffic.
Google Ultimate Interface (377). Aside from the oddity of this not-that-useful page being so popular, what’s interesting is that a long time ago I created a second version that failed to work in Mozilla, so I had it redirect to the old version for those users which I thought (at the time) would be rare… yet there are almost 6 times as many bookmarks for the Mozilla/Firefox version.
Image Search Engines (322) – finally we come to a page that is actually popular among the general public; it’s also the only page on my old site that I updated “recently”
Translation Wizard (201) – sadly this hardly works any more but I loved the idea and spent an insane amount of time to build this

Another thing I noticed is that a tool I’d once built but never referred to anywhere, and could only be found by going to a tag page on this blog and clicking “more” in the sidebar somewhere somehow has 10 bookmarks.

Tagged delicious, egosurfing, faganfinder | Comments Off

Announcing Quizify

Posted on February 18, 2010 by mfagan

Back in early 2005 I hacked up a quick web app to help me study for the Arthropod Zoology course I was taking in university. It helped me so much that in 2006 I decided to remake it in a non-ugly and usable way and I demoed it at BarCampWaterloo in 2007.

There it rested, my “current” yet abandoned project until around September 2009 when my friend Ben began to refactor all the code.

Lately I’ve had the time to work on it more seriously. I’ve moved it to Quizify.com and it is now ready for the general public.

Functionality today is fairly simple but quite useful, at least in my opinion :-). Input a URL that includes a definition list (as in a <dl> in HTML) and it creates a flashcard-like quiz with the data.

I plan to continue improving it, but in the meantime, feedback is welcome. Oh, and the NLP APIs I was blogging about recently was related to this project, but for a feature that won’t be ready for some time.

Tagged quizify, semanticweb | Comments Off

Aardvark

Posted on February 13, 2010 by mfagan

I’ve had this draft post about Aardvark for about two weeks now. Now that they’ve been acquired by Google, I guess it’s about time to finally publish it.

I first heard about Aardvark via the Seattle Tech Startups mailing list and eventually got around to trying it. Few things get past my initial attempt, but I’ve still got Aardvark. It’s a question-and-answer service where you can ask questions yourself and answer questions of others.

What I’ve enjoyed the most about Aardvark (beyond it’s ability to send questions to the right people) is how easy to use and friendly it is. I interact with it via instant messenger, and every message it sends me includes all the instructions I need, in a friendly way, without being too verbose either. It’s impossible to not understand how to use it.

Recently they published a paper – Anatomy of a Large-Scale Social Search Engine (the name is a reference to a famous Google paper) – which I found quite interesting. I was expecting more statistics about the usefulness and value of Aardvark than the paper had, however the interesting part is that Aardvark turned out to be far more sophisticated than I’d realized. As I read it I’d think of a way to make it even better, and later on in the paper, find that they’d already done that. One astonishing graphic in the paper is their graph of users over time; that’s some impressive growth.

Now that Google’s bought them, I only hope that they’ll allow the founders to keep doing the good job they’ve been doing… I’ve seen too many excellent products wither after acquisition (e.g. dodgeball and jotspot).

Tagged aardvark, acquisition, google, qna, socialsearch | Comments Off

Frozen Lizards in Florida

Posted on February 10, 2010 by mfagan

A few weeks ago I was in Florida, around the Fort Lauderdale area, and for the first couple of days, it was very cold, for Florida. Too cold for many lizards, that’s for sure. Here are a couple of my photos of deceased lizards taken after it warmed up. These are mostly iguanas and other large lizards, as I didn’t take photos of any small ones.

In the second photo you can see that something has removed the lizard’s tail, presumably after it died. The second-last photo shows three large iguanas lying on floating plant material, all of whom presumably froze and fell out of a large tree that reached over the water; the last photo is a cropped zoom-in of one of those three.

Tagged florida, fortlauderdale, iguanas, lizards, myphotos | Comments Off

The Race for Timbuktu: In Search of Africa’s City of Gold by Frank T. Kryza | LibraryThing

Posted on January 28, 2010 by mfagan

I mentioned a little over a year ago that I’m now keeping track of what I read (using LibraryThing). I enter the book, the date I finished it, and a rating out of five stars, but I’ve been thinking for a while that perhaps I should do some reviews as well, especially for the good ones.

I recently read The Race for Timbuktu: In Search of Africa’s City of Gold by Frank T. Kryza, about some British expeditions in the early 1800s. Overall definitely interesting, and I was surprised to realized just how recently Europe had zero clue about what lay within most of Africa, Given how shortly after these expeditions most of Africa was colonized.

The main flaws of the book were jumping back and forth between different expeditions in different years, especially in the first half of the book. There are some excellent maps included, but most of them are far too small to actually read much, so I’ve located some of them online:

The third map from the book I would have also liked to have found; I contacted the British Library for help (mentioned in the book as the source) but they were not able to find it. I’m sure I could get it eventually, but that was the limit of how much work I really wanted to put into this.

Aside from the maps being too small, I would have appreciated having more maps that showed the kingdoms and cultural groups with their boundaries, as it becomes difficult to follow this well given the number of times and places described.

Tagged bookreview, britishempire, colonialism, frankkryza, timbuktu, tripoli | Comments Off

Some notes on Map Kibera mapping – Mikel Maron

Posted on January 20, 2010 by mfagan

It occurs to me that I’ve hardly mentioned OpenStreetMap on this blog, despite that it’s often an obsession of mine, as people who’ve met me in person would quickly confirm. As the Wikipedia of maps (no other explanation works nearly as well), it is open, easy to contribute to, and I believe, will eventually be the source used for most general mapping applications. Even today it gets quite a bit of use, and growing.

Anyhow, Mikel Maron posts on the Map Kibera blog (Some notes on Map Kibera mapping) about some of the amazing work he organized mapping Kibera, Nairobi, one of the largest slums in the world. It’s interesting how a project that began as a counter to the high-priced Ordinance Survey maps in London has become (among many other things), among the best in maps of the developing world, and and an important resource in humanitarian efforts such as Haiti.

I myself have contributed to the project wherever I am living (or have lived), with lots of contributions around Bellevue, WA, a bit in several places in Toronto, and last week a ton of very detailed and localized mapping in a small section of Florida.

Tagged haiti, kibera, mapkibera, mapping, mikelmaron, openstreetmap | 2 Comments

The Loudness Wars (and leaf blowers)

Posted on January 18, 2010 by mfagan

In an earlier post complaining about excessive noise I briefly mentioned the trends within music. The Loudness Wars: Why Music Sounds Worse : NPR has more interesting details on this include some real stats (sadly using only the top song from each year rather than some sort of average). There’s a Wikipedia article now, Loudness war. Some more anecdotes can be found in an article from the Times Online.

The comfort for me in all this is that people seem to be really realizing that this is a problem and ruining the quality of music, and so maybe we are or soon will hit a peak and then trend back to something more reasonable. Don’t even get me started on people using earbuds whose music can still be heard by others.

Leaf Blowers

Last week I was also reminded of something that’s been bothering me for years, leaf blowers. I’d describe a leaf blower as something that accomplishes the same thing as a rake, but with the added benefit of costing more, taking much longer, requiring more effort, using gas or electricity (cost and pollution), and making a ton of noise. A quick search online for ban “leaf blowers” shows that I am very far from being the only one with this opinion. It appears that a lot of people are working to get them banned, and in some places have succeeded. Let’s hope that spreads.

I did my usual Facebook check and there are tons of anti-leaf-blower groups, with fewer than a couple hundred people in most of them. The Clean Air California website seems to be aiming to ban them statewide or even US-wide. Here (sorta) in Toronto I see an article on a motion to ban them in 2007 which failed for what seems like pathetic reasons including industry lobbying, 6 years after the previous attempt to ban them.

Looking at the leaf blower issue turned up the Noise Pollution Clearinghouse, which seems to be perhaps the American equivalent of the Canadian advocacy site I linked to in my previous post. Last year I attended a lecture at Town Hall Seattle by Gordon Hempton who started the One Square Inch of Silence project, not a bad idea but more significant for it’s symbolism than the one particular place. I don’t have any real conclusion here, I’m just complaining in my usual, not-really-doing-anything-about-the-problem way.

Tagged gordonhempton, leafblowers, music, noise, noisepollution, onesquareinchofsilence, sound, volume | 2 Comments

Comparing NLP APIs for Entity Extraction

Posted on January 2, 2010 by mfagan

Update: a number have people have pointed out some small errors and some additional APIs that I should look at. See my half-hearted followup: Entity Extraction APIs, once again.

As part of a project I’m working on (more on that later), I wanted to be able to take some text (probably in the form of a web page) and get a list of the important entities/keywords/phrases.

It turns out that there are actually quite a few companies that offer a service like this available freely (at least somewhat) through an API, so I set out to try them all out and assess their quality and suitability for my project. Most of these APIs are provided by companies that do various things in the NLP (natural language processing) realm and/or work with large semantic datasets. Many of the APIs provide a variety of information, only some of which is the set of entities that I’m looking for, so they may have good features that are excluded from my narrow comparison.

Using the APIs

To evaluate the APIs I wrote a script to make use of each one (scroll to the bottom to see it in action). They were fairly similar but the code to handle each one is slightly different. In many cases they offered multiple response formats but I opted for XML for each of them which made things simple enough, and I got used to using SimpleXML in PHP. The main difference between them all is simply the XPath expression needed to pick out the entities. For each API I grab the entity, any available synonyms (minus some de-duplication), and the self-reported relevance score for that entity, if available. If not already sorted by that relevance, I sort them.

An additional issue was that although most of the APIs accepted a URL as input, some required the actual content, in either HTML or plain text. When accepting content from a web page, the service needs to be smart about ignoring web navigation, ads, etc. when determining what is important, and they vary in ability to do that. Alchemy (one of the APIs tested here) also has a web page cleaning API which can be accessed on its own. Results from the Yahoo API were of such low quality that I actually ran the input web page through the web page cleaning API before sending it to Yahoo, and it is those results which are evaluated here.

Most of the analysis here is based off a sample of eight web pages including Wikipedia articles, news articles, and other pages with a lot of text content from a variety of subjects. I have not yet done any analysis of how the quality of the response for each API is affected by the length of the input document.

The APIs

The APIs I tested were, roughly in order of increasing quality,

Comment on this post to let me know if I’m missing any.

API terms

Most APIs today have limits on both how much they can be called, and what you can use them for. Here they are ranked roughly by “most usable terms and limits” to least:

Evri: Currently no API limit; essentially no requirements
AlchemyAPI: 30,000 calls per day (although more may be available); commercial use is definitely okay
OpenCalais: 50,000 calls per day, 4 calls per second; one must display their logo as-is; if you are syndicating the data it needs to preserve their GUIDs; see details
BeliefNetworks: 2,000 calls per day, 1 call per second; essentially no requirements
Yahoo: 5,000 calls per IP per day; non-commercial use only
OpenAmplify: 1,000 “transactions” per day; note that one call is 1-4 transactions depending on the input type and whether you want all or a subset of the output; commercial use is definitely okay

Please note the standard I-am-not-a-lawyer and that this is just a summary. Please read the terms of service yourself.

Languages

Although my project is English-only for now, ideally there would be support for other languages. All the samples I used were in English so that is what is being used to evaluate the quality, but here is the full list of what languages each API claims to handle:

Yahoo: English
OpenCalais: English, French, Spanish
BeliefNetworks: A white paper on their website claims that their technology can support multiple languages, but most likely only English is currently supported
OpenAmplify: English
AlchemyAPI: English, French, German, Italian, Portuguese, Russian, Spanish, Swedish, however some important features are only available for English
Evri: English

I have not done any research into similar APIs which are not available for English at all, but if you find one, please let me know and I’ll make a note of it here.

The response from OpenAmplify and AlchemyAPI will also include the language of the input document. For AlchemyAPI this includes 97 languages, not just the ones that the API can handle. If you’re just looking for good language identification, there are other resources for that, some open source. My ancient Language Identification page still has some useful links there.

Number of Entities and Relevance Scores

For the purposes of my project, I see the APIs which return more entities from the document to be more useful, all else being equal. BeliefNetworks allows you to specify the number of entities returned (supposedly up to 75 but it actually returns 76), and as such always returns that number, which is almost always more than any other API. Yahoo returns up to 20 entities (which isn’t documented), which is often the least of any API. Here I list the APIs sorted by number of entities returned from most to least:

BeliefNetworks
AlchemyAPI
OpenAmplify
OpenCalais
Yahoo
Evri

There are a couple of important caveats here, however. This is based off of a very small sample, so other than BeliefNetworks returning the most, the list could be off. Beyond that, OpenCalais has a fairly small limit on text length (100,000 characters, presumably including the HTML tags) and if the input is too long, it returns no entities at all, just an error message. The ranking above excludes those examples. OpenAmplify has a limit of 2.5K, however they just truncate the document instead of failing (although this counts as an additional “transaction”). Oddly, Evri returned an error of “rejected by content filter” for this news article and returned no entities. Evri’s ranking in the list above is unchanged with or without the inclusion of that example.

Relevance Scores

All of the APIs, with the exception of Yahoo’s, include some metric with each entity rating its relevance to the input document. This is important as every user of any of these APIs would most likely want to establish a minimum relevance threshold for actually making use of the entities. The number of entities comparison above is based on no threshold at all; obviously changing the threshold would affect the comparison. AlchemyAPI and OpenCalais use scores from zero-to-one, however Evri, OpenAmplify, and BeliefNetworks have their own scale. I haven’t yet done any work to normalize all these scores and I think that most likely the best practice would be to independently determine your own threshold on a per-api basis depending on your own needs.

Semantic Links

By semantic links I simply mean that the entities returned have some sort of links or references to additional information about those entities. Although not necessarily required for my project, this may be very useful. Two of the APIs, Evri and AlchemyAPI include this information when they successfully map a found entity to an entity in their own database. Evri provides a reference to the entity in Evri’s own system, whereas AlchemyAPI links to a variety of other sources: the website for the entity, Freebase, MusicBrainz, and others.

In addition to or instead of these semantic links, Evri, AlchemyAPI, and OpenCalais have their own systems of classification and label entities with things like “Person” and “Religion”. See Evri’s most popular ‘facets’, AlchemyAPI Entity Types, and OpenCalais: Metadata Element Simple Output Value for specifics of each. OpenAmplify is even more basic but provides broad categories such as “Locations” and “Proper Nouns”, and entities may be listed in more than one of these broad categories. Yahoo and BeliefNetworks provide no additional context.

Additional Information

Some of these APIs provide a wealth of information that I disregard entirely but could be useful to others. For example Amplify returns a lot of information about the sentiment being expressed about each entity, information about the person who authored the document (e.g. gender, education), the style of the document (e.g. slang usage), and actions expressed within the document. OpenCalais also extracts events and facts from a document, as well as other details per entity such as the ticker symbol for entities which are public companies. AlchemyAPI can extract quotations from the document. Note that this is a summary and not a complete list of all the data that these APIs return.

Synonyms vs. Duplicates

The better APIs here, at least as far as I’ll be using them, succeed at recognizing that “Smith” referred to throughout an article is the same as “John Smith” mentioned in the first sentence. I want duplicates minimized, and for each entity to have as many valid names/synonyms as possible. The APIs differ here significantly.

Evri is definitely the best, followed by AlchemyAPI. Unfortunately AlchemyAPI sometimes misdisambiguates (ooh, no results on Google or Bing for that word yet) which results in incorrect synonyms, however that isn’t a huge problem for me. An example is the article I referred to earlier where AlchemyAPI confuses a Canadian military unit for the British monarch it was named after. Yahoo and OpenCalais fall into the middle. OpenAmplify and BeliefNetworks have a fair number of unmerged duplicate entities. For my purposes, I don’t care if the synonyms come from the input document or an external database, which is what Evri and AlchemyAPI probably use.

Taking a look at each API

Yahoo

This was the only API that I was aware of until recently, and I’ve blogged about it before. The input format is plain text, so since I’m using URLs as input, I have to first extract the text, strip the HTML, and send that. As I mentioned above, the quality was so poor when using web pages as input that the text must first be scrubbed of web page navigation, etc., and I used AlchemyAPI to do that. Even then, the quality was still poor and the API returned things that I would describe more as long phrases than as entities. Given that, not to mention the maximum of 20 entities, and the non-commercial restriction, I don’t see myself making use of this API.

OpenCalais

This API also accepted content rather than a URL. The content format must be specified in the API call. I simply retrieved the URL, and passed all of its content (with HTML) on to OpenCalais. They suggest making sure to remove web page navigation, but without me doing this, that didn’t present a problem. What was a problem was the short maximum document length. To actually use OpenCalais you should make sure to truncate documents before making the request, which is work that I haven’t yet implemented myself. Even when results were returned, the overall quality was mediocre.

The default output format is RDF, which is very verbose and includes a lot more information than I needed. I opted for the Text/Simple format which is actually XML.

For free API users sometimes the response comes back as “server busy” and I experienced this myself sometimes while trying it out.

BeliefNetworks

This API has a simple output, and is somewhat unusual in its functionality. Unlike all the other APIs, where the entities are extracted from the input document, with BeliefNetworks it seems they find entities which are related to the document but not necessarily actually in it. This produces some interesting results that are sometimes good but overall less related than I’d expect, and in one of my examples, completely unrelated and bizarre. Given that, and the frequency of duplicate entities, as mentioned, I would describe the overall quality as mediocre, although usually better than OpenCalais.

OpenAmplify

The most notable feature of OpenAmplify is all the additional information they provide, as described above.

They take input either as a URL or the content itself, and “charge” an additional transaction if you call them using a URL. I used URL input but also tried submitting the HTML, submitting the text (HTML stripped), and submitting the text after putting through the AlchemyAPI web page cleaning, and in all cases the results were about the same or worse.

OpenAmplify notes that they may not be able to follow all URL redirects (although I didn’t test this with any of the APIs), but this issue can be avoided by following the redirects yourself before making the request. As mentioned earlier, they only look at the first 2.5K of input. They also accept RSS/Atom as input, which is a nice feature.

Although I’ve set up my script to remove duplicates it currently misses removing some duplicate entities from OpenAmplify as the entity may be listed several times in the response but with different relevance scores.

One problem I found was that the entities returned usually consisted of a single token (one word) which just made them less useful. Overall, the quality was okay, generally better than BeliefNetworks.

AlchemyAPI

Other than the occasional misdisambiguation, AlchemyAPI is quite good.

Evri

Evri’s API is also quite good, with the biggest flaw being that it doesn’t return very many entities.

Overall Quality Summary

Overall, Evri and AlchemyAPI were definitely the best and most suited for my purposes. The quality of Evri’s was the best across the small sample, although not in all instances, and it didn’t return as many entities as AlchemyAPI. Interestingly these two APIs are also the two which include semantic links and have the least restrictions and high API limits.

OpenAmplify and BeliefNetworks are the runners up. OpenCalais fared poorly in my evaluation, but I suspect it would do better when looking at all the rest that their API. Yahoo’s API unfortunately just wasn’t good enough to use when any of the other APIs are available.

I’m convinced that trying to build a similar service myself is not worth it at all. One thing that I haven’t tried yet is combining these APIs together in some way, although that could potentially improve the results quite a bit.

You can see the script in action (until I take it down) at http://faganm.com/test/get_entities.php?u=[any URL].

Posted in Uncategorized | Tagged alchemyapi, api, apis, beliefnetworks, entityextraction, evri, naturallanguageprocessing, nlp, openamplify, opencalais, webservices, yahoo | 25 Comments

On the US GovÃ¢â‚¬â„¢t “Going Google”

Posted on December 25, 2009 by mfagan

On the US Gov’t “Going Google” – very true. Google has had great success, and thus they are a problem.

Tagged antitrust, google, government, publicvsprivate | Comments Off

Yahoo! Will Kill MyBlogLog Next Month

Posted on December 23, 2009 by mfagan

Yahoo! Will Kill MyBlogLog Next Month – of all the services Yahoo’s been killing, this one is just sad. MyBlogLog was pretty innovative and just as useful today as it always has been, despite years of nothing new. At the moment, you can still see it on the side of this blog’s homepage. Via Scott Rafer

Tagged mybloglog, yahoo | Comments Off