Some notes on Map Kibera mapping - Mikel Maron

It occurs to me that I’ve hardly mentioned OpenStreetMap on this blog, despite that it’s often an obsession of mine, as people who’ve met me in person would quickly confirm. As the Wikipedia of maps (no other explanation works nearly as well), it is open, easy to contribute to, and I believe, will eventually be the source used for most general mapping applications. Even today it gets quite a bit of use, and growing.

Anyhow, Mikel Maron posts on the Map Kibera blog (Some notes on Map Kibera mapping) about some of the amazing work he organized mapping Kibera, Nairobi, one of the largest slums in the world. It’s interesting how a project that began as a counter to the high-priced Ordinance Survey maps in London has become (among many other things), among the best in maps of the developing world, and and an important resource in humanitarian efforts such as Haiti.

I myself have contributed to the project wherever I am living (or have lived), with lots of contributions around Bellevue, WA, a bit in several places in Toronto, and last week a ton of very detailed and localized mapping in a small section of Florida.

Comparing NLP APIs for Entity Extraction

Update: a number have people have pointed out some small errors and some additional APIs that I should look at; until I get this post updated, please check out some of the great user comments at the bottom

As part of a project I’m working on (more on that later), I wanted to be able to take some text (probably in the form of a web page) and get a list of the important entities/keywords/phrases.

It turns out that there are actually quite a few companies that offer a service like this available freely (at least somewhat) through an API, so I set out to try them all out and assess their quality and suitability for my project. Most of these APIs are provided by companies that do various things in the NLP (natural language processing) realm and/or work with large semantic datasets. Many of the APIs provide a variety of information, only some of which is the set of entities that I’m looking for, so they may have good features that are excluded from my narrow comparison.

Using the APIs

To evaluate the APIs I wrote a script to make use of each one (scroll to the bottom to see it in action). They were fairly similar but the code to handle each one is slightly different. In many cases they offered multiple response formats but I opted for XML for each of them which made things simple enough, and I got used to using SimpleXML in PHP. The main difference between them all is simply the XPath expression needed to pick out the entities. For each API I grab the entity, any available synonyms (minus some de-duplication), and the self-reported relevance score for that entity, if available. If not already sorted by that relevance, I sort them.

An additional issue was that although most of the APIs accepted a URL as input, some required the actual content, in either HTML or plain text. When accepting content from a web page, the service needs to be smart about ignoring web navigation, ads, etc. when determining what is important, and they vary in ability to do that. Alchemy (one of the APIs tested here) also has a web page cleaning API which can be accessed on its own. Results from the Yahoo API were of such low quality that I actually ran the input web page through the web page cleaning API before sending it to Yahoo, and it is those results which are evaluated here.

Most of the analysis here is based off a sample of eight web pages including Wikipedia articles, news articles, and other pages with a lot of text content from a variety of subjects. I have not yet done any analysis of how the quality of the response for each API is affected by the length of the input document.

The APIs

The APIs I tested were, roughly in order of increasing quality,

Comment on this post to let me know if I’m missing any.

API terms

Most APIs today have limits on both how much they can be called, and what you can use them for. Here they are ranked roughly by “most usable terms and limits” to least:

Evri
Currently no API limit; essentially no requirements
AlchemyAPI
30,000 calls per day (although more may be available); commercial use is definitely okay
OpenCalais
50,000 calls per day, 4 calls per second; one must display their logo as-is; if you are syndicating the data it needs to preserve their GUIDs; see details
BeliefNetworks
2,000 calls per day, 1 call per second; essentially no requirements
Yahoo
5,000 calls per IP per day; non-commercial use only
OpenAmplify
1,000 “transactions” per day; note that one call is 1-4 transactions depending on the input type and whether you want all or a subset of the output; commercial use is definitely okay

Please note the standard I-am-not-a-lawyer and that this is just a summary. Please read the terms of service yourself.

Languages

Although my project is English-only for now, ideally there would be support for other languages. All the samples I used were in English so that is what is being used to evaluate the quality, but here is the full list of what languages each API claims to handle:

Yahoo
English
OpenCalais
English, French, Spanish
BeliefNetworks
A white paper on their website claims that their technology can support multiple languages, but most likely only English is currently supported
OpenAmplify
English
AlchemyAPI
English, French, German, Italian, Portuguese, Russian, Spanish, Swedish, however some important features are only available for English
Evri
English

I have not done any research into similar APIs which are not available for English at all, but if you find one, please let me know and I’ll make a note of it here.

The response from OpenAmplify and AlchemyAPI will also include the language of the input document. For AlchemyAPI this includes 97 languages, not just the ones that the API can handle. If you’re just looking for good language identification, there are other resources for that, some open source. My ancient Language Identification page still has some useful links there.

Number of Entities and Relevance Scores

For the purposes of my project, I see the APIs which return more entities from the document to be more useful, all else being equal. BeliefNetworks allows you to specify the number of entities returned (supposedly up to 75 but it actually returns 76), and as such always returns that number, which is almost always more than any other API. Yahoo returns up to 20 entities (which isn’t documented), which is often the least of any API. Here I list the APIs sorted by number of entities returned from most to least:

  1. BeliefNetworks
  2. AlchemyAPI
  3. OpenAmplify
  4. OpenCalais
  5. Yahoo
  6. Evri

There are a couple of important caveats here, however. This is based off of a very small sample, so other than BeliefNetworks returning the most, the list could be off. Beyond that, OpenCalais has a fairly small limit on text length (100,000 characters, presumably including the HTML tags) and if the input is too long, it returns no entities at all, just an error message. The ranking above excludes those examples. OpenAmplify has a limit of 2.5K, however they just truncate the document instead of failing (although this counts as an additional “transaction”). Oddly, Evri returned an error of “rejected by content filter” for this news article and returned no entities. Evri’s ranking in the list above is unchanged with or without the inclusion of that example.

Relevance Scores

All of the APIs, with the exception of Yahoo’s, include some metric with each entity rating its relevance to the input document. This is important as every user of any of these APIs would most likely want to establish a minimum relevance threshold for actually making use of the entities. The number of entities comparison above is based on no threshold at all; obviously changing the threshold would affect the comparison. AlchemyAPI and OpenCalais use scores from zero-to-one, however Evri, OpenAmplify, and BeliefNetworks have their own scale. I haven’t yet done any work to normalize all these scores and I think that most likely the best practice would be to independently determine your own threshold on a per-api basis depending on your own needs.

Semantic Links

By semantic links I simply mean that the entities returned have some sort of links or references to additional information about those entities. Although not necessarily required for my project, this may be very useful. Two of the APIs, Evri and AlchemyAPI include this information when they successfully map a found entity to an entity in their own database. Evri provides a reference to the entity in Evri’s own system, whereas AlchemyAPI links to a variety of other sources: the website for the entity, Freebase, MusicBrainz, and others.

In addition to or instead of these semantic links, Evri, AlchemyAPI, and OpenCalais have their own systems of classification and label entities with things like “Person” and “Religion”. See Evri’s most popular ‘facets’, AlchemyAPI Entity Types, and OpenCalais: Metadata Element Simple Output Value for specifics of each. OpenAmplify is even more basic but provides broad categories such as “Locations” and “Proper Nouns”, and entities may be listed in more than one of these broad categories. Yahoo and BeliefNetworks provide no additional context.

Additional Information

Some of these APIs provide a wealth of information that I disregard entirely but could be useful to others. For example Amplify returns a lot of information about the sentiment being expressed about each entity, information about the person who authored the document (e.g. gender, education), the style of the document (e.g. slang usage), and actions expressed within the document. OpenCalais also extracts events and facts from a document, as well as other details per entity such as the ticker symbol for entities which are public companies. AlchemyAPI can extract quotations from the document. Note that this is a summary and not a complete list of all the data that these APIs return.

Synonyms vs. Duplicates

The better APIs here, at least as far as I’ll be using them, succeed at recognizing that “Smith” referred to throughout an article is the same as “John Smith” mentioned in the first sentence. I want duplicates minimized, and for each entity to have as many valid names/synonyms as possible. The APIs differ here significantly.

Evri is definitely the best, followed by AlchemyAPI. Unfortunately AlchemyAPI sometimes misdisambiguates (ooh, no results on Google or Bing for that word yet) which results in incorrect synonyms, however that isn’t a huge problem for me. An example is the article I referred to earlier where AlchemyAPI confuses a Canadian military unit for the British monarch it was named after. Yahoo and OpenCalais fall into the middle. OpenAmplify and BeliefNetworks have a fair number of unmerged duplicate entities. For my purposes, I don’t care if the synonyms come from the input document or an external database, which is what Evri and AlchemyAPI probably use.

Taking a look at each API

Yahoo

This was the only API that I was aware of until recently, and I’ve blogged about it before. The input format is plain text, so since I’m using URLs as input, I have to first extract the text, strip the HTML, and send that. As I mentioned above, the quality was so poor when using web pages as input that the text must first be scrubbed of web page navigation, etc., and I used AlchemyAPI to do that. Even then, the quality was still poor and the API returned things that I would describe more as long phrases than as entities. Given that, not to mention the maximum of 20 entities, and the non-commercial restriction, I don’t see myself making use of this API.

OpenCalais

This API also accepted content rather than a URL. The content format must be specified in the API call. I simply retrieved the URL, and passed all of its content (with HTML) on to OpenCalais. They suggest making sure to remove web page navigation, but without me doing this, that didn’t present a problem. What was a problem was the short maximum document length. To actually use OpenCalais you should make sure to truncate documents before making the request, which is work that I haven’t yet implemented myself. Even when results were returned, the overall quality was mediocre.

The default output format is RDF, which is very verbose and includes a lot more information than I needed. I opted for the Text/Simple format which is actually XML.

For free API users sometimes the response comes back as “server busy” and I experienced this myself sometimes while trying it out.

BeliefNetworks

This API has a simple output, and is somewhat unusual in its functionality. Unlike all the other APIs, where the entities are extracted from the input document, with BeliefNetworks it seems they find entities which are related to the document but not necessarily actually in it. This produces some interesting results that are sometimes good but overall less related than I’d expect, and in one of my examples, completely unrelated and bizarre. Given that, and the frequency of duplicate entities, as mentioned, I would describe the overall quality as mediocre, although usually better than OpenCalais.

OpenAmplify

The most notable feature of OpenAmplify is all the additional information they provide, as described above.

They take input either as a URL or the content itself, and “charge” an additional transaction if you call them using a URL. I used URL input but also tried submitting the HTML, submitting the text (HTML stripped), and submitting the text after putting through the AlchemyAPI web page cleaning, and in all cases the results were about the same or worse.

OpenAmplify notes that they may not be able to follow all URL redirects (although I didn’t test this with any of the APIs), but this issue can be avoided by following the redirects yourself before making the request. As mentioned earlier, they only look at the first 2.5K of input. They also accept RSS/Atom as input, which is a nice feature.

Although I’ve set up my script to remove duplicates it currently misses removing some duplicate entities from OpenAmplify as the entity may be listed several times in the response but with different relevance scores.

One problem I found was that the entities returned usually consisted of a single token (one word) which just made them less useful. Overall, the quality was okay, generally better than BeliefNetworks.

AlchemyAPI

Other than the occasional misdisambiguation, AlchemyAPI is quite good.

Evri

Evri’s API is also quite good, with the biggest flaw being that it doesn’t return very many entities.

Overall Quality Summary

Overall, Evri and AlchemyAPI were definitely the best and most suited for my purposes. The quality of Evri’s was the best across the small sample, although not in all instances, and it didn’t return as many entities as AlchemyAPI. Interestingly these two APIs are also the two which include semantic links and have the least restrictions and high API limits.

OpenAmplify and BeliefNetworks are the runners up. OpenCalais fared poorly in my evaluation, but I suspect it would do better when looking at all the rest that their API. Yahoo’s API unfortunately just wasn’t good enough to use when any of the other APIs are available.

I’m convinced that trying to build a similar service myself is not worth it at all. One thing that I haven’t tried yet is combining these APIs together in some way, although that could potentially improve the results quite a bit.

You can see the script in action (until I take it down) at http://faganm.com/test/get_entities.php?u=[any URL].

Brain Off » Open Source Geo Stack :: Mikel Maron :: Building Digital Technology for Our Planet

Brain Off » Open Source Geo Stack :: Mikel Maron :: Building Digital Technology for Our Planet - since I emailed Mikel asking for a list like this about a year and a half ago, I can take credit for it, right?

Anyhow this is a great list that includes everything except the actual sources of data. The best one for that is often OpenStreetMap, but it really depends where and what you’re mapping.

Pattern Finder: Standards Work Doesn’t Have to Be Contentious

Pattern Finder: Standards Work Doesn’t Have to Be Contentious - I’ll second that. DeWitt has been doing great work with OpenSearch.

Why Facebook Shouldn’t Fear OpenSocial

Why Facebook Shouldn’t Fear OpenSocial - I’m supposed to be studying, so of course it’s a good time to do some blogging.

Anyhow, I agree with Josh that the idea that the competition now being Facebook vs OpenSocial is silly. Facebook is doing an absolutely amazingly fantastic job pleasing users, developers, being innovative, and soon, generating profit. Their upcoming “Beacon” plans seem as brilliant as their previous ones. The only bad thing I have to say about them (from a business perspective), is that they have been way to slow getting their advertising products out. In the long run, that may not make much difference.

OpenSocial is not competition in any sense of the word. It’s just a little specification to standardize some web services, which is a good thing. And assuming it gains the traction it is expected (the supporters actually follow through), then Facebook will just join it too, and they haven’t lost anything, really. In fact they’ll have gained additional developers and applications.

Facebook would have to be really stupid to act any other way, and from what I’ve seen, they are anything but. Except their HR, I’m not so in love with that.

Is it just me, or is MySpace sitting on their laurels? Just copying Facebook isn’t going to do it, and besides, they don’t seem to be copying them very well or quickly. I thought being the major player was supposed to count for something, like having resources.

One last comment on OpenSocial… while it is certainly good for developers that there will be a common API, let’s not forget that this simply means it will be easy to have an application run on multiple websites… separately. Having an application that seamlessly uses more than one social website simultaneously will still be an enormous headache. So there’s plenty more to be done there.

Update Nov 5. After reading a few things elsewhere, maybe myspace isn’t doing nothing, they just decided to let Google deal with all their advertising, and hope to make enough from that. But since that will likely be almost all of their revenue, might that not be a bad idea?

DeWitt Clinton » Blog Archive » Yelp search API

DeWitt Clinton » Blog Archive » Yelp search API - yesterday I decided I didn’t have the time to read through the whole Yelp spec and make my suggestions on how they be using OpenSearch. So naturally, DeWitt (independently) did exactly that.

High Earth Orbit » Blog Archive » OpenSearch Geo and Time extensions

High Earth Orbit » Blog Archive » OpenSearch Geo and Time extensions - I was working on something like this back when I was working on OpenSearch, but the industry and formats weren’t at the right stage. It’s great that the community is working on this now, there is so much potential. I like Andrew’s point “that is almost too easy.”

As always, DeWitt is involved. Here’s the latest draft.

A big OpenSearch roundup

A big OpenSearch roundup - it’s been a while since I posted an OpenSearch update, fortunately DeWitt has kept up with things better than I have. One thing seems clear: that OpenSearch is gaining adoption everywhere and moving into the future along with other standards, as it should. And where would we be without lolcats…

DeWitt Clinton » Blog Archive » What a day!

DeWitt Clinton » Blog Archive » What a day!

Microsoft is supporting OpenID and Apple is denouncing DRM.

no comment by me is even needed here

Naming websites

Once upon a time (okay, 1995), Ward Cunningham invented WikiWikiWebs. They spread all over, even slowly creeping into the commercial world. In 2001 it was thought that using one would help speed up article-writing for Nupedia. Today they are known as wikis, and that particular one has grown so popular that it is not only known by virtually every internet user, its popularity relative to other specific wikis is so much greater that to almost everyone, Wikipedia = Wiki = Wikipedia.

Wikis are a very useful type of website for many applications. Not all, of course, but many. When thinking up a name for a website using a wiki, a convenient name is “Wiki”+”topic”, e.g. Wikitravel. Of course, doing so presumably makes your website the definitive wiki for that topic, despite the reality of others e.g. World66. [hmn, upon writing this I find that these two examples have now decided to work together… cool]. You can see their relative popularity on Alexa; how much of the greater popularity of Wikitravel do you think can be attributed to its name online?

The strategy these days seems to be (1) pick a topic, make a wiki for it and (2) call it Wiki[topic]. The difference, I think, is that now the naming is very deliberate, rather than convenient. I think it is working very well, too.

Open[Topic] is the other name I wanted to mention. While there are many open sourceish projects around, it seems that people are getting better at marketing and are calling just about everything Open Something. I’m personally a strong advocate of OpenSearch, and I do think that some of its success can be attributed to the name.

Identity systems have been proposed and built for years. Marc Canter will tell you how great things would be today if we’d been supporting the Sxip technology years ago. Today it seems like the momentum behind OpenID is really going forward, and that it may indeed be poised to succeed more than any previous system of its kind. How much of its success do you think can be attributed to the name?

This post was provoked by the mention of Wikileaks as I listen to the radio.

Live Search’s WebLog : Create your own search engine (an update to Live Search Macros)

Live Search’s WebLog : Create your own search engine (an update to Live Search Macros) - congrats to Live.com and Zach for getting this out. It would be cool even if it didn’t have OpenSearch support ;-)

Sam Ruby: OpenSearch Description Validation

Sam Ruby: OpenSearch Description Validation - it’s so nice to have Sam on board the OpenSearch train :-)

DeWitt Clinton�s Unto.net � Blog Archive � Introducing OpenSearch.org

DeWitt Clinton�s Unto.net � Blog Archive � Introducing OpenSearch.org - this is excellent news. more when I get some time…

IEBlog : Search in IE7 RC1

IEBlog : Search in IE7 RC1 - they support the newish “referrer” extension - yay. let’s hope everybody else starts supporting this too

Google Code - Updates: New GData API: Google Base

Google Code - Updates: New GData API: Google Base - two of my biggest hopes - Google opening up Google Base, and more wide adoption of APIs based around OpenSearch - all at once. This could be big.

randomness

work seems to be keeping me rather busy

Yesterday I got around to fixing a several-month-old bug with my University of Waterloo search engine. Turns out the problem was Yahoo having changed their query parser. The query I was sending used to be

search terms (site:example.com OR site:example2.com OR ... site:exampleN.com)

however example.com wasn’t showing up on the results… the fix was adding a space before the ending parentheses.

search terms (site:example.com OR site:example2.com OR ... site:exampleN.com )

I wish Yahoo would publicly document all of their advanced search syntax, including the maximum query length.

I’ve been meaning to do another OpenSearch Update post. I’ve recently started adding some of these to del.icio.us. Noticing lots of non-English blog posts on OpenSearch lately, which is very cool. Today someone asked about including thumbnails. I’ve replied suggesting Media RSS but asking for consensus (although my email still needs to be moderated).

Lots of neat stuff in the mapping space lately. Thanks to Mikel Maron, Virtual Earth now has georss feeds.

So for years I’ve been largely ignoring the social networking websites. Or to be more accurate, reading up on them a lot, but not actually using them. Among other things, I don’t want to waste my time, nor provide a lot of my personal data to some walled garden. Regarding the latter, PeopleAggregator has been out for a while, and I hadn’t gotten around to congradulating Marc and Phillip. Anyhow, Facebook came to my school (this year I believe) and I’ve found that I’m actually using it. Not much, but more than I’ve ever used another similar site. Unlike the first generation of these websites, it actually has a point to it. I’m still resisting uploading photos to it (if I annotate those photos, am I ever going to be able to export that? highly unlikely) and I don’t like using it for messaging, because it won’t be searchable and integrated with my email or instant messaging services. Amusingly enough, I do think Facebook will actually succeed in making money. Hmn.. I guess I don’t have any major point to make here..

OpenSearch Update

I’ve been fairly busy at Microsoft, working, and hanging out with other interns and so I’m way behind on blogging about OpenSearch.

Internet Explorer 7 beta 2 and Firefox Bon Echo are out, both with some degree of OpenSearch support. Both support autodiscovery of Description files. IE7 (not sure about Firefox) supports search results in OpenSearch Response (RSS/Atom) as well as HTML. IE7 (and I suspect Firefox) do not support extended search parameters (those beyond searchTerms, startPage, etc.), but that’s to be expected at this stage.

Firefox support is a little odd, in that they also support some odd pseudoOpenSearch format. So please, developers, use real OpenSearch, it’ll work equally well in all readers, not just Firefox.

Firefox’s beta also has support for “search suggestions” when using Google or Yahoo. DeWitt has shown how (see draft document) these suggestions can be implemented in a way that is completely compatible with OpenSearch, without changing the existing format (JSON) at all. And it also opens the door to allowing suggestions themselves in OpenSearch; the Query element is ideal for this purpose.

From a webmaster perspective, the OpenSearch referrer extension (draft) is really great, allowing search sites to see where their searches are coming from. I’ve wanted this for a while, and it’s great to see it happening.

Perhaps more interesting than any of this is moving forward on adding structured data into OpenSearch, and DeWitt’s draft OpenSearch and Microformats is a great step in that direction. Personally I like data to be in XML more directly (rather than embedding it within atom:content, for example), but hopefully that approach can work in tandem, still using microformats. I’ll be looking into it, as I unofficially advise my university on how to create an API for their people search. Others have been looking at this too.

These are just some of the major happenings in OpenSearch. There are a variety of new software libraries, such as in Java and Ruby. An increasing number of organizations are basing their APIs and other things on OpenSearch. A9.com’s listing of OpenSearch providers is now well over 300. It’s hard to believe how far OpenSearch has come and how far it looks like it may go.

Google Data APIs Protocol

Google Data APIs Protocol - interesting move from Google. I (and others) have thought for a while that combining OpenSearch’s read capabilities with the Atom Publishing Protocol’s write capabilities would create a very powerful API, and that’s roughly what Google is doing here.

It’s great to see the OpenSearch support (a bit - they’re using startIndex, totalResults and itemsPerPage), but I’d like to see them using it more. Some of what they’re doing is contrary to how OpenSearch works (that’s not a problem per-say), as they’re using predefined query names such as q and max-results (and a folder for categories) rather that allowing people to use whichever they want and then specify them in an OpenSearch Description file.

In that same vein, it would be nice to see them make use of autodiscovery, as Atom, RSS, OpenSearch, and others do. Upon first inspection I would say these autodiscovered documents could be OpenSearch Descriptions, but I may be wrong about that.

One interesting thing to note is that they mention how startIndex is 1-based (which is true), and then display an example with a value of “0″. Sounds like DeWitt is right, it does need to handle 0-based numbers too; even Google is making that mistake.

DeWitt brings up some other good points as well.

Via Niall.

Update: Joe Gregorio weighs in

Update 2: Marc Canter (one of my favourite bloggers) finds this linkworthy ;-) although I’m always amazed at the spellings my name gets.

Google Toolbar Button API Follow-up

In my last post was my initital reaction to this new API from Google. It’s not surprising that I’m worried about Google’s plans here, as their record on XML cooperation hasn’t been all that stellar. I haven’t fully looked into it yet, but I had noticed Google’s absence from a new standardization effort; Retailers, Engines Want Standard for Product Description (via Gary) lists MSN, Yahoo!, and others.

Anyhow, getting down to the real point, I’ve decided to completely skip over “What Google Should Have Done,” and go right ahead to “What Google Should Now Do.” Save myself the wasted keystrokes.

Step 1: Fix Feed Refresh Interval

Remove the refresh-interval attribute from <feed>. Add it to RSS/Atom in a namespace. This shouldn’t really change anything. This has nothing to do with OpenSearch by the way, it’s just my general opinion on XML - extend an existing format rather than creating a new one.

After I started writing this, DeWitt posted his take on it all: Google Toolbar, Custom Buttons, and OpenSearch. It includes a lot of what I was going to say, so I will continue my comments as a reply to his post.

A final note, for anyone that’s counting… this makes at least four different Google products that are RSS/Atom readers (Google Reader, Google Toolbar, Google Personalized Homepage, Google Desktop). I hope they’re all using the API that the Google Reader team has been developing.

Google Toolbar API - Guide to Making Custom Button

Google Toolbar API - Guide to Making Custom Button - aaaargh. I see Google’s recreated the OpenSearch Description format. Nice job guys. Oh yeah, and it also functions as an RSS feed information thingy…. which as far as I can tell, only provides refresh rate…. if they need that so badly they could make that element an extension to RSS/Atom.

It seems like Google’s attitude nowadays is “developers like APIs, and they like XML, so lets create lots and lots of little tiny APIs and new XML formats.” How about a new search API, like for images. The web search API was last updated years ago… . Oh, in case we’re counting, Google now has created XML formats for sitemaps (but they accept RSS and Atom, so what was the point?), homepage modules (why not use HTML, as I’ve written before?), “buttons” (Google Toolbar), 50 (exaggeration) kinds of microcontent (Google Base), etc.

More later when I get back from school and have time to look into this more fully.

Windows Search Guide

Windows Search Guide - a (very beta) page is up for adding search engines to the search box in Internet Explorer 7. Note that at the time of this writing the OpenSearch Description files they’re using are in v1.1 draft 1, which they’ll hopefully upgrade appropriately. Also they’re declaring that the results are in RSS, when they are actually in HTML.

Anyhow, now that there are two browsers (okay, so IE7 hasn’t been released yet…) that support adding search engines via javascript, here’s a single javascript function that handles both of them. It assumes you have three files - .src plugin file and a 16x16 icon, and an OpenSearch Description file. There’s a .src to OpenSearch Description file converter I wrote on A9.com.

function addEngine() {
  try {
    window.sidebar.addSearchEngine('http://example.com/plugin.src',
      'http://example.com/plugin.png', 'Example Search Engine', 'Category Name');
  }
  catch (e) {
    try {
      window.external.AddSearchProvider('http://example.com/opensearch.xml');
    }
    catch (e) {
      alert('Internet Explorer 7, Firefox, Mozilla, Netscape 6 or higher,
        or Camino is needed to install a search engine.');
    }
  }
}

personal notes for later:

Opera:
Manually Editing Opera Searches using search.ini
Opera Search.ini Editor 1.25

Safari
Add Mozilla-like keyword functionality to Safari’s search bar (a hack)
AcidSearch 0.61

Major OpenSearch upgrade

I’ve been hard at work at A9.com, working on the OpenSearch website.

Here’s some of what’s new:

OpenSearch 1.1 Draft 2
The first draft went up in September - hopefully the second draft will become final within a few weeks. A9.com already supports it. The biggest change since the first draft? A fourth component to the specification: OpenSearch Query which allows you to reference a query. It may not sound like a big deal, but I think it is. Right now you can use it to provide spelling suggestions, related searches, etc. to A9. While not yet supported by A9, it allows for any of search parameters to be used - so it can establish a dialog between an opensearch producer and consumer using extended search parameters, even if the consumer doesn’t know anything about them. Another change is the addition of autodiscovery - imagine doing that with search tools!
Improved documentation and developer resources
New and/or improved: General FAQ, Developer FAQ, Developer How-to, specification changelog, guide to upgrading from 1.0 to 1.1, an index of elements and attributes, general tips. There are also listings of tools/software for producing and consuming OpenSearch feeds. This includes an OpenSearch-to-XHTML stylesheet (XSLT - very comprehensive), a converter for any XML into OpenSearch, and a converter from Sherlock plugins (used in Firefox).
Mailing List
OpenSearch isn’t called “open” for no reason. And to further that cause there is now a mailing list for discussing the specification, software for reading and writing it, etc.

That’s the gist of it. Although it isn’t yet, I think OpenSearch is very much on the road to become ubiquitous, just as RSS/Atom is becoming so. The support by Internet Explorer 7 gives that a huge push.

It’s amazing that I’ve been given the opportunity to put so much work into an open format, that benefits the entire industry, not just A9.com. You can be sure I’ll be saying more about OpenSearch in the future - if not in this blog, then on the mailing list, on other blogs, etc.

And for those who have no idea what I’m talking about - the homepage of the OpenSearch website is hopefully much clearer now at explaining it :-)

From metasearch to distributed information environments

From metasearch to distributed information environments (Lorcan Dempsy) is a good overview on metasearch in the academic enviroment, and search/metadata APIs.

I looked at a number of the documents, including the first two PowerPoint files and the information on MXG. All worth looking at.

In terms of meta/federated search, those schools (first two PowerPoints) are definitely making leaps forward. The commercial and academic worlds are beginning to learn from each other. The improvements are great, but need to be much greater.

The MXG (.doc file) proposal looks to me like an attempt to make a simpler but not as great version of SRU, which tried to do the same for Z39.50. Which is good news, the authors seem to have the right attitude. I also like how they’ve made levels of the specification, each of which is more complicated, and thus closer to SRU (that last is SRU).

If I were them I’d think hard about OpenSearch. It is a much simpler specification (clearly not originating from the academic world) which accomplishes less than even MXG Level 1. But not that much less, considering how much easier it is to use.

One specific thing that OpenSearch does that the other specifications don’t, is allow search engines to use their own URL variables instead of predefined ones. It looks fairly trivial to me for this concept to be integrated into the SRU/MXG specifications.

Back to academic ‘multi’ search tools, there is UWhub, my personal project. Right now it does web search and image search (just added that this week), but I would definitely like to expand this to include searching within the school’s library, among other things.

IEBlog : Hello from LA!

IEBlog : Hello from LA! - placeholder until tomorrow when I comment on this

this is a category-free zone

I installed WordPress two days ago and have had a little time to work on it. I’ve never used WordPress before and I’m extremely pleased with how easy it was to work with it and the plugin I’m using. That plugin would be Jerome’s Keywords.

After activating it I made modifications to my templates so that tags are displayed instead of categories for individual posts. I also modified the plugin to add a ‘related tags’ feature to the tag pages and search results pages, and a ‘common recent tags’ feature to the home page. All of which are done in order of frequency. The tags are also displayed with the meta keywords tag on individual post pages, tag pages, and search results pages, the latter two using the ‘related tags’. Lastly, I got the tags to show up in my atom feed in dc:subject, but not in the HTML.

So, that’s that, and I’m very happy with it. The only thing left to do really, is a ‘all tags’ page, but as I don’t know much about either WordPress or this plugin ;-) I think I’ll wait until the next version of the plugin for that.

Oh, and I intend to roll this back into the plugin eventually.

Update April 10 - I have added links to ‘this tag elsewhere’ from tag pages, added atom feeds for each tag and for any search, and added a ’search within this tag’ option to tag pages.