Blogdigger Dev Blog: Blogdigger Acquired by Odeo

Blogdigger Dev Blog: Blogdigger Acquired by Odeo - very interesting news. Back in March 2003 Dave Winer blogged that perhaps there should be a search engine based off data in RSS feeds. Three people started coding that weekend, Greg Gershman, Scott Johnson, and François Schiettecatte, forming BlogDigger, Roogle, and RSS-Search.

Scott’s “Roogle” launched by Mondayish, was Slashdotted to death, and and became Feedster. Scott and François both decided to make a company out of their ventures, and realizing that they lived in Boston, merged into Feedster. Greg later incorporated BlogDigger, but didn’t take quite the same route. I became online friends with them all to varying degrees, got to intern at Feedster in 2005, and also met Greg during that time. Although it never really attained the prestige that Feedster did, BlogDigger was always cool to me. Greg added searching by category (now generally called tags these days) thanks to my suggestion. His geo-based search is pretty nice, despite the fairly small number of geocoded blogs.

Feedster is 404ing these days, later PubSub died. Technorati eventually added searching to their backlink features, but they are struggling to some extent these days. The likes of Google Blogsearch hardly help. At any rate, Greg hung in there, and it is great to see it living on even further, and I hope him the best at his new job.

Yahoo Embraces The Semantic Web - Expect The Internet To Organize Itself In A Hurry

Yahoo Embraces The Semantic Web - Expect The Internet To Organize Itself In A Hurry - wow. Watching things grow sloooowly for a long time, and then it finally seems like things are picking up… very exciting.

Update: link is The Yahoo! Search Open Ecosystem

Pattern Finder: Standards Work Doesn’t Have to Be Contentious

Pattern Finder: Standards Work Doesn’t Have to Be Contentious - I’ll second that. DeWitt has been doing great work with OpenSearch.

DeWitt Clinton » Blog Archive » Yelp search API

DeWitt Clinton » Blog Archive » Yelp search API - yesterday I decided I didn’t have the time to read through the whole Yelp spec and make my suggestions on how they be using OpenSearch. So naturally, DeWitt (independently) did exactly that.

High Earth Orbit » Blog Archive » OpenSearch Geo and Time extensions

High Earth Orbit » Blog Archive » OpenSearch Geo and Time extensions - I was working on something like this back when I was working on OpenSearch, but the industry and formats weren’t at the right stage. It’s great that the community is working on this now, there is so much potential. I like Andrew’s point “that is almost too easy.”

As always, DeWitt is involved. Here’s the latest draft.

A big OpenSearch roundup

A big OpenSearch roundup - it’s been a while since I posted an OpenSearch update, fortunately DeWitt has kept up with things better than I have. One thing seems clear: that OpenSearch is gaining adoption everywhere and moving into the future along with other standards, as it should. And where would we be without lolcats…

Image Search Engines || Fagan Finder

Image Search Engines || Fagan Finder - virtually all of Fagan Finder has stagnated since 2005 (if not 2002/2003), and I definitely do not have the time to give it the upgrade it deserves. Anyhow, so I decided to update the image search page… taking out all the dead links, etc., and putting in some newer stuff. You’d be surprised how long that takes, even though I didn’t bother adding descriptions for the new additions like I used to.

The image search page has been among the most popular pages on Fagan Finder since it was created in May 2003, and I often get requests for people who want to advertise on the page. So now I have updated it (first time since June 2003), added an ad spot (right now it shows Google Adsense 50% of the time and Adbrite 50% of the time, but I’m still playing with that. For the latter, I have it set that I must approve all ads.

Somehow, despite removing dead and crummy websites, it has gone from 42 search tools and 42 external links to 65 search tools and 42 external links. The page is seriously getting crowded. Anyhow, let me know if you find any bugs or if I’m missing anything.

Live Search’s WebLog : Create your own search engine (an update to Live Search Macros)

Live Search’s WebLog : Create your own search engine (an update to Live Search Macros) - congrats to Live.com and Zach for getting this out. It would be cool even if it didn’t have OpenSearch support ;-)

Sam Ruby: OpenSearch Description Validation

Sam Ruby: OpenSearch Description Validation - it’s so nice to have Sam on board the OpenSearch train :-)

DeWitt Clinton�s Unto.net � Blog Archive � Introducing OpenSearch.org

DeWitt Clinton�s Unto.net � Blog Archive � Introducing OpenSearch.org - this is excellent news. more when I get some time…

IEBlog : Search in IE7 RC1

IEBlog : Search in IE7 RC1 - they support the newish “referrer” extension - yay. let’s hope everybody else starts supporting this too

Google Code - Updates: New GData API: Google Base

Google Code - Updates: New GData API: Google Base - two of my biggest hopes - Google opening up Google Base, and more wide adoption of APIs based around OpenSearch - all at once. This could be big.

randomness

work seems to be keeping me rather busy

Yesterday I got around to fixing a several-month-old bug with my University of Waterloo search engine. Turns out the problem was Yahoo having changed their query parser. The query I was sending used to be

search terms (site:example.com OR site:example2.com OR ... site:exampleN.com)

however example.com wasn’t showing up on the results… the fix was adding a space before the ending parentheses.

search terms (site:example.com OR site:example2.com OR ... site:exampleN.com )

I wish Yahoo would publicly document all of their advanced search syntax, including the maximum query length.

I’ve been meaning to do another OpenSearch Update post. I’ve recently started adding some of these to del.icio.us. Noticing lots of non-English blog posts on OpenSearch lately, which is very cool. Today someone asked about including thumbnails. I’ve replied suggesting Media RSS but asking for consensus (although my email still needs to be moderated).

Lots of neat stuff in the mapping space lately. Thanks to Mikel Maron, Virtual Earth now has georss feeds.

So for years I’ve been largely ignoring the social networking websites. Or to be more accurate, reading up on them a lot, but not actually using them. Among other things, I don’t want to waste my time, nor provide a lot of my personal data to some walled garden. Regarding the latter, PeopleAggregator has been out for a while, and I hadn’t gotten around to congradulating Marc and Phillip. Anyhow, Facebook came to my school (this year I believe) and I’ve found that I’m actually using it. Not much, but more than I’ve ever used another similar site. Unlike the first generation of these websites, it actually has a point to it. I’m still resisting uploading photos to it (if I annotate those photos, am I ever going to be able to export that? highly unlikely) and I don’t like using it for messaging, because it won’t be searchable and integrated with my email or instant messaging services. Amusingly enough, I do think Facebook will actually succeed in making money. Hmn.. I guess I don’t have any major point to make here..

OpenSearch Update

I’ve been fairly busy at Microsoft, working, and hanging out with other interns and so I’m way behind on blogging about OpenSearch.

Internet Explorer 7 beta 2 and Firefox Bon Echo are out, both with some degree of OpenSearch support. Both support autodiscovery of Description files. IE7 (not sure about Firefox) supports search results in OpenSearch Response (RSS/Atom) as well as HTML. IE7 (and I suspect Firefox) do not support extended search parameters (those beyond searchTerms, startPage, etc.), but that’s to be expected at this stage.

Firefox support is a little odd, in that they also support some odd pseudoOpenSearch format. So please, developers, use real OpenSearch, it’ll work equally well in all readers, not just Firefox.

Firefox’s beta also has support for “search suggestions” when using Google or Yahoo. DeWitt has shown how (see draft document) these suggestions can be implemented in a way that is completely compatible with OpenSearch, without changing the existing format (JSON) at all. And it also opens the door to allowing suggestions themselves in OpenSearch; the Query element is ideal for this purpose.

From a webmaster perspective, the OpenSearch referrer extension (draft) is really great, allowing search sites to see where their searches are coming from. I’ve wanted this for a while, and it’s great to see it happening.

Perhaps more interesting than any of this is moving forward on adding structured data into OpenSearch, and DeWitt’s draft OpenSearch and Microformats is a great step in that direction. Personally I like data to be in XML more directly (rather than embedding it within atom:content, for example), but hopefully that approach can work in tandem, still using microformats. I’ll be looking into it, as I unofficially advise my university on how to create an API for their people search. Others have been looking at this too.

These are just some of the major happenings in OpenSearch. There are a variety of new software libraries, such as in Java and Ruby. An increasing number of organizations are basing their APIs and other things on OpenSearch. A9.com’s listing of OpenSearch providers is now well over 300. It’s hard to believe how far OpenSearch has come and how far it looks like it may go.

Google Data APIs Protocol

Google Data APIs Protocol - interesting move from Google. I (and others) have thought for a while that combining OpenSearch’s read capabilities with the Atom Publishing Protocol’s write capabilities would create a very powerful API, and that’s roughly what Google is doing here.

It’s great to see the OpenSearch support (a bit - they’re using startIndex, totalResults and itemsPerPage), but I’d like to see them using it more. Some of what they’re doing is contrary to how OpenSearch works (that’s not a problem per-say), as they’re using predefined query names such as q and max-results (and a folder for categories) rather that allowing people to use whichever they want and then specify them in an OpenSearch Description file.

In that same vein, it would be nice to see them make use of autodiscovery, as Atom, RSS, OpenSearch, and others do. Upon first inspection I would say these autodiscovered documents could be OpenSearch Descriptions, but I may be wrong about that.

One interesting thing to note is that they mention how startIndex is 1-based (which is true), and then display an example with a value of “0″. Sounds like DeWitt is right, it does need to handle 0-based numbers too; even Google is making that mistake.

DeWitt brings up some other good points as well.

Via Niall.

Update: Joe Gregorio weighs in

Update 2: Marc Canter (one of my favourite bloggers) finds this linkworthy ;-) although I’m always amazed at the spellings my name gets.

Google Toolbar Button API Follow-up

In my last post was my initital reaction to this new API from Google. It’s not surprising that I’m worried about Google’s plans here, as their record on XML cooperation hasn’t been all that stellar. I haven’t fully looked into it yet, but I had noticed Google’s absence from a new standardization effort; Retailers, Engines Want Standard for Product Description (via Gary) lists MSN, Yahoo!, and others.

Anyhow, getting down to the real point, I’ve decided to completely skip over “What Google Should Have Done,” and go right ahead to “What Google Should Now Do.” Save myself the wasted keystrokes.

Step 1: Fix Feed Refresh Interval

Remove the refresh-interval attribute from <feed>. Add it to RSS/Atom in a namespace. This shouldn’t really change anything. This has nothing to do with OpenSearch by the way, it’s just my general opinion on XML - extend an existing format rather than creating a new one.

After I started writing this, DeWitt posted his take on it all: Google Toolbar, Custom Buttons, and OpenSearch. It includes a lot of what I was going to say, so I will continue my comments as a reply to his post.

A final note, for anyone that’s counting… this makes at least four different Google products that are RSS/Atom readers (Google Reader, Google Toolbar, Google Personalized Homepage, Google Desktop). I hope they’re all using the API that the Google Reader team has been developing.

Google Toolbar API - Guide to Making Custom Button

Google Toolbar API - Guide to Making Custom Button - aaaargh. I see Google’s recreated the OpenSearch Description format. Nice job guys. Oh yeah, and it also functions as an RSS feed information thingy…. which as far as I can tell, only provides refresh rate…. if they need that so badly they could make that element an extension to RSS/Atom.

It seems like Google’s attitude nowadays is “developers like APIs, and they like XML, so lets create lots and lots of little tiny APIs and new XML formats.” How about a new search API, like for images. The web search API was last updated years ago… . Oh, in case we’re counting, Google now has created XML formats for sitemaps (but they accept RSS and Atom, so what was the point?), homepage modules (why not use HTML, as I’ve written before?), “buttons” (Google Toolbar), 50 (exaggeration) kinds of microcontent (Google Base), etc.

More later when I get back from school and have time to look into this more fully.

Puzzlepieces � tree-climbing vines (April 17, 2005)

Puzzlepieces � tree-climbing vines (April 17, 2005) - yes, I’m linking to a blog post of mine from last year. Most of my posts generate about zero comments, so this one, at 5, is a lot, the last one just added today.

All from people I don’t know, it seems that people are searching for “tree climbing vine” and variations, and I’m ranking quite well for those. So if you have vines like those and want to get rid of them, check that post for comments ;-)

Windows Search Guide

Windows Search Guide - a (very beta) page is up for adding search engines to the search box in Internet Explorer 7. Note that at the time of this writing the OpenSearch Description files they’re using are in v1.1 draft 1, which they’ll hopefully upgrade appropriately. Also they’re declaring that the results are in RSS, when they are actually in HTML.

Anyhow, now that there are two browsers (okay, so IE7 hasn’t been released yet…) that support adding search engines via javascript, here’s a single javascript function that handles both of them. It assumes you have three files - .src plugin file and a 16x16 icon, and an OpenSearch Description file. There’s a .src to OpenSearch Description file converter I wrote on A9.com.

function addEngine() {
  try {
    window.sidebar.addSearchEngine('http://example.com/plugin.src',
      'http://example.com/plugin.png', 'Example Search Engine', 'Category Name');
  }
  catch (e) {
    try {
      window.external.AddSearchProvider('http://example.com/opensearch.xml');
    }
    catch (e) {
      alert('Internet Explorer 7, Firefox, Mozilla, Netscape 6 or higher,
        or Camino is needed to install a search engine.');
    }
  }
}

personal notes for later:

Opera:
Manually Editing Opera Searches using search.ini
Opera Search.ini Editor 1.25

Safari
Add Mozilla-like keyword functionality to Safari’s search bar (a hack)
AcidSearch 0.61

Major OpenSearch upgrade

I’ve been hard at work at A9.com, working on the OpenSearch website.

Here’s some of what’s new:

OpenSearch 1.1 Draft 2
The first draft went up in September - hopefully the second draft will become final within a few weeks. A9.com already supports it. The biggest change since the first draft? A fourth component to the specification: OpenSearch Query which allows you to reference a query. It may not sound like a big deal, but I think it is. Right now you can use it to provide spelling suggestions, related searches, etc. to A9. While not yet supported by A9, it allows for any of search parameters to be used - so it can establish a dialog between an opensearch producer and consumer using extended search parameters, even if the consumer doesn’t know anything about them. Another change is the addition of autodiscovery - imagine doing that with search tools!
Improved documentation and developer resources
New and/or improved: General FAQ, Developer FAQ, Developer How-to, specification changelog, guide to upgrading from 1.0 to 1.1, an index of elements and attributes, general tips. There are also listings of tools/software for producing and consuming OpenSearch feeds. This includes an OpenSearch-to-XHTML stylesheet (XSLT - very comprehensive), a converter for any XML into OpenSearch, and a converter from Sherlock plugins (used in Firefox).
Mailing List
OpenSearch isn’t called “open” for no reason. And to further that cause there is now a mailing list for discussing the specification, software for reading and writing it, etc.

That’s the gist of it. Although it isn’t yet, I think OpenSearch is very much on the road to become ubiquitous, just as RSS/Atom is becoming so. The support by Internet Explorer 7 gives that a huge push.

It’s amazing that I’ve been given the opportunity to put so much work into an open format, that benefits the entire industry, not just A9.com. You can be sure I’ll be saying more about OpenSearch in the future - if not in this blog, then on the mailing list, on other blogs, etc.

And for those who have no idea what I’m talking about - the homepage of the OpenSearch website is hopefully much clearer now at explaining it :-)

From metasearch to distributed information environments

From metasearch to distributed information environments (Lorcan Dempsy) is a good overview on metasearch in the academic enviroment, and search/metadata APIs.

I looked at a number of the documents, including the first two PowerPoint files and the information on MXG. All worth looking at.

In terms of meta/federated search, those schools (first two PowerPoints) are definitely making leaps forward. The commercial and academic worlds are beginning to learn from each other. The improvements are great, but need to be much greater.

The MXG (.doc file) proposal looks to me like an attempt to make a simpler but not as great version of SRU, which tried to do the same for Z39.50. Which is good news, the authors seem to have the right attitude. I also like how they’ve made levels of the specification, each of which is more complicated, and thus closer to SRU (that last is SRU).

If I were them I’d think hard about OpenSearch. It is a much simpler specification (clearly not originating from the academic world) which accomplishes less than even MXG Level 1. But not that much less, considering how much easier it is to use.

One specific thing that OpenSearch does that the other specifications don’t, is allow search engines to use their own URL variables instead of predefined ones. It looks fairly trivial to me for this concept to be integrated into the SRU/MXG specifications.

Back to academic ‘multi’ search tools, there is UWhub, my personal project. Right now it does web search and image search (just added that this week), but I would definitely like to expand this to include searching within the school’s library, among other things.

IEBlog : Hello from LA!

IEBlog : Hello from LA! - placeholder until tomorrow when I comment on this

What’s wrong with MSN’s RSS search

News from Luigi about RSS search from MSN leads me to think MSN Search knows what they’re doing. Or not.

They are putting RSS/Atom search integrated right in with their web search. This is good. But… they’re displaying RSS feeds as regular search results, without modification. That means that when you click on a RSS feed result, you are taken to (surprise) the RSS feed, which, most of the time, is not in a human-readable format. Hello usability? This is acceptable for a major engine to put out for average web users?? Additionally, the ‘cache’ link for RSS feed results displays a somewhat more human-readable display, but it could definitely be improved.

Virtually all, if not all RSS feeds today are representations of existing web pages. It would make a little more sense to point to those, and provide an additional link to the actual RSS feed. This is essentially what all the major RSS search engines are smart enough to do, including Feedster, Blogdigger, and Bloglines.

Actually those engines are all smarter still, since they’re indexing individual RSS items rather than whole RSS feeds as if they were a single document. That’s a huge benefit of RSS; that the individual items have been separated, and usually come with important metadata, like the date. MSN doesn’t seem to make use of this at all, although admittedly their implementation is new.

It does appear that Yahoo has got some of this right, linking to web pages (and sometimes the web pages of the individual items). However, the same does not apply to their search API, which does use RSS feed URLs as the main link for each search result, and it does not provide the web page alternative. Which leads me to the news today of Yahoo Weather in RSS. They’re even including some excellent data in there, but, they’ve defined a new namespace for some of this data, which points to http://xml.weather.yahoo.com/ns/rss/1.0, which returns a 404 now. Also it’d be nice if they labeled their namespace ‘weather,’ rather than ‘yweather.’ And I strongly suspect that there are existing weather vocabularies they may have been able to use instead.

Anyway, back to MSN Search, they’ve introduced two new syntaxes, feed:, to specify to look for RSS feeds, and hasfeed: to specify that the results are web pages that have RSS feeds. That seems okay, but the way to use the syntax is odd. For example feed: site:bbc.co.uk. It has been semi-standard for a while to use syntax like syntax:foo, as in the site: keyword used, however the new syntax seems to be syntax: by itself. Confusing. Let’s just assume that this is temporary, until there’s a web-based interface for choosing to find RSS feeds.

</rant>

Search Interface Protocols and Specifications

Search Interface Protocols and Specifications - I haven’t actually read this (yet) but I’ve been thinking along these lines lately. To add to the list of protocols/specifications discussed, I would also mention Sherlock and Mycroft, the format used for Dave’s Quick Search Deskbar (examples), “quick searches” such as those used in Mozilla products and the virtually identical method used for the Google Deskbar, and the Yahoo! Search Web Services. And I’m sure there are others that I’m not aware of or can’t think of right now. Via Lorcan Dempsey.

yahoo site searching syntax

Here’s a summary of what I’ve learned when restricting Yahoo! search to specific websites.

  • always use brackets

    prevents errors, especially with boolean. not really necessary in this example, but nevertheless: (search terms) (site:example.com)

  • use capitals for boolean

    site:example.com OR site:example2.com

  • specify field names always

    Use site:example.com OR site:example2.com not site:(example.com OR example2.com)

  • how to specify paths

    to specify a website that isn’t a (sub)domain use site:example.com inurl:folder/folder2 for the website example.com/folder/folder2/

    One problem is that if you are specifying a domain name and a website with a path, results for the latter will be ranked higher, because they match both site: and inurl:. To compensate for that, you could use a different method: inurl:example_com/folder/folder2. Note the use of the underscore instead of a dot for the last (and only the last) dot in the domain name. Also, in rare circumstances, this will find pages that are not in example.com, but have those terms in the URL somewhere.

  • specifying multiple folders in a site

    site:example.com (inurl:folder OR inurl:folder2)
    or
    inurl:example_com/folder OR inurl:example_com/folder2

  • specifying multiple sites with paths

    this can be derived from previous points, but here goes: site:example.com OR (site:example2.com inurl:folder) or more advanced: site:example.com OR site:example2.com OR (site:example3.com (inurl:folder OR inurl:folder2 OR inurl:folder3))

  • Use OR for multiple exclusion

    NOT (site:example.com OR site:example2.com)

  • putting it all together

    (search terms) (site:example.com OR (site:example2.com inurl:folder)) NOT (site:sub.example.com OR (site:example.com inurl:somefolder)) not that this is restricting to two websites (one with a path) but excluding sites from that first website in a specific subdomain or folder