Blogdigger Dev Blog: Blogdigger Acquired by Odeo

Blogdigger Dev Blog: Blogdigger Acquired by Odeo - very interesting news. Back in March 2003 Dave Winer blogged that perhaps there should be a search engine based off data in RSS feeds. Three people started coding that weekend, Greg Gershman, Scott Johnson, and François Schiettecatte, forming BlogDigger, Roogle, and RSS-Search.

Scott’s “Roogle” launched by Mondayish, was Slashdotted to death, and and became Feedster. Scott and François both decided to make a company out of their ventures, and realizing that they lived in Boston, merged into Feedster. Greg later incorporated BlogDigger, but didn’t take quite the same route. I became online friends with them all to varying degrees, got to intern at Feedster in 2005, and also met Greg during that time. Although it never really attained the prestige that Feedster did, BlogDigger was always cool to me. Greg added searching by category (now generally called tags these days) thanks to my suggestion. His geo-based search is pretty nice, despite the fairly small number of geocoded blogs.

Feedster is 404ing these days, later PubSub died. Technorati eventually added searching to their backlink features, but they are struggling to some extent these days. The likes of Google Blogsearch hardly help. At any rate, Greg hung in there, and it is great to see it living on even further, and I hope him the best at his new job.

dougmccune.com » Blog Archive » Not Your Mamma’s Maps

dougmccune.com » Blog Archive » Not Your Mamma’s Maps - a very slick app for browsing geospatial data. Their screencast shows input through a geoRSS feed, but I am guessing that will not be freely available. Via Mikel.

High Earth Orbit » Blog Archive » OpenSearch Geo and Time extensions

High Earth Orbit » Blog Archive » OpenSearch Geo and Time extensions - I was working on something like this back when I was working on OpenSearch, but the industry and formats weren’t at the right stage. It’s great that the community is working on this now, there is so much potential. I like Andrew’s point “that is almost too easy.”

As always, DeWitt is involved. Here’s the latest draft.

randomness

work seems to be keeping me rather busy

Yesterday I got around to fixing a several-month-old bug with my University of Waterloo search engine. Turns out the problem was Yahoo having changed their query parser. The query I was sending used to be

search terms (site:example.com OR site:example2.com OR ... site:exampleN.com)

however example.com wasn’t showing up on the results… the fix was adding a space before the ending parentheses.

search terms (site:example.com OR site:example2.com OR ... site:exampleN.com )

I wish Yahoo would publicly document all of their advanced search syntax, including the maximum query length.

I’ve been meaning to do another OpenSearch Update post. I’ve recently started adding some of these to del.icio.us. Noticing lots of non-English blog posts on OpenSearch lately, which is very cool. Today someone asked about including thumbnails. I’ve replied suggesting Media RSS but asking for consensus (although my email still needs to be moderated).

Lots of neat stuff in the mapping space lately. Thanks to Mikel Maron, Virtual Earth now has georss feeds.

So for years I’ve been largely ignoring the social networking websites. Or to be more accurate, reading up on them a lot, but not actually using them. Among other things, I don’t want to waste my time, nor provide a lot of my personal data to some walled garden. Regarding the latter, PeopleAggregator has been out for a while, and I hadn’t gotten around to congradulating Marc and Phillip. Anyhow, Facebook came to my school (this year I believe) and I’ve found that I’m actually using it. Not much, but more than I’ve ever used another similar site. Unlike the first generation of these websites, it actually has a point to it. I’m still resisting uploading photos to it (if I annotate those photos, am I ever going to be able to export that? highly unlikely) and I don’t like using it for messaging, because it won’t be searchable and integrated with my email or instant messaging services. Amusingly enough, I do think Facebook will actually succeed in making money. Hmn.. I guess I don’t have any major point to make here..

There is no XML without namespaces

Yes, this makes two blog posts today, and yes, I’m going to talk about XML again.

I’ve suspected this for a while, but hadn’t looked into it. Thanks to Sam Ruby, I see that someone has: Who knows an XML document from a hole in the ground? shows that indeed, a lot of RSS/Atom parsers are not reading XML as XML… or at least, they’re not understanding the namespaces.

This wasn’t a problem when most feeds were bare-bones, and before Atom. Now, only a couple of years after I expected, all sorts of data and metada is starting to be put into feeds, with lots of different namespaces.

This is one of those things were if you’re a feed reader, and you don’t understand namespaces, you are broken, and need to be fixed. There’s no way around it, end of story.

That being said, I’m much more optimistic now than I was about those fixes actually happened. Phil Ringnalda’s Atom title tests really did help and pushed a lot more readers into supporting it properly. Now let’s see some real XML parsing.

FeedBurner is cool, but…

FeedBurner offers a very attractive service, and their new FeedFlare is just one part of that. But please, FeedBurner… when a user changes some settings, record the time of that change and only allow that change to affect new items. Not that it isn’t fun to see a whole lot of my subscriptions suddenly all marked as unread.

What’s wrong with MSN’s RSS search

News from Luigi about RSS search from MSN leads me to think MSN Search knows what they’re doing. Or not.

They are putting RSS/Atom search integrated right in with their web search. This is good. But… they’re displaying RSS feeds as regular search results, without modification. That means that when you click on a RSS feed result, you are taken to (surprise) the RSS feed, which, most of the time, is not in a human-readable format. Hello usability? This is acceptable for a major engine to put out for average web users?? Additionally, the ‘cache’ link for RSS feed results displays a somewhat more human-readable display, but it could definitely be improved.

Virtually all, if not all RSS feeds today are representations of existing web pages. It would make a little more sense to point to those, and provide an additional link to the actual RSS feed. This is essentially what all the major RSS search engines are smart enough to do, including Feedster, Blogdigger, and Bloglines.

Actually those engines are all smarter still, since they’re indexing individual RSS items rather than whole RSS feeds as if they were a single document. That’s a huge benefit of RSS; that the individual items have been separated, and usually come with important metadata, like the date. MSN doesn’t seem to make use of this at all, although admittedly their implementation is new.

It does appear that Yahoo has got some of this right, linking to web pages (and sometimes the web pages of the individual items). However, the same does not apply to their search API, which does use RSS feed URLs as the main link for each search result, and it does not provide the web page alternative. Which leads me to the news today of Yahoo Weather in RSS. They’re even including some excellent data in there, but, they’ve defined a new namespace for some of this data, which points to http://xml.weather.yahoo.com/ns/rss/1.0, which returns a 404 now. Also it’d be nice if they labeled their namespace ‘weather,’ rather than ‘yweather.’ And I strongly suspect that there are existing weather vocabularies they may have been able to use instead.

Anyway, back to MSN Search, they’ve introduced two new syntaxes, feed:, to specify to look for RSS feeds, and hasfeed: to specify that the results are web pages that have RSS feeds. That seems okay, but the way to use the syntax is odd. For example feed: site:bbc.co.uk. It has been semi-standard for a while to use syntax like syntax:foo, as in the site: keyword used, however the new syntax seems to be syntax: by itself. Confusing. Let’s just assume that this is temporary, until there’s a web-based interface for choosing to find RSS feeds.

</rant>