Term Extraction Documentation for Yahoo! Search Web Services - YDN

Yahoo!’s Term Extraction Web Search is about to be discontinued. very sad

wait, nevermind

Email and Newsreader management; thanks to Mailbucket and Yahoo Pipes

I’ve been working (for a long time, but more lately) on both reducing what I see in my email and newsreader as well as differentiating the two.

I’ve decided that my email (Gmail) is for personal communication, as in mail that people specifically send to me, and important computer-generated emails like bill payment reminders. My newsreader (Google Reader) is for news and alerts that while interesting to me, are not crucial. This aligns with the fact that I always check my email first, and that if I had to, I could just mark newsreader items as read without any real consequences.

I am using three things to do this: one policy and two web-based tools.

Policy-wise, I examine more closely what emails I get. In the past I might have deleted an email from some company or website that I wasn’t interested in, but now I take the extra time to go to their website and either unsubscribe completely or uncheck certain parts of what they send me.
I already use my newsreader to subscribe to feeds when possible, but a lot of sites still only have email newsletters. There are a few services which will allow you to convert emails into RSS feeds and I’m finding MailBucket to be the best. I create a filter in Gmail so that all mail from the newsletter are automatically marked as read, moved to my archive (so they don’t show up in my inbox) and forwarded to that-newsletter-name@mailbucket.org. That way, I don’t have to give these websites an alterate email, they still use my gmail address, and I still archive all those emails in Gmail, but I will never see them there unless I want to. Instead, I subscribe to http://mailbucket.org/that-newsletter-name.xml in my newsreader, and I see all the content there. Perfect.

Getting items from my email to my newsreader is one step, but then there are many feeds I read for which I am only interested in some of the items, not all. For these, I create a Yahoo Pipe that takes in a particular feed and filters it by excluding items which match various criteria or only including items that match. I unsubscribe from the original feed and subscribe to the filtered version, immediately reducing how much stuff gets into my newsreader. Very nice. There are some minor drawbacks to this, such as that anything which uses my newsreader’s data won’t be perfect, such as Google Reader recommending feeds to me based on what I already read, and for Google Reader’s crawler telling websites how many people have subscribed to their feeds. Either way, tiny problems compared to the great benefits.

Pipes Blog » Blog Archive » Introducing iCal and CSV Support

Pipes Blog » Blog Archive » Introducing iCal and CSV Support - so this news is way old… somehow I never knew that Yahoo pipes supported iCal for both input and output. Pipes is such an amazingly powerful tool, I definitely need to play with it more, especially for events.

WordPress › MyBlogLog: Just for you « WordPress Plugins

WordPress › MyBlogLog: Just for you « WordPress Plugins - I doubt this will get much uptake, but it is actually really neat. Anyone with a mybloglog cookie, when viewing a blog with this plugin, will see a list of posts on that blog that specifically match their interests.

Yahoo Embraces The Semantic Web - Expect The Internet To Organize Itself In A Hurry

Yahoo Embraces The Semantic Web - Expect The Internet To Organize Itself In A Hurry - wow. Watching things grow sloooowly for a long time, and then it finally seems like things are picking up… very exciting.

Update: link is The Yahoo! Search Open Ecosystem

Popfly

Popfly - I heard about this first though email, but it’s all over the web as well. Microsoft has done an amazing job of making it really easy to combine web services, and I only hope that the output itself (something in an iframe?) is just as web malleable as the services it uses.

randomness

work seems to be keeping me rather busy

Yesterday I got around to fixing a several-month-old bug with my University of Waterloo search engine. Turns out the problem was Yahoo having changed their query parser. The query I was sending used to be

search terms (site:example.com OR site:example2.com OR ... site:exampleN.com)

however example.com wasn’t showing up on the results… the fix was adding a space before the ending parentheses.

search terms (site:example.com OR site:example2.com OR ... site:exampleN.com )

I wish Yahoo would publicly document all of their advanced search syntax, including the maximum query length.

I’ve been meaning to do another OpenSearch Update post. I’ve recently started adding some of these to del.icio.us. Noticing lots of non-English blog posts on OpenSearch lately, which is very cool. Today someone asked about including thumbnails. I’ve replied suggesting Media RSS but asking for consensus (although my email still needs to be moderated).

Lots of neat stuff in the mapping space lately. Thanks to Mikel Maron, Virtual Earth now has georss feeds.

So for years I’ve been largely ignoring the social networking websites. Or to be more accurate, reading up on them a lot, but not actually using them. Among other things, I don’t want to waste my time, nor provide a lot of my personal data to some walled garden. Regarding the latter, PeopleAggregator has been out for a while, and I hadn’t gotten around to congradulating Marc and Phillip. Anyhow, Facebook came to my school (this year I believe) and I’ve found that I’m actually using it. Not much, but more than I’ve ever used another similar site. Unlike the first generation of these websites, it actually has a point to it. I’m still resisting uploading photos to it (if I annotate those photos, am I ever going to be able to export that? highly unlikely) and I don’t like using it for messaging, because it won’t be searchable and integrated with my email or instant messaging services. Amusingly enough, I do think Facebook will actually succeed in making money. Hmn.. I guess I don’t have any major point to make here..

TechCrunch » Rumor: Yahoo Acquired Jotspot

TechCrunch » Rumor: Yahoo Acquired Jotspot - I don’t usually report on breaking news rumours but this one is very interesting, because Jotspot is such an amazing company/product.

Yahoo! buys Upcoming.org

This is a very significant move. Need I say more? Events and calendaring are gearing up to be huge. Does this deal have anything to do with the Google Calendar rumours? Via Software Only.

What’s wrong with MSN’s RSS search

News from Luigi about RSS search from MSN leads me to think MSN Search knows what they’re doing. Or not.

They are putting RSS/Atom search integrated right in with their web search. This is good. But… they’re displaying RSS feeds as regular search results, without modification. That means that when you click on a RSS feed result, you are taken to (surprise) the RSS feed, which, most of the time, is not in a human-readable format. Hello usability? This is acceptable for a major engine to put out for average web users?? Additionally, the ‘cache’ link for RSS feed results displays a somewhat more human-readable display, but it could definitely be improved.

Virtually all, if not all RSS feeds today are representations of existing web pages. It would make a little more sense to point to those, and provide an additional link to the actual RSS feed. This is essentially what all the major RSS search engines are smart enough to do, including Feedster, Blogdigger, and Bloglines.

Actually those engines are all smarter still, since they’re indexing individual RSS items rather than whole RSS feeds as if they were a single document. That’s a huge benefit of RSS; that the individual items have been separated, and usually come with important metadata, like the date. MSN doesn’t seem to make use of this at all, although admittedly their implementation is new.

It does appear that Yahoo has got some of this right, linking to web pages (and sometimes the web pages of the individual items). However, the same does not apply to their search API, which does use RSS feed URLs as the main link for each search result, and it does not provide the web page alternative. Which leads me to the news today of Yahoo Weather in RSS. They’re even including some excellent data in there, but, they’ve defined a new namespace for some of this data, which points to http://xml.weather.yahoo.com/ns/rss/1.0, which returns a 404 now. Also it’d be nice if they labeled their namespace ‘weather,’ rather than ‘yweather.’ And I strongly suspect that there are existing weather vocabularies they may have been able to use instead.

Anyway, back to MSN Search, they’ve introduced two new syntaxes, feed:, to specify to look for RSS feeds, and hasfeed: to specify that the results are web pages that have RSS feeds. That seems okay, but the way to use the syntax is odd. For example feed: site:bbc.co.uk. It has been semi-standard for a while to use syntax like syntax:foo, as in the site: keyword used, however the new syntax seems to be syntax: by itself. Confusing. Let’s just assume that this is temporary, until there’s a web-based interface for choosing to find RSS feeds.

</rant>

Google Maps API

Google Maps API - finally, and thank goodness. I haven’t looked at the API yet, hopefully it still leaves a place for Mikel’s fantastic worldKit. Via Google Blog.

Update later June 29: also today, the Yahoo! Maps Web Service. All because it’s the beginning of the Where 2.0 conference. I’m still not sure how I feel about the industry practice of launching things to coincide with conferences…

yahoo site searching syntax

Here’s a summary of what I’ve learned when restricting Yahoo! search to specific websites.

  • always use brackets

    prevents errors, especially with boolean. not really necessary in this example, but nevertheless: (search terms) (site:example.com)

  • use capitals for boolean

    site:example.com OR site:example2.com

  • specify field names always

    Use site:example.com OR site:example2.com not site:(example.com OR example2.com)

  • how to specify paths

    to specify a website that isn’t a (sub)domain use site:example.com inurl:folder/folder2 for the website example.com/folder/folder2/

    One problem is that if you are specifying a domain name and a website with a path, results for the latter will be ranked higher, because they match both site: and inurl:. To compensate for that, you could use a different method: inurl:example_com/folder/folder2. Note the use of the underscore instead of a dot for the last (and only the last) dot in the domain name. Also, in rare circumstances, this will find pages that are not in example.com, but have those terms in the URL somewhere.

  • specifying multiple folders in a site

    site:example.com (inurl:folder OR inurl:folder2)
    or
    inurl:example_com/folder OR inurl:example_com/folder2

  • specifying multiple sites with paths

    this can be derived from previous points, but here goes: site:example.com OR (site:example2.com inurl:folder) or more advanced: site:example.com OR site:example2.com OR (site:example3.com (inurl:folder OR inurl:folder2 OR inurl:folder3))

  • Use OR for multiple exclusion

    NOT (site:example.com OR site:example2.com)

  • putting it all together

    (search terms) (site:example.com OR (site:example2.com inurl:folder)) NOT (site:sub.example.com OR (site:example.com inurl:somefolder)) not that this is restricting to two websites (one with a path) but excluding sites from that first website in a specific subdomain or folder