BuzzTracker: no one home?

My personal site, epcostello.net, has been up since 2003. Over five years I've redesigned, reorganized, remodeled and removed a lot of the site. When it first launched, my personal blog was epcostello.net/journal, my link and commentary blog was epcostello.net/epicrisis, and yet another blog had long form essays. Since then I've consolidated everything under epcostello.net/epicrisis.

Each time I've remodeled, I've tried to be good and redirect old URLs to the appropriate new URL. In a few cases, specifically with web feeds, I've turned off the redirects and issue 410 Gone messages instead.

As far as I can tell, few feed readers, slurpers, indexers, what-not remove a feed, ever. They ignore 404 File Not Found, and also appear to ignore 410 (which is as explicit as you can get: The file existed, now it's gone, it's not expected to return, now go away).

So, while looking through the raw logs for my personal site for May I came across a flurry of hits on epcostello.net/journal/rss.xml from an agent identifying itself as BuzzTracker/1.02.

Now, thanks to the wonder of cheap disk, I can tell you the following:

That feed went live on September 12, 2003
I redirected with a 301 permanently moved redirect on July 25, 2005.
I marked the feed as Gone with a 410 on April 17, 2007
BuzzTracker first hit the feed on July 20, 2007 (three months after it started returning a 410)
BuzzTracker hit the feed a second time in August 2007, then went silent
Starting April 8, 2008 BuzzTracker (well something identifying itself as BuzzTracker) started hitting the feed daily, somedays once, somedays many times.

Now, it does not cost me anything, really, to serve this, but this is just one of many user-agents out there that is so poorly written that it continues to fetch a URL it's been told is permanently gone over and over again. And that all adds up to wasted bandwidth and processor use on my part.

I had not heard of BuzzTracker, so I looked around on the site, intending to send feedback asking that the feed be removed from their cache so that they stop requesting it.

The feedback page reads "Page not found" (though it returns a "200" and not a "404").

Further tooling around reveals that the site was bought by Yahoo! in 2007 (a year later it has no Yahoo! branding and not much else to indicate any integration with Yahoo!). So, you get this blog post instead.

Now, I should point out that there's other agents which are hitting the same URL, getting the same 410, and continuing to do so on a hourly and daily basis:

Blogdigger/2.0
Feedfetcher-Google (feed-id: 17286166821941995468)
Syndic8/1.0
Mozilla/4.0 (Webclipping.com)

This is just sloppy programming. And I find it really frustrating, it just adds noise and burden to the server side. You might think Oh, look, it's only one hit every couple of hours but there's no limit (the 410 is supposed to be that limiting factor, it's an intentional statement on the server administrator's behalf that the resource is gone, gone, gone. Go away. Really. Requesting it again it an hour will not make it return.)

I follow and occasionally post to the twitter-development list, I even have a couple of twitter related projects sitting on the side waiting for the launch of oAuth (or something comparable) to access twitter. There's a lot of great ideas there, but there's a lot of dumb programming as well. In as much as twitter has its own stability problems, I wonder (and believe) if many of their problems are caused by just gut-wrenchingly bad programming on the part of some of the tools people are writing against the twitter API. Many don't seem to do any caching at all, they pound away on the API making requests that could be computed on the client side, and instead of backing off they keep retrying the failed command until the account gets locked out for exceeding the API requests limit.

Having been on the wrong end of the Internet firehose many times myself, can I just ask that developers give more than 30 seconds of thought before unleashing some of these nifty gadgets out onto the world, contemplating what the impact will be on the (likely free) services they're beating the crap out of?

Posted in Web Services

202: Accepted Archives

Unless otherwise noted, content is licensed for reuse under the Creative Commons Attribution-ShareAlike 3.0 License. Please read and understand the license before repurposing content from this site.