202: Accepted

A collection of informal posts about web and internet technology.

How I'd Fix Twitter (and why that won't happen)

I started playing with twitter in January 2007, well before that year's SXSW breakout. I didn't start playing around with the twitter API until recently, joining the Twitter Development Talk Google Group, writing first my own command line update client (a very cunning combination of bash and curl which until this week totally failed to URL encode parameters. Doh.) and more recently a followers history tool (which isn't public, and won't be until I either get over accepting people's twitter credentials or some sort of third party authentication scheme is rolled out). My other posts about twitter are collected here: Topics/Web Services/Twitter.

I'll just preface the rest of this with: I have no inside insight, and have no idea what is wrong with twitter other than the general it's not scaling particularly well, is it?.

So, this is all just supposition.

I just wanted to make that clear.

The biggest problem with twitter? It's free.

  • There's no cost to posting updates to twitter.
  • There's no cost to pulling updates from twitter.
  • There's no cost∗ to using the twitter API.
  • There's no cost∗ to writing incredibly poor client code to use the twitter APIs.
  • There's no cost∗ to distribute client applications which amplify twitter's usage dramatically.

I know, you're thinking WTF is that ∗ character there for? And there is a cost you bozo, there's a rate limit!.

I concede your point, there is a rate limit, which applies on a per–user basis.

But that's not a cost. That's not a price. The per–user limit doesn’t cause a user to sit back and think: Ooh, do I really want to tweet "Taking the dogs for a walk to the Telectrascope on Fulton Ferry."

I'm old school. I think it's the right, the duty of anyone running a web site to protect that site from abusive behavior, whatever that may be. I regularly rant on our nextNY list that people can and should take proactive measures to protect their sites (blocking users, bots, what–not). I don't think that we, as site managers (webmasters, whatever you want to call us these days) have to suck up all of the traffic thrown at a site just because as a general principle we're open to user generated content, APIs into our services, means of extending whatever it is we're providing (and, in theory, profiting from in some way).

So, I think twitter's problems come down to three separate areas which are intersecting:

  • There's no cost to using twitter.
  • Twitter wasn't designed for the way it's being used.
  • Unintended side–effects of network effects.

How I'd fix twitter

Stabilize the current system but limit its capabilities.
What? Yeah, I'd lock down the system, perhaps even in its current requests limited to 30, non-Jabber state. Stabilize it in that configuration, declare victory, and move on.
Create an operations team with no development responsibilities.
When you're firefighting a technology fire two things usually occur: the same people deep in the bowels of the problem are usually expected to also keep the system running, while trying to figure out how to solve the problem. It doesn't work. Over time at ibm.com, we evolved to have a "duty webmaster" who had the run of the site, while everyone who was "off-duty" worked on application development or bug fixing. Asking the programmer to solve the problem while running the site, is like asking a firefighter to battle the fire while working with the architect and fire marshal to investigate the fire and prevent it from happening again.
Segregate API traffic to api.twitter.com
This has been mentioned on the mailing list several times, but I suspect that the twitter team is so busy keeping their heads above water that they just can't focus on this. After years of trying to consolidate IBM web sites into www.ibm.com, I've come to accept that there's certain traffic and application patterns which best serve the overall audience by being segregated off into their own play area.
Roll out some sort of third party authentication scheme
Whether it's oAuth, something comparable, or completely RYO, separating out third party API actions from individual users' accounts would go a long way to establishing an accurate accounting of just what's going on within the twitter system.
Create an "open" API sandbox
Currently anyone can create a twitter client and that has been one of the great factors in its growth. I think that should continue with a couple of minor modifications. This sandbox is one of them: basically move the current API setup to a sandbox with a reasonable rate limit for developers. Make it easy for developers to register. But make it a pain in the ass to use as a production API.
Lock down the public API
In parallel with the sandbox, make api.twitter.com a locked down site: you can't go into production without some sort of approval by twitter. Now, this review could be a technology review (how will this new client impact our service?) or a business review (how much will this client cost our service?). But right here, by simply saying Stop! You have to be reviewed. imposes a cost to using the API, and that is a good thing. If you're just screwing around, use the sandbox. If you want to distribute code built on the twitter API, then there's a cost. Whether there's a financial cost or not is up to twitter.
Clean up the API
I'm not going to go into an API-by-API breakdown here, perhaps later, but many of the API functions spit back a lot of redundant data, meaning many more database calls, JOINs, UNIONs, etc. than necessary. Many of the API functions make it much easier to deploy twitter clients, for example the routine to get a list of my followers returns not only my followers, but their profiles and most recent status. All of that information is available through other API calls, but given the current imposition of API limits, I could quickly burn through the current request count simply trying to get the latest status for each follower (I have ~150 followers at present). So this API cleanup would have to be concurrent with rollout of oAuth or something comparable.
Tell Your Critics to Chill
I find some of the venom spewing throughout the blogosphere to be just shocking and disgusting. Twitter's a free service. Various pundits have written about writing a distributed twitter (Hey, let's not just slander them, let's take out what chance they have at a successful business model too!), or claimed that it's some sort of public resource. Just chill. Twitter's problems are solvable. Not necessarily easily solvable, but solvable.

Why this won't happen…

.net culture is fickle. The amount of time it would take to rebuild twitter on–the–fly as I’ve outline isn't very long, maybe weeks if one starts with a clear hand and blank sheet of paper, months under the current firestorm setup. But the various .net glitterati would just as soon shove twitter under the water, than see it succeed.

Imposing a cost to using twitter would have its own cost of course: what's made twitter grow is the API and the complete flexibility to develop applications and mashups against that API. Many of the mashups are only possible when the data and API use are financially free. Many of the clients are only possible when API use is transactionally free (or close enough).

Someone has to pay to use these services. If so many people find them to be useful, why hasn't a successful business model appeared for recouping the cost?

Watching twitter's struggles have been an education for me, not just for the technical issues they face, but also the network effects of the API usage, and the reaction and criticism they've received as they work through the technical issues.

Followup: Kee Hinckley has written up an excellent post to the twitter development group: An odd request for Twitter - Please stop fixing bugs in the API.

BuzzTracker: no one home?

My personal site, epcostello.net, has been up since 2003. Over five years I've redesigned, reorganized, remodeled and removed a lot of the site. When it first launched, my personal blog was epcostello.net/journal, my link and commentary blog was epcostello.net/epicrisis, and yet another blog had long form essays. Since then I've consolidated everything under epcostello.net/epicrisis.

Each time I've remodeled, I've tried to be good and redirect old URLs to the appropriate new URL. In a few cases, specifically with web feeds, I've turned off the redirects and issue 410 Gone messages instead.

As far as I can tell, few feed readers, slurpers, indexers, what-not remove a feed, ever. They ignore 404 File Not Found, and also appear to ignore 410 (which is as explicit as you can get: The file existed, now it's gone, it's not expected to return, now go away).

So, while looking through the raw logs for my personal site for May I came across a flurry of hits on epcostello.net/journal/rss.xml from an agent identifying itself as BuzzTracker/1.02.

Now, thanks to the wonder of cheap disk, I can tell you the following:

  • That feed went live on September 12, 2003
  • I redirected with a 301 permanently moved redirect on July 25, 2005.
  • I marked the feed as Gone with a 410 on April 17, 2007
  • BuzzTracker first hit the feed on July 20, 2007 (three months after it started returning a 410)
  • BuzzTracker hit the feed a second time in August 2007, then went silent
  • Starting April 8, 2008 BuzzTracker (well something identifying itself as BuzzTracker) started hitting the feed daily, somedays once, somedays many times.

Now, it does not cost me anything, really, to serve this, but this is just one of many user-agents out there that is so poorly written that it continues to fetch a URL it's been told is permanently gone over and over again. And that all adds up to wasted bandwidth and processor use on my part.

I had not heard of BuzzTracker, so I looked around on the site, intending to send feedback asking that the feed be removed from their cache so that they stop requesting it.

The feedback page reads "Page not found" (though it returns a "200" and not a "404").

Further tooling around reveals that the site was bought by Yahoo! in 2007 (a year later it has no Yahoo! branding and not much else to indicate any integration with Yahoo!). So, you get this blog post instead.

Now, I should point out that there's other agents which are hitting the same URL, getting the same 410, and continuing to do so on a hourly and daily basis:

  • Blogdigger/2.0
  • Feedfetcher-Google (feed-id: 17286166821941995468)
  • Syndic8/1.0
  • Mozilla/4.0 (Webclipping.com)

This is just sloppy programming. And I find it really frustrating, it just adds noise and burden to the server side. You might think Oh, look, it's only one hit every couple of hours but there's no limit (the 410 is supposed to be that limiting factor, it's an intentional statement on the server administrator's behalf that the resource is gone, gone, gone. Go away. Really. Requesting it again it an hour will not make it return.)

I follow and occasionally post to the twitter-development list, I even have a couple of twitter related projects sitting on the side waiting for the launch of oAuth (or something comparable) to access twitter. There's a lot of great ideas there, but there's a lot of dumb programming as well. In as much as twitter has its own stability problems, I wonder (and believe) if many of their problems are caused by just gut-wrenchingly bad programming on the part of some of the tools people are writing against the twitter API. Many don't seem to do any caching at all, they pound away on the API making requests that could be computed on the client side, and instead of backing off they keep retrying the failed command until the account gets locked out for exceeding the API requests limit.

Having been on the wrong end of the Internet firehose many times myself, can I just ask that developers give more than 30 seconds of thought before unleashing some of these nifty gadgets out onto the world, contemplating what the impact will be on the (likely free) services they're beating the crap out of?

EOY 2007 Data Analysis

It's a new year and time for some dumb data analysis.

Most interesting thing to me this year is that most of the traffic
to this site and my other sites (notably epcostello.net) is from
automated agents: search engines, random webcrawlers, SEO's link injectors.

HTTP Status Codes:

155127200
64662304
11400301
6352302
5705404
5308202
2784401
1609405
126400
63414
31500
30403
3501

Top Ten Hosts

455166.249.73.200crawl-66-249-73-200.googlebot.com
223481.52.143.16natcrawlbloc03.net.m1.fti.net
189081.52.143.15natcrawlbloc01.net.m1.fti.net.
149264.152.34.36jfk-lv3-n4.panthercdn.com
138038.99.203.110Panscient_Data_Services.demarc.cogentco.com
991128.194.135.94web-crawler.irl.cs.tamu.edu
885216.240.154.103
88466.249.73.148crawl-66-249-73-148.googlebot.com
80064.92.162.210
77372.30.177.225wm509310.inktomisearch.com.

Raw Top Ten Requests

21547/robots.txt
8030/favicon.ico
6801/articles/2005/12/27/a_practically_u/
6062/articles/nav-commenters.gif
6055/g/Google_logo_transparent.png
5582/d/4/js/ajax/
5290/202/2006/06/disabling_trackbacks_in_movabl/
4445/
4217/g/feed-icon-16×16.png
3993/g/by-sa-3.0-88×31.png

Filtered Top Ten Requests

21547/robots.txt
6801/articles/2005/12/27/a_practically_u/
5290/202/2006/06/disabling_trackbacks_in_movabl/
4445/
2812/202/
2664/202/2006/12/google_reader_annoyances/
2121/202/2006/10/social-bookmarking-and-attention/
1278/202/2006/11/bloglines_new_features_playlis/
912/articles/
890/202/2006/07/yet_another_spam_retaliation_t/

Top 20 non-caching requestors of Robots.txt:

2353crawl-66-249-73-200.googlebot.com[66.249.73.200]
1404natcrawlbloc03.net.m1.fti.net[81.52.143.16]
1192natcrawlbloc01.net.m1.fti.net[81.52.143.15]
770wm509310.inktomisearch.com[72.30.177.225]
736ct501085.crawl.yahoo.net[74.6.86.230]
688wm509458.inktomisearch.com[74.6.74.202]
498crawl-66-249-73-148.googlebot.com[66.249.73.148]
496livebot-65-55-213-74.search.live.com[65.55.213.74]
491wm508816.inktomisearch.com[74.6.69.173]
342lj512274.crawl.yahoo.net[74.6.19.77]
342wm511001.inktomisearch.com[72.30.252.135]
327natcrawlbloc02.net.s1.fti.net [193.252.149.15]
288lm502044.crawl.yahoo.net[72.30.226.173]
262ct501101.crawl.yahoo.net[74.6.86.207]
23367.110.56.45.ptr.us.xo.net[67.110.56.45]
224crawl-66-249-73-132.googlebot.com[66.249.73.132]
208wm511565.inktomisearch.com[72.30.226.209]
206c02.entireweb.com[89.150.197.130]
199wm509426.inktomisearch.com[74.6.75.46]
197ip67-95-51-86.z51-95-67.customer.algx.net[67.95.51.86]

Last time robots.txt changed: 23 March 2007

Top Ten Referrers (filtered):

97488"-"
533"http://neworder.box.sk/forum.php?page=last&did=multSecurity%20and%20Networking&thread=251392"
244"http://www.zenatode.org.uk/ian/internet/hotmail.xhtml"
212"http://my.yahoo.com/"
142"http://www.google.com/search?hl=en&q=phx.gbl"
122"http://www.stumbleupon.com/refer.php?url=http%3A%2F%2Fartific.com%2Farticles%2F2005%2F12%2F27%2Fa_practically_u%2F"
117"http://www.google.com/search?q=phx.gbl&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a"
88"http://www.google.com/search?hl=en&q=phx.gbl&btnG=Google+Search"
68"http://neworder.box.sk/forum.php?did=multSecurity%20and%20Networking&thread=251392"
55"http://www.dslreports.com/shownews/DNS-Hacks-Phishing-20-90182"

Internal referrers and obviously junk referrers have been filtered out.

Top Ten Raw Search Requests:

1474q=phx.gbl
260q=phx.gbl"
122q=.phx.gbl
107q=phx.gbl%3A1863
98q=phx%2egbl"
76q=crawler.bloglines.com
62q=phx.gbl+domain
59q=gbl+domain
53q=gbl+tld
42q=%22phx.gbl%22

Top 20 phx.gbl searches:

phx.gbl is a pseudo-domain used by Microsoft for a variety of services. I wrote about it in On The Importances of Revers DNS which I now realize is still using the previous design system for this site.

1474q=phx.gbl
260q=phx.gbl"
122q=.phx.gbl
107q=phx.gbl%3A1863
62q=phx.gbl+domain
42q=%22phx.gbl%22
30q=.phx.gbl%3A1863
29q=@phx.gbl
26q=%40phx.gbl
23q=phx.gbl+netstat
22q=phx.gbl+1863
22q=.phx.gbl"
19q=phx.gbl%3A1863"
18q=by2msg2204708.phx.gbl
17q=what+is+phx.gbl
15q=netstat+phx.gbl
15q=by1msg4176104.phx.gbl
14q=phx.gbl+msn
13q=by2msg2204912.phx.gbl
13q=%22phx.gbl%22"

Top 20 Non-phx.gbl searches:

76q=crawler.bloglines.com
38q=artific
16q=infobackground
15q=ed+costello
15q=207.46.108.36
15q=202+Accepted
13q=crawler.bloglines.com"
13q=207.46.111.86
13q=202+accepted
11q=spam+retaliation
11q=InfoBackground
10q=google+reader+rename+folder
10q=Reverse+DNS
9q=kb05474
9q=importance+of+reverse+dns
9q=crawler.bloglines.com+
8q=nokia+espionage
8q=iab+ad+units
8q=hotmail+reverse+dns
7q=tvpath.com

In March 2007 I wrote my own trackback endpoint in PHP which logs all of the trackback data to a file instead of beating up my MovableType installation and MySQL database.

Trackbacks Received since 21 March 2007: 10258

Number of Valid Trackbacks: 0

Top Ten Trackback Sources:

447207-234-131-237.ptr.primarydns.com[207.234.131.237]
156movinglabs.com [195.242.99.80]
150u15250532.onlinehome-server.com [74.208.14.63]
144218.189.232.72.static.reverse.ltdomains.com[72.232.189.218]
124[206.123.73.15] [206.123.73.15]
122server.camelotwealthcreation.com[69.50.210.8]
113giantlogic.net [208.101.35.52]
9989-149-195-161.internetserviceteam.com [89.149.195.161]
96u15251680.onlinehome-server.com [74.208.14.215]
95210.219.232.72.static.reverse.ltdomains.com[72.232.219.210]

Top Fifteen Trackback Titles:

169"Tramadol."
151"Phentermine."
119"Xanax."
94"Cialis."
62"Lexapro."
56"Ephedra."
52"Valium."
52"Ultram."
47"Zoloft."
43"Ambien."
42"Fioricet."
37"Percocet."
37"Cheapphentermine."
37"Adderall."
34"Soma."

End of year domain cleanup

I have the following domains available for sale through Sedo:

Acquire them through Sedo or contact me at sales @ artific.com.

Reboot

I realized I sort of fell off the blog beat here. I've rebuilt the site (again) using the Yahoo! User Interface Library which I've been getting to know and use for various sites over the past year. I'm currently at the Defrag 2007 conference in Denver, CO and am enjoying it, it's serving (for me) as an introduction to the intersection of the current crop of social tools and enterprises.

I'll post some notes from Defrag later today.

Feedburner's Migration to Google notice

I just attempted to log into my FeedBurner account and got the following notice (in addition to the login information):

NOTE: Service of FeedBurner publisher accounts will not be interrupted as a result of the acquisition by Google. You will have a 14-day interim period ending June 15, 2007 to opt-out of allowing Google to service your account. If you take no action by June 15, 2007, the rights to your data will transfer from FeedBurner to Google. Opting out will terminate your user agreement with FeedBurner, permanently delete your FeedBurner account, feeds, and all related statistical data and history, and prevent the transfer of your data rights to Google. To opt-out, contact us via accountx@feedburner.com, provide your FeedBurner account Username, and request to have your FeedBurner account deleted. We will contact you at your registered email address to confirm your deletion request before completing it.

While I don't object to the sale or to the company merging into the Google Borg, 14 days seems to be awfully short to give notice to those who don't want to continue using FeedBurner after it becomes part of Google. I have feeds which I ceased advertising years ago, which either return 301 redirects or 410 gone messages and they still get tens of hits per day. If you delete your FeedBurner account you can't redirect the subscribers using that account (unless you either redirect using a 302 or 307 redirect from a URL you control, or you used the feeds.yoursitename.tld service, which you could simply point back to a site you control). With summer vacations and the vagaries of feed updates my guess is that many people or organizations who do opt-out of the FeedBurner-Google migration will lose many readers, who will just get dropped.

FeedBurner should allow, for a limited time, say 60 days, the ability to opt-out in some way and redirect the feed elsewhere.

An interesting exercise would be to track the people who drop their FeedBurner feeds (and accounts), picking out the feeds with the highest traffic volume and registering them for yourself (unless FB has changed to blocking re-registration of a feed URL).

TwitterSense

I noticed yesterday that twitter has added Google Adsense ads to the individual status message pages, but only if you are not logged into twitter. Here's two examples:

  • Unauthenticated: Thumbnail of screenshot of twitter.com
  • Authenticated (no AdSense): Thumbnail of screenshot of twitter.com

Twitter have also added a tab to your twitter home page showing "replies" (messages sent to @username.

Update: fixed images and thumbnails.

How Staples.com lost my business today

My wife and I are in the process of moving. In the scheme of things, we're not moving all that far (approximately 660 meters if Google Earth can be trusted). In New York City you only need to move an avenue or two to be in a completely different neighborhood.

So, we're moving, and I want to get various things lined up. I used the U.S.P.S. online address change form, am forwarding all of our mail to a P.O. Box to "cleanse" our data trail in Corporate America's databanks, and am trying to figure out our broadband solution (the new building has a T1 we're told with no limit on the exclamations. That might have been cool in 1998, but we have and use a 7Mbs DSL line today, which makes a piddling T1 look downright like dialup. I digress.).

I have found it handy to have a stamp with our address on it for the rare times we actually send postal mail, so I went to Staples.com to order a stamp online. I have an account there, which is probably the only reason I thought to go there. After tooling around the site for 30 seconds I got asked if I wanted to go to the Staples Custom Printing Shop. On clicking "ok", I ended up at http://www.staples.marktheworld.com/browsercheck.asp, which is apparently where Staples has outsourced their custom stamp printing to. The browsercheck.asp in the URL should give away what happened next:

We are sorry for the inconvenience. Our site currently supports only Internet Explorer version 4.0 or higher. This is due to the advanced features used in the product customization process.

Come on. I mean, sure, they probably used an ActiveX control written in 1999 to show what the stamp would look like. And MSIE is used by, what, 80% of the worldwide marketplace? And they probably don't want to waste the precious investment in the ActiveX and ASP coding. The net result is that they lost me as a customer, there is no reason, today, to be designing web applications solely for one browser platform. None. I will accept that if you're on a tightly controlled intranet you might consider it, but really there's just no reason for this.

202: Accepted Archives

Feeds

We use Feedburner to distribute our web feeds:

Google

Copyright 2002–2008 Artific Consulting LLC.

Unless otherwise noted, content is licensed for reuse under the Creative Commons Attribution-ShareAlike 3.0 License. Please read and understand the license before repurposing content from this site.