A collection of informal posts about web and internet technology.
I started playing with twitter in January 2007, well before that year's SXSW breakout. I didn't start playing around with the twitter API until recently, joining the Twitter Development Talk Google Group, writing first my own command line update client (a very cunning combination of bash and curl which until this week totally failed to URL encode parameters. Doh.) and more recently a followers history tool (which isn't public, and won't be until I either get over accepting people's twitter credentials or some sort of third party authentication scheme is rolled out). My other posts about twitter are collected here: Topics/Web Services/Twitter.
I'll just preface the rest of this with:
I have no inside insight, and have no idea what is wrong with
twitter other than the general it's not scaling particularly well, is it?
.
So, this is all just supposition.
I just wanted to make that clear.
The biggest problem with twitter? It's free.
I know, you're thinking WTF is that ∗ character there for? And there is a cost you bozo, there's a rate limit!
.
I concede your point, there is a rate limit, which applies on a per–user basis.
But that's not a cost.
That's not a price.
The per–user limit doesn’t cause a user to sit back and think: Ooh, do I really want to tweet "Taking the dogs for a walk to the Telectrascope on Fulton Ferry."
I'm old school. I think it's the right, the duty of anyone running a web site to protect that site from abusive behavior, whatever that may be. I regularly rant on our nextNY list that people can and should take proactive measures to protect their sites (blocking users, bots, what–not). I don't think that we, as site managers (webmasters, whatever you want to call us these days) have to suck up all of the traffic thrown at a site just because as a general principle we're open to user generated content, APIs into our services, means of extending whatever it is we're providing (and, in theory, profiting from in some way).
So, I think twitter's problems come down to three separate areas which are intersecting:
Stop! You have to be reviewed.imposes a cost to using the API, and that is a good thing. If you're just screwing around, use the sandbox. If you want to distribute code built on the twitter API, then there's a cost. Whether there's a financial cost or not is up to twitter.
Hey, let's not just slander them, let's take out what chance they have at a successful business model too!), or claimed that it's some sort of public resource. Just chill. Twitter's problems are solvable. Not necessarily easily solvable, but solvable.
.net culture is fickle. The amount of time it would take to rebuild twitter on–the–fly as I’ve outline isn't very long, maybe weeks if one starts with a clear hand and blank sheet of paper, months under the current firestorm setup. But the various .net glitterati would just as soon shove twitter under the water, than see it succeed.
Imposing a cost to using twitter would have its own cost of course: what's made twitter grow is the API and the complete flexibility to develop applications and mashups against that API. Many of the mashups are only possible when the data and API use are financially free. Many of the clients are only possible when API use is transactionally free (or close enough).
Someone has to pay to use these services. If so many people find them to be useful, why hasn't a successful business model appeared for recouping the cost?
Watching twitter's struggles have been an education for me, not just for the technical issues they face, but also the network effects of the API usage, and the reaction and criticism they've received as they work through the technical issues.
Followup: Kee Hinckley has written up an excellent post to the twitter development group: An odd request for Twitter - Please stop fixing bugs in the API.
My personal site, epcostello.net, has been up since 2003. Over five years I've redesigned, reorganized, remodeled and removed a lot of the site. When it first launched, my personal blog was epcostello.net/journal, my link and commentary blog was epcostello.net/epicrisis, and yet another blog had long form essays. Since then I've consolidated everything under epcostello.net/epicrisis.
Each time I've remodeled, I've tried to be good and redirect old URLs to the appropriate new URL.
In a few cases, specifically with web feeds, I've turned off the redirects and issue 410 Gone messages instead.
As far as I can tell, few feed readers, slurpers, indexers, what-not remove a feed, ever. They ignore 404 File Not Found, and also appear to ignore 410 (which is as explicit as you can get: The file existed, now it's gone, it's not expected to return, now go away
).
So, while looking through the raw logs for my personal site for May I came across a flurry of hits on epcostello.net/journal/rss.xml from an agent identifying itself as BuzzTracker/1.02.
Now, thanks to the wonder of cheap disk, I can tell you the following:
301 permanently moved redirect on July 25, 2005.410 on April 17, 2007Now, it does not cost me anything, really, to serve this, but this is just one of many user-agents out there that is so poorly written that it continues to fetch a URL it's been told is permanently gone over and over again. And that all adds up to wasted bandwidth and processor use on my part.
I had not heard of BuzzTracker, so I looked around on the site, intending to send feedback asking that the feed be removed from their cache so that they stop requesting it.
The feedback page reads "Page not found" (though it returns a "200" and not a "404").
Further tooling around reveals that the site was bought by Yahoo! in 2007 (a year later it has no Yahoo! branding and not much else to indicate any integration with Yahoo!). So, you get this blog post instead.
Now, I should point out that there's other agents which are hitting the same URL, getting the same 410, and continuing to do so on a hourly and daily basis:
This is just sloppy programming. And I find it really frustrating, it just adds noise and burden to the server side. You might think Oh, look, it's only one hit every couple of hours
but there's no limit (the 410 is supposed to be that limiting factor, it's an intentional statement on the server administrator's behalf that the resource is gone, gone, gone. Go away. Really. Requesting it again it an hour will not make it return.)
I follow and occasionally post to the twitter-development list, I even have a couple of twitter related projects sitting on the side waiting for the launch of oAuth (or something comparable) to access twitter. There's a lot of great ideas there, but there's a lot of dumb programming as well. In as much as twitter has its own stability problems, I wonder (and believe) if many of their problems are caused by just gut-wrenchingly bad programming on the part of some of the tools people are writing against the twitter API. Many don't seem to do any caching at all, they pound away on the API making requests that could be computed on the client side, and instead of backing off they keep retrying the failed command until the account gets locked out for exceeding the API requests limit.
Having been on the wrong end of the Internet firehose many times myself, can I just ask that developers give more than 30 seconds of thought before unleashing some of these nifty gadgets out onto the world, contemplating what the impact will be on the (likely free) services they're beating the crap out of?
It's a new year and time for some dumb data analysis.
Most interesting thing to me this year is that most of the traffic
to this site and my other sites (notably epcostello.net) is from
automated agents: search engines, random webcrawlers, SEO's link injectors.
| 155127 | 200 |
| 64662 | 304 |
| 11400 | 301 |
| 6352 | 302 |
| 5705 | 404 |
| 5308 | 202 |
| 2784 | 401 |
| 1609 | 405 |
| 126 | 400 |
| 63 | 414 |
| 31 | 500 |
| 30 | 403 |
| 3 | 501 |
| 4551 | 66.249.73.200 | crawl-66-249-73-200.googlebot.com |
| 2234 | 81.52.143.16 | natcrawlbloc03.net.m1.fti.net |
| 1890 | 81.52.143.15 | natcrawlbloc01.net.m1.fti.net. |
| 1492 | 64.152.34.36 | jfk-lv3-n4.panthercdn.com |
| 1380 | 38.99.203.110 | Panscient_Data_Services.demarc.cogentco.com |
| 991 | 128.194.135.94 | web-crawler.irl.cs.tamu.edu |
| 885 | 216.240.154.103 | |
| 884 | 66.249.73.148 | crawl-66-249-73-148.googlebot.com |
| 800 | 64.92.162.210 | |
| 773 | 72.30.177.225 | wm509310.inktomisearch.com. |
| 21547 | /robots.txt |
| 8030 | /favicon.ico |
| 6801 | /articles/2005/12/27/a_practically_u/ |
| 6062 | /articles/nav-commenters.gif |
| 6055 | /g/Google_logo_transparent.png |
| 5582 | /d/4/js/ajax/ |
| 5290 | /202/2006/06/disabling_trackbacks_in_movabl/ |
| 4445 | / |
| 4217 | /g/feed-icon-16×16.png |
| 3993 | /g/by-sa-3.0-88×31.png |
| 21547 | /robots.txt |
| 6801 | /articles/2005/12/27/a_practically_u/ |
| 5290 | /202/2006/06/disabling_trackbacks_in_movabl/ |
| 4445 | / |
| 2812 | /202/ |
| 2664 | /202/2006/12/google_reader_annoyances/ |
| 2121 | /202/2006/10/social-bookmarking-and-attention/ |
| 1278 | /202/2006/11/bloglines_new_features_playlis/ |
| 912 | /articles/ |
| 890 | /202/2006/07/yet_another_spam_retaliation_t/ |
| 2353 | crawl-66-249-73-200.googlebot.com | [66.249.73.200] | |
| 1404 | natcrawlbloc03.net.m1.fti.net | [81.52.143.16] | |
| 1192 | natcrawlbloc01.net.m1.fti.net | [81.52.143.15] | |
| 770 | wm509310.inktomisearch.com | [72.30.177.225] | |
| 736 | ct501085.crawl.yahoo.net | [74.6.86.230] | |
| 688 | wm509458.inktomisearch.com | [74.6.74.202] | |
| 498 | crawl-66-249-73-148.googlebot.com | [66.249.73.148] | |
| 496 | livebot-65-55-213-74.search.live.com | [65.55.213.74] | |
| 491 | wm508816.inktomisearch.com | [74.6.69.173] | |
| 342 | lj512274.crawl.yahoo.net | [74.6.19.77] | |
| 342 | wm511001.inktomisearch.com | [72.30.252.135] | |
| 327 | natcrawlbloc02.net.s1.fti.net [193.252.149.15] | ||
| 288 | lm502044.crawl.yahoo.net | [72.30.226.173] | |
| 262 | ct501101.crawl.yahoo.net | [74.6.86.207] | |
| 233 | 67.110.56.45.ptr.us.xo.net | [67.110.56.45] | |
| 224 | crawl-66-249-73-132.googlebot.com | [66.249.73.132] | |
| 208 | wm511565.inktomisearch.com | [72.30.226.209] | |
| 206 | c02.entireweb.com | [89.150.197.130] | |
| 199 | wm509426.inktomisearch.com | [74.6.75.46] | |
| 197 | ip67-95-51-86.z51-95-67.customer.algx.net | [67.95.51.86] | |
Last time robots.txt changed: 23 March 2007
| 97488 | "-" |
| 533 | "http://neworder.box.sk/forum.php?page=last&did=multSecurity%20and%20Networking&thread=251392" |
| 244 | "http://www.zenatode.org.uk/ian/internet/hotmail.xhtml" |
| 212 | "http://my.yahoo.com/" |
| 142 | "http://www.google.com/search?hl=en&q=phx.gbl" |
| 122 | "http://www.stumbleupon.com/refer.php?url=http%3A%2F%2Fartific.com%2Farticles%2F2005%2F12%2F27%2Fa_practically_u%2F" |
| 117 | "http://www.google.com/search?q=phx.gbl&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a" |
| 88 | "http://www.google.com/search?hl=en&q=phx.gbl&btnG=Google+Search" |
| 68 | "http://neworder.box.sk/forum.php?did=multSecurity%20and%20Networking&thread=251392" |
| 55 | "http://www.dslreports.com/shownews/DNS-Hacks-Phishing-20-90182" |
Internal referrers and obviously junk referrers have been filtered out.
| 1474 | q=phx.gbl |
| 260 | q=phx.gbl" |
| 122 | q=.phx.gbl |
| 107 | q=phx.gbl%3A1863 |
| 98 | q=phx%2egbl" |
| 76 | q=crawler.bloglines.com |
| 62 | q=phx.gbl+domain |
| 59 | q=gbl+domain |
| 53 | q=gbl+tld |
| 42 | q=%22phx.gbl%22 |
phx.gbl is a pseudo-domain used by Microsoft for a variety of services. I wrote about it in On The Importances of Revers DNS which I now realize is still using the previous design system for this site.
| 1474 | q=phx.gbl |
| 260 | q=phx.gbl" |
| 122 | q=.phx.gbl |
| 107 | q=phx.gbl%3A1863 |
| 62 | q=phx.gbl+domain |
| 42 | q=%22phx.gbl%22 |
| 30 | q=.phx.gbl%3A1863 |
| 29 | q=@phx.gbl |
| 26 | q=%40phx.gbl |
| 23 | q=phx.gbl+netstat |
| 22 | q=phx.gbl+1863 |
| 22 | q=.phx.gbl" |
| 19 | q=phx.gbl%3A1863" |
| 18 | q=by2msg2204708.phx.gbl |
| 17 | q=what+is+phx.gbl |
| 15 | q=netstat+phx.gbl |
| 15 | q=by1msg4176104.phx.gbl |
| 14 | q=phx.gbl+msn |
| 13 | q=by2msg2204912.phx.gbl |
| 13 | q=%22phx.gbl%22" |
| 76 | q=crawler.bloglines.com |
| 38 | q=artific |
| 16 | q=infobackground |
| 15 | q=ed+costello |
| 15 | q=207.46.108.36 |
| 15 | q=202+Accepted |
| 13 | q=crawler.bloglines.com" |
| 13 | q=207.46.111.86 |
| 13 | q=202+accepted |
| 11 | q=spam+retaliation |
| 11 | q=InfoBackground |
| 10 | q=google+reader+rename+folder |
| 10 | q=Reverse+DNS |
| 9 | q=kb05474 |
| 9 | q=importance+of+reverse+dns |
| 9 | q=crawler.bloglines.com+ |
| 8 | q=nokia+espionage |
| 8 | q=iab+ad+units |
| 8 | q=hotmail+reverse+dns |
| 7 | q=tvpath.com |
In March 2007 I wrote my own trackback endpoint in PHP which logs all of the trackback data to a file instead of beating up my MovableType installation and MySQL database.
Trackbacks Received since 21 March 2007: 10258
Number of Valid Trackbacks: 0
| 447 | 207-234-131-237.ptr.primarydns.com | [207.234.131.237] | |
| 156 | movinglabs.com [195.242.99.80] | ||
| 150 | u15250532.onlinehome-server.com [74.208.14.63] | ||
| 144 | 218.189.232.72.static.reverse.ltdomains.com | [72.232.189.218] | |
| 124 | [206.123.73.15] [206.123.73.15] | ||
| 122 | server.camelotwealthcreation.com | [69.50.210.8] | |
| 113 | giantlogic.net [208.101.35.52] | ||
| 99 | 89-149-195-161.internetserviceteam.com [89.149.195.161] | ||
| 96 | u15251680.onlinehome-server.com [74.208.14.215] | ||
| 95 | 210.219.232.72.static.reverse.ltdomains.com | [72.232.219.210] | |
| 169 | "Tramadol." |
| 151 | "Phentermine." |
| 119 | "Xanax." |
| 94 | "Cialis." |
| 62 | "Lexapro." |
| 56 | "Ephedra." |
| 52 | "Valium." |
| 52 | "Ultram." |
| 47 | "Zoloft." |
| 43 | "Ambien." |
| 42 | "Fioricet." |
| 37 | "Percocet." |
| 37 | "Cheapphentermine." |
| 37 | "Adderall." |
| 34 | "Soma." |
Copyright 2002–2008 Artific Consulting LLC.
Unless otherwise noted, content is licensed for reuse under the Creative Commons Attribution-ShareAlike 3.0 License.
Please read and understand the license before repurposing content from this site.