p. Earlier this year I wrote of my "frustrations":http://artific.com/202/2006/06/back_from_vacation/ in returning from vacation and feeling "behind" on my feeds, and wishing for some sort of magical blend between "Technorati":http://technorati.com/ and "Bloglines":http://bloglines.com/. That magical blend never appeared but I've continued to look into the problem and possible solutions.
p. To be clear: my goal is to try to cobble something together from what's out there, not to build something new from scratch, not to create yet another silo of information.
p. So, along those lines I have been thinking about how one would build such a tool, what data would I gather, what data could I gather? My hypothesis has been that the act of bookmarking a URL indicates interest in that URL, and that that could be used as a way of creating a waited rank which could then be leveraged by something else, perhaps a Greasemonkey script for Bloglines.
p. So the problem for today's post is this: is it possible to extract data about a URL from a social bookmarking site? If it is possible, is the data of any actual use? Bonus points for being able to extract data using an API or feed without requiring authentication or site registration.
p. Ten hours ago I set aside four hours to answering the question. By the end of four hours I was barely into the list of social bookmarking sites I'd cobbled together. Eventually I trimmed the list of sites to 36. I am not going to go through and evaluate each site, though I'll list them at the end of this article.
p. The key to answering the question was the presence and support of APIs into the various services. I was floored at how many of the services provide no documented API whatsoever. Many provide JavaScript bookmarklets to add data to the service. Many provide genuine web service APIs (using REST, XML-RPC or a special blend of RDF or XML over HTTP), but only to add or manipulate data within the service.
p. I found that very few social bookmarking services provided a means of extracting data about a URL from the service, let alone a formally supported API for doing so.
p. Furthermore: the services which do provide APIs, of any kind, tend to hide the links to documentation in obscure places. Few place a link to "developer tools" or "support" or "APIs" or "tools" on their home page. I tended to find links buried in About pages, occasionally a Help page, and frequently casually tossed into a blog post or news post announcing the availability of the API or tool.
p. A number of the services appear to be pretty much dead. Either content has not been updated (I figure if it hasn't been updated in 90 days, it's dead Jim), or the sole content appears to be SEO spammers.
p. Because finding APIs was such a pain, I also checked into the various feeds available. Many of the sites provide a per-bookmark or per-URL feed (typically RSS 2.0-ish).
h2. Findings
p. Almost all social bookmarking services require some sort of authentication to access the API, if one is available. The form of authentication was varied: HTTP "Basic" authentication (with or without a realm), WSSE authentication, cookie based authentication, or application identifiers.
p. A number of the services mimic the del.icio.us API, but require accounts on the individual services to use the API.
p. del.icio.us, Bloglines, and RawSugar provide the most useful information without requiring an existing account with the service.
p. del.icio.us provides metainformation about a URL with either of these queries: http://del.icio.us/url/checkurl?url=[URL] or http://del.icio.us/rss/url/?url=[URL].
p. Bloglines can provide similar information through http://www.bloglines.com/search?q=bcite[URL] (add &format=rss to generate an RSS feed).
p. Note that neither service provides this function through web services, only as unauthenticated GET requests. del.icio.us does provide a URL history service through the API, but only returns data if the authenticated user has bookmarked the given URL. This is inconsistent with the public, unauthenticated behaviour.
p. RawSugar provides this information through its standard, unauthenticated APIs.
p. Of the rest of the services with APIs, most required some form of authentication (usually a user account, but occasionally some sort of application token (with or without a user account).
p. Digg's RSS feed could be quite useful but only for items currently being "dugg". Digg does not appear to provide per-digg (or per-URL) feeds.
p. Many services provide URL specific feeds, or feeds following various patterns (most recent, most rated, by tag), and use RSS/RDF elements to include tag information (usually as a category for the post containing the URL).
h2. Observations & Next steps
p. I think it is impractical to attempt to build a tool which would take a given URL and attempt to retrieve data about that URL from the many social bookmarking sites. It is conceivable that you could build something focused solely on those sites which expose data, with the caveat that it would not be representative of the most popular sites.
p. I found it interesting how many of these sites provide many ways of adding information to the site, but make it difficult to retrieve information (and make it nearly impossible if you're not a registered user of the site). There was some irony in reading various manifestoes posted on sites which were going to be "better than del.icio.us" yet provide no methods of exporting data in a usable format other than for backups.
p. I plan to take what I've learned from Bloglines, Technorati, del.icio.us, and RawSugar and do something with it, though what I'm not exactly sure. I see potential for a MovableType/WordPress plugin (show the mashed-up cosmos for a post). My original goal was to use data like this to influence my feed reading, either by creating a custom feed of the top-ranked posts, or manipulating something like Bloglines or Google Reader with a Greasemonkey hack.
p. If you provide an API to your service, link to the documentation off your homepage, your help page, your support page. Make sure some sort of result appears in your site search results (if you even have a site search).
p. If you do not provide an API to your service, reconsider that decision or you will be road kill. You are doing a disservice to your consumers by keeping all of the information penned up.
p. I noticed a number of services use MD5 hashes of the URL as a key. I had tried doing this many years ago when I wrote a spider and got roundly beaten up over the possibility of collisions (this was well before collisions in MD5 were documented). I don't know enough about MD5 or collisions in MD5 to know whether using MD5's as the key for URLs is going to be a significant problem or not. At the time of my pummeling no one offered up an alternative (in the end, the spider died because it was deemed competitive to a planned software product. Said product never actually saw the light of day).
p. The proliferation of API keys and methods of accessing APIs is going to kill us. I don't have a ready made solution, perhaps something using OpenID/SXIP/etc. is in order. I had been thinking that using PGP or x509 keys would help, but the overhead of transmitting the public key component is a bit too much to ask (my gpg public key is 1700 bytes, my x.509 PEM file is nearly 4k). Perhaps the answer is some sort of well known service URL (I need to check into WSDL and see if that answers this question: "How do I discover all of the services available on this server?").
p. "Attention" is more than what I'm paying attention to. As a publisher I'd like to know who's paying attention to me and what I publish. As a reader I'd like to know who's paying attention to what I read. Most of these services, whether they're bookmark oriented or feed oriented or something else entirely, focus primarily on their registered users and rarely make data available to the publisher, even more rarely to the public at large.
h2. Services
p. These are my raw, mostly unedited notes and observations about the services I looked at. This was not intended to be a thorough review of all social bookmarking services. If I could find no information about an API and could not check out feeds on a per-URL basis, I moved on to the next site in my list. I did not create accounts on any services for this review, however I have existing accounts at Bloglines, Yahoo!, Digg, del.icio.us and am more familiar with their offerings. When I found a web service, API, or feed which had promise, I used the curl tool running under
cygwin on Windows XP to invoke the service or request the feed, to verify whether or not the data would be useful.
h4. "Blinklist":http://www.blinklist.com/
* no public, documented api
h4. "Bloglines":http://bloglines.com/
* http://www.bloglines.com/search?q=bcite[URL]&s=fr&pop=l&news=m&format=rss to generate rss of links to an article
* API requires basic authentication
* API does not support URL queries
h4. "Blogmarks":http://Blogmarks.net/
* No public, documented api
* Appear to be able to parse some information about tags from Atom file generated for each blogmark
* Ok, they do have an api but only discoverable by reading the blog
* Must be authenticated to use api
* http://dev.blogmarks.net/wiki/AtomApiSpec
h4. "Blogmemes":http://Blogmemes.net/
* Api/tools link on bottom of main page
* Tags for urls exposed in feedburner feed
* Uses the akarru opensource engine
h4. "Citeulike":http://Citeulike.org/
* API to publish but not apparently to retrieve
h4. "Connotea":http://www.connotea.org/
* clear link to api on homepage
* http://www.connotea.org/wiki/WebAPI
* all api requests require authentication
* can retrieve # bookmarks and tags
h4. Blinkbits.com
* It's dead jim.
h4. "del.icio.us":http://del.icio.us/
* Can find out who has bookmarked a url using http://del.icio.us/url/checkurl?url=%URL or http://del.icio.us/rss/url/?url=%URL% (this produces an RDF feed, not quite RSS)
* Alternately, compute md5 hash of URL and specify after /url/
* Also discoverable through API but only if authenticated user has tagged link!
* appeared to be only service utilizing SSL to encrypt connections
h4. "de.lirio.us":http://de.lirio.us/
* seems to have morphed into simpy
h4. "Digg":http://digg.com/
* Does not (yet) have an API to query URLs
* RSS feed includes the digg category, # diggs, and number of comments on digg, as well as who submitted the digg
* Has useless Expiration header on RSS feed (Thu, 19 Nov 1981 08:52:00 GMT)
h4. "Feedmarker":http://Feedmarker.com/
* seems inactive
h4. "Furl":http://Furl.net/
* No obvious API to retrieve data
h4. Gravee (http://www.gravee.com/)
* No api/developer links
* Seems to be dead
h4. "Jots":http://jots.com/
* Api discovered off "about" page
* Categories for links exposed in RSS
* XMLRPC api
* All API requests require username & password for account
* Does not appear to have a method to query a URL and get back meta information across jots.com
h4. "Kinja":http://kinja.com
* No useful information about tools or services could be found
h4. "Lilisto":http://Lilisto.com
* Seems to be dead based on circular loop trying to click on "getlister" information
* API uses Javascript to export an HTML div (according to the lilisto wiki)
h4. "Linkroll":http://www.linkroll.com/
* no meta information in rss
h4. "Lookmarks":http://Lookmarks.net
* Rss does not expose categories or other metadata
h4. "ma.gnolia.com":http://ma.gnolia.com
* clear "support & tools" link
* requires member's application access key
* given userid & password can retrieve key
* observation: Some API error codes are the same as HTTP status codes but have different meanings.
* Little data available w/o authentication
* Can retrieve tags and rating for URL
h4. "Maple":http://www.maple.nu/
* Nothing obvious re: api
h4. "MyWeb":http://myweb.yahoo.com/
* API, requires Application ID identifcation
* Can retrieve tags by url, urls by tag, and tag frequency Newsgator
* Provides SOAP methods to retrieve and set information about URLs (see http://www.newsgator.com/ngs/api/Post.aspx)
* Must have Newsgator account and API key
h4. "Newsvine":http://newsvine.com
* No obvious apis, developer paths
* Rss feeds feature only newsvine links
h4. "RawSugar":http://Rawsugar.com
* Api documentation linked off help
* Multi level authentication (userid/password in URL, userid/password in WWW-Authenticate header, Rawsugar.com auth cookie)
* http://www.rawsugar.com/api/url?url=[URL] returns information about a URL, unauthenticated, 404 if URL not bookmarked
* http://www.rawsugar.com/api/user?username=[username] returns information about a user, unauthenticated
* api at http://www.rawsugar.com/doc/api
h4. "Reddit":http://Reddit.com/
* No metainformation in feed.
* No apparent API.
h4. "Rojo":http://Rojo.com/
* Set of tools but all user oriented for adding to service
* No metainformation in feed.
h4. "Scuttle":http://scuttle.org/
* Claims to duplicate del.icio.us API
* Use http://scuttle.org/history/$md5hash instead of http://del.icio.us/url/$md5
* No rss feed for URL histories
* Tags as categories in recent rss feed
h4. "Shadows":http://Shadows.com/
* Claims to duplicate del.icio.us API
* Requires userid/password
* Site hung while investigating
h4. "Simpy":http://Simpy.com/
* Kudos for having "tools" and "api" in prominent location
* Requires HTTP basic userid/password authentication
* Can get information about urls *you* have tagged but not general information about URL
* Can get HTML "history" for URL with http://www.simpy.com/link/info/[URL]
h4. "Spurl":http://Spurl.net/
* No apparent API.
* Received 404's off http://stream.spurl.net/list.php when trying to browse any stream
* RSS/Atom feeds of recently added items, no meta information
h4. "Squidoo":http://www.squidoo.com/
* No apparent API.
* No metainformation in feed.
h4. "Stumbleupon":http://Stumbleupon.com/
* No apparent API.
* No metainformation in feed.
h4. "Tailrank":http://Tailrank.com/
* No metainformation in feed
* Link to tools is for displaying tailrank data
* Found link to api off "about" page
* One method
* Does not return any meta information in RSS
* Cannot query URL
h4. "Techmeme":http://Techmeme.com/
* No apparent API.
* No metainformation in feed.
h4. "Technorati":http://Technorati.com/
* Clear link to developer's information and APIs
* API requires API Key to access.
* Provides "cosmos" information about a URL (see http://technorati.com/developers/api/cosmos.html)
h4. "Wink":http://Wink.com
* Link to developers information
* Links to specific apis failed with php errors
h4. "Wists":http://Wists.com/
* No apparent API.
* Category/tag exposed in RSS
* Wait, api linked off sidebar of blog
* API does not appear to support query by URL
Posted in [at|in]tention