Several years ago a friend posed this problem: her corporate intranet team was mandated to use the in-house search engine. This search engine had a fatal flaw: it had no way to index entitled or authenticated content, thus much of the company's intranet was invisible and not showing up in the search engine. Some solutions had been discussed: dual path the authentication code so permit access by the crawler, hardcode userids and passwords into the crawler, temporarily turn off authentication when the crawler did an index run. None were really practical: the company had a distributed intranet, there were hundreds of web sites with different sorts of authentication in use; hardcoding information is just asking for trouble as is temporarily turning off the security.
Although my friend has since left that company, I have been thinking about the problem since then, toying with solutions, thinking up variations, but never really putting ink to paper so to speak.
What started as a question about allowing robots access to entitled content digressed into issues about automatically registering and authenticating any user agent to such content, and in turn authenticating agents between web services and authenticating agents as proxies for users.
Although identity and authentication are of interest to me, they're not my specialty and I readily concede not being up to the minute with the latest information about OpenID or Cardspace. So I offer the following ideas with this disclaimer: they're just ideas, I do not claim any originality, I just would like to see if we can push solutions a bit quicker.
The ideas are briefly:
I outline the first two ideas below and plan to expand on them in future posts. The last two (using public keys for API requests) need some more thought put into them. I think where my head is at is using the flickr API as a starting point, but standardizing on a PGP or S/MIME signature to sign requests rather than have each service come up with a different signature method.
My initial solution to my friend's problem was to update the Robot Exclusion Protocol. Move from a flat ASCII text file format to an XML format, add in some elements to designate protected URLs and other elements to designate authentication methods, mix in a few complaints which have arisen over the years about robots.txt, simmer with a few demonstration APIs and then try to convince every automated agent to adapt a new REP.
As I have thought about it, instead I think what I would propose is an extension to Google's Sitemaps protocol to add in elements to denote what URLs are protected (perhaps <authenticate> or <entitle>?) and where the agent should go to register and request an authentication token. I would add a few additional stanzas or elements requesting (and presenting) contact and support information, something sorely lacking in the relationship between robots and sites today. Finally you'd want an element describing what the robot can do with the content (index? cache?).
A complying server would respond to a request for an authenticated document with a standard HTTP
403 status, but include a reference to the robots.xml or sitemaps file covering the URL in a <link> element. Perhaps something like:
<link rel="authentication" href="http://example.com/sitemap.xml">
A complying user agent would not request the sitemap file until it received a
403 status response. Only then would it look for a sitemap file (relying either on a Link: HTTP header or a <link> element in the response).
I will write about this in more detail in a separate post. My goal with this proposal would be to create a standard, XML based, method and process for granting access to authenticated content.
Once you have an automated means to request and grant access to authenticated content for use by robots, why not extend it to users as well? For starters, you likely want to collect different information from a robot than from a user. Secondly users are more concerned about their privacy and may not want to be automatically registered or logged into a web site.
Many browsers today support semi-automated authentication: they offer to store a "userid" and "password" if they encounter a form with an input element of type password. However, in some testing I did last year I discovered that browsers will typically take the field immediately "next to" the password field, without regard to whether it really is or is not the "userid" that matches the password.
What I propose then is a variation on the extended sitemaps.xml file, call it agents.xml if you need a name.
A complying server would respond to a request for authenticated content with a 403 status. In addition to the <link> element provided for robots, we add a second <link> element for non-robot user agents.
A complying user agent would retrieve the agents.xml file, parse it, determine whether or not it has registered for the site previously (if so, use the cached identification information to attempt to log into the site). If it hasn't registered before it should prompt the user to confirm if it should register automatically, disclosing what information the site is requesting and submitting a registration request once confirmed.
This is very similar to what Kim Cameron of Microsoft has proposed Cardspace. I need to delve into the Cardspace documents because I think that there is a there there that I've ignored due to an aversion to all things WS-* and Soap.
One difference is that I personally think it is bad web server form to use redirects to deny access to a resource. Where Cardspace uses redirects to a login form, I would use a status.
A second difference is that I think that we should provide a way to automate registration and authentication to sites which are not ready to jump to third party authentication. If a site wishes to use userid / password authentication (granting access either through Basic / Digest authentication or Cookie based authentication) then so be it. Yes it is a pain, and it does not relieve us of the multiple userid/password problem, but if a site could cogently describe the endpoints for authentication and the rules for the userid, password, and other registration fields, browsers could intelligently managed one's account space and create smarter browser based account managers.
Nothing in this solution should preclude using OpenID, Cardspace, or whatever else comes along. The server should indicate what authentication methods it supports. The user agent should determine which methods it supports and either automatically authenticate or display the site's error page which presumably would contain links to a non-automated registration scheme.
Again, I plan to write this up in more detail and work on some prototype implementations. I would like to read some more about Cardspace and the simple registration extensions to OpenID.
Posted in Authentication & Identity
Copyright 2002–2011 Artific Consulting LLC.
Unless otherwise noted, content is licensed for reuse under the Creative Commons Attribution-ShareAlike 3.0 License. Please read and understand the license before repurposing content from this site.