URL vs. URI: URLs as Queries and URIs as Identities
Continuing my with my ranting about the URL vs. URI bit, I thought I’d continue on given my renewed interest in this topic thanks to Ora. In our LEDP position paper, we made the observation that URLs represent queries while URIs are identifiers. If you’re wondering why you should care about this subtle distinction, please read on.
URLs as Queries
We’ve stated that URLs are queries, but what does that really mean? Those of you familiar with blog software such as WordPress, know that the default URL pattern might go something like this:
Here, the URL forms a query for a blog post using its internal identifier. In this case, the URL is asking the WordPress database for a post and related items using the primary key of a row that represents the post. For most, it’s pretty obvious that the query parameter “p” refers to the internal identity of the post.
As we mention in the paper, there are many other ways to construct URLs to the same post. For example, we can embed the ID into a path segment:
In all of these cases, the server application is interpreting the URL and using elements of the URL to internally resolve the information the client requested. While I’ve only singled out WordPress, this pattern is quite common among several web application frameworks.
Internal Identity vs. Global Identity
When folks put information on the web, the content they publish usually has two identities:
- An Internal or local identity. This maybe the name of a file (i.e. “me.jpg”) or a the primary key of a row in a database.
- An external identity which is the ID of the information you’ve published. On the web, this is the “global identity” exposed by the URL of the content
Often, people don’t tend to think about either much. The global identity of a resource is usually an after thought and is determined by the underlying framework driving the application. With web servers serving up documents, we’re usually exposing the the local file name of the of the document. With database driven applictaions, we’re exposing the primary key, or some alternate key, of a row in a database. Quite often when web applictaion changes frameworks, we see the global identity change too and the URL patterns change (i.e. .NET to Java’s JSP, to Ruby, etc.).
Using the previous WordPress example, we know that the internal blog post ID is “399″, but this internal ID really isn’t suitable as a globally unique, unambiguous identifier. Another blog using WordPress, running the same exact version of the software, could also have a blog post ID of “399″. This does not means that the two sites have the same content, it only means that the two instances happen to have a post with an internal ID of “399″. As you might have noticed, the value “399″ isn’t a suitable web-scale identifier. We need something else.
Subbu Allamaraju had an interesting post a while back about Resource Identity and Cool URIs. Subbu asserts that there is a distintion between the identity of a resource and its location and his key point is spot on:
URIs uniquely identify resources but a URI used to fetch something is not always a good candidate to serve as a unique identifier in client applications.
And this is where I feel that the core confusion with URLs and URIs: identity vs. location. If we look at his initial example, he desribes the following:
<accounts xmlns="urn:org:bank:accounts"> <account> <id>AZA12093</id> <link href="http://bank.org/account/AZA12093" rel="self"/> ... </account> <account> <id>ADK31242</id> <link href="http://bank.org/account/ADK31242" rel="self"/> ... </account> </accounts>
In his example, the internal identity of each account is being expressed through a path segment in the href attribute of the link element. This approach is functional and is similar to that of the previous WordPress example.
The problem with this approach is that the ID values are only unique within the domain “bank.org”. There’s no reliable way to assert that two sites are referring to the same account if we have to rely on the value of a path segment or query parameter. As stated earlier, if we take WordPress as an exmaple again, blog post “399″ might talk about the Kardashians, or something else. There’s no gurantee that the two URLs refer to the same information if they share the same internal identity. Most likely, they don’t.
We could make the link and the ID one in the same:
<accounts xmlns="urn:org:bank:accounts"> <account> <id>http://bank.org/account/AZA12093</id> <link href="http://bank.org/account/AZA12093" rel="self"/> ... </account> <account> <id>http://bank.org/account/ADK31242</id> <link href="http://bank.org/account/ADK31242" rel="self"/> ... </account> </accounts>
You might wonder what the hell is going on here since it looks pretty much the same as the first example. The difference is that we’re saying that the ID and the link are identical. That is, the identity of the account is the URI. That URI also happens to be a URL that can be dereferenced. This works, and it’s considered basic principle of Linked Data.
However, there are some problems with this approach too. As Subbu rightly points out, URIs are not always cool URIs. We are all aware that URIs do change at some point. If Bank.org is acquired by BiggerBank.com, what happens to the ID since we tied the ID to a host name that is likely to be retired soon?
One solution is to follow good web practices and maintain the bank.org domain and either redirect requests to the older URLs to the new ones. Adobe does this with links to FurtureSplash and Macromedia Flash locations. These URLs all resolve to the Adobe Flash product pages.
This strategy allows us to keep the original identity but the link is changed to accommodate the new domain. We can expand on this strategy and change the value of the link:
<accounts xmlns="urn:org:bank:accounts"> <account> <id>http://bank.org/account/AZA12093</id> <link href="http://biggerbank.com/accounts/?id=http%3A//bank.org/account/AZA12093" rel="self"/> ... </account> <account> <id>http://bank.org/account/ADK312423</id> <link href="http://biggerbank.com/accounts/?id=http%3A/bank.org/account/ADK31242" rel="self"/> ... </account> </accounts>
Yes, it looks ugly and weird, but it’s valid and it works. If you look closely, it’s not much different from the initial WordPress URL. The only difference is that we’ve replaced a numeric identifier for a URI. It’s a URI that references another URI, but it is valid. For some reason, people just don’t like URLs that look like this.DBPedia does this since they’re describing data on Wikipedia:
The big difference with this approach is that it’s clear that the identifier is globally unique. There’s significantly less ambigutity about the ID:
http://en.wikipedia.org/wiki/BMW_7_Series_(E23) than the ID:
BMW_7_Series_(E23). Because no one else can mint valid URLs within the Wikipedia domain, you can have greater confidence that multiple applications are referring to the same thing. URIs as identifiers are globally unique.
The global identity of information resources shouldn’t change as frequently as it does. It drives my wife apeshit that all of her recipe bookmarks change everytime MarthaStewart.com updates their site. Part of the problem I believe is that most folks doing Information Architecture don’t take identity into consideration and that a fair number of web frameworks do very little to assist in quality URI/URL design. But this post is long enough, so I’ll save that for another post.