Over the past few days I have been reading up on the State Chart XML spec. Ever since reading some of Stu Charltons ideas on a RESTful Hypermedia Agent and listening to his WS-REST keynote presentation, I’ve taken more of an interest in hierarchical state machines and began taking a more in-depth look into SCXML.

The design of SCXML is interesting. I like the ECMA Script functionality and XML structure feels clean at first. It all seems fine until you get to the section on executable content. There’s a bit of debate around “executable xml” and executable XML frameworks like Jelly. And even folks poking fun at the idea of an XML programming language. As a Java guy, I have grown to dislike build tools such as Ant due to the fact that one can express relatively complex conditional logic in XML. I like tools like Maven better since it’s XML model is more declarative rather than executable (profiles are conditional). When I need more complex build operations, I’d be looking at tools like Gradle. Don’t get me started on XSLT.

I really like the concept of SCXML, but I’m not sold on the design. I don’t really have issue with the use of XML in general, I get it. However, the executable XML content bit is really hard to get past. Expecially when a scripting evironment is available to the SCXML environment. I can debug JavaScript code with a number of tools. Executable XML content? Not so much. For me, the executable content bit is the technical equivalent of a two-bagger.

A year has passsed since my last post on URIs and URLs and it would seem that some of the concepts are still lost on some folks. With that said, I figured I’d throw up another post that I could try and address some of the questions raised in the comments of both posts.

URLs and URNs are both URIs

This is one point that can’t be stated enough. A URL is a URI and a URN is a URI, plain and simple. It’s really quite challenging to phrase it any other way. It’s like trying to explain that a Human is a Mammal but a Mammal is not always a Human.

Examples of URLs and URNs:

People have also suggested that these posts could have been more helpful if I had provided some examples that illustrate the difference between a URL and a URI. Based on the previous point, one should able conclude that a URL is a URI and therefore there’s no reasonable way to present an example that distingushes the two. However, we can provide examples that distingush a URL from a URN:

All of these Examples are URIs:
Examples of URLs: mailto:someone@example.com
http://www.damnhandy.com/
https://github.com/afs/TDB-BDB.git
file:///home/someuser/somefile.txt
Examples of URNs: urn:mpeg:mpeg7:schema:2001urn:isbn:0451450523
urn:sha1:YNCKHTQCWBTRNJIV4WNAE52SJUQCZO5C
urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66

Again, all of the examples above are all valid examples of URIs. You’ll note of course that all of the URNs are prefixed with “urn:”.
URIs are Opaque Identifiers

There’s a very informative page by Tim Berners-Lee that provides a lot of good deails on Uniform Resource Identifiers. One very important point is the notion of URI opacity, which states:

“The only thing you can use an identifier for is to refer to an object. When you are not dereferencing, you should not look at the contents of the URI string to gain other information.”

When you followed the link to this page, you didn’t have to do anything other than clicking it. Your browser only had to look at the URI scheme and the domain in order to resolve this particular document. Everything after “damnhandy.com” is defined to be opaque to the client. This point may seem orthogonal to the original post, but it’s a very important aspect of URIs. I bring this point up because the following question was asked in the comments:

Can we say that:

“http://www.domain.tld/somepath/file.php?mykey=somevalue”

is an URI

and that the “http://www.domain.tld/somepath/file.php” part is an URL?

No. No you cannot. Both are URLs, which are also URIs. The idea behind URI Opacity is that you should not look at the string to make any inferences as to what is at the other end. The presence of a query string does not distinguish a URL from a URI. Both strings are URIs and URLs.

Another commenter also asserted the following:

URI is whole address of a resource but resource extension is not mentioned. in URL we also mention the extension.
for example:

URI: www.abc/home
URL: www.abc/home.html

This is also incorrect. Being pedantic, neither is a syntactically correct URI or URL. But that aside, the presence of a file extention also does not distinguish a URI from a URL. Again, they are both valid URLs as well as URIs. Going back to URI opacity, you also cannot conclude that the two URLs reference the same resource or the same representation. The URI is an opaque string that identifies a resource. They’re still both URIs.

HTTP URIs Can Identify Non-Document Resources

This the one point that I think hits at the crux of everyone’s confusion on the matter. Things get all meta-meta when we start using URIs to identify things that are not documents on the web. It is perfectly acceptable to use to URIs to identify concepts, or non-document resources. If a URI happens to express scheme that suggests that it can be dereferenced, it is not a requirement that a document resource is actually dereferenced by that URI.

Most people’s expectation of HTTP URIs is that they it can always be dereferenced. There is a common assumption that an HTTP URI must point to a document. If it document resources is not available at that URI, that is you get a 404 error, then the URI is somehow bad. Just because the server could not locate document or a representation for the URI, does not mean that the HTTP URI is an invalid identifier.

The finer bits of this issue are summed up in httpRange-14 as well as this article by Xiaoshu Wang. The concepts around httpRange-14 are deep , but I think it’s these types of ideas that trips people up a lot. IMHO, it’s one of those concepts that gets muddled in the great internet telephone game that causes more confusion. For example, it seems that people are under the impression that if a URL does not express a file extension, then it represents a concept and therefore is a URI, but it just doesn’t work that way.

A URL is a URI and a URI can be used to represent anything. Furthermore, each URI is unique. This is why you cannot assume that http://www.abc/home is the same as http://www.abc/home.html just by looking at the URI. These are both distinct URIs that may or may not represent the same resource. Because URIs are opaque, you as the client should not be attempting to make any decisions about the URI.

Without a doubt, the URL vs. URI post is by the most visited page on this blog. Even so, there’s still a lot of confusion on the topic and so I thought I’d break it down in less words. The original post was slightly misleading in that I attempted to compare URI to URL, when in fact it should have defined the relationship between URI, URL, and URN. In this post, I hope to clear that in more concise terms. But first, here’s a pretty picture:

uri_class_diagram

Both URLs and URNs are specializations, or subclasses of URI. You can refer to both URLs and URNs as a URI. In applictaion terms: if your application only calls for a URI, you should be free to use either or.
Now, here’s where the big difference between URN and URL: a URL is location bound and dereferencable over the web. A URN is just a name and isn’t bound to a network location. However, BOTH are still valid URIs. Now, if the application requires a URI that is bound to a network location, you must use the specialization of URI called URL.

Remember that URI stands for Uniform Resource Identifier, which is used to identify some “thing”, or resource, on the web. Both URLs and URNs are specialization’s (or subclasses if you will), of URI. You’d be correct by referring to both a URL or URN as a URI. In applictaion terms: if your application only calls for an identifier, you should be free to use either a URL or a URN. For brevity, you can state that the application simply requires a URI and the use of a URL or URN would statisfy that requirement.

Now if your application needs a URI that dereferencable over the web, you should be aware of the difference between URN and URL. A URL is location bound and defines the mechanism as to how to retrieve the resource over the web. A URN is just a name and isn’t bound to a network location. For example, you may have a URN for a books ISBN number in the form of urn:isbn:0451450523. The URN is still a valid URI, but you cannot dereference it, it’s just a name used to provide identity. So to put in simpler terms:

  • A URI is used to define the identity of some thing on the on the web
  • Both URL and URN are URIs
  • A URN only defines a name, it provides no details about how to get the resource over a network.
  • A URL defines how to retrieve the resource over the web.
  • You can get a “thing” via a URL, you can’t get anything with a URN
  • Both URL and URN are URIs as the both identify a resource

There some other items that need clarification based on some comments I’ve received from the original post:

  • Elements of a URI such as query string, file extension, etc. have no bearing on whether or not a URL is a URI. If the URI is defines how to get a resource over the web, it’s a URL.
  • A URL is not limited to HTTP. There are many other protocol schemes that can be plugged into a URL.
  • If a URL defines a scheme other than HTTP, it does not magically become a URI. The URI defines how to get the resource, whether it be HTTP, FTP, SMB, etc., it’s still a URL. But because the URL identifies a resource, it’s a URI as well.

Yeah, I’ve probably repeated myself a few times, but I wanted to stress a few points.

There’s also been some confusion about when to use the term URI. As I stated in the original post explained above, it depends on what you’re doing. If everything your application does involves accessing data over the web, you’re most likely using URL exclusively. In that case, it wouldn’t be a bad thing to use the term URL. Now, if the application can use either a network location, or a name, then URI is the proper term. For example, XML namespaces are usually declared using a URI. The namespace may just be a name, or a URL that references a DTD or XML Schema. So if you’re using a URL for identity and retrieval, it’s probably best to use URI.

When it comes to manipulating photographs, I live in Photoshop. One feature of all Adobe products that I like is the ability to annotate images and other documents using their eXtensible Metadata Platform, or XMP. XMP is a collection of RDF statements that get embedded into a document that describe many facets of the document. I’ve always wanted to be able to somehow get that data out of these files and doing something with it for application purposes.

There are projects like Jempbox, which work on manipulating the XMP data but offers no facilities to extract the XMP packet from image files. The Apache XML Graphics Commons is more the ticket I was looking for. The library includes and XMP parser that performs by scanning a files for the XMP header. The approach works quite well and supports pretty much every format supported by the XMP specification. The downside of XML Graphics Commons is that it doesn’t property read all of the RDF statements. Some of the data is skipped or missed completely. To top it off, neither framework allows you to get at the raw RDF data.

What I really wanted to do was to get the XMP packet in its entirety and load it into a triples store like Sesame or Virtuoso. This of course means that you want to have the data available as RDF. Rather than inventing my own framework to do all of this, I found the Aperture Framework. Aperture is simply amazing framework that can extract RDF statements from just about anything. Of course, the one thing that is missing is XMP support. So, I set out on implementing my own Extractor that can suck out the entire XMP packet as RDF. It’s based on the work started in the XML Graphics Commons project, but modified significantly so that it pulls out the RDF data. Once extracted, it’s very easy to store the statements into a triple store and execute SPARQL queries on it.

Right now the, this  XMPExtractor can read XMP from the following formats:

  • JPEG Images (image/jpeg)
  • TIFF Images (image/tiff)
  • Adobe DNG (image/x-adobe-dng)
  • Portable Network Graphic (image/png)
  • PDF (application/pdf)
  • EPS, Postscipt, and Adobe Illustrator files (application/postscript)
  • Quicktime (video/quicktime)
  • AVI (video/x-msvideo)
  • MPEG-4 (video/mp4)
  • MPEG-2 (video/mpeg)
  • MP3 (audio/mpeg)
  • WAV Audio (audio/x-wav)

On the downside, I’ve found that if you use the XMPExtractor with a Crawler, you’ll run into some problems with Adobe Illustrator files. The problem is that the PDFExtractor mistakes these files for PDFs and then fails. But as long as you’re not using Illustrator files, you should be ok. There’s also a few nitpicks with JPEG files and the JpgExtractor in that the sample files included in the XMP SDK are flagged as invalid JPEG files. However, every JPEG file I created from Photoshop and iPhoto seem to work fine. But after a little more testing, I’ll look at offering it up as a contribution to the project.

Over the past few weeks, I’ve been taking a deep dive into the Semantic Web.  As some will tell you, a number of scalability and performance issues with the Semantic Frameworks have not been fully addressed. While true to some extent, there’s been a large amount of quality research out there, but it is actually somewhat difficult to find.  Part of the reason is that much of this research is published on the web in PDF. To add insult to injury, these are PDFs without associated meta data nor hyperlinks to associated research (which is not to say that prior research isn’t properly cited).

Does this seem slightly ironic? The Semantic Web is built on the hypertext-driven nature of the web. Even though PDF is a readable and relatively open format, it’s still a rather sucktastic hypertext and on-screen format (especially when it’s published as a two-column layout).  PDFs are an agnostic means of publishing a document that was authored in something like Microsoft Word or LaTex. They are generated as an afterthought and most do not take the time to properly hyperlink these PDFs to external sources. Why is this bad? Given that we’re using the document web (or rather the “Page Ranked” web) of today, this makes it a bit more challenging to locate this kind of information on the web. In a sense, this is akin to not eating your own dog food. If knowledge isn’t properly linked into the web as it works today, it effectively doesn’t exist. Guys like Roy T. Fielding get this, and it’s most likely why his dissertation is available as HTML, as well as PDF variants.

As a friendly suggestion to the Semantic Web research community: please consider using XHTML to publish your research. Or even better, XHTML with RDF/A. Additionally, leverage hyperlinks to cite related or relevant works. It’s not enough anymore to use a purely textual bibliography or footnotes. The document web needs hyperlinks.

There’s a lot of cool and important stuff being worked on but this knowledge is not being properly disseminated. No doubt, in some cases publishing to two different formats can be a lot of work. But in the long term the payoffs are that information is widely available and you’re leveraging the Semantic Web as it was meant to be.