When it comes to manipulating photographs, I live in Photoshop. One feature of all Adobe products that I like is the ability to annotate images and other documents using their eXtensible Metadata Platform, or XMP. XMP is a collection of RDF statements that get embedded into a document that describe many facets of the document. I’ve always wanted to be able to somehow get that data out of these files and doing something with it for application purposes.

There are projects like Jempbox, which work on manipulating the XMP data but offers no facilities to extract the XMP packet from image files. The Apache XML Graphics Commons is more the ticket I was looking for. The library includes and XMP parser that performs by scanning a files for the XMP header. The approach works quite well and supports pretty much every format supported by the XMP specification. The downside of XML Graphics Commons is that it doesn’t property read all of the RDF statements. Some of the data is skipped or missed completely. To top it off, neither framework allows you to get at the raw RDF data.

What I really wanted to do was to get the XMP packet in its entirety and load it into a triples store like Sesame or Virtuoso. This of course means that you want to have the data available as RDF. Rather than inventing my own framework to do all of this, I found the Aperture Framework. Aperture is simply amazing framework that can extract RDF statements from just about anything. Of course, the one thing that is missing is XMP support. So, I set out on implementing my own Extractor that can suck out the entire XMP packet as RDF. It’s based on the work started in the XML Graphics Commons project, but modified significantly so that it pulls out the RDF data. Once extracted, it’s very easy to store the statements into a triple store and execute SPARQL queries on it.

Right now the, this  XMPExtractor can read XMP from the following formats:

  • JPEG Images (image/jpeg)
  • TIFF Images (image/tiff)
  • Adobe DNG (image/x-adobe-dng)
  • Portable Network Graphic (image/png)
  • PDF (application/pdf)
  • EPS, Postscipt, and Adobe Illustrator files (application/postscript)
  • Quicktime (video/quicktime)
  • AVI (video/x-msvideo)
  • MPEG-4 (video/mp4)
  • MPEG-2 (video/mpeg)
  • MP3 (audio/mpeg)
  • WAV Audio (audio/x-wav)

On the downside, I’ve found that if you use the XMPExtractor with a Crawler, you’ll run into some problems with Adobe Illustrator files. The problem is that the PDFExtractor mistakes these files for PDFs and then fails. But as long as you’re not using Illustrator files, you should be ok. There’s also a few nitpicks with JPEG files and the JpgExtractor in that the sample files included in the XMP SDK are flagged as invalid JPEG files. However, every JPEG file I created from Photoshop and iPhoto seem to work fine. But after a little more testing, I’ll look at offering it up as a contribution to the project.