My Linked Data publishing ‘platform’

Among the goals I had in mind for Companjen.name were to publish (parts of) my family tree so that others can benefit from it (without being bound to specific collaborative genealogy websites), and to play around with linked data (i.e. having a webspace to publish my own ‘minted’ URIs with data). I believe the second goal has been completed (and that the first can be achieved using the second).

Linked Data

Linked Data is based on using Uniform Resource Identifiers (URIs) for online and offline resources, that are dereferenceable via HTTP, so that at least useful information (i.e. metadata) about the resource is returned, if the resource itself cannot be returned. The machine-readable data format of choice is RDF, which should be serialized as RDF/XML (because all RDF parsers must be able to read that) and any other serialization I wish. For human agents it may be nice to have a data representation in HTML.

URI design

Because every URI is an identifier, we want to make sure they don’t break. I want the URIs I use to identify resources to be recognizable as such, and they need to be in my domain. Therefore I chose to have all URIs that may be used in my Linked Data to start with “http://companjen.name/id/”. (Resources can have many identifiers, so I can easily add another one to resources that already have URIs.)

What comes after the namespace prefix can take many forms; I haven’t decided yet. I do think it is nice to reserve filetype extensions for the associated data representations, i.e. “.html” for HTML, “.rdf” for RDF/XML and “.ttl” for Turtle documents.

How it works

My hosting provider allows me to use PHP, .htaccess files and MySQL, all of which I used to create the “platform”. It is composed of the PHP Content Negotiation library from ptlis.net, the PHP RDF library ARC2, two custom PHP scripts and a .htaccess file.

Since all URIs that I want to use have the same path “/id/”, but I don’t want to keep HTML, RDF/XML and Turtle files of every resource, I wrote some RewriteRules (helped by looking at Neil Crosby’s beginner’s guide) in the .htaccess file in the document root to redirect the request to a content negotiating PHP script. That script lets the Content Negotiation library determine the best content type based on the Accept header in the HTTP request and sends the user to the URI appended with .rdf, .ttl or .html via HTTP 303 See Other.

The HTTP client will then look up the new URI. Since the requested path will still contain “/id/”, mod_rewrite will catch the request, but another rule points to a PHP script that queries the ARC triplestore and puts it in the requested format (RDF/XML and Turtle are created by ARC itself, HTML is created by filling a template).

What you get when you look up something in the /id/ space, is the result of a simple “DESCRIBE <URI>” request to the triplestore, which is somewhat limited: it will only return triples with <URI> as subject. This gives some context (one of the principles of Linked Data), but it may be very interesting to know in what triples the resource is used as object or property (if applicable).

Future work

Apart from making the results more interesting by returning triples that have the URI in the property or object part, there is more to do to mature the platform.

First and foremost: fill the triplestore. There are things that I’d like to publish myself, instead of giving them away to commercial parties from whom I can only access them through controlled APIs. I already mentioned my family tree, but another example is concerts I visit. Let Last.fm, Songkick, Resident Advisor get that info from my triplestore, so that I only have to create the info once and keep control over it. Or maybe the concert venue will find my data on Sindice and display my review on the concert’s page. Oh, the possibilities of the Semantic Web…

As more data will become available in the triplestore, it makes sense to describe the different datasets using the Vocabulary of Interlinked Datasets (VoID) and put a link to the VoID document at the .well-known location. My family tree will be a nameable dataset, for example, with links to DBpedia, perhaps GeoNames and perhaps eventually online birth, marriage and death records.

The current HTML template is a table with columns Subject, Property and Object. A templating engine that has templates for different resource types would be a nice start, so that e.g. a person in my family tree will be displayed with a photo and birth and death dates like genealogy websites usually do (e.g. “⚭” for marriage). Maybe there are browsers/editors for linked data family trees already, but looking for them is also future work.

Now to ‘mint’ a URI for myself: http://companjen.name/id/BC. Look it up if you like!