Terms of Semantics

From the “just throwing it out there” dept. in cooperation with the “I’m too lazy to do some research into existing efforts” dept.

It is generally known that people rarely read all terms of use and privacy statements of all involved parties providing a service. South Park used that knowledge as a storyline in The Human Cent-iPad.

One of the reasons for ignoring the Terms of Use (ToU) is their excessive length and use of legalese. They contain too much incomprehensible language. You need a lawyer to fully understand the rights and obligations that come with the service.

But many services share many characteristics, like the definition of a jurisdiction whose laws guide the service conditions, the definition of a service provider and consumer, definitions of content, ownership and other rights to the content.

Isn’t it possible to encode the characteristics of definitions in a standardised sets of terms?

If various (similar) services provide the definitions of their services in standardised terms, they could more easily be understood and compared. It would help non-human agents to select the best services for themselves and their human controllers.

More thought is needed.

Can you please remove ‘meaningful punctuation’ from field contents, librarians?

Dear Cataloguing librarians,

It is time to realise that using punctuation as a way of marking sub-field boundaries is bad practice. You must not want to put title and “responsible entity” in one field and then try to split the field contents using punctuation like ” / “. You must not want to use an author’s full name in reverse order + year of birth + year of death (if applicable) to identify the person – certainly if you also allow an optional “.”.

You need to understand: the machines are not smart enough yet to understand your cataloguing rules and therefore they don’t get the meaning of what you put in the fields. Even the ones at OCLC are not smart enough yet.

What drove me to write this, was this example: Linked Data about Works published by OCLC. It is buzzing and – I agree with Ed Summers – pretty cool. The data structure and semantics can be improved, as Richard Wallis of OCLC says in a blog post. The example that Ed took in his blog post, Weaving the Web” by Tim Berners-Lee, demonstrates my issue (which is not touched upon by Ed or Richard).

The work’s title is shown as:

Weaving the Web : the original design and ultimate destiny of the World Wide Web by its inventor /

Yes, Tim himself said he would have gotten rid of the two forward slashes after http: in URIs, had he had the chance to start over, but the slash at the end of the title was not Tim’s intent. I bet you put “Tim Berners-Lee” or even “by Tim Berners-Lee” after that slash in the 245 field of your MARC record.

Second point, from the same example, the authors. And contributors, and creators. I know the temporary URIs will be replaced by VIAF URIs, but OCLC will still need to map…

"creator" : [ "http://experiment.worldcat.org/entity/work/data/27331745#Person/berners_lee_tim", "http://experiment.worldcat.org/entity/work/data/27331745#Person/berners_lee_tim_1955" ]

… to the one and only Tim Berners-Lee (who co-authored the book). In this example that should be easy, as there aren’t many people called Tim Berners-Lee on the planet and there is only one with a very strong connection to “the Web”, but the general case is not that simple. (You need context for that, and even then there is a chance that you make incorrect matches. You may find some context for this in my thesis.)

I’ll come back to you in some time to see how you’re getting on with fixing all of this. I’m counting on you!



Short review of the CSV ontology

This is interesting: the CSV ontology to describe the columns of a CSV file and the file itself. I can definitely see the value in rich descriptions of CSV files, or spreadsheets in general. But I’m also really tempted to ask “if you use RDF for the file ‘header’, why not the ‘body’ too?” There are various csv2rdf tools (although I haven’tused any), but TabLinker is the only one I know my colleagues are working on 🙂

Then I thought of file size: even Turtle files can easily grow larger than CSV files containing the same values. Moreover, support for CSV is more widely available, isn’t it?

The example use case (labeling and describing a file and its columns) also reminded me of ARFF, which embeds some metadata (comments on file level and field name & data type on column level) and allows sparse data, which could save bytes. But allowing only ASCII in the file makes the format pretty outdated. The XML-based XRFF allows the use of other encodings.

The CSV ontology itself needs a little revision, as some of the definitions are unclear (to me, at least), and the example CSV document contains spaces after commas (leading spaces are part of the field, according to RFC 4180). As an example of unclarity, the definition for the property mapsTo is Which RDF class values in the column map to — this may suggest the range of the the property is rdfs:Class, but the examples all have a property as object of mapsTo. When this correctly means the range is rdfs:Property, and if my understanding that you could create triples following the pattern <[subject]> <[column mapped property]> <[value in cell]> is correct, it is still unclear what the subject of the triple would be. There is no definition of a property that can be used to define a column as subject of the triple pattern. I guess it is not trivial to define.

Suddenly I’m reminded of Karma, which interactively, supported by machine learning, can create mappings for the columns of a CSV file to RDF. Wonder if its mappings can be mapped to the CSV ontology?

Response to “Three reasons why the Semantic Web has failed”

Posted on http://gigaom.com/2013/11/03/three-reasons-why-the-semantic-web-has-failed/ as a comment (but at the time of posting it is still awaiting moderation).

I’d like to disagree with most of the article. Your argument “the Semantic Web has failed” does not follow from your “reasons”.
Sure, I’m pretty familiar with the Semantic Web and able to understand RDF (really, it’s not impossible to understand) and (most of) OWL, but that is not why I think a Synaptic Web can live next to a Semantic Web. To start: wouldn’t it be great for your streaming web interpreters to be presented with structured information next to unstructured text? Let it live on top of the Semantic Web (and the rest of the Web).

Do you want to exclude facts from knowledge? I, too, couldn’t care less about Leonardo da Vinci’s height, but if I see the Mona Lisa in Paris, I might want to know what else he painted and did and where I can see that. You need boring facts for that. Boring, but useful facts.
For human consumption “messages” are only part of knowledge. Take science for example. Science doesn’t only live in conversation; loads of scientific knowledge is transferred in documents.

The Semantic Web doesn’t depend on XML. Or JSON – although JSON-LD is gaining lots of ground. Human end users shouldn’t need to see raw facts in any text format, only developers. Turtle is the easiest to read and write by hand, I think, but eventually programmers will do that just as rarely as they read and write JSON.

We’re still a long way from having phones that measure brain activity to decipher our thoughts before they become pieces of knowledge consisting of concepts and, err, facts about things we do, want, and feel. In light of my privacy, I’d like my phone to not push my thoughts and activities to the Synaptic Web. It could ask specific questions to the Web that I would like answered, but those questions are likely to be based around concepts, time and place (“what museums are open around here tomorrow?”). That almost works and looks like keyword search.

I like the vision of a Synaptic Web (I heard the term for [the] first time today), but to call the Semantic Web failed because people actually want a Synaptic Web was not proven today.

Linked Data at the University of Twente?

Of course the University of Twente should offer their data as Linked Open Data! It could be the first Dutch university to be a member of Linked Universities, although the data themselves are of course more important than being listed as a member of some website.

Data are all around the University, and some of them already published. The event calendar is available in iCalendar format, phone book is accessible through LDAP and a web form and bibliographic data about research publications are searchable via a webpage, and can be aggregated using OAI-PMH. The RKB Explorer offers harvested data from both the University repository and the repository of the Faculty of Electrical Engineering, Mathematics and Computer Science as Linked Data already, but those harvests are from 2009 and before.

Here’s a list of things that I think can be easily described using Linked Data:

  • buildings (including historical buildings, like Charlie)
  • rooms (lecture halls, offices, labs, dressing rooms, theatres, boiler rooms, etc.)
  • opening hours
  • squares, fields
  • streets
  • bus stops
  • parking spaces
  • points of interest (coffee machines, candy vending machines, fire extinguishers, etc.)
  • artworks (paintings, outside artworks)
  • people (staff, student body representatives, etc.)
  • study statistics (number of enrolled students, per programme, etc.)
  • organisational structures (faculties, research groups, research institutes, spin off companies, facilities)
  • associations (student union, cultural and sports clubs, alumni associations, etc.)
  • scientific publications
  • scientific data
  • library catalogue
  • archive catalogue
  • study programmes (courses, requirements)
  • contracts (EU funding, cleaning, catering, coffee and candy vending machines, etc.)
  • events (lectures, meetings, concerts, dissertations, performances)

The university and the university library tend to think of “scientific data linked to the publications that are based on the data” when they hear Linked Data, but if all these data are available as RDF (or at least in some open data format, e.g. CSV), they can allow many more useful applications to be developed. Think of programme checks (do you need more or other courses?), appointment schedulers that account for walking distances between rooms, or bus stop and room, and visualisations of parts of buildings that produce the most publications. Integration of the phone book entries of people with their publications is easy, and creating filters for the event calendar is just as easy: it’s a matter of adjusting a SPARQL query.

We’ll see what happens.

My Linked Data publishing ‘platform’

Among the goals I had in mind for Companjen.name were to publish (parts of) my family tree so that others can benefit from it (without being bound to specific collaborative genealogy websites), and to play around with linked data (i.e. having a webspace to publish my own ‘minted’ URIs with data). I believe the second goal has been completed (and that the first can be achieved using the second).

Linked Data

Linked Data is based on using Uniform Resource Identifiers (URIs) for online and offline resources, that are dereferenceable via HTTP, so that at least useful information (i.e. metadata) about the resource is returned, if the resource itself cannot be returned. The machine-readable data format of choice is RDF, which should be serialized as RDF/XML (because all RDF parsers must be able to read that) and any other serialization I wish. For human agents it may be nice to have a data representation in HTML.

URI design

Because every URI is an identifier, we want to make sure they don’t break. I want the URIs I use to identify resources to be recognizable as such, and they need to be in my domain. Therefore I chose to have all URIs that may be used in my Linked Data to start with “http://companjen.name/id/”. (Resources can have many identifiers, so I can easily add another one to resources that already have URIs.)

What comes after the namespace prefix can take many forms; I haven’t decided yet. I do think it is nice to reserve filetype extensions for the associated data representations, i.e. “.html” for HTML, “.rdf” for RDF/XML and “.ttl” for Turtle documents.

How it works

My hosting provider allows me to use PHP, .htaccess files and MySQL, all of which I used to create the “platform”. It is composed of the PHP Content Negotiation library from ptlis.net, the PHP RDF library ARC2, two custom PHP scripts and a .htaccess file.

Since all URIs that I want to use have the same path “/id/”, but I don’t want to keep HTML, RDF/XML and Turtle files of every resource, I wrote some RewriteRules (helped by looking at Neil Crosby’s beginner’s guide) in the .htaccess file in the document root to redirect the request to a content negotiating PHP script. That script lets the Content Negotiation library determine the best content type based on the Accept header in the HTTP request and sends the user to the URI appended with .rdf, .ttl or .html via HTTP 303 See Other.

The HTTP client will then look up the new URI. Since the requested path will still contain “/id/”, mod_rewrite will catch the request, but another rule points to a PHP script that queries the ARC triplestore and puts it in the requested format (RDF/XML and Turtle are created by ARC itself, HTML is created by filling a template).

What you get when you look up something in the /id/ space, is the result of a simple “DESCRIBE <URI>” request to the triplestore, which is somewhat limited: it will only return triples with <URI> as subject. This gives some context (one of the principles of Linked Data), but it may be very interesting to know in what triples the resource is used as object or property (if applicable).

Future work

Apart from making the results more interesting by returning triples that have the URI in the property or object part, there is more to do to mature the platform.

First and foremost: fill the triplestore. There are things that I’d like to publish myself, instead of giving them away to commercial parties from whom I can only access them through controlled APIs. I already mentioned my family tree, but another example is concerts I visit. Let Last.fm, Songkick, Resident Advisor get that info from my triplestore, so that I only have to create the info once and keep control over it. Or maybe the concert venue will find my data on Sindice and display my review on the concert’s page. Oh, the possibilities of the Semantic Web…

As more data will become available in the triplestore, it makes sense to describe the different datasets using the Vocabulary of Interlinked Datasets (VoID) and put a link to the VoID document at the .well-known location. My family tree will be a nameable dataset, for example, with links to DBpedia, perhaps GeoNames and perhaps eventually online birth, marriage and death records.

The current HTML template is a table with columns Subject, Property and Object. A templating engine that has templates for different resource types would be a nice start, so that e.g. a person in my family tree will be displayed with a photo and birth and death dates like genealogy websites usually do (e.g. ” for marriage). Maybe there are browsers/editors for linked data family trees already, but looking for them is also future work.

Now to ‘mint’ a URI for myself: http://companjen.name/id/BC. Look it up if you like!