Categories
General

Can you really refine DC elements in HTML as explained in the Metadata MOOC?

I originally posted this as a question in the Coursera MOOC on Metadata’s discussion forums on 2014-07-30. So far I have received no written responses at all, but I did get 5 points (upvotes) for posting. I hope someone outside the MOOC may be able to answer and please correct me if I’m wrong. If my concerns are justified, perhaps this may trigger a bit of a change in the course materials — I’m wary that in the coming video lecture series on Linked Data dr. Pomerantz will still confuse subject and object, but I’ll wait to see if anything has changed since last year’s session.

I’m having trouble understanding why using DC.date.modified as name in a <meta> element is the correct way of refining the date element in HTML, for two reasons:

  1. I cannot find anything explaining it in the current specifications
  2. I use elements in someone else’s namespace that have not been defined in that namespace (or anywhere else), don’t I?

Longer version:

It appears that the document Element Refinement in Dublin Core Metadata (linked next to the 2-7 lecture about Qualified Dublin Core) only describes element refinement using the RDF model. I know the semantics of rdfs:subPropertyOf and it makes sense to me to express refinements this way.

For learning to express Dublin Core elements in HTML, the document refers to the obsolete http://dublincore.org/documents/dcq-html/, which has one reference to refining DC elements by appending . + the refinement, namely in the section that describes how the spec is (in)compatible with previous and other versions.

Note that previous versions of this document (and other DCMI HTML-encoding documents) made some different recommendations to those found in this document, as follows:
(…)
Previous recommendations specified prefixing element refinements by the element being refined, for example ‘DC.Date.modified’ rather than ‘DCTERMS.modified’.
(…)
These forms of encoding are acceptable but are no longer considered the preferred form.

The document that replaces it, Expressing Dublin Core metadata using HTML/XHTML meta and link elements, has no references or examples of refined elements at all. It does make clear how to interpret the properties as URIs:

In a DC-HTML Prefixed Name, the prefix is the part of the name preceding the first period character; the remainder of the string following the period is treated as the "local name" and appended to the "namespace URI". If a DC-HTML Prefixed Name contains more than one period character, the prefix is the part preceding the first period, and the local name is the remainder of the name following the first period, and any subsequent period characters simply form part of the local name.

In the following example the DC-HTML Prefixed Name "XX.date.removed" corresponds to the URI http://example.org/terms/date.removed

<link rel="schema.XX" href="http://example.org/terms/" >
<meta name="XX.date.removed" content="2007-05-05" >

If I were to apply this to a property whose namespace is the DC Elements Set namespace, I’d get a URI that is in the DC namespace, but that is not defined in that namespace.

<link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" >
<meta name="DC.title.serial" content="Services to Government" >

I say that the value of the property http://purl.org/dc/elements/1.1/title.serial of a described resource is "Services to Government". This property is not defined. The same goes for the examples in the lectures and homework: by just appending some string that may or may not make sense to other humans, you can’t really expect to refine the property, can you?

So I’m puzzled. What document(s) explain this way of refining (well-)defined elements?

After continuing my search: Is it RFC 2731 from 1999 by Kunze, Encoding Dublin Core Metadata in HTML? Section 6 indeed explains the use of

<meta name    = "PREFIX.ELEMENT_NAME.SUBELEMENT_NAME" ... >

for "qualifying" elements, but a few lines below, it says:

Note that the qualifier syntax and label suffixes (which follow an element name and a period) used in examples in this document merely reflect current trends in the HTML encoding of qualifiers. Use of this syntax and these suffixes is neither a standard nor a recommendation.

Categories
Linked (Open) Data

Bibliogs

How about a(nother) community-built database of bibliographic data?

One that is not limited to books or scientific journal articles, but does aim to identify the related people, organisations, perhaps even rooms in which conferences were held. That aims to connect the dots rather than re-mint URIs for everything that already has a URI, but recognises the possibility that referenced items may change over time.

I’m sure I can’t pull this off overnight by myself, but let me try to at least see if there is interest. Comments and ideas are welcome.

Oh, I already paid for Bibliogs.info, which may make clear to Discogs.com users what I sort-of have in mind, or may make Discogs.com send me a “don’t steal our name” notice.

Categories
Technology

More Open Library

After some initial playing, I have started work on a VacuumBot for Open Library. It is supposed to be a general bot that can clean up some of the mess I found in the datadump of January 2012.

By now I have compiled several lists of dirty data and key counts in this Gist. Among these are:

  • count of keys found in author records*
  • count of keys found in edition records*
  • count of keys found in work records
  • count of identifiers found in edition records
  • count of classifications found in edition records

*: there are a few edition records that have the type “/type/author”. Hence they were treated as author records. The Open Library website also tries to render them as authors, resulting in “name missing”. It’s possible that there are authors who are stored and treated as editions, but I haven’t looked for them yet.

Besides records being mixed up, there are classifications as identifiers and identifiers as classifications too. Without proper documentation non-librarians like me cannot tell which identifier or classification to choose if there is an identifier “LC Control Number” and a classification “LCCN permalink”; the same holds for a classification “Library of Congress” and an identifier “LC Classification (LCC)”. From the compiled lists it is clear that this is a hard problem. Different identifiers found in the dump that have something to do with the LoC, including total occurrences and number of records they appear in:

  • library_of_congress (2 / 2)
  • library_of_congress_catalog_card_no. (4 / 4)
  • library_of_congress_catalog_card_number (4 / 4)
  • library_of_congress_catalog_no. (5 / 5)
  • library_of_congress_catalogue_number (6 / 6)
  • library_of_congress_classification_(lcc) (62 / 61)

The numbers aren’t that high, but show that no cleaning has been performed yet.

One problem in the data of OL is the appearance of organizations as authors. Somehow a batch of bad records was imported which created the author records. There is no information in those records that allows them to be treated differently (like a flag that says the author is a corporate identity and hence adding a birth date doesn’t make sense). Or is there? Some author records have an “entity_type” key. The most popular value for that key is “org”, the second most is “person” (followed by “author”, “Writer” and “Pseudonym”). There are 977 different values, so it must have played a different role than just disambiguate between organizations and persons. I also found three cases of spam: entity_type : “jerk”, “c*nt” (censored) and “die” (a death threat?).

Did I say this is a work in progress yet? If you like, follow or fork the work on my GitHub fork.

Categories
Technology

Playing with the Open Library

Lately I’ve been playing (it’s not ‘working’ yet) with Open Library and its data. I’m even on the discuss and tech mailing lists, forked the GitHub repository and did a pull request.

Why?

Just like I use Discogs.com to keep track of my CDs, I thought I could use an online editable catalog for my books. And if I need to add data to the online catalog, it should be open. Discogs releases its data into the public domain, which is a good thing if the Discogs company should cease to exist and the website would go down. Moreover, we all benefit from open catalogs. Because I was thinking of using linked data for my catalog, integrating the data using Linked Data shouldn’t be too hard. Open Library offers their data into the public domain as RDF (and JSON and OPDS), so this could work. My catalog should at least have URIs for the items on my shelves, properties linking the items to the manifestations on Open Library and any information that I want to add, such as the date the item was introduced in my collection, perhaps the cost and source.

“Improving Open Library RDF output”

There were some things in the RDF that didn’t make sense to me. So I filed a support request, and quickly received a request in return: to join the OL tech mailing list and start a discussion about changes to the RDF output. That sounded kind of exciting – talking to and with the developers of Open Library and perhaps even getting my ideas into an open source and open data project. I thought my ideas were good and clear, thus easily implemented. In reality it has been taking some time…

One of the most important concepts in Linked Data is the Universal Resource Identifier (URI). Every resource should get a unique URI, so that anyone can refer to the resource using that URI. In Open Library, there was no accepted URI structure for the authors, works and editions. That is, sometimes the URI references had a “/” at the end, and sometimes they didn’t. I got confused and asked if these were different identifiers, and indeed they are different. So a software agent could not tell that both URIs referred to the same resource.

For URIs to be usefuls as identifiers, you need to treat them as such and not as just strings. In the RDF output of a work, each edition is connected to the work via the work’s URI. But instead of using the work URI as a URI reference, it is printed with extra quotes as a Literal. When I brought this up on the tech list, the discussion quickly moved from “using the right RDF” to “using the right FRBR relationship between edition and work” (a discussion that apparently hasn’t been solved yet).

I don’t want to list all the issues I had with the RDF templates – there are just too many… On the content level, however, there are some interesting issues, probably caused by the underlying datamodel. For example: the RDF template writes the contents from both the “contributors” field and the “contributions” field to the output. Why are there two fields for apparently the same content? (I think I know the answer, but I only have my experimental results below to back it up.)

Digging deeper

So to improve the RDF output, I needed a better understanding of the datamodel that feeds the templates. After surfing the Open Library help and developer pages looking for documentation on the datamodel, using DuckDuckGo and Google to try to find a database schema and noticing that the schemas I found had a little label “out of date”, I finally found /type. This is the current schema that the Infogami and Infobase ‘layers’ of Open Library use. It is pretty straightforward, although not complete (there is no field in /type/edition that matches the contributors field used in the RDF template, for example). Most fields and types make sense, but not all. And when trying to understand those, you realize how much you miss the documentation.

A big drawback of the schema is that lots of data, like dates, physical dimensions and locations, are stored as strings instead of structured fields. I guess this is because most data comes from strings from MARC records. (Other people don’t like this either.) The field type to store URIs, /type/uri, apparently is not used – URIs are stored as strings.

I sent an email to the tech list asking for clarification on /type/edition.contributions[] and /type/edition.contributors[], which seem to be meant for the same information, and /type/edition.distributors[], which doesn’t appear to be in the edit form, and /type/author.uris[], which is not documented, but could serve as container of other URIs for the resource, i.e. the URIs <owl:sameAs> the OL URI. I haven’t had a reply yet, so I thought I had to find out for myself.

A little data wrangling

I downloaded the 8 GB gzipped dump file of January 2012, which turned out to be 35 GB after unzipping. Every line consists of five tab separated fields: record type, record key, record revision, datetime of last modification and the record itself as JSON. Streaming editor (GNU) sed was the tool of choice on my moderate home computer, as it can handle the .txt file easily. It loads line after line, processes it and prints it if I tell it to, using regular expressions as search patterns. This makes it perfect to extract records from the dump.

Distributors?

$ sed -nre '/"distributors": [/p' ol_dump_2012-01-31.txt

(i.e. print lines that contain ‘”distributors”: [‘) yielded nothing, so no records contain a distributors field. Was this field meant to be used only programmatically? I don’t know.

Contributions / contributors

I used similar commands to extract records containing the contributions and/or contributors fields. The contributors.txt file contains 14996 records in about 24 MB. The contributions.txt file is 8 GB, so Notepad++ would choke on that file had I tried to open it to count lines…

$ wc -l contributions.txt
7439623 contributions.txt

So 7439623 records contain a contributions field. /type/edition.contributions[] is defined as an array of strings. Looking at the contributors, I see a JSON object containing a ‘role’ and a ‘name’ field. So that explains how name and role are stored: structured! It still doesn’t explain why the contributors field is not listed when looking at /type/edition though.

The JSON records can be extracted by throwing away everything up to and excluding the first curly bracket:

$ sed -nre 's/^[^{]*//p' contributors.txt > contributors.json
$ sed -nre 's/^[^{]*//p' contributions.txt > contributions.json

Google Refine with the RDF extension is a nice tool to ‘wrangle’ contributors.json, so I loaded it and found just under 300 different roles in the 14996 records that had 25643 contributors in total. That number may be large for someone not involved with book publishing, but it is not a problem per se. However, the list is messy: there were several that were just translations, typos or other kinds of variation. With some effort I managed to reconcile about 24000 contributors’ roles to roles on the MARC code list for relators of the Library of Congress (which can already be used in a Linked Data environment). This may not be the best way forward, as the MARC list is limited, but it was the only documented source of relators that I could find.

Links to other websites / (linked) data sources

The author record contains a field labeled “wikipedia”, which may be a relic of some time in which only the English Wikipedia was thought relevant (this is pure speculation), but cannot be edited manually anymore in the edit form. All links now go in a “links” array, with a label for each link. Links to the English Wikipedia all have URLs starting with “http://en.wikipedia.org”, so they can be easily filtered.

Useful links are also those that point to the Virtual International Authority File (VIAF), that contains authority records from quite a few national libraries, so that authors can be quite accurately identified. Since by mistake many authors in Open Library are organizations instead of people, matching these ‘authors’ against their ‘corporate entity’ records in VIAF may be the way to separate them from the human authors.

$ sed -nre '/"wikipedia": "/p' ol_dump_2012-01-31.txt > wikipedia-field.txt
$ sed -nre '/"url": "http://en.wikipedia.org/p' ol_dump_2012-01-31.txt > wikipedia-english.txt
$ sed -nre '/http://viaf.org/p' ol_dump_2012-01-31.txt > viaf.txt

What does all this yield?

$ wc -l *.txt
7439623 contributions.txt
  14996 contributors.txt
47796615 ol_dump_2012-01-31.txt
     38 viaf.txt
   1294 wikipedia-english.txt
   1864 wikipedia-field.txt
55254430 total

So January’s data dump contains almost 48 million records in total (authors, editions and works). There are many many more strings containing some unstructured form of contribution attribution than there are structured contribution attributions (of which one is an ISBN and one a character). There are still more old(?) Wikipedia links than there are new general links to the English Wikipedia and there are only 38 links to VIAF.

Stop playing, start working

It seems there is a lot of work (not just playing) to do if Open Library should sit between VIAF, DBpedia and other authoritative linked data sources in the LOD Cloud. BTW: should it? Leave your comments, please!

Categories
Meta

Now with Dublin Core and translations!

As a fan of semantics on the web, I couldn’t let it happen that my own blog posts don’t have at least some basic metadata. Thanks to the Dublin Core for WordPress plugin, the basic metadata (title, author, publication date, language of the post, etc.) is now automatically inserted into all the blog posts. See for yourself under “View Page Info”.

I was proud to see the metadata all there. Except for language, because even for Dutch posts, it was “en-US” (American English). There didn’t appear to be any options to change that language, turn off the inclusion of language metadata, let alone setting the language per post.

But plugins exist for that too. I chose Polylang after reading its description. Per post I now set the language of the post and what other post contains the translation (if there is any). Category names and the blog’s tagline now also have translations. I’m still getting used to the new navigation around the blog (on your first visit the blog language is probably determined by your preferred language settings) and having to write posts twice. Don’t expect me to add a third language.

Categories
Meta

Nu met Dublin Core en vertalingen!

Als fan van semantiek op het web kan ik het niet laten gebeuren dat mijn eigen stukjes tekst niet op zijn minst algemene metadata bevatten. Dankzij de WordPressplugin Dublin Core for WordPress wordt de basismetadata (titel, auteur, datum van publicatie, taal van het stuk, enz.) nu in elke post gevoegd. Kijk eens bij “Page info” (Engelse Firefox) en zie de gegevens daar staan.

Trots was ik ook toen ik die gegevens vond. Behalve op de taal: bij een Nederlands stuk stond als taal “en-US”, oftewel Amerikaans Engels. Er leek geen instelling te zijn om die taal aan te passen, de taal niet te laten invoegen in de pagina’s, laat staan per artikel aan te geven in welke taal het artikel is geschreven.

Maar ook daar zijn plugins voor. Op basis van de beschrijving heb ik gekozen voor Polylang. Per artikel geef ik nu aan in welke taal het is geschreven en welk ander artikel de vertaling ervan is (mits dat artikel er is). Daarnaast zijn categorieën en de tagline van de blog nu ook tweetalig. Ik moet zelf ook wennen aan de navigatie tussen de talen (bij je eerste bezoek wordt waarschijnlijk uitgegaan van de taal die je hebt ingesteld als voorkeurstaal) en vooral ook aan het feit dat ik ook stukken twee keer moet schrijven. Verwacht daarom geen derde taalkeuze.