Categories
Technology

Playing with the Open Library

Lately I’ve been playing (it’s not ‘working’ yet) with Open Library and its data. I’m even on the discuss and tech mailing lists, forked the GitHub repository and did a pull request.

Why?

Just like I use Discogs.com to keep track of my CDs, I thought I could use an online editable catalog for my books. And if I need to add data to the online catalog, it should be open. Discogs releases its data into the public domain, which is a good thing if the Discogs company should cease to exist and the website would go down. Moreover, we all benefit from open catalogs. Because I was thinking of using linked data for my catalog, integrating the data using Linked Data shouldn’t be too hard. Open Library offers their data into the public domain as RDF (and JSON and OPDS), so this could work. My catalog should at least have URIs for the items on my shelves, properties linking the items to the manifestations on Open Library and any information that I want to add, such as the date the item was introduced in my collection, perhaps the cost and source.

“Improving Open Library RDF output”

There were some things in the RDF that didn’t make sense to me. So I filed a support request, and quickly received a request in return: to join the OL tech mailing list and start a discussion about changes to the RDF output. That sounded kind of exciting – talking to and with the developers of Open Library and perhaps even getting my ideas into an open source and open data project. I thought my ideas were good and clear, thus easily implemented. In reality it has been taking some time…

One of the most important concepts in Linked Data is the Universal Resource Identifier (URI). Every resource should get a unique URI, so that anyone can refer to the resource using that URI. In Open Library, there was no accepted URI structure for the authors, works and editions. That is, sometimes the URI references had a “/” at the end, and sometimes they didn’t. I got confused and asked if these were different identifiers, and indeed they are different. So a software agent could not tell that both URIs referred to the same resource.

For URIs to be usefuls as identifiers, you need to treat them as such and not as just strings. In the RDF output of a work, each edition is connected to the work via the work’s URI. But instead of using the work URI as a URI reference, it is printed with extra quotes as a Literal. When I brought this up on the tech list, the discussion quickly moved from “using the right RDF” to “using the right FRBR relationship between edition and work” (a discussion that apparently hasn’t been solved yet).

I don’t want to list all the issues I had with the RDF templates – there are just too many… On the content level, however, there are some interesting issues, probably caused by the underlying datamodel. For example: the RDF template writes the contents from both the “contributors” field and the “contributions” field to the output. Why are there two fields for apparently the same content? (I think I know the answer, but I only have my experimental results below to back it up.)

Digging deeper

So to improve the RDF output, I needed a better understanding of the datamodel that feeds the templates. After surfing the Open Library help and developer pages looking for documentation on the datamodel, using DuckDuckGo and Google to try to find a database schema and noticing that the schemas I found had a little label “out of date”, I finally found /type. This is the current schema that the Infogami and Infobase ‘layers’ of Open Library use. It is pretty straightforward, although not complete (there is no field in /type/edition that matches the contributors field used in the RDF template, for example). Most fields and types make sense, but not all. And when trying to understand those, you realize how much you miss the documentation.

A big drawback of the schema is that lots of data, like dates, physical dimensions and locations, are stored as strings instead of structured fields. I guess this is because most data comes from strings from MARC records. (Other people don’t like this either.) The field type to store URIs, /type/uri, apparently is not used – URIs are stored as strings.

I sent an email to the tech list asking for clarification on /type/edition.contributions[] and /type/edition.contributors[], which seem to be meant for the same information, and /type/edition.distributors[], which doesn’t appear to be in the edit form, and /type/author.uris[], which is not documented, but could serve as container of other URIs for the resource, i.e. the URIs <owl:sameAs> the OL URI. I haven’t had a reply yet, so I thought I had to find out for myself.

A little data wrangling

I downloaded the 8 GB gzipped dump file of January 2012, which turned out to be 35 GB after unzipping. Every line consists of five tab separated fields: record type, record key, record revision, datetime of last modification and the record itself as JSON. Streaming editor (GNU) sed was the tool of choice on my moderate home computer, as it can handle the .txt file easily. It loads line after line, processes it and prints it if I tell it to, using regular expressions as search patterns. This makes it perfect to extract records from the dump.

Distributors?

$ sed -nre '/"distributors": [/p' ol_dump_2012-01-31.txt

(i.e. print lines that contain ‘”distributors”: [‘) yielded nothing, so no records contain a distributors field. Was this field meant to be used only programmatically? I don’t know.

Contributions / contributors

I used similar commands to extract records containing the contributions and/or contributors fields. The contributors.txt file contains 14996 records in about 24 MB. The contributions.txt file is 8 GB, so Notepad++ would choke on that file had I tried to open it to count lines…

$ wc -l contributions.txt
7439623 contributions.txt

So 7439623 records contain a contributions field. /type/edition.contributions[] is defined as an array of strings. Looking at the contributors, I see a JSON object containing a ‘role’ and a ‘name’ field. So that explains how name and role are stored: structured! It still doesn’t explain why the contributors field is not listed when looking at /type/edition though.

The JSON records can be extracted by throwing away everything up to and excluding the first curly bracket:

$ sed -nre 's/^[^{]*//p' contributors.txt > contributors.json
$ sed -nre 's/^[^{]*//p' contributions.txt > contributions.json

Google Refine with the RDF extension is a nice tool to ‘wrangle’ contributors.json, so I loaded it and found just under 300 different roles in the 14996 records that had 25643 contributors in total. That number may be large for someone not involved with book publishing, but it is not a problem per se. However, the list is messy: there were several that were just translations, typos or other kinds of variation. With some effort I managed to reconcile about 24000 contributors’ roles to roles on the MARC code list for relators of the Library of Congress (which can already be used in a Linked Data environment). This may not be the best way forward, as the MARC list is limited, but it was the only documented source of relators that I could find.

Links to other websites / (linked) data sources

The author record contains a field labeled “wikipedia”, which may be a relic of some time in which only the English Wikipedia was thought relevant (this is pure speculation), but cannot be edited manually anymore in the edit form. All links now go in a “links” array, with a label for each link. Links to the English Wikipedia all have URLs starting with “http://en.wikipedia.org”, so they can be easily filtered.

Useful links are also those that point to the Virtual International Authority File (VIAF), that contains authority records from quite a few national libraries, so that authors can be quite accurately identified. Since by mistake many authors in Open Library are organizations instead of people, matching these ‘authors’ against their ‘corporate entity’ records in VIAF may be the way to separate them from the human authors.

$ sed -nre '/"wikipedia": "/p' ol_dump_2012-01-31.txt > wikipedia-field.txt
$ sed -nre '/"url": "http://en.wikipedia.org/p' ol_dump_2012-01-31.txt > wikipedia-english.txt
$ sed -nre '/http://viaf.org/p' ol_dump_2012-01-31.txt > viaf.txt

What does all this yield?

$ wc -l *.txt
7439623 contributions.txt
  14996 contributors.txt
47796615 ol_dump_2012-01-31.txt
     38 viaf.txt
   1294 wikipedia-english.txt
   1864 wikipedia-field.txt
55254430 total

So January’s data dump contains almost 48 million records in total (authors, editions and works). There are many many more strings containing some unstructured form of contribution attribution than there are structured contribution attributions (of which one is an ISBN and one a character). There are still more old(?) Wikipedia links than there are new general links to the English Wikipedia and there are only 38 links to VIAF.

Stop playing, start working

It seems there is a lot of work (not just playing) to do if Open Library should sit between VIAF, DBpedia and other authoritative linked data sources in the LOD Cloud. BTW: should it? Leave your comments, please!

One reply on “Playing with the Open Library”

Comments are closed.