After some initial playing, I have started work on a VacuumBot for Open Library. It is supposed to be a general bot that can clean up some of the mess I found in the datadump of January 2012.
By now I have compiled several lists of dirty data and key counts in this Gist. Among these are:
- count of keys found in author records*
- count of keys found in edition records*
- count of keys found in work records
- count of identifiers found in edition records
- count of classifications found in edition records
*: there are a few edition records that have the type “/type/author”. Hence they were treated as author records. The Open Library website also tries to render them as authors, resulting in “name missing”. It’s possible that there are authors who are stored and treated as editions, but I haven’t looked for them yet.
Besides records being mixed up, there are classifications as identifiers and identifiers as classifications too. Without proper documentation non-librarians like me cannot tell which identifier or classification to choose if there is an identifier “LC Control Number” and a classification “LCCN permalink”; the same holds for a classification “Library of Congress” and an identifier “LC Classification (LCC)”. From the compiled lists it is clear that this is a hard problem. Different identifiers found in the dump that have something to do with the LoC, including total occurrences and number of records they appear in:
library_of_congress (2 / 2)
library_of_congress_catalog_card_no. (4 / 4)
library_of_congress_catalog_card_number (4 / 4)
library_of_congress_catalog_no. (5 / 5)
library_of_congress_catalogue_number (6 / 6)
library_of_congress_classification_(lcc) (62 / 61)
The numbers aren’t that high, but show that no cleaning has been performed yet.
One problem in the data of OL is the appearance of organizations as authors. Somehow a batch of bad records was imported which created the author records. There is no information in those records that allows them to be treated differently (like a flag that says the author is a corporate identity and hence adding a birth date doesn’t make sense). Or is there? Some author records have an “entity_type” key. The most popular value for that key is “org”, the second most is “person” (followed by “author”, “Writer” and “Pseudonym”). There are 977 different values, so it must have played a different role than just disambiguate between organizations and persons. I also found three cases of spam: entity_type : “jerk”, “c*nt” (censored) and “die” (a death threat?).
Did I say this is a work in progress yet? If you like, follow or fork the work on my GitHub fork.