Finding bad edits in the Open Library catalogue: ideas

Note: this post may be updated to accomodate other ideas or replace ideas.

The Open Library catalogue can be edited by anyone. That became a problem when a wave of spam bots had found it last year. Now only users capable of deciphering the captcha can edit.

There has always been some spam. I have reported spam accounts and I think they have been removed and blocked, but in the end something else is needed to recognise spam automatically.

But not only spam can be tracked – erroneous edits and good edits may be tracked too, to (machine) learn from.

I accidentally made VacuumBot break its pattern of changing strange or inconsistently used formats to better formats, by letting it remove formats that should have been improved. For instance, “Ebook” should have become “E-book” (that’s the Van Dale spelling), but it became “”.

On the other hand, I have used the same VacuumBot to remove ” ;” (space semicolon) from the pagination field and moved bits of information to other fields in the record. I see a pattern here (“… but you created it, Ben!” Right.)


Bad edits break patterns (assumption 1).

Good edits create patterns (assumption 2).

Edit patterns can be discovered via the Open Library APIs (assumption 3, with anekdotal evidence).

Possible independent variables/inputs for pattern recognition or edit classification:

  • user
  • field
  • number of bytes changed in the whole record
  • combinations of characters replaced
  • combinations of characters removed
  • combinations of characters added
  • edit distances between old and new values
  • combinations of characters moved between fields
  • revision
  • field newly filled out
  • validation of values

Perhaps even nicer patterns can be detected when complete edit histories are used. When a certain field is changed back and forth several times, that may be vandalism.


Addition 1:
Can dimensions or weight be guessed from format, number of pages, and weight or dimensions?
Can language be guessed from the title words?