A new form of Twitter spam?

Since a couple of weeks I have been attracting a new form of spambots. It took me a couple of minutes to realise that the accounts were probably not controlled by humans. Here’s what I get:
I receive email that Aurora Santee (@Ritanbrj) favorited one of your Tweets!. The tweet was in Dutch and nothing indicated that Aurora Santee could understand it.

It’s becoming a pattern:

  • the username (@Ritanbrj) doesn’t have anything in common with the real name (Aurora Santee)
  • the real name isn’t what I think are common Western names, but they are feminine
  • the account has a bio and it may include an URL (don’t click it, of course)
  • the few accounts that I looked at had about 16 tweets, most of them some sort of quotes; also about 50 following and 20-30 followers
  • all account activity before the favoriting of not so random tweets (the five bots favourited two tweets) happened on the same day

With that said, there are similar bots that favourite tweets and just advertise “buy Twitter followers” in their timelines. And there are actual people favouriting some of my tweets… 🙂

    Finding bad edits in the Open Library catalogue: ideas

    Note: this post may be updated to accomodate other ideas or replace ideas.

    The Open Library catalogue can be edited by anyone. That became a problem when a wave of spam bots had found it last year. Now only users capable of deciphering the captcha can edit.

    There has always been some spam. I have reported spam accounts and I think they have been removed and blocked, but in the end something else is needed to recognise spam automatically.

    But not only spam can be tracked – erroneous edits and good edits may be tracked too, to (machine) learn from.

    I accidentally made VacuumBot break its pattern of changing strange or inconsistently used formats to better formats, by letting it remove formats that should have been improved. For instance, “Ebook” should have become “E-book” (that’s the Van Dale spelling), but it became “”.

    On the other hand, I have used the same VacuumBot to remove ” ;” (space semicolon) from the pagination field and moved bits of information to other fields in the record. I see a pattern here (“… but you created it, Ben!” Right.)

    Ideas

    Bad edits break patterns (assumption 1).

    Good edits create patterns (assumption 2).

    Edit patterns can be discovered via the Open Library APIs (assumption 3, with anekdotal evidence).

    Possible independent variables/inputs for pattern recognition or edit classification:

    • user
    • field
    • number of bytes changed in the whole record
    • combinations of characters replaced
    • combinations of characters removed
    • combinations of characters added
    • edit distances between old and new values
    • combinations of characters moved between fields
    • revision
    • field newly filled out
    • validation of values

    Perhaps even nicer patterns can be detected when complete edit histories are used. When a certain field is changed back and forth several times, that may be vandalism.

    Etc.

    Addition 1:
    Can dimensions or weight be guessed from format, number of pages, and weight or dimensions?
    Can language be guessed from the title words?