Short review of “Spamming in Scholarly Publishing: A Case Study”

Interesting: a researcher, Marcin Kozak, gets a lot of unsollicited email (spam) trying to convince him to publish in a journal or with a publisher and decides to check out these journals and publishers.

Kozak, M., Iefremova, O. and Hartley, J. (2015), Spamming in scholarly publishing: A case study. Journal of the Association for Information Science and Technology. doi: 10.1002/asi.23521

The abstract covers it well:

Spam has become an issue of concern in almost all areas where the Internet is involved, and many people today have become victims of spam from publishers and individual journals. We studied this phenomenon in the field of scholarly publishing from the perspective of a single author. We examined 1,024 such spam e-mails received by Marcin Kozak from publishers and journals over a period of 391 days, asking him to submit an article to their journal. We collected the following information: where the request came from; publishing model applied; fees charged; inclusion or not in the Directory of Open Access Journals (DOAJ); and presence or not in Beall’s (2014) listing of dubious journals. Our research showed that most of the publishers that sent e-mails inviting manuscripts were (i) using the open access model, (ii) using article-processing charges to fund their journal’s operations; (iii) offering very short peer-review times, (iv) on Beall’s list, and (v) misrepresenting the location of their headquarters. Some years ago, a letter of invitation to submit an article to a particular journal was considered a kind of distinction. Today, e-mails inviting submissions are generally spam, something that misleads young researchers and irritates experienced ones.

Some details were missing, however. I think good methodologies for assessing a publisher’s or journal’s trustworthiness are necessary, so it would be great if people researching these methodologies get the details correct.

The location of the headquarters was determined via various means, one of these being a lookup of the domain name holder’s (or registrant’s) country in a WHOIS system. The authors conclude this is not a reliable method, but do not explain why. A few sentences before they do suggest that the registrant’s country is the country the publisher/journal is based in, or that WHOIS shows the location of the server. Exactly what information was used from WHOIS is not described.

Another way of determining the headquarters’ location was to look up the information on the website. How to determine that information is found or missing is not mentioned.

One of the conclusions is that “the average time claimed for peer review was 4 weeks or less.” I don’t see how this follows from the summary table of claimed time for peer review, because it contains N/A values, and nearly all claimed times are 4 weeks or less. The form of the statement is wrong.

Finally, I would have liked to see a reason for not including the dataset. I can only guess why the authors deliberately did not provide the names of journals and publishers.

I think the conclusions hold (except for the one mentioned above), and that work should be performed to improve the methodology for judging journal quality. Eventually, the work would be automated and be easily replicated over time. Results from such automated checks could be added to the DOAJ.

Terms of Semantics

From the “just throwing it out there” dept. in cooperation with the “I’m too lazy to do some research into existing efforts” dept.

It is generally known that people rarely read all terms of use and privacy statements of all involved parties providing a service. South Park used that knowledge as a storyline in The Human Cent-iPad.

One of the reasons for ignoring the Terms of Use (ToU) is their excessive length and use of legalese. They contain too much incomprehensible language. You need a lawyer to fully understand the rights and obligations that come with the service.

But many services share many characteristics, like the definition of a jurisdiction whose laws guide the service conditions, the definition of a service provider and consumer, definitions of content, ownership and other rights to the content.

Isn’t it possible to encode the characteristics of definitions in a standardised sets of terms?

If various (similar) services provide the definitions of their services in standardised terms, they could more easily be understood and compared. It would help non-human agents to select the best services for themselves and their human controllers.

More thought is needed.

Email wish list

Things I might find interesting in an email client or environment. Raw thoughts.

  • fast search
  • search inside encrypted emails
  • get related and linked stuff easily at any time during reading
  • easily organise windows and content, e.g.
    • when you select to reply to an email, put the draft next to the original so that you don’t need to switch windows
    • use a consistent style when commenting inline
  • easily create tasks and notes from parts of emails
  • organise / order following user’s workflow
    • by people, topic, tone of content, project, task, event
    • integrate with business process(es)
  • suggest while typing
    • spelling and grammar
    • references to other emails / notes / events / conversations
    • people to include as recipient
    • named entities
    • expansions of abbreviations
    • different tone of text by using templates
    • replacement text
  • find and use embedded machine-actionable information
  • explain what the client did to your experience
    • log observations and following actions by the mail user agent
    • show the reasons for suggesting stuff

New hobby: “You’re vulnerable as in CVE-2009-3555”

I started a new hobby: pointing out vulnerabilities for a particular man-in-the-middle attack. It is described in CVE-2009-3555:

The TLS protocol, and the SSL protocol 3.0 and possibly earlier, as used in Microsoft Internet Information Services (IIS) 7.0, mod_ssl in the Apache HTTP Server 2.2.14 and earlier, OpenSSL before 0.9.8l, GnuTLS 2.8.5 and earlier, Mozilla Network Security Services (NSS) 3.12.4 and earlier, multiple Cisco products, and other products, does not properly associate renegotiation handshakes with an existing connection, which allows man-in-the-middle attackers to insert data into HTTPS sessions, and possibly other types of sessions protected by TLS or SSL, by sending an unauthenticated request that is processed retroactively by a server in a post-renegotiation context, related to a “plaintext injection” attack, aka the “Project Mogul” issue.

I bet I am not the only one practioning this activity, because the vulnerability was described in 2009. And the fix for this vulnerability has been around for quite some years too. After the releases of the various fixes for this vulnerability the quest for KAMITMA (Knights Against Man-In-The-Middle Attacks) began. It is easy to become a KAMITMA in the context of CVE-2009-3555. You just need to point out this vulnerability to the owner of TLS/SSL protected websites that are vulnerable, politely asking that they update their software. You might mention that this is an old vulnerability (2009 is like a century in computer development).

I started this hobby because I found my own website (and ownCloud and email) is vulnerable. I asked my hosting provider and they responded by offering moving to a ‘new’ hosting environment. New is not the newest version of Apache httpd, but I am planning to move to this new hosting environment soon. If that doesn’t fix this vulnerability, I’ll have to move and suggest others to do so too because it’d show my hosting provider doesn’t care about security.

The easiest way of spotting this vulnerability is to use Firefox and make it block connections that are made using the old vulnerable protocol. Open a tab in Firefox, enter about:config and press enter. Search for security.ssl.require_safe_negotiation and double-click the row to set it to true. The next time you try to visit a vulnerable website you’ll see this:

bitly-fail

When you see this, it’s hobby time: send a message to the owner of the website asking to get updated and more secure.

I found these websites and try to keep this list updated when I find something changes:

Addition, 2014-08-28: Tweakers.net, a news website and price comparison website for electronics, hosts its images on a different domain (tweakimg.net and subdomain ic.tweakimg.net). Apparently, these are vulnerable because all color and layout is lost when I look at the Pricewatch. The browser console shows what is going on:

tweakers-fail

Finding bad edits in the Open Library catalogue: ideas

Note: this post may be updated to accomodate other ideas or replace ideas.

The Open Library catalogue can be edited by anyone. That became a problem when a wave of spam bots had found it last year. Now only users capable of deciphering the captcha can edit.

There has always been some spam. I have reported spam accounts and I think they have been removed and blocked, but in the end something else is needed to recognise spam automatically.

But not only spam can be tracked – erroneous edits and good edits may be tracked too, to (machine) learn from.

I accidentally made VacuumBot break its pattern of changing strange or inconsistently used formats to better formats, by letting it remove formats that should have been improved. For instance, “Ebook” should have become “E-book” (that’s the Van Dale spelling), but it became “”.

On the other hand, I have used the same VacuumBot to remove ” ;” (space semicolon) from the pagination field and moved bits of information to other fields in the record. I see a pattern here (“… but you created it, Ben!” Right.)

Ideas

Bad edits break patterns (assumption 1).

Good edits create patterns (assumption 2).

Edit patterns can be discovered via the Open Library APIs (assumption 3, with anekdotal evidence).

Possible independent variables/inputs for pattern recognition or edit classification:

  • user
  • field
  • number of bytes changed in the whole record
  • combinations of characters replaced
  • combinations of characters removed
  • combinations of characters added
  • edit distances between old and new values
  • combinations of characters moved between fields
  • revision
  • field newly filled out
  • validation of values

Perhaps even nicer patterns can be detected when complete edit histories are used. When a certain field is changed back and forth several times, that may be vandalism.

Etc.

Addition 1:
Can dimensions or weight be guessed from format, number of pages, and weight or dimensions?
Can language be guessed from the title words?

“Kennis over publiceren” converted to EPUB

At a panel discussion about publishing cultures in academia on the 18th of December 2012 (which unfortunately I didn’t attend), De Jonge Akademie published a little book on the topic [zotpressInText item=”2GMXVGV6″].

Although the book’s paper size is almost the same as my Sony (PRS-T2) e-reader’s screen size, the PDF version isn’t really readable on the device. The letters are too small, even when most whitespace is removed. Because I wanted to read it, and preferably on my e-reader, I converted it to EPUB myself. Here are some observations about the process.

My first attempt was fully manual: I had opened the PDF in PDF-XChange Viewer and copied the text from the document to Sigil. This introduces anomalies, as markup (headings, line and paragraph breaks etc.) and formatting (e.g. italic text, superscripts) were lost. It’s a lot of work to restore, even for this PDF of just 86 pages. I quit during processing of the second chapter.

The second attempt still took some work, but the first step was easier already. Calibre was able to convert the PDF and create an EPUB file, saving most of the markup and formatting and even the cover.
There was a lot to tweak, though:

  • soft hyphens at the end of lines are not removed in the conversion process;
  • most of the uppercase letters were stored as (and hence copied as) lowercase, including chapter titles, quotes in ‘small caps’ and “de jonge akademie”;
  • text in footers ended up in the middle of the text (although this also happened when manually copying from the source document);
  • tables were torn apart (but this may have been an option in the conversion process that I should have turned off);
  • front and back cover were apparently stored as one image with the cutting marks in the PDF, and had to be cut out by hand, stored as separate JPEGs and linked to in the EPUB;
  • in some phrases that were in italics, each word had its own set of <i></i> tags;
  • I recreated the box around one paragraph in the introduction;
  • I added as much metadata as I could find in the original to the EPUB;
  • the interviews with members of De Jonge Akademie had no markup, just formatting – I made them ‘real’ chapters by putting the title in <h1></h1>;
  • I moved one of the interviews from the middle of a chapter to the end of the chapter, to not confuse the table of contents creator.

There is probably more that can be done, but this seems enough for now. I accept suggestions for improvement of the result and the process (though I probably will not do this again soon).

The resulting EPUB-file can be downloaded. This derived work is available under the original licence (Creative Commons Attribution 3.0 NL), so you can (e.g.) improve it without asking.

Source:

[zotpressInTextBib]

Hosting my own Twitter images on WordPress using Tweet Images

As you may know, I use Twitter. I sometimes use it to post images from my smartphone to the world too. Those images are first taken by the TweetDeck app, then uploaded to a photo hosting service (YFrog by default). The photo host returns a link to the media, which is usually a landing page that prominently shows the picture or video, plus advertisements and other stuff that may or may not be of interest.

Why host my own pictures?

It is easy to have a free large service host your pictures, but there are some drawbacks. They make money using your pictures, and they may even have claimed ownership of it when you uploaded it. After that, I usually lose track of the pictures I uploaded because the tweets in which they appear vanish after some time. And the copies that are stored in my phone lack the context of the tweet.

Since I already have webhosting, a WordPress install and space to add pictures, why not host my own pictures? Well, it turned out to be not so simple – mostly because the major image hosters are the only services you can choose from in the Twitter clients. For example, TweetDeck on Android only lets you choose between YFrog and Twitpic. More on this later.

Make your WordPress install your Twitter image host

Although I am not the only person to have done this, this story reflects the steps I took to get to making my WordPress my image host.

There’s a plugin for that

Sure enough, on the first result page for a search on “twitter images” I found Tweet Images, which promised to be pretty much what I envisioned: no OAuth fuzz, but a secret URI that only the authorized person(‘s Twitter client) knows. It separates the tweeting and image uploading, whereas the major image hosts require access to your Twitter account (reading tweets, sending tweets, following accounts) before accepting anything. Tweet Images takes (just) the image and tweet text, creates an image post and returns the URI of the new post. It uses the shortest possible URIs to return to the Twitter client: <blog address>?p=xxx, and that apparently only works when that is not already the permalink form of the blog.

The permalinks created for the image posts contain a somewhat ugly hash, instead of the usual slugs that WordPress creates. But again, there’s a plugin for that: Clean URLs for Tweet Images. Installed flawlessly.

(I should have another plugin that adds semantic markup to the picture, i.e. <picture> foaf:depicts <person/thing> based on Twitter handles included in the tweet. Perhaps I should first install a plugin that adds Open Graph metadata to the post.)

Trumpet, the only Twitter client that supports a custom image host

After updating my WordPress user profile – you have to check a box before Tweet Images accepts pictures for the user – I thought I was good to go. The included test script uploaded an image with message correctly and returned the URI of the post that contains the photo I uploaded.

As I had noticed TweetDeck doesn’t allow setting a custom image host (Twitter, the owner of TweetDeck doesn’t provide support pages for the TweetDeck app either). So after some searching and installing several Twitter clients, I had to conclude that Trumpet is the only Android app to support custom image upload URIs. Although it is still called a beta version, it seems to be pretty complete, stable and nice. I’ve had notifications that there were 20 new mentions when there were no new mentions at all, but that is the only flaw I discovered. Oh, and you have to take the picture outside Trumpet as it doesn’t ‘connect’ to the Camera app. From the Gallery, however, you can share images using Trumpet like you can with other apps via the Share command.

The real test…

… is of course to upload an image from my phone. I copied and pasted the URI to post to in Trumpet’s settings and tried. “Unable to upload, please try again.” So I tried again, but the result was the same. I checked the server error log, which showed that Mod Security blocked the upload:

[Wed Apr 18 15:03:50 2012] [error] [client <ip address>] mod_security: Access denied with code 500. Error processing request body: (null) [hostname "ben.companjen.name"] [uri "/tweetimage/<secret>"]

I’m not completely sure what “(null)” means in this context.

My webhosting provider will not make an exception in Mod Security for me or my specific client. After a second email to the customer service I received a useful answer: that (null) may mean that there is a null character in the request somewhere. I have yet to find it, but that would probably mean that the Twitter4J library (developed by Twitter, used by Trumpet) creates bad requests.

The workaround I used now is lowering the level of security (i.e. using some directives in .htaccess to turn off Mod Security for POST requests). I should be able to turn it off for a specific path, but figuring out how will take a little more time. It seems my hosting provider is very restrictive in what directives can be used in .htaccess files, which makes it hard to try out several options without making the whole site unavailable. Or that may just be me 🙂

Image previews in clients

One of the disadvanteges of building an image host yourself, is that there are no standards or even best practices for exposing its services to clients. Apparently Trumpet is okay with uploading to any URI that returns something like <mediaurl>http://…</mediaurl>, and so are a couple of clients for iOS. But Tweet Images returns the URI for the post, not the image itself (or one of the resized copies). This is just like any of the major hosters. However, it appears that each hoster has its own way of getting the (preview) image’s URI from the post’s URI. That means clients wanting to support image previews have to know how to interact with each host they want to support. In this light: how can any client know from the URIs I tweet where to find the preview images?

I have yet to try and find out if copying e.g. YFrog’s behaviour can work, although I’m afraid all these behaviours are hardcoded and only work with the right URIs. Some standardisation should be possible here, shouldn’t it?

If you have any ideas for improvement, please leave a comment.

Linked Data at the University of Twente?

Of course the University of Twente should offer their data as Linked Open Data! It could be the first Dutch university to be a member of Linked Universities, although the data themselves are of course more important than being listed as a member of some website.

Data are all around the University, and some of them already published. The event calendar is available in iCalendar format, phone book is accessible through LDAP and a web form and bibliographic data about research publications are searchable via a webpage, and can be aggregated using OAI-PMH. The RKB Explorer offers harvested data from both the University repository and the repository of the Faculty of Electrical Engineering, Mathematics and Computer Science as Linked Data already, but those harvests are from 2009 and before.

Here’s a list of things that I think can be easily described using Linked Data:

  • buildings (including historical buildings, like Charlie)
  • rooms (lecture halls, offices, labs, dressing rooms, theatres, boiler rooms, etc.)
  • opening hours
  • squares, fields
  • streets
  • bus stops
  • parking spaces
  • points of interest (coffee machines, candy vending machines, fire extinguishers, etc.)
  • artworks (paintings, outside artworks)
  • people (staff, student body representatives, etc.)
  • study statistics (number of enrolled students, per programme, etc.)
  • organisational structures (faculties, research groups, research institutes, spin off companies, facilities)
  • associations (student union, cultural and sports clubs, alumni associations, etc.)
  • scientific publications
  • scientific data
  • library catalogue
  • archive catalogue
  • study programmes (courses, requirements)
  • contracts (EU funding, cleaning, catering, coffee and candy vending machines, etc.)
  • events (lectures, meetings, concerts, dissertations, performances)

The university and the university library tend to think of “scientific data linked to the publications that are based on the data” when they hear Linked Data, but if all these data are available as RDF (or at least in some open data format, e.g. CSV), they can allow many more useful applications to be developed. Think of programme checks (do you need more or other courses?), appointment schedulers that account for walking distances between rooms, or bus stop and room, and visualisations of parts of buildings that produce the most publications. Integration of the phone book entries of people with their publications is easy, and creating filters for the event calendar is just as easy: it’s a matter of adjusting a SPARQL query.

We’ll see what happens.

Discogs data, MonetDB/XQuery and Gephi

Wow, this was a draft from February of 2011. Time to freshen it up and post 🙂

I love Discogs for letting me catalogue my music while building a database of all musical releases worldwide. Sometimes I wished it was easier to get a high level overview of the trees of labels and their sublabels. Finally I tried myself to create such a tree. Here is something about how I tried; another idea I have with the release data is somewhere at the end of this post.

Discogs release their data as monthly data dumps into the public domain, which allows anyone to do anything with it. It’s too bad that not all information is exported (barcodes are not included, amongst other things), but there are many things you can do without those features.

The Database group of the University of Twente contributed to MonetDB/XQuery, so during the course XML & Databases 1 students are told they could use that to test their XQueries on. (Apart from the promotional words, it does work and when you spot a bug in the XQuery or PF/Tijah full text index part, support is at walking distance.)

So, my plan was to create a graph of the parent label-sublabel structure from the XML dump. It’s not too hard: download and unzip data, load it into MonetDB, use XQuery to create an XML graph file.

Installing Gephi is easy, as it installs just like any other piece of Windows software. It supports an XML graph format called GEXF, which supports sizing nodes and handling thickness of edges.

Installing MonetDB/XQuery was easy, even on a Windows 7 64 bit machine. Starting and stopping is just slightly more complicated than starting from the Start menu, because I wanted to use ‘discogs’ as database name and also wanted to use some options with the mclient command line client.

The labels dataset is 31.2 MB when unpacked and loads easily. A simple XQuery shows there are 189252 labels and 17789 sublabel relations. In Discogs all entities are distinguished by name possibly followed by a number in parentheses. Sublabel relations are designated using label names within a <sublabel> element, but those sublabels have a <parentLabel> element as well (when applicable, of course). The connection between parent label and sublabel can thus be found both ways.

When you thought that BMG or the Warner Music Group have the most sublabels, you were wrong. Although BMG has 78 direct sublabels and Warner has about the same number, the ‘biggest’ parent label is Not On Label. All releases without a clear label go on that or one of its ~1800 sublabels (like “Not On Label (Metallica Self-released)”). This is something to take into account when creating XQueries or rendering a graph.

This is about as far as I got then. I may have gotten distracted by a deadline. So where was I?

Looking at the files, I think my goals expanded to creating a graph of my complete collection, including artists, labels and releases. But even for a subset of 60 releases that doesn’t make a pretty graph. So for starters, here are the labels in my collection (as of February 2011), size of the node relative to the number of releases of that label in the collection. On releases with multiple labels, all labels are counted. (Sonar Kollektiv is the largest node.)

Graph of record labels and sublabels in my Discogs collection by February 11, 2011
Graph of record labels and sublabels in my Discogs collection by February 11, 2011

I’ll think about posting more images, but I need to find out what more I have first 🙂