Short review of “Spamming in Scholarly Publishing: A Case Study”

Interesting: a researcher, Marcin Kozak, gets a lot of unsollicited email (spam) trying to convince him to publish in a journal or with a publisher and decides to check out these journals and publishers.

Kozak, M., Iefremova, O. and Hartley, J. (2015), Spamming in scholarly publishing: A case study. Journal of the Association for Information Science and Technology. doi: 10.1002/asi.23521

The abstract covers it well:

Spam has become an issue of concern in almost all areas where the Internet is involved, and many people today have become victims of spam from publishers and individual journals. We studied this phenomenon in the field of scholarly publishing from the perspective of a single author. We examined 1,024 such spam e-mails received by Marcin Kozak from publishers and journals over a period of 391 days, asking him to submit an article to their journal. We collected the following information: where the request came from; publishing model applied; fees charged; inclusion or not in the Directory of Open Access Journals (DOAJ); and presence or not in Beall’s (2014) listing of dubious journals. Our research showed that most of the publishers that sent e-mails inviting manuscripts were (i) using the open access model, (ii) using article-processing charges to fund their journal’s operations; (iii) offering very short peer-review times, (iv) on Beall’s list, and (v) misrepresenting the location of their headquarters. Some years ago, a letter of invitation to submit an article to a particular journal was considered a kind of distinction. Today, e-mails inviting submissions are generally spam, something that misleads young researchers and irritates experienced ones.

Some details were missing, however. I think good methodologies for assessing a publisher’s or journal’s trustworthiness are necessary, so it would be great if people researching these methodologies get the details correct.

The location of the headquarters was determined via various means, one of these being a lookup of the domain name holder’s (or registrant’s) country in a WHOIS system. The authors conclude this is not a reliable method, but do not explain why. A few sentences before they do suggest that the registrant’s country is the country the publisher/journal is based in, or that WHOIS shows the location of the server. Exactly what information was used from WHOIS is not described.

Another way of determining the headquarters’ location was to look up the information on the website. How to determine that information is found or missing is not mentioned.

One of the conclusions is that “the average time claimed for peer review was 4 weeks or less.” I don’t see how this follows from the summary table of claimed time for peer review, because it contains N/A values, and nearly all claimed times are 4 weeks or less. The form of the statement is wrong.

Finally, I would have liked to see a reason for not including the dataset. I can only guess why the authors deliberately did not provide the names of journals and publishers.

I think the conclusions hold (except for the one mentioned above), and that work should be performed to improve the methodology for judging journal quality. Eventually, the work would be automated and be easily replicated over time. Results from such automated checks could be added to the DOAJ.

Short review of the International Journal of Digital Library Services

If you do not like Elsevier’s misinterpretation of the Creative Commons licences, then stay away from the International Journal of Digital Library Services.

Reviewing this journal was easy. (I was partially inspired by Jeffrey Beall’s list of things to look for to determine ‘predatoriness’.) The website features animated GIF images and other very general images as ‘context’ on its homepage, capitalised titles, uses the ISSN in title references, and features many spelling and grammar mistakes. But most importantly, and the only real reason to recommend not doing business with this journal, is that the copyright page reads:

Articles which are published in IJODLS are under the terms and conditions of the Creative Commons [Attribution] License. Aim of IJODLS is to disseminate information of scholarly research published in related to library and information science.
The submission of the manuscript means that the authors automatically agree to assign exclusive copyright to Editor-in-Chief of IJODLS for printed and electronic versions of IJODLS, if the manuscript is accepted for publication. The work shall not then be published elsewhere in any language without the written consent of the publisher. The articles published in IJODLS are protected by copyright, which covers translation rights and the exclusive right to reproduce and distribute all of the articles printed in the journal.

In other words: this journal (“intellectual property rights” being one of its keywords in DOAJ) doesn’t get licences right. Any journal that requires transfer of copyrights for publication will not get my recommendations, but this copyright statement makes me distrust the publisher.

Short review of the CSV ontology

This is interesting: the CSV ontology to describe the columns of a CSV file and the file itself. I can definitely see the value in rich descriptions of CSV files, or spreadsheets in general. But I’m also really tempted to ask “if you use RDF for the file ‘header’, why not the ‘body’ too?” There are various csv2rdf tools (although I haven’tused any), but TabLinker is the only one I know my colleagues are working on 🙂

Then I thought of file size: even Turtle files can easily grow larger than CSV files containing the same values. Moreover, support for CSV is more widely available, isn’t it?

The example use case (labeling and describing a file and its columns) also reminded me of ARFF, which embeds some metadata (comments on file level and field name & data type on column level) and allows sparse data, which could save bytes. But allowing only ASCII in the file makes the format pretty outdated. The XML-based XRFF allows the use of other encodings.

The CSV ontology itself needs a little revision, as some of the definitions are unclear (to me, at least), and the example CSV document contains spaces after commas (leading spaces are part of the field, according to RFC 4180). As an example of unclarity, the definition for the property mapsTo is Which RDF class values in the column map to — this may suggest the range of the the property is rdfs:Class, but the examples all have a property as object of mapsTo. When this correctly means the range is rdfs:Property, and if my understanding that you could create triples following the pattern <[subject]> <[column mapped property]> <[value in cell]> is correct, it is still unclear what the subject of the triple would be. There is no definition of a property that can be used to define a column as subject of the triple pattern. I guess it is not trivial to define.

Suddenly I’m reminded of Karma, which interactively, supported by machine learning, can create mappings for the columns of a CSV file to RDF. Wonder if its mappings can be mapped to the CSV ontology?

Response to “Three reasons why the Semantic Web has failed”

Posted on as a comment (but at the time of posting it is still awaiting moderation).

I’d like to disagree with most of the article. Your argument “the Semantic Web has failed” does not follow from your “reasons”.
Sure, I’m pretty familiar with the Semantic Web and able to understand RDF (really, it’s not impossible to understand) and (most of) OWL, but that is not why I think a Synaptic Web can live next to a Semantic Web. To start: wouldn’t it be great for your streaming web interpreters to be presented with structured information next to unstructured text? Let it live on top of the Semantic Web (and the rest of the Web).

Do you want to exclude facts from knowledge? I, too, couldn’t care less about Leonardo da Vinci’s height, but if I see the Mona Lisa in Paris, I might want to know what else he painted and did and where I can see that. You need boring facts for that. Boring, but useful facts.
For human consumption “messages” are only part of knowledge. Take science for example. Science doesn’t only live in conversation; loads of scientific knowledge is transferred in documents.

The Semantic Web doesn’t depend on XML. Or JSON – although JSON-LD is gaining lots of ground. Human end users shouldn’t need to see raw facts in any text format, only developers. Turtle is the easiest to read and write by hand, I think, but eventually programmers will do that just as rarely as they read and write JSON.

We’re still a long way from having phones that measure brain activity to decipher our thoughts before they become pieces of knowledge consisting of concepts and, err, facts about things we do, want, and feel. In light of my privacy, I’d like my phone to not push my thoughts and activities to the Synaptic Web. It could ask specific questions to the Web that I would like answered, but those questions are likely to be based around concepts, time and place (“what museums are open around here tomorrow?”). That almost works and looks like keyword search.

I like the vision of a Synaptic Web (I heard the term for [the] first time today), but to call the Semantic Web failed because people actually want a Synaptic Web was not proven today.