Hot topic: trust & quality in science, science publishing et al.

While I’m preparing for presenting standards for and certification of Trustworthy Digital Repositories in a workshop about preservation metadata, which is about demonstrating trustworthiness of DRs, others are discussing trust and quality too. (This is not an extensive or necessarily balanced review –€“ this is what caught my attention.)

Richard Smith asks how researchers can be judged on the quality of their work, rather than the supposed impact of their work. Neither Impact Factor or Altmetrics should be used as a metric to judge a researcher’s performance, he argues.

Jeffrey Beall doesn’t trust the intentions (and with it, the quality) of another publisher, but Peter Murray-Rust disputes Beall’s conclusions because the quality of the reasoning is sub-par. This is a good debate to have in general – I think trust and quality of reviews is important enough to discuss in the context of science publishing.

After a sting that showed many Open Access academic journals were keen on publishing bogus science for money, recently two major academic publishers removed bogus papers from their collections. Was there peer review in these cases? If there was, its quality was far too low.

Therefore you should be able to review the reviews too. SciRev lets researchers do so; quality and speed of the review process for journals can be rated, together with the outcome of the review (accepted, rejected, withdrawn). Alternatively, quality of peer review can be expressed in a number called preSCORE, according to preSCORE (, Inc.?). I’m not sure whether either method suffices to judge a journal on.

Finally, for now, some are scrutinising the whole system of academia and/or science publishing. Sydney Brenner talks to Elizabeth Dzeng in King’s Review about this, Michael White writes about it on Pacific Standard and Robbert Dijkgraaf compared publishers’ Big Deals to a hypothetical supermarket (paywalled) forcing customers to buy the whole store contents in his column in NRC Handelsblad.

Will things change now?

Update, 2014-03-05: yesterday my (now former) colleague Frank van der Most presented some of his results in a Europe-sponsored research project ACUMEN (Academic Careers Understood through Measurement and Norms). He interviewed academics at different levels of seniority and deans and HR managers about research data sharing and evaluations as part of evaluations of researchers. It is not yet part of the standard evaluations, that is certain.

Can you please remove ‘meaningful punctuation’ from field contents, librarians?

Dear Cataloguing librarians,

It is time to realise that using punctuation as a way of marking sub-field boundaries is bad practice. You must not want to put title and “responsible entity” in one field and then try to split the field contents using punctuation like ” / “. You must not want to use an author’s full name in reverse order + year of birth + year of death (if applicable) to identify the person – certainly if you also allow an optional “.”.

You need to understand: the machines are not smart enough yet to understand your cataloguing rules and therefore they don’t get the meaning of what you put in the fields. Even the ones at OCLC are not smart enough yet.

What drove me to write this, was this example: Linked Data about Works published by OCLC. It is buzzing and – I agree with Ed Summers – pretty cool. The data structure and semantics can be improved, as Richard Wallis of OCLC says in a blog post. The example that Ed took in his blog post, Weaving the Web” by Tim Berners-Lee, demonstrates my issue (which is not touched upon by Ed or Richard).

The work’s title is shown as:

Weaving the Web : the original design and ultimate destiny of the World Wide Web by its inventor /

Yes, Tim himself said he would have gotten rid of the two forward slashes after http: in URIs, had he had the chance to start over, but the slash at the end of the title was not Tim’s intent. I bet you put “Tim Berners-Lee” or even “by Tim Berners-Lee” after that slash in the 245 field of your MARC record.

Second point, from the same example, the authors. And contributors, and creators. I know the temporary URIs will be replaced by VIAF URIs, but OCLC will still need to map…

"creator" : [ "http://experiment.worldcat.org/entity/work/data/27331745#Person/berners_lee_tim", "http://experiment.worldcat.org/entity/work/data/27331745#Person/berners_lee_tim_1955" ]

… to the one and only Tim Berners-Lee (who co-authored the book). In this example that should be easy, as there aren’t many people called Tim Berners-Lee on the planet and there is only one with a very strong connection to “the Web”, but the general case is not that simple. (You need context for that, and even then there is a chance that you make incorrect matches. You may find some context for this in my thesis.)

I’ll come back to you in some time to see how you’re getting on with fixing all of this. I’m counting on you!

Cheers,

Ben

Short review of the CSV ontology

This is interesting: the CSV ontology to describe the columns of a CSV file and the file itself. I can definitely see the value in rich descriptions of CSV files, or spreadsheets in general. But I’m also really tempted to ask “if you use RDF for the file ‘header’, why not the ‘body’ too?” There are various csv2rdf tools (although I haven’tused any), but TabLinker is the only one I know my colleagues are working on 🙂

Then I thought of file size: even Turtle files can easily grow larger than CSV files containing the same values. Moreover, support for CSV is more widely available, isn’t it?

The example use case (labeling and describing a file and its columns) also reminded me of ARFF, which embeds some metadata (comments on file level and field name & data type on column level) and allows sparse data, which could save bytes. But allowing only ASCII in the file makes the format pretty outdated. The XML-based XRFF allows the use of other encodings.

The CSV ontology itself needs a little revision, as some of the definitions are unclear (to me, at least), and the example CSV document contains spaces after commas (leading spaces are part of the field, according to RFC 4180). As an example of unclarity, the definition for the property mapsTo is Which RDF class values in the column map to — this may suggest the range of the the property is rdfs:Class, but the examples all have a property as object of mapsTo. When this correctly means the range is rdfs:Property, and if my understanding that you could create triples following the pattern <[subject]> <[column mapped property]> <[value in cell]> is correct, it is still unclear what the subject of the triple would be. There is no definition of a property that can be used to define a column as subject of the triple pattern. I guess it is not trivial to define.

Suddenly I’m reminded of Karma, which interactively, supported by machine learning, can create mappings for the columns of a CSV file to RDF. Wonder if its mappings can be mapped to the CSV ontology?

Dag LinkedIn!

Na een mailtje “Your contact, X, has just joined LinkedIn” (met ‘X onbekend’) ben ik op zoek gegaan in mijn adresboek. Ja verrek! X, een vroegere kamerzoeker staat in Other contacts in Gmail. Eind 2008 hebben we gemaild en in de tussentijd heb ik LinkedIn een keer mijn adresboek gegeven. Ik voel me vies.

Maar ik kan mijn ooit geïmporteerde adresboek niet meer verwijderen van LinkedIn. De links die lijken te verwijzen naar pagina’s waarop dat kan (‘imported contacts’, ‘organize contacts’), sturen mij door naar de algemene ‘contacten’-pagina. Dan is er maar één andere manier.

Dag LinkedIn!

 

(Update: (met ‘X onbekend’) toegevoegd aan eerste zin, ‘contacten’ veranderd in de algemene ‘contacten’-pagina.)

The pain of plain (text emails)

Since the introduction of HTML email (must have been before my first email encounter), and especially now that internet speeds have gone up and computers are fast enough to render email in HTML, email has been HTML.

Since I switched to only looking at the plain text version of multipart/alternative emails, I see that HTML has become the default for many corporate emails.

I see HTML entities: &copy; and &nbsp; for which there are Unicode code points and UTF-8 representations.

I see:

Click here for more.

… without any URL that my email client could make clickable.

I see the opposite:

More URL than text
More URL than text

I see CSS, with hacks:

/* Client-specific Styles */
#outlook a{padding:0;} /* Force Outlook to provide a "view in browser" button. */
body{width:100% !important;} .ReadMsgBody{width:100%;} .ExternalClass{width:100%;} /* Force Hotmail to display emails at full width */
body{-webkit-text-size-adjust:none;} /* Prevent Webkit platforms from changing default text sizes. */
/* Reset Styles */
body{margin:0; padding:0;}
img{border:0; height:auto; line-height:100%; outline:none; text-decoration:none;}
table td{border-collapse:collapse;}
#backgroundTable{height:100% !important; margin:0; padding:0; width:100% !important;}
p {margin-top: 14px; margin-bottom: 14px;}
/* ////////// STANDARD STYLING: PREHEADER ////////// */
.preheaderContent div a:link, .preheaderContent div a:visited, /* Yahoo! Mail Override */ .preheaderContent div a .yshortcuts /* Yahoo! Mail Override */{
color: #3b6e8f;
font-weight:normal;
text-decoration:underline;
}
.mainContent a:link, a:visited{
color:#336699;
}
/* ////////// STANDARD STYLING: FOOTER LINKS ////////// */
.footerContent div a:link, .footerContent div a:visited, /* Yahoo! Mail Override */ .footerContent div a .yshortcuts /* Yahoo! Mail Override */{
/*@editable*/ color:#336699;
/*@editable*/ font-weight:normal;
/*@editable*/ text-decoration:underline;
}

And I see HTML:


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><title>Uw bestelling bij Ticketscript</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<table width="468">Beste Ben Companjen,

Come on, corporations, I am in business with you (these emails are results of me being a client of corporations’ services)! Please put some effort in serving your plain text email readers.

Universiteit Twente wil meer met academisch erfgoed, maar weet nog niet hoe

In het novembernummer (PDF) van UT Nieuws vertellen verschillende medewerkers van de Universiteit Twente over het academisch erfgoed van de inmiddels 52-jarige universiteit. In het artikel Een atlas uit 1650 en een computer van ƒ100.000,-: De universiteit inventariseert haar academisch erfgoed (p. 26-27) komen de Boot-collectie (15.000 oude boeken, waaronder een Blaeu-atlas uit 1650) en de studieverzameling van oude computers en (meet)apparatuur aan bod. Daarnaast staan op de UT-lijst van het academisch erfgoed ook “de digitale foto- en filmcollectie van de Beeldbank, het fotoarchief en enkele oude apparaten van de faculteit ITC [en] digitale foto’s en films van de faculteit GW”.

De collectie ontsluiten gaat niet zomaar. De Boot-collectie is te waardevol en broos om publiekelijk tentoon te stellen. En het exposeren van apparaten kost geld, wat ten koste zou gaan van onderwijs en onderzoek.

De universiteit overweegt zich aan te sluiten bij de Stichting Academisch Erfgoed waarin meerdere universiteiten kennis uitwisselen. Maar ook dat kost geld.

Brinksma: ‘Die stichting brengt regels en kosten met zich mee. We moeten afwegen of deelname ons genoeg voordelen oplevert. Een besluit hierover hebben we nog niet genomen.’ Overigens heeft de UT sowieso de verantwoordelijkheid om het eigen academisch erfgoed in stand te houden, vindt Brinksma, of het nou op de lijst van de stichting staat of niet. ‘Academisch Erfgoed vertegenwoordigt ideeën en ontwikkelingen die voor de universiteit van belang zijn. Ik vind dat we de natuurlijke hoeder moeten zijn van deze sporen van de geschiedenis op onze eigen vakgebieden. Hiermee kunnen we het verhaal voor latere generaties vertellen.’

Dit verhaal doet me iets meer dan een beetje denken aan het bezoekerscentrum/museum waar ik over blogde en over geïnterviewd werd (ach, ik was niet de eerste met ideeën, toch?). Anders dan in mijn voorstel gaat dit artikel alleen over academisch erfgoed en zonne-auto’s tellen daar blijkbaar niet in mee.

Ben benieuwd: zijn we een stapje dichter bij een museum?

Response to “Three reasons why the Semantic Web has failed”

Posted on http://gigaom.com/2013/11/03/three-reasons-why-the-semantic-web-has-failed/ as a comment (but at the time of posting it is still awaiting moderation).

I’d like to disagree with most of the article. Your argument “the Semantic Web has failed” does not follow from your “reasons”.
Sure, I’m pretty familiar with the Semantic Web and able to understand RDF (really, it’s not impossible to understand) and (most of) OWL, but that is not why I think a Synaptic Web can live next to a Semantic Web. To start: wouldn’t it be great for your streaming web interpreters to be presented with structured information next to unstructured text? Let it live on top of the Semantic Web (and the rest of the Web).

Do you want to exclude facts from knowledge? I, too, couldn’t care less about Leonardo da Vinci’s height, but if I see the Mona Lisa in Paris, I might want to know what else he painted and did and where I can see that. You need boring facts for that. Boring, but useful facts.
For human consumption “messages” are only part of knowledge. Take science for example. Science doesn’t only live in conversation; loads of scientific knowledge is transferred in documents.

The Semantic Web doesn’t depend on XML. Or JSON – although JSON-LD is gaining lots of ground. Human end users shouldn’t need to see raw facts in any text format, only developers. Turtle is the easiest to read and write by hand, I think, but eventually programmers will do that just as rarely as they read and write JSON.

We’re still a long way from having phones that measure brain activity to decipher our thoughts before they become pieces of knowledge consisting of concepts and, err, facts about things we do, want, and feel. In light of my privacy, I’d like my phone to not push my thoughts and activities to the Synaptic Web. It could ask specific questions to the Web that I would like answered, but those questions are likely to be based around concepts, time and place (“what museums are open around here tomorrow?”). That almost works and looks like keyword search.

I like the vision of a Synaptic Web (I heard the term for [the] first time today), but to call the Semantic Web failed because people actually want a Synaptic Web was not proven today.

Alternatives for “blog post”

If you run or write for a blog, or talk about blogs, you’ll want to refer to the content of blogs. Publications on blogs (perhaps mostly outside news organisation blogs) are commonly referred to as “blog post” (I guess news organisation may call online-only content “articles” too).

Because I don’t particularly like this phrase, I had a little brainstorm and wrote down some alternatives. In order of thought:

  • bulletin
  • alert
  • post
  • publication
  • letter
  • essay
  • entry
  • issue
  • report
  • note
  • proceeding
  • paper
  • review
  • record
  • part
  • article
  • brief
  • column
  • opinion
  • memo
  • memorandum
  • view
  • announcement
  • document
  • thing
  • story
  • text
  • section
  • chapter
  • blog
  • message
  • line
  • piece
  • memoir(s)
  • item
  • journal entry

I am still working (a bit) on a redesign and custom theme for this blog, but have already decided to go with “message” in the abbreviated form “Mess.”. Lots of “mess.-es” create a mess, which translates to Dutch as “Bende”. This concludes another mess. for this mess.

Workflow for saving song ratings from iTunes to MusicBrainz

If you, like me, a) have a large collection of music files on your computer and b) you manage (and play) them through iTunes and c) have a MusicBrainz account to do some metadata normalisation, have you thought of this?

Suppose you want to export your ratings to the MusicBrainz server so that eventually you can import them to another music player, let MusicBrainz Picard embed your ratings in the files or just share them (not having them public but still on the server is not possible). Perhaps this could work:

  1. Use MusicBrainz Picard to tag MP3 files with MusicBrainz recording (previously track) identifiers
  2. Export the iTunes library – it will now be a plist XML file
  3. Optional(?): Convert the exported iTunes library to normal XML
  4. Extract a list of file names and MusicBrainz identifiers from the files
  5. Extract a list of file names and ratings from the (converted, if necessary) iTunes library
  6. Send for each of the MusicBrainz identifiers with a rating the rating to the MusicBrainz server

Why haven’t I done this yet? Well, priorities lie elsewhere. I’m not convinced I want all my ratings online, although I may want to use my rating information in other applications in the future. Perhaps they’re of use during the Public Broadcasting hackathon November 9 – what if an application could match me to a radio station playing the music I like?

Step 1 will take a lot of time, because my collection is far from fully covered by MusicBrainz. Step 2 is easy. Step 3 has at least a potential solution. Step 4 can be handled by a shell script that I don’t have yet. Step 5 can be performed by an XQuery or XSLT transformation. Step 6 needs a script that calls the MB REST service with input.

Or are there better ways?

Blauw online, maar hack me niet

Ik wilde een stukje schrijven als reactie op de column van Rosanne Hertzberger, met enigszins onderbouwde argumenten om niet volledig vóór het hackvoorstel van Minister Opstelten te zijn. Ze stelde in de NRC van afgelopen weekend dat de politie moet mogen hacken om bewijsmateriaal te vinden voor bijvoorbeeld kinderpornobezit. Bits of Freedom roept volgens haar ook te snel te hard “nee, het hackvoorstel is dom!” zonder alternatieven te bieden.

Kindermisbruik en kinderporno moet worden bestreden, daar ben ik het mee eens. De politie moet online ook aanwijzingen kunnen verzamelen en nagaan, zoals ze dat offline kunnen. En Bits of Freedom roept ook hard en in de hardste uitspraken mis ik ook wel eens de nuance.

Er zijn analogieën te verzinnen om de impact van hacken door de politie te duiden en ik was in mijn hoofd al bezig met onzichtbare agenten in je huis en ogenschijnlijk normale maar stiekem geforceerde sloten. Maar volgens mij is de kracht van Rosannes punt dat de politie moet kunnen hacken om bezitters van kinderporno te pakken nogal afgezwakt door het nieuws dat er 144 verdachten zijn bezocht om gegevensdragers in beslag te nemen — zonder gehackt te zijn. De urgentie van een diepgaande reactie op de column is daarmee ook even weg. Daarom voor de lezer, en omdat ik de antwoorden ook niet zo maar heb:

  • In welke verhouding staat kinderporno tot andere ‘narigheid’ in de maatschappij?
  • Hoe sterk moet de verdenking zijn voordat de politie bij je langskomt of ingrijpt?
  • Zijn de middelen van de politie, met Team High Tech Crime en elf teams voor bestrijding kinderporno, niet sterk genoeg?
  • Wat denken we tegenwoordig van de rechtsbeginselen?

(Noot: ik sta achter het werk van Bits of Freedom en heb daarom eerder gedoneerd aan de stichting.)