Discogs data, MonetDB/XQuery and Gephi

Wow, this was a draft from February of 2011. Time to freshen it up and post 🙂

I love Discogs for letting me catalogue my music while building a database of all musical releases worldwide. Sometimes I wished it was easier to get a high level overview of the trees of labels and their sublabels. Finally I tried myself to create such a tree. Here is something about how I tried; another idea I have with the release data is somewhere at the end of this post.

Discogs release their data as monthly data dumps into the public domain, which allows anyone to do anything with it. It’s too bad that not all information is exported (barcodes are not included, amongst other things), but there are many things you can do without those features.

The Database group of the University of Twente contributed to MonetDB/XQuery, so during the course XML & Databases 1 students are told they could use that to test their XQueries on. (Apart from the promotional words, it does work and when you spot a bug in the XQuery or PF/Tijah full text index part, support is at walking distance.)

So, my plan was to create a graph of the parent label-sublabel structure from the XML dump. It’s not too hard: download and unzip data, load it into MonetDB, use XQuery to create an XML graph file.

Installing Gephi is easy, as it installs just like any other piece of Windows software. It supports an XML graph format called GEXF, which supports sizing nodes and handling thickness of edges.

Installing MonetDB/XQuery was easy, even on a Windows 7 64 bit machine. Starting and stopping is just slightly more complicated than starting from the Start menu, because I wanted to use ‘discogs’ as database name and also wanted to use some options with the mclient command line client.

The labels dataset is 31.2 MB when unpacked and loads easily. A simple XQuery shows there are 189252 labels and 17789 sublabel relations. In Discogs all entities are distinguished by name possibly followed by a number in parentheses. Sublabel relations are designated using label names within a <sublabel> element, but those sublabels have a <parentLabel> element as well (when applicable, of course). The connection between parent label and sublabel can thus be found both ways.

When you thought that BMG or the Warner Music Group have the most sublabels, you were wrong. Although BMG has 78 direct sublabels and Warner has about the same number, the ‘biggest’ parent label is Not On Label. All releases without a clear label go on that or one of its ~1800 sublabels (like “Not On Label (Metallica Self-released)”). This is something to take into account when creating XQueries or rendering a graph.

This is about as far as I got then. I may have gotten distracted by a deadline. So where was I?

Looking at the files, I think my goals expanded to creating a graph of my complete collection, including artists, labels and releases. But even for a subset of 60 releases that doesn’t make a pretty graph. So for starters, here are the labels in my collection (as of February 2011), size of the node relative to the number of releases of that label in the collection. On releases with multiple labels, all labels are counted. (Sonar Kollektiv is the largest node.)

Graph of record labels and sublabels in my Discogs collection by February 11, 2011
Graph of record labels and sublabels in my Discogs collection by February 11, 2011

I’ll think about posting more images, but I need to find out what more I have first 🙂