I Can See The Music - The TasteMapper Lab Experiment
By Kevin Moot
My name is Kevin Moot and I am a Senior Software Developer at The Nerdery. I was fortunate to be involved in one of the Nerdery's first projects of the Nerdery Labs program.
The Nerdery Labs program is in the same vein as Google's famous "20% time" (in which Google developers are offered an opportunity to invest 20% of their time at work fostering their own personal side-projects). Our Nerdery Labs program presents a great opportunity for developers to bring their own personal projects to life and play with new technologies, creating experiments ranging from software "toys" to projects with potential business applications.
Our concept was ambitious: create a diagram of the entire musical universe.
I teamed up with one of our Nerds in our User Experience department, Andrew Golaszewski, to conceptualize how we could visualize a musical universe in which users could explore an interconnected set of musical artists and genres, jumping from node to node in much the same way that your curiosity might take you from a Wikipedia article on "Socioeconomics of the Reformation Era" and somehow end up on a video clip of "The Funniest Baby Sloth Video Ever!"
Assuming there was some huge data set out there which would give us insights into the composition of people's personal musical libraries and playlists, could we find out which genres and artists are most commonly associated with one another? Are fans of Ozzie Osbourne likely to see Simon and Garfunkel popping up in their listening history? What other artists should a loyal follower of Arcade Fire be listening to? Do certain genres display more listener "stickiness" that others – that is, do fans of pop music statistically branch out more often to other genres than devotees of death metal?
To narrow this ambitious plan down to a more reasonable, bite-sized problem set, we decided to concentrate on depicting a single central artist at a time as a "root" node. Connected to this root note would be a set of the most closely related artists.
Thus was born the TasteMapper experiment.
Our first task was to find a suitable data source. We needed a large, public data set that would key us in to the public's listening habits. On top of that, the data needed to be clean and complete – spotty data with missing, duplicated, or incorrect information was not enough to meet our standards.
Our first instinct was to turn to popular streaming music services such as Pandora, Spotify, and Last.fm. Of these, turned out only Last.fm has a public-facing API which developers can use to extract data about listener data. A beam of light spontaneously opened up from the heavens. Here was a source that could tell us about the listening habits of 5 million Last.fm users. Thanks to Last.fm's popular "Scrobbling" feature, freely exposed to us was every track Last.fm users have added to their libraries, along with the number of times each track has been played!
I promptly spun up a database instance and created a .NET process to download data from Last.fm's official API, grabbing detailed information about users, tracks, and artists. The total time to transfer all of the data I needed was limited by the speed of Last.fm's network, which turned out to be about 14 users' worth of listening data per minute.
I left the download process running day and night continuously for four weeks, ending up with a list of 1 million tracks taken from the libraries of 560,602 users. This funneled down into a total of 245,234 unique artists.
The next challenge was: how can we process this data into something usable?
We first tackled the issue of categorizing each of the artists by genre. Interestingly, Last.fm does not label each artist by genre; rather, we had to rely on Last.fm's unique system of artist tagging. Each Last.fm user can arbitrarily tag an artist with any label they would like. For instance; Eminem might be tagged as rap, hip-hop, and male vocalist. Since Last.fm's system of tagging artists is essentially crowdsourced, we came across some interesting anomalies: tags such as "Russian funeral doom" and "vegetarian progressive grindcore" randomly pervaded a number of artists in the data set.
We assigned each artist to one of 14 major musical genres based on an intelligent search through that artist's tags for matching keywords. The aforementioned oddball tags were either pigeonholed into one of our 14 main genres or ignored by rulesets which threw out nonsense tags.
Now that we had the data, we needed to create a process which would output the percentage of listeners to artist A who also had artist B in their library – for the set of all artists. That is, the similarity of every 245,234 artists were compared to every other 245,234 artists.
This is a big set of information – over 6 billion possible combinations. To compound matters, the relationship needed to be bi-directional: for example, 40% of listeners to Beatles also have Michael Jackson in their library, but 59% of Michael Jackson listeners have the Beatles in their library. So, we actually are dealing with something on the order of 12 billion combinations.
The sheer volume of data presented a bottleneck in terms of slow performance. Early database queries threatened to only fully complete after running for over a week, so the first rule of business was to incorporate some best practices for optimal database performance, such as properly indexing each table and ensuring queries were extracting only the minimum amount of necessary data into temporary tables along each step of processing. This helped considerably, but solid days of processing time still faced us.
Is this approaching Big Data territory? In the spirit of Big Data, we ideally did not want to limit the data we were considering (e.g. by taking a smaller random sampling of users) – it was important for us to look at "the whole" of the data in order to extract the correlations we were looking for.
In order to efficiently process this data, we could have certainly rethought our approach entirely and utilized specialized software for processing Big Data, such as Apache Hadoop. This would've enabled us to work with a very large dataset by breaking it down into smaller, manageable chunks that could be processed by hundreds of servers running in parallel.
While this approach would be great for a robust production environment, it was beyond the scope of this experiment. We instead opted to reduce the data set to manageable levels by purging the data set of very unpopular artists and/or fringe artists that a user had listened to only a very small number of times. In the end, the omissions allowed us to process several million artist combinations using traditional a SQL Server database in a matter of hours rather than days or weeks.
With our finely-finessed data set, we now faced the prospect of creating an appealing visualization. Before even one line of code was written, we approached the UX issues head-on: in what way could we depict a strong artist correlation versus a weak correlation? How could we visually encode the genre of each artist? How could we play around the meaning of colors and sizes? We clearly needed a strong graphics platform to bring our directed graph to life.
On top of this, we wanted an easy way to navigate through the cloud of artists, and a simple search feature to allow users to find a specific artist.
To implement the search functionality, we used the new HTML5
In practice, it has a few flaws: it is not supported universally by all modern browsers, and it is also not extremely robust in the respect that it requires the user to exactly type the first letters of the term (eg, typing "Beatles" would not find any matching term, but "The Beatles" would work).
For the visualization of artist relationships, we ended up going with a directed graph, with the root artist appearing in the middle of the page branching out to the 30 most strongly correlated artists. The user could navigate and explore additional artists by clicking on any of the ancillary artist branches.
Sometimes overshadowed by the buzz of HTML5, Scalable Vector Graphics (SVG) have been around a long time, and are a great way of drawing lines and shapes in the browser. This underutilized and underappreciated tool turned out to be perfect for this application – all modern browsers offer support for SVG's, so solving issues is a breeze since each of the graphical SVG elements can be easily viewed and modified via the web browser's built-in debugging tools, and SVG offers a set of interesting filter effects which we could utilize if we so wished.
We initially researched some SVG-based visualization and charting libraries such as d3, which offers some great functionality, but in the end I felt it was a bit overkill for depicting a relatively simple collection of primitive shapes and lines. If we wanted to support older browsers (such as Internet Explorer 8), a library such as Raphaël would be attractive, but in the end we didn't want the experience to be weighed down by the least common denominator of browsers.
We proceeded to put together some early prototypes of the directed graph using "plain old" native SVG without any additional libraries. The results were encouraging, but not without some caveats. The most significant downside to SVGs was performance – once we cranked up the number of artists displayed on-screen at once from 30 nodes up to hundreds of nodes, the framerate bogged down considerably. Although I would consider using HTML5 Canvas as a stronger solution to achieve the fastest flat-out performance possible, SVG's still made sense as a pragmatic approach to our more focused model.
To add icing to the cake, we added some animated dazzle to the whole experience by bringing in the Greensock Animation Platform, which offers an excellent tweening engine. Just like one would animate a native DOM element, each SVG element exposes a number of properties (x, y, radius, opacity, etc) which are extremely easy to hook into an animation/tweening engine.
During the conception of this experiment, our initial hypothesis was that artists would be most strongly correlated in terms of their similarity. That is, artists with similar genres, attitudes, and styles would appear most strongly connected with one another
Our findings instead revealed that popularity trumps similarity. Which actually makes a lot of sense – the mere fact that an artist has hit mega super-star status means that they will appear strongly correlated with just about every other artist. For example, a sobering number of artists were strongly connected with the popular artist Coldplay, regardless of whether or not Coldplay had any musical similarity to the other artist in question.
This also brings up an interesting point – is there inherent bias in the data? Although we strived to avoid a small/random sampling size, we are still only considering listening data from users of the Last.fm service, whose listeners tend to skew younger and more tech-savvy. And, just like the great migration of social networkers from MySpace to Facebook, the introduction of newer music services such as Spotify (launched in 2008) means that many users of Last.fm (launched waaay back in 2002) have probably already moved on to the "next big thing" and no longer have active/current playlist information in their Last.fm records.
We've barely scratched the surface with this experiment – with this rich Last.fm data set on hand, the most exciting possibilies are still to come. We can analyze trends from different angles, perhaps zooming out to a 10,000-foot "genre-centric" view of the musical universe. We could also cross-reference artist and genre relationships using user demographics, such as age and location.