Friday, September 21, 2012

Don't Do Journey: Karaoke and a Data Analysis Musing

"DON'T DO JOURNEY!!" The look of terror and disbelief in her eyes was both sudden and palpable.

What can I say? People feel very strongly about karaoke. Every since this joy/terror was gifted/unleashed upon the world, it seems that there is no shortage of people who have very strong feelings about it.

It's kind of a love/hate relationship. People love it. Or they hate it. Or they love to hate it. Or they hate the fact that they love it. Either way, it's kind of surprising how polarizing it can be.

There's a place here in Toronto that's quite popular for it. Well, actually I don't know how popular it is, but they do have it five nights a week. As I was looking at their website one day, I had one of these oh, neat moments - the contents of their entire karaoke songbook, a list of all 32,636 songs, is available in PDF format.

Slam that into a PDF to CSV converter.... tidy up a little, and we've got data!

So what's the most available to sing at the Fox if you happen to be feeling courageous enough? The Top 10:

Hail to The King, baby.

Traditional? Standard? What the heck? I've never even heard of those artists! Are those some 70's rock bands like The Eagles or.... oh, right. That makes sense. Really, traditional and standard should be the same category.

After traditional songs, no one can touch The King, followed by Ol' Blue Eyes with about half as many songs. Just in case you were wondering, the next 10 spots after Celine Dion are a lot of country followed by The Stones.

And that, unfortunately, is it. Which brings us to my musing on data analysis.

On a very simplistic high level, you could say that there are 3 steps to data analysis:

1. Get the data
2. Make with the analysis
3. Write up report/article/paper/post for management/news outlet/academic journal/blog

And like I said, that is a massive oversimplification. Because really, you can break each step into many sub-steps, which don't necessarily flow in order and could be iterative. For example, Step 1:

1a. Get the data
1b. Decide if there are any other data you need
1c. Get that data 
1d. Clean and process data in usable format
1e. ....

Et cetera. My roommate and I were having a discussion on these matters, and he quite astutely pointed out that many people take Step 1 for granted. Worse yet, some don't appreciate that there is more to Step 1 than 1a.

And that is why this is another short post with only one graph. Because there's only so much analysis you can do with Artist, Title and Song ID. There's options, to pull a whole bunch of data: Gracenote (but they appear to be a bit stingy with their API), freedb, MusicBrainz, and Discogs. But I'm not going to set up a local SQL server or write a bunch of code right now; though it would be interesting to see an in-depth analysis taking into consideration many things like song length, year, genre, and lyric content to name a few.

As my roommate and I were talking, he pointed out that if you had a karaoke machine (actually I think it's computers with iTunes now) which kept track of all the songs picked, there'd be something more interesting to analyze: What is the distribution of the popularity of songs? How frequently are different songs of different genres and years picked?

We agreed that it's most likely exponential (as many things are) - Don't Stop Believin' probably gets picked almost once a night, but there are likely many, many other songs that have never have been (and probably never will be) picked. And lastly, I'm always left wondering, how many singers are actually in tune for more than half the song?

No comments:

Post a Comment