Friday, September 21, 2012

Don't Do Journey: Karaoke and a Data Analysis Musing

"DON'T DO JOURNEY!!" The look of terror and disbelief in her eyes was both sudden and palpable.

What can I say? People feel very strongly about karaoke. Every since this joy/terror was gifted/unleashed upon the world, it seems that there is no shortage of people who have very strong feelings about it.

It's kind of a love/hate relationship. People love it. Or they hate it. Or they love to hate it. Or they hate the fact that they love it. Either way, it's kind of surprising how polarizing it can be.

There's a place here in Toronto that's quite popular for it. Well, actually I don't know how popular it is, but they do have it five nights a week. As I was looking at their website one day, I had one of these oh, neat moments - the contents of their entire karaoke songbook, a list of all 32,636 songs, is available in PDF format.

Slam that into a PDF to CSV converter.... tidy up a little, and we've got data!

So what's the most available to sing at the Fox if you happen to be feeling courageous enough? The Top 10:

Hail to The King, baby.

Traditional? Standard? What the heck? I've never even heard of those artists! Are those some 70's rock bands like The Eagles or.... oh, right. That makes sense. Really, traditional and standard should be the same category.

After traditional songs, no one can touch The King, followed by Ol' Blue Eyes with about half as many songs. Just in case you were wondering, the next 10 spots after Celine Dion are a lot of country followed by The Stones.

And that, unfortunately, is it. Which brings us to my musing on data analysis.

On a very simplistic high level, you could say that there are 3 steps to data analysis:

1. Get the data
2. Make with the analysis
3. Write up report/article/paper/post for management/news outlet/academic journal/blog

And like I said, that is a massive oversimplification. Because really, you can break each step into many sub-steps, which don't necessarily flow in order and could be iterative. For example, Step 1:

1a. Get the data
1b. Decide if there are any other data you need
1c. Get that data 
1d. Clean and process data in usable format
1e. ....

Et cetera. My roommate and I were having a discussion on these matters, and he quite astutely pointed out that many people take Step 1 for granted. Worse yet, some don't appreciate that there is more to Step 1 than 1a.

And that is why this is another short post with only one graph. Because there's only so much analysis you can do with Artist, Title and Song ID. There's options, to pull a whole bunch of data: Gracenote (but they appear to be a bit stingy with their API), freedb, MusicBrainz, and Discogs. But I'm not going to set up a local SQL server or write a bunch of code right now; though it would be interesting to see an in-depth analysis taking into consideration many things like song length, year, genre, and lyric content to name a few.

As my roommate and I were talking, he pointed out that if you had a karaoke machine (actually I think it's computers with iTunes now) which kept track of all the songs picked, there'd be something more interesting to analyze: What is the distribution of the popularity of songs? How frequently are different songs of different genres and years picked?

We agreed that it's most likely exponential (as many things are) - Don't Stop Believin' probably gets picked almost once a night, but there are likely many, many other songs that have never have been (and probably never will be) picked. And lastly, I'm always left wondering, how many singers are actually in tune for more than half the song?

Tuesday, September 4, 2012

FBI iPhone Leak Breakdown

Don't know if you heard, but something that is making the news today is that hacker group AntiSec purportedly gained control of an FBI agent's laptop and got a hold of 12 million UDIDs which were apparently being tracked.

A UDID is Apple's unique identifier for each of its 'iDevices', and if known could be used to get a lot of personally identifiable information about the owner of each product.

The hackers released the data on pastebin here. In the interests of protecting the privacy of the users, they removed all said personally identifiable information from the data. This is kind of a shame in a way, as it would have been interesting to do an analysis of the geographic distribution of the devices which were (allegedly) being tracked, amongst other things. I suppose they released the data for more (allegedly) altruistic purposes - i.e. to let people find out if the FBI was tracking them, not to have the data analyzed.

The one useful column that was left was the device type. Surprisingly, the majority of devices were iPads. Of course, this could just be unique to the million and one records of the 12 million which the group chose to release.

Breakdown:
iPhone: 345,384 (34.5%)
iPad: 589,720 (59%)
iPod touch: 63,724 (6.4%)
Undetermined: 1,173 (0.1%)
Total: 1,000,001

Forgive me Edward Tufte, for using a pie chart.

omg lol brb txt l8r - Text Message Analysis, 2011-2012


Introduction

I will confess, I don't really like texting. I communicate through text messages, because it does afford many conveniences, and occupies a sort of middle ground between actual conversation and email, but that doesn't mean that I like it.

Even though I would say I text a fair bit, more than some other Luddites I know, I'm not a serial texter. I'm not like one of these 14-year-old girls who sends thousands of text messages a day (about what, exactly?).

I recall reading about one such girl in the UK who sent in excess of 100,000 text messages one month. Unfortunately her poor parents received a rather hefty phone bill, as she did this without knowing she did not have an unlimited texting plan. But seriously, what the hell did she write? Even if she only wrote one word per text message, 100,000 words is ~200 pages of text. She typed all that out on a mobile phone keyboard (or even worse, a touch screen)? That would be a sizeable book.

If you do the math it's even crazier in terms of time. There are only 24 hours in the day, so assuming little Miss Teen Texter of the Year did not sleep, she still would have to send 100,000 in a 24 * 30 = 720 hour period, which averages out to be about one message every 25 seconds. I think by that point there is really no value added to the conversations you are having. I'm pretty sure I have friends I haven't said 100,000 words to over all the time that we've know each other.

But I digress.

Background

Actually getting all the data out turned out to be much easier than I anticipated. There exists an Android App which will not only back up all your texts (with the option of emailing it to you), but conveniently does so in an XML file with human-readable dates and a provided stylesheet (!). Import the XML file into Excel or other software and boom! You've got time series data for every single text message you've ever sent.

My data set spans the time from when I first started using an Android phone (July 2011) up to approximately the present, when I last created the backup (August 13th).

In total over this time period (405 days) I sent 3655 messages (~46.8%) and received 4151 (~53.2%) for a grand total of 7806 messages. This averages out to approximately 19 messages / day total, or about 1.25 messages per hour. As I said, I'm not a serial texter. Also I should probably work on responding to messages.

Analysis

First we can get a 'bird's eye view' of the data by plotting a colour-coded data point for each message, with time of day on the y-axis and the date on the x-axis:



Looks like the majority of my texting occurs between the hours of 8 AM to midnight, which is not surprising. As was established in my earlier post on my sleeping patterns, I do enjoy the night life, as you can see from the intermittent activity in the range outside of these hours (midnight to 4 AM). As Dr. Wolfram commented in his personal analytics posting, it was interesting to look at the plot and think 'What does this feature correspond to?' then go back and say 'Ah, I remember that day!'.

It's also interesting to see the back and forth nature of the messaging. As I mentioned before, the split in Sent and Received is almost 50/50. This is not surprising - we humans call these 'conversations'.

We can cross-tabulate the data to produce a graph of the total daily volume in SMS: 

Interesting to note here the spiking phenomenon, in what appears to be a somewhat periodic fashion. This corresponds to the fact that there are some days where I do a lot of texting (i.e. carry on several day-long conversations) contrasted with days where I might have one smaller conversation, or just send one message or so to confirm something ('We still going to the restaurant at 8?' - 'Yup, you know it' - 'Cool. I'm going to eat more crab than they hauled in on the latest episode of Deadliest Catch!').

I appeared to be texting more back in the Fall, and my overall volume of text diminished slightly into the New Year. Looking back at some of the spikes, some corresponded to noteworthy events (birthday, Christmas, New Year's), whereas others did not. For example, the largest spike, which occurred on September 3rd, just happened to be a day where I had a lot of conversations at once not related to anything in particular.

Lastly, through the magic of a Tableau dashboard (pa-zow!) we can combine these two interactive graphs for some data visualization goodness:



Next we make a histogram of the data to look at the distribution of the daily message volume. The spiking behaviour and variation in volume previously evident can be seen in the tail of the histogram dropping off exponentially:

Note that is the density in black, not a fitted theoretical distribution
The daily volume follows what appears to be an exponential-type distribution (log-normal?). This is really neat to see out of this, as I did not know what to expect (when in doubt, guess Gaussian) but is not entirely shocking -  other communication phenomena have been shown to be a Poisson process (e.g. phone calls). Someone correct me if I am way out of line here.

Lastly we can analyze the volume of text messages per day of the week, by making a box plot:

Something's not quite right here...

As we saw in the histogram, the data are of an exponential nature. Correcting the y-axis in this regard, the box plot looks a little more how one would expect:

Ahhhh.

We can see that overall there tends to be a greater volume of texts Thursday to Sunday. Hmmm, can you guess why this is? :)

This can be further broken down with a heat map of the total hourly volume per day of week:

This is way easier to make in Tableau than in R.


As seen previously in the scatterplot, the majority of messages are concentrated between the hours of 8 (here it looks more like 10) to midnight. In line with the boxplot just above, most of that traffic is towards the weekend. In particular, the majority of the messages were mid-to-late afternoon on Fridays.

We have thus fair mainly been looking at my text messages as time series data. What about the content of the texts I send and receive?

Let's compare the distribution of message lengths, sent versus received. Since there are an unequal number of Sent and Received messages, I stuck with a density plot:

Line graphs are pretty.


Interestingly, again, the data are distributed in an exponential fashion.

You can see distinctive humps at the 160 character mark. This is due to longer messages being broken down into multiple messages under the max length. Some carriers (or phones?) don't break up the messages, and so there are a small number of length greater than the 'official' limit.

Comparing the blue and red lines, you can see that in general I tend to be wordier than my friends and acquaintances.

Lastly, we can look at the written content. I do enjoy a good wordcloud, so we can by plunk the message contents into R and create one:
Names blurred to protect the innoncent (except me!).

What can we gather from this representation of the text? Well, nothing I didn't already know.... my phone isn't exactly a work Blackberry.

Conclusions

  • Majority of text message volume is between 10 AM to midnight
  • Text messages split approximately 50/50 between sent and received due to conversations
  • Daily volume is distributed in an exponential fashion (Poisson?)
  • Majority of volume is towards the end of the week, especially Friday afternoon
  • I should be less wordy (isn't that the point of the medium?)
  • Everybody's working for the weekend

References & Resources

SMS Backup and Restore @ Google Play
https://play.google.com/store/apps/details?id=com.riteshsahu.SMSBackupRestore&hl=en

Tableau Public
http://www.tableausoftware.com/public/community