Sunday, December 29, 2013

What's in My Inbox? Data Analysis of Outlook

Introduction

Email is the bane of our modern existence.

Who of us hasn't had a long, convoluted, back-and-forth email thread going on for days (if not weeks) in order to settle an issue which could have been resolved with a simple 5 minute conversation?

With some colleagues of mine, email has become so overwhelming (or their attempts to organize it so futile) that it brings to my mind Orwell's workers at the Ministry of Truth in 1984 and their pneumatic tubes and memory holes - if the message you want is not in the top 1% (or 0.01%) of your inbox and you don't know how to use search effectively, then for all intents and purposes it might as well be gone (see also: Snapchat).

Much has been written on the subject of why exactly we send and receive so much of it, how to best organize it, and whether or not it is, in fact, even an effective method of communication.

At one time even Gmail and the concept of labels was revolutionary - and it has done some good in organizing the ever-increasing deluge that is email for the majority of people. Other attempts have sprung up to tame the beast and make sense of such a flood of communication - most notably in my mind Inbox Zero, the simply-titled smartphone app Mailbox, and MIT's recent data visualization project Immersion.

But email, with all its systemic flaws, misuse, and annoyances, is definitely here for good, no question. What a world we live in.

But I digress.

Background

I had originally hoped to export everything from Gmail and do a very thorough analysis of all my personal email. Though this is now a lot easier than it used to be, I got frustrated at the time trying to write a Python script and moved on to other projects.

But then I thought, hey, why not do the same thing for my work email? I recently discovered that it's quite easy to export email from Outlook (as I detailed last time) so that brings us to this post.

I was somewhat disappointed that Outlook can only export a folder at a time (which does not include special folders such as search folders or 'All Mail') - I organize my mail into folders and wanted an export of all of it.

That being said, the bulk probably does remain in my inbox (4,217 items in my inbox resulted in a CSV that was ~15 MB) and we can still get a rough look using what's available  The data cover the period from February 27th, 2013 to Nov 16th, 2013.

Email by Contact
First let's  look at the top 15 contacts by total number of emails. Here are some pretty simple graphs summarizing that data, first by category of contact:


In the top 15, split between co-workers/colleagues and management is pretty even. I received about 5 times as much email from coworkers and managers as from stakeholders (but then again a lot of the latter ended up sorted into folders, so this count is probably higher). Still, I don't directly interact with stakeholders as much as some others, and tend to work with teams or my immediate manager. Also, calls are usually better.


Here you can see that I interacted primarily with my immediate colleague and manager the most, then other management, and the remainder further down the line are a mix which includes email to myself and from office operations. Also of note - I don't actually receive that much email (I'm more of a "in the weeds" type of guy) or, as I said, much has gone into the appropriate folders.

Time-Series Analysis
The above graphs show a very simplistic and high level view of what proportion of email I was receiving from who (with a suitable level of anonymity, I hope). More interesting is a quick and simple analysis of patterns in time of the volume of email I received - and I'm pretty sure you already have an idea of what some of these patterns might be.

When doing data analysis, I always feel it is important to first visualize as much of the data as practically possible - in order to get "a feel" for the data and avoid making erroneous conclusions without having this overall familiarity (as I noted in an earlier post). If a picture is worth thousand words then a good data visualization is worth a thousand keystrokes and mouse clicks.

Below is a simple scatter plot all the emails received by day, with the time of day on the y-axis:


This scatterpolot is perhaps not immediately particulary illuminating, however it already shows us a few things worth noting:

  • the majority of emails appear in a band approximately between 8 AM and 5 PM
  • there is increased density of email in the period between the end of July and early October, after which there is a sparse interval until mid-month / early November
  • there appears to be some kind of periodic nature to the volume of daily emails, giving a "strip-like" appearance (three guesses what that periodic nature is...)

We can look into this further by considering the daily volume of emails, as below. The black line is a 7 day moving average:


We can see the patterns noted above - the increase in daily volume after 7/27 and the marked decrease mid-October. Though I wracked my brain and looked thoroughly, I couldn't find a specific reason why there was an increase over the summer - this was just a busy time for projects (and probably not for myself sorting email). The marked decrease in October corresponds to a period of bench time, which you can see was rather short-lived.

As I noted previously in analyzing communications data, the distribution of this type of information is exponential in nature and usually follows a log-normal distribution. As such, a moving average is not the greatest measure of central tendency - but a decent approximation for our purposes. Still, I find the graph a little more digestible when depicted with a logarithmic y-axis, as below:


Lastly we consider the periodic nature of the emails which is noted in the initial scatterplot. We can look for patterns by making a standard heatmap with the weekday as the column and hour of day as the row, as below:


You can clearly see that the the majority of work email occurs between the hours of 9 to 5 (shocking!). However some other interesting points of note are the bulk of email in the mornings at the begiinning of the week, fall-off after 5 PM at the end of the week (Thursday & Friday) and the messages received Saturday morning. Again, I don't really receive that much email, or have spirited a lot of it away into folders as I noted at the beginning of the article (this analysis does not include things like automated emails and reports, etc.)

Email Size & Attachments
Looking at file attachments, I believe the data are more skewed than the rest, as the clean-up of large emails is a semi-regular task for the office worker (as not many have the luxury of an unlimited email inbox capacity - even executives) so I would expect that values on the high end to have largely been removed. Nevertheless it still provides a rough approximation of how email sizes are distributed and what proportion have attachments included.

First we look at the overall proportion of email left in my inbox which has attachments - of the 4,217 emails, 2914 did not have an attachment (69.1%) and 1303 did (30.9%).

Examining the size of emails (which includes the attachments) in a histogram, we see a familiar looking distribution, which here I have further expanded by making it into a Pareto chart. (note that the scale on the left y-axis is logarithmic):


Here we can see that of what was left in my inbox, all messages were about 8 MB in size or less, with the vast majority being 250K or less. In fact 99% of the email was less than 1750KB, and 99.9% less than 6MB.

Conclusion

This was a very quick analysis of what was in my inbox, however we saw some interesting points of note, some of which confirm what one would expect - in particular:
  • vast majority of email is received between the hours of 9-5 Monday to Friday
  • majority of email I received was between the two managers & colleagues I work closest with
  • approximately 3 out of 10 emails I received had attachments
  • the distribution of email sizes is logarithmic in nature

If I wanted to take this analysis further, we could also look at the trending by contact and also do some content analysis (the latter not being done here for obvious reasons, of course).

This was an interesting exercise because it made me mindful again of what everyday analytics is all about - analyzing rich data sets we are producing all the time, but of which we are not always aware.

References and Resources

Inbox Zero
http://inboxzero.com/

Mailbox
http://www.mailboxapp.com/

Immersion
https://immersion.media.mit.edu/

Data Mining Email to Discover Organizational Networks and Emergent Communities in Work Flows

Sunday, November 17, 2013

How to Export Your Outlook Inbox to CSV for Data Analysis

So one of my colleagues at work showed me this cool script he wrote in Visual Basic to pull all the data from Outlook for analysis.

Cool, I thought - I'd like to do that, but don't want to muck about in VB.

Well, I was surprised to discover that Outlook has the ability to export email to CSV built in! Follow the simple steps below (here demonstrated in Outlook 2010) and you can analyze your emails yourself and do some cool quantified self type analysis

How to Export Outlook Email to CSV (from Outlook)

1. Open Outlook and click File then Options to bring up the options dialog:


2. Selected Advanced, then click the Export button:


3. Click Export to a file and then the next button:


4. Selected Comma Separated Values (Windows) and click next.


5. Unless you want to export a different folder, select Inbox and click next.


6. Browse to a folder and/or type a filename for your export.


7.  Choose Map Custom Fields... if you want to customize which fields to export. Otherwise click the Finish button.


8. Sit tight while Outlook does its thing.



You should now have a CSV file of your inbox data!

How to Export Outlook Email to CSV (from Access)

This is all very well and good, but unfortunately exporting to CSV from Outlook does not provide the option for date and time as fields to be included, which makes it useless if you'd like to do time series (or other temporal) analysis.

To get the date and time data you can pull data from Outlook into Access and then export it as noted in this metafilter thread.

Import from Outlook into Access
1. Fire up Access and create a new database. Select External Data, More.. and then Outlook Folder.


2. Select Import the source data into a new table in the current database and click OK


3. Select the email account and folder you'd like to import and click Next 


4. Change the field settings if you'd like. Otherwise accept the defaults by clicking Next


5. Let Access add the primary key or not (you don't need it). Click Next 


6. Click Finish and wait. When the process is done you should have a new table called 'Inbox'.



Export Data from Access to a CSV
1. Make sure the Inbox table is selected and click External Data then Text File.


2. Pick or type a filename and click OK


3. Selected Delimited and click Next


4. Select Comma as the delimiter and tick the box which says Include Field Names on First Row. Click next.


5. Pick or type a filename and click Finish


You should now have your Inbox data exported as CSV (including time / date data!) and ready for analysis. Of course you can repeat this process and append to the Access database folder by folder to analyze all the mail you have in Outlook.

Sunday, November 10, 2013

What's in my Pocket? (Part II) - Analysis of Pocket App Article Tagging

Introduction

You know what's still awesome? Pocket.

As I noted in an earlier post (oh god, was that really more than a year ago?!) I started using the Pocket application, previously known as Read It Later, in July of 2011 and it has changed my reading behavior ever since.

Lately I've been thinking a lot about quantified self and how I'm not really tracking anything anymore. Something which was noted at one of the Meetups is that data collection is really the hurdle: like anything in life - voting, marketing, dating, whatever - you have to make it easy otherwise most people probably won't bother to do it. I'm pretty sure there's a psychological term for this - something involving the word 'threshold'.

That's where smartphones come in. Some people have privacy concerns about having all their data in the cloud (obviously I don't, as I'm willing putting myself all on display in the blog here) but that aside, one of the cool things about smartphone apps is that you are passively creating lots of data. Over time this results in a data set about you. And if you know how to pull that data you can analyze it (and hence yourself).  I did this previously, for instance with my text messages and also with data from Pocket collected up to that time.

So let's give it a go again, but this time with a different focus for the analysis.

Background

This time I wasn't so interested in when I read articles and from where, but moreso in the types of articles I was reading. In the earlier analysis, I summarized what I was reading by top-level domain of the site - and what resulted was a high-level overview of my online reading behavior.

Pocket added the ability for you to tag your articles. The tags are similar to labels in Gmail and so the relationships can be many to one. This provides a way for you to categorize your reading list (and archive) by category, and for the purposes of this analysis here, to analyze them accordingly.

First and foremost, we need the data (again). Unfortunately over the course of the development of the Pocket application, the amount of data you can get easily via export (without using the API) has diminished. Originally the export was available both as XML or JSON, but unfortunately those are now no longer available.

However, you can still export your reading list as an HTML file, which contains attributes in the link elements for the time the article was added and the tags it has attached.

Basically the export is quasi-XML, so it's a simple matter of writing some R code using the XML library to get the data into a format we can work with (CSV):


Here I extract the attributes and also create a column for each tag name with a binary value for if the article had that tag (one of my associates at work would call this a 'classifier', though it's not the data science-y kind). Because I wrote this in a general enough fashion, you should be able to run the code on your own Pocket export and get the same results.

Now that we have some data we can plunk it into Excel and do some data visualization.

Analysis

First we examine the state of articles over time - what is the proportion of articles added over time which were tagged versus not?

Tagged vs. Untagged

You can see that initially I resisted tagging articles, but starting November adopted it and began tagging almost all articles added. And because stacked area graphs are not especially good data visualization, here is a line graph of the number of articles tagged per month:


Which better shows that I gradually adopted tagging from October into November. Another thing to note from this graph is that my Pocket usage peaked between November of last year to May of this year, after which the number of articles added on a monthly basis decreases significantly (hence the previous graph being proportional).

Next we examine the number of articles by subject area. I've collected them into more-or-less meaningful groups and will explain the different tags as we go along. Note the changing scale on the y-axes for these graphs, as the absolute number of articles varies greatly by category.

Psych & Other Soft Topics
As I noted previously in the other post, when starting to use Pocket I initially read a very large number of psych articles.

I also read a fair number of "personal development" articles (read: self-helpish - mainly from The Art of Manliness) which has decreased greatly as of late. The purple are articles on communications, the light blue "parapsych", which is my catchall for new-agey articles relating to things like the zodiac, astrology, mentalism, mythology, etc. (I know it's all nonsense, but hey it's good conversation for dinner parties and the next category).

The big spike recently was a cool site I found recently with lots of articles on the zodiac (see: The Barnum Effect). Most of these later got deleted.

Dating & Sex
Now that I have your attention... what you don't read articles on sex? The Globe and Mail's Life section has a surprising number of them. Also if you read men's magazines online there are a lot, most of which are actually pretty awful. You can see too that articles on dating made up a large proportion of my reading back in the fall, also from those types of sites (which thankfully I now visit far less frequently).

News, etc.
This next graph is actually a bit busy for my liking, but I found this data set somewhat challenging to visualize overall, given the number of categories and how they change in time.


News is just that. Tech mostly the internet and gadgets. Jobs is anything career related. Finance is both in the news (macro) and personal. Marketing is a newcomer.

Web & Data

The data tag relates to anything data-centric - as of late more applied to big data, data science and analytics. Interestingly my reading on web analytics preceded my new career in it (January 2013), just like my readings in marketing did - which is kind of cool. It also goes to show that if you read enough about analytics in general you'll eventually read about web analytics.

Data visualization is a tag I created recently so has very few articles - many of which I would have previously tagged with 'data'.

Life & Humanities



If that other graph was a little too busy this one is definitely so, but I'm not going to bother to break it out into more graphs now. Articles on style are of occasional interest, and travel has become a recent one. 'Living' refers mainly to articles on city life (mostly from The Globe as well as the odd one from blogto).

Work
And finally some new-comers, making up the minority, related to work:


SEO is search engine optimization and dev refers to development, web and otherwise.

Gee that was fun, and kind of enlightening. But tagging in Pocket is like in Gmail - it is not one-to-one but many-to-one. So next I thought to try to answer the question: which tags are most related? That is, which tags are most commonly applied to articles together?

To do this we again turn to R and the following code snippet, on top of that previous, does the trick:

All this does is remove the untagged articles from the tag frame and then run a correlation between each column of the tag matrix. I'm no expert on exotic correlation coefficients, so I simply used the standard (Pearson's). In the case of simple binary variables (true / false such as here), the internet informs me that this reduces to the phi coefficient.

Given there are 30 unique tags, this creates a 30 x 30 matrix, which is visualized below as a heatmap:


Redder is negative, greener is positive. I neglected to add a legend here as when not using ggplot or a custom function it is kind of a pain, but some interesting relationships can still immediately be seen. Most notably food and health articles are the most strongly positively correlated while data and psych articles are most strongly negatively correlated.

Other interesting relationships are that psych articles are negatively correlated with jobs, tech and web analytics (surprise, surprise) and positively correlated with communications, personal development and sex; news is positively correlated with finance, science and tech.

Conclusion

All in all this was a fun exercise and I also learned some things about my reading habits which I already suspected - the amount I read (or at least save to read later) has changed over time as well as the sorts of topics I read about. Also some types of topics are far more likely to go together than others.

If I had a lot more time I could see taking this code and standing it up into some sort of generalized analytics web service (perhaps using Shiny if I was being really lazy) for Pocket users, if there was sufficient interest in that sort of thing.

Though it was still relatively easy to get the data out, I do wish that the XML/JSON export would be restored to provide easier access, for people who want their data but are not necessarily developers. Not being a developer, my attempts to use the new API for extraction purposes were somewhat frustrating (and ultimately unsuccessful).

Though apps often make our lives easier with passive data collection, all this information being "in the cloud" does raise questions of data ownership (and governance) and I do wish more companies, large and small, would make it easier for us to get a hold of our data when we want it.

Because at the end of the day, it is ultimately our data that we are producing - and it's the things it can tell us about ourselves that makes it valuable to us.

Resources

Pocket - Export Reading List to HTML

Pocket - Developer API

Phi Coefficient

The Barnum (Forer) Effect

code on github

Friday, October 18, 2013

Bananagrams!!!

It was nice to be home with the family for Thanksgiving, and to finally take some time off.

A fun little activity which took up a lot of our time over the past weekend was Bananagrams, which, if you don't already know, is sort of like a more action-packed version of Scrabble without the board.

Being the type of guy that I am, I started to think about the distribution of letters in the game. A little Googling led to some prior art to this post.

The author did something neat (which I wouldn't have thought of) by making a sort of bar chart using the game pieces. Strangely though, they chose not to graph the different distributions of letters in Bananagrams and Scrabble but instead listed them in a table.

So, assuming the data from the post are accurate, here is a quick breakdown of said distributions below. As an added bonus, I've also included that trendy digital game that everyone plays on Facebook and their iDevices:
Bar graph of letter frequencies of Scrabble, Bananagrams and Words with Friends

Looking at the graph, it's clear the Bananagrams has more tiles than the other games (the total counts are 144, 104 and 100 for Banagrams, Words with Friends and Scrabble respectively) and notably also does not have blank tiles of which the other games have 2 each. Besides the obvious prevalence of vowels in all 3 games, T, S, R, N, L and D also have high occurrence.

We can also compare the relative frequencies of the different letters in each game with respect to Scrabble. Here I took the letter frequency for each game (as a percent) then divided it by the frequency of the same letter in Scrabble. The results are below:

Bar chart of Bananagrams and Words with Friends letter frequencies relative to Scrabble
Here it is interesting to note that the relative frequency of H in Words with Friends is nearly double that in Scrabble (~192%). Also D, S and T have greater relative frequencies. The remaining letters are fairly consistent with the exception of I and N which are notably less frequent.

Bananagrams relative letter frequency is fairly consistent overall, with the exception of J, K, Q, X, and Z which are around the 140 mark. I guess the creator of the game felt there weren't enough of the "difficult" letters in Scrabble.

There's more analysis that could be done here (looking at the number of points per letter in WWF & Scrabble versus their relative frequency immediately comes to mind) but that should do for now. Hope you found this post "a-peeling".

Saturday, September 14, 2013

Analysis of the TTC Open Data - Ridership & Revenue 2009-2012

Introduction

I would say that the relationship between the citizens of Toronto and public transit is a complicated one. Some people love it. Other people hate it and can't stop complaining about how bad it is. The TTC want to raise fare prices. Or they don't. It's complicated.

I personally can't say anything negative about the TTC. Running a business is difficult, and managing a complicated beast like Toronto's public system (and trying to keep it profitable while keeping customers happy) cannot be easy. So I feel for them. 

I rely extensively on public transit - in fact, I used to ride it every day to get to work. All things considered, for what you're paying, this way of getting around the city is a hell of a good deal (if you ask me) compared to the insanity that is driving in Toronto.

The TTC's ridership and revenue figures are available as part of the (awesome) Toronto Open Data initiative for accountability and transparency. As I noted previously, I think the business of keeping track of things like how many people ride public transit every day must be a difficult one, so you have to appreciate having this data, even if it is likely more of an approximation and is in a highly summarized format.

There are larger sources of open data related to the TTC which would probably be a lot cooler to work with (as my acquaintance Mr. Branigan has done) but things have been busy at work lately, so we'll stick to this little analysis exercise.

Background

The data set comprises numbers for: average weekly ridership (in 000's), annual ridership (peak and off-peak), monthly & budgeted monthly ridership (in 000's), and monthly revenue, actual and budgeted (in millions $). More info here [XLS doc].

Analysis

First we consider the simplest data and that is the peak and off-peak ridership. Looking at this simple line-graph you can see that the off-peak ridership has increased more than peak ridership since 2009 - peak and off-peak ridership increasing by 4.59% and 12.78% respectively. Total ridership over the period has increased by 9.08%.



Below we plot the average weekday ridership by month. As you can see, this reflects the increasing demand on the TTC system we saw summarized yearly above. Unfortunately Google Docs doesn't have trendlines built-in like Excel (hint hint, Google), but unsurprisingly if you add a regression line the trend is highly significant ( > 99.9%) and the slope gives an increase of approximately 415 weekday passengers per month on average.


Next we come to the ridership by month. If you look at the plot over the period of time, you can see that there is a distinct periodic behavior:


Taking the monthly averages we can better see the periodicity - there are peaks in March, June & September, and a mini-peak in the last month of the year:


This is also present in both the revenue (as one would expect) and the monthly budget (which means that the TTC is aware of it). As to why this is the case, I can't immediately discern, though I am curious to know the answer. This is where it would be great to have some finer grained data (daily or hourly) or data related to geographic area or per station to look for interesting outliers and patterns.

Alternatively if we look at the monthly averages over the years of average weekday ridership (an average of averages, I am aware - but the best we can do given the data we have), you can see that there is a different periodic behavior, with a distinct downturn over the summer, reaching a low in August which then recovers in September to the maximum. This is interesting and I'm not exactly sure what to make of it, so I will do what I normally do which is attribute it to students.


Lastly, we come to the matter of the financials. As I said the monthly revenue and budget for the TTC follow the same periodic pattern as the ridership, and on the plus side, with increased ridership, there is increased revenue. Taking the arithmetic difference of the budgeted (targeted) revenue from actual, you can see that over time there is a decrease in this quantity:
Again if you do a linear regression this is highly significant ( > 99.9%). Does this mean that the TTC is becoming less profitable over time? Maybe. Or perhaps they are just getting better at setting their targets? I acknowledge that I'm not an economist, and what's been done here is likely a gross oversimplification of the financials of something as massive as the TTC.

That being said, the city itself acknowledges [warning - large PDF] that while the total cost per hour for an in-service transit vehicle has decreased, the operating cost has increased, which they attribute to increases in wages and fuel prices. Operating public transit is also more expensive here in TO than other cities in the province, apparently, because we have things like streetcars and the subway, whereas most other cities only have buses. Either way, as I said before, it's complicated.

Conclusion

I always enjoy working with open data and I definite appreciate the city's initiative to be more transparent and accountable by providing the data for public use.

This was an interesting little analysis and visualization exercise and some of the key points to take away are that, over the period in question:
  • Off-peak usage of the TTC is increasing at a greater rate than peak usage
  • Usage as a whole is increasing, with about 415 more weekday riders per month on average, and a growth of ~9% from 2009 - 2012
  • Periodic behavior in the actual ridership per month over the course of the year
  • Different periodicity in average weekday ridership per month, with a peak in September
It would be really interesting to investigate the patterns in the data in finer detail, which hopefully should be possible in the future if more granular time-series, geographic, and categorical data become available. I may also consider digging into some of the larger data sets, which have been used by others to produce beautiful visualizations such as this one.

I, for one, continue to appreciate the convenience of public transit here in Toronto and wish the folks running it the best of luck with their future initiatives.

References & Resources

TTC Ridership - Ridership Numbers and Revenues Summary (at Toronto Open Data Portal)

Toronto Progress Portal - 2011 Performance Measurement and Benchmarking Report

Monday, August 12, 2013

The Top 0.1% - Keeping Pace with the Data Explosion?

I've been thinking a lot lately, as I do, about data.

When you work with something so closely, it is hard not to have the way you think about what you work with impact the way you think about other things in other parts of your life.

For instance, cooks don't think about food the same way once they've worked in the kitchen; bankers don't think about money the same way once they've seen a vault full of it; and analysts don't think about data the same way, once they start constantly analyzing it.

The difference being, of course, that not everything is related to food or money, but everything is data if you know how to think like an analyst.

I remember when I was in elementary school as a young child and a friend of mine described to me the things of which he was afraid. We sat in the field behind the school and stared down at the gravel on the track.

"Think about counting every single pebble of gravel on the track," he said. "That's the sort of thing that really scares me." He did seemed genuinely concerned. "Or think about counting every grain of sand on a beach, and then think about how many beaches there are in the whole world, and counting every grain of sand on every single one of those beaches. That's the sort of thing that frightens me."

The thing that scared my childhood friend was infinity; or perhaps not the infinite, but just very very large numbers - the quantities of the magnitude relating to that thing everyone is talking about these days called Big Data.

And that's the sort of thing I've been thinking about lately.

I don't remember the exact figure, but if you scour the internet to read up on our information age, and in particular our age of "Big Data" you will find statements similar to that below:

.... there has been more information created in the past year than there was in all of recorded history before it.

Which brings me to my point about the Top 0.1%.

Or, perhaps, to be more fair, probably something more like the Top 0.01%.

There is so much information out there. Every day around the world, every milliliter of gas pumped, every transaction at POS, every mouse click on millions of websites on the internet is being recorded, and creating more data.

Our capacity to record and store information has exploded exponentially.

But, perhaps somewhat critically, our ability to work with it and analyze it has not.

In the same way that The Ingenuity Gap talks about how the complexity of problems facing society is ever increasing but our ability to implement solutions is not matching that pace, we might be in danger of similarly finding the amount of information being recorded and stored in our modern world is exponentially increasing but our ability to manage and analyze it is not. Not only from a technological perspective, but also from a human perspective - there is only so much information one person can handle working with and keep "in mind".

I know that many other people are thinking this way as well. This is my crude, data-centric take on what people have been talking about since the 1970's - information overload. And I know that many other authors have touched on this point recently as it is of legitimate concern; for instance - acknowledging that the skill set needed to work with and make sense of these data sets of exponentially increasing size is so specialized that data scientists will not scale.

Will our technology and ability to manage data be able to keep up with the ever increasing explosion of it? Will software and platforms develop and match pace such that those able to make sense of these large data sets are not just a select group of specialists? How will the analyst of tomorrow handle working with and thinking about the analysis of such massive and diverse data sets?

Only time will answer these questions, but the one thing that seems certain is that the data deluge will only continue as storage becomes ever cheaper and of greater density while the volume, velocity and variety of data collected worldwide continues to explode.

In other words: there's more that came from.

Saturday, June 22, 2013

Everything in Its Right Place: Visualization and Content Analysis of Radiohead Lyrics

Introduction

I am not a huge Radiohead fan.

To be honest, the Radiohead I know and love and remember is that which was a rock band without a lot of 'experimental' tracks - a band you discovered on Big Shiny Tunes 2, or because your friends told you about it, or because it was playing in the background of a bar you were at sometime in the 90's.

But I really do like their music, I've become familiar with more of it and overall it does possess a certain unique character in its entirety. Their range is so diverse and has changed so much over the years that it would be really hard not to find at least one track that someone will like. In this way they are very much like the Beatles, I suppose.

I was interested in doing some more content analysis type work and text mining in R, so I thought I'd try song lyrics and Radiohead immediately came to mind.

Background

In order to first do the analysis, we need all the data (shocking, I know). Somewhat surprisingly, putting 'radiohead data' into Google comes up with little except for many, many links to the video and project for House of Cards which was made using LIDAR technology and had the data set publicly released.

So once again we are in this situation where we are responsible for not only analyzing all the data and communicating the findings, but also getting it as well. Such is the life of an analyst, everyday and otherwise (see my previous musing on this point).

The lyrics data was taken from the listing of Radiohead lyrics at Green Plastic Radiohead.

Normally it would be simply a matter of throwing something together in Python using Beautiful Soup as I have done previously. Unfortunately, due to the way these particular pages were coded, that proved to be a bit more difficult than expected.

As a result the extraction process ended up being a convoluted ad-hoc data wrangling exercise involving the use of wget, sed and Beautiful Soup - a process which was neither enjoyable nor something I would care to repeat.

In retrospect, two points:

Getting the data is not always easy.
Sometimes sitting down beforehand and looking at where you are getting it from, the format it is in and how to best go about getting it into the format you need will save you a lot  of wasted time and frustration in the long run. Ask questions before you begin - what format is the data in now? What is the format I need/would like it to be in to do the analysis? What steps are required in order to get from one to the other (i.e. what is the data transformation or mapping process)?

That being said, my methods got me where I needed to be, however there were most likely easier, more straightforward approaches which would have saved a lot frustration on my part.

If you're going to code a website, use a sane page structure and give important page elements ids.
Make it easy on your other developers (and the rest of the world in general) by labeling your <div> containers and other elements with ids (which are unique!!) or at least classes. Otherwise how are people going to scrape all your data and steal it for their own ends? I joke... kind of. 

In this case my frustrations actually stemmed mainly from some questionable code for a cache-buster. But even once I got past that, the contents of the main page containers were somewhat inconsistent. Such is life, and the internet.

The remaining data, album and track length - were taken from the Wikipedia pages for each album and later merged with the calculations done with the text data in R.

Okay, enough whinging - we have the data - let's check it out.

Analysis

I stuck with what I consider to be the 'canonical' Radiohead albums - that is, the big releases  you've probably heard about even if you're like me a not a hardcore Radiohead fan - 8 albums in total (Pablo Honey, The Bends, OK Computer, Kid A, Amnesiac, Hail to the Thief, In Rainbows, and The King of Limbs).

Unstructured (and non-quantitative) data always lends itself to more interesting analysis - with something like text, how do we analyze it? How do we quantify it? Let's start with the easily quantifiable parts and go from there.

Track Length
Below is a boxplot of the track lengths per album, with the points overplotted.

Distribution of Radiohead track lengths by album

Interestingly Pablo Honey and Kid A have the largest ranges of track length (from 2:12 to 4:40 and 3:42 to 7:01 respectively) - if you ignore the single tracks around 2 minutes on Amnesiac and Hail to the Thief the variance of their track lengths is more in line with all the other albums. Ignoring the single outlier, The King of Limbs is appears to be special given its narrow range of track lengths.

Word Count
Next we look at the number of words (lyrics) per album:

Distribution of number of words per Radiohead album

There is a large range of word counts, from the two truly instrumental tracks (Treefingers on Kid A and Hunting Bears on Amnesiac) to the wordier tracks (Dollars and Cents and A Wolf at the Door). Pablo Honey almost looks like it has two categories of songs - with a split around the 80 word mark.

Okay, interesting and all, but again these are small amounts of data and only so much can be drawn out as such.

Going forward we examine two calculated quantities.

Calculated Quantities - Lexical Density and 'Lyrical Density'

In the realm of content analysis there is a measure known as lexical density which is a measure of the number of content words as a proportion of the total number of words - a value which ranges from 0 to 100. In general, the greater the lexical density of a text, the more content heavy it is and more 'unpacking' it takes to understand - texts with low lexical density are easier to understand.

According to Wikipedia the formula is as follows:


where Ld is the analysed text's lexical density, NLex is the number of lexical word tokens (nouns, adjectives, verbs, adverbs) in the analysed text, and N is the number of all tokens (total number of words) in the analysed text.

Now, I am not a linguist, however it sounds like this is just the ratio of words which are not stopwords to the total number - or could at least be approximated by it. That's what I went with in the calculations in R using the tm package (because I'm not going to write a package to calculate lexical density by myself).

On a related note, I completely made up a quantity which I am calling 'lyrical density' which is much easier to calculate and understand - this is just the number of lyrics per song over the track length, and is measured in words per second. An instrumental track would have lyrical density of zero, and a song with one word per second for the whole track would have a lyrical density of 1.

Lexical Density
Distribution of lexical density of Radiohead songs by album

Looking at the calculated lexical density per album, we can see that the majority of songs have their lexical density between about 30 to 70. The two instrumental songs have a lexical density of 0 (as they have no words) and distribution appears most even on OK Computer. The most content-word heavy song is on Hail to the Thief and is I Will (No Man's Land)

If you could imaging extending the number of songs Radiohead written to infinity, you might get a density function something like below, with the bulk of songs having density between 30 and 70 (which I imagine is a normal reasonable range for any text) and a little bump at 0 for their instrumental songs:
Histogram of lexical density of Radiohead tracks with overplotted density function
Lyrical Density
Next we come to my calculated quantity, lyrical density - or the number of words per second on each track.

Distribution of lyrical density of Radiohead tracks by album

Interestingly, there are outlying tracks near the high end where the proportion of words to the song length is greater than 1 (Fitter Happier, A Wolf at the Door, and Faust Arp). Fitter Happier shouldn't even really count, as it is really an instrumental track with a synthesized voice dubbed overtop. If you listen to A Wolf at the Door it is clear why the lyrical density is so high - Thom is practically rapping at points. Otherwise Kid A and The King of Limbs seem to have less quickly sung lyrics than the other albums on average.

Lexical Density + Lyrical Density
Putting it all together, we can examine the quantities for all of the Radiohead songs in one data visualization. You can examine different albums by clicking the color legend at the right, and compare multiple albums by holding CTRL and clicking more than one.


The songs are colour-coded by album. The points are plotted by lexical density along y-axis against the lyrical density along the x-axis and sized by total number of words in the song. As such, the position of the point in the plot gives an idea of the rate of lyrical content in the track - a song like I Might Be Wrong is fitting a lot less content words into a song at a slower rate than a track like A Wolf at the Door which is packed much tighter with both lyrics and meaning.

Conclusion

This was an interesting project and it was fascinating to take something everyday like song lyrics and analyze them as data (though some Radiohead fans might argue that there is nothing 'everyday' about Radiohead lyrics).

All in all, I feel that a lot of the analysis has to be taken with a grain of salt (or a shaker or two), given the size of the data set (n = 89). 

That being said, I still feel it is still proof positive that you can take something typically thought of as very artistic and qualitative like a song, and classify it in a meaningful way in quantitative fashion. I had never listened to the song Fitter Happier, yet it is a clear outlier in several measures - and listening to the song I discovered why - it is a track with a robot-like voice over and not containing sung lyrics at all. 

A more interesting and ambitious project would be to take a much larger data set, where the measures examined here would be more reliable given the large n, and look at things such as trends in time (the evolution of American rock lyrics) or by genre / style of music. This sort of thing exists out there already to an extent, for example, in work done with The Million Song Data Set which I came across in some of my Google searches I made for this project.

But as I said, this would be a large and ambitious amount of work, which is perhaps more suited for something like a research paper or thesis - I am just one (everyday) analyst. 

References & Resources

Radiohead Lyrics at Green Plastic Radiohead

The Million Song Data Set

Measuring the Evolution of Contemporary Western Popular Music [PDF]

Radiohead "House of Cards" by Aaron Koblin

code, data & plots on github