Tuesday, January 22, 2013

Finer Points Regarding Data Visualization Choices

The human mind is limited.

We can only process so much information at one time. Numerals are text which communicate quantity. However, unlike other text, it's a lot harder to read a whole bunch of numbers and get a high-level understanding of what is being communicated. There are sentences of numbers and quantities (these are called equations, but not everyone is as literate in them) however simply looking at a pile of data and having an understanding of the 'big picture' is not something most people can do. This is especially true as the amount of information becomes larger than a table with a few categories and values.

If you're a market research, business, data, financial, or (insert other prefix here) analyst, part of your job is taking a lot of information and making sense of that information, so that other people don't have to. Let's face it - your Senior Manager or The VP doesn't have time to wade through all the data - that's why they hired you.

Ever since Descartes' epiphany (and even before that) people have been realizing that there are other, more effective ways to communicate information than having to look at all the details. You can communicate the shape of the data without knowing exactly how many Twitter followers were gained each day. You can see what the data look like without having to know the exact dollar value for sales each and every day. You can feel what the data are like, and get an intuitive understanding of what's going on, without having to look at all the raw information.

Enter data visualization.

Like any practice, data visualization and the depicting quantitative relationships visually can be done poorly or can be done well. I'm sure you've seen examples of the former, whether it be in a presentation or other report, or perhaps floating around the Internet. And the latter, like so many good things, is not always so plentiful, nor appreciated. Here I present some finer points between data visualization choices, in the hope that you will always find yourself depicting data well.

Pie (and Doughnut) Chart

Ah, the pie chart. The go-to the world over when most people seek to communicate data, and one both loved and loathed by many.

The pie chart should be used to compare quantities of different categories where the proportion of the whole is important, not the absolute values (though these can be added with labelling as well). It's important that the number of categories being compared remain small - depending on the values, the readability of the chart decreases greatly as the number of categories increases. You can see this below. The second example is a case where an alternate representation should be considered, as the chart's readability and usefulness is lower given the larger number of proportions being compared:



Doughnut charts are the same as pie charts but with a hole in the center. They may be used in the place of multiple pie charts by nesting the rings:

Hmm.

Though again, as the number of quantities being compared increases the readability and visual utility generally decreases and you are better served by a bar chart in these cases. Also there is the issue that the area of each annulus will be different for the same angle, depending upon which ring it is in.

With circular charts it is best to avoid legends as this causes the eye to flit back and forth between the different segments and the legend, however when abiding by this practice for doughnut charts labeling becomes a problem, as you can see above.

Tufte contends that a bar chart will always serve better than a pie chart (though some others disagree). The issue is that there is some debate about the way the human mind processes comparisons with angular representations versus those depicted linearly or by area. I tend to agree and find the chart below much better data visualization that the one we saw previously:

Isn't that much better?

From a practical perspective - a pie chart is useful because of its simplicity and familiarity, and is a way to communicate proportion of quantities when the number of categories being compared is small. 

Bonus question:
Q. When is it a good idea to use a 3-D pie chart?
A. Never. Only as an example of bad data visualization!

Bar Charts

Bar charts are used to depict the values of a quantity or quantities across categories. For example, to depict sales by department, or per product type.

This type of chart can be (and is) used to depict values over time, however, said chunks of time should be discrete (e.g. quarters, years) and of a small number. When a comparison is to be done over time and the number of periods / data points is larger, it is better visualized using a line chart.


As the number of categories becomes large, an alternative to the usual arrangement ('column' chart) is to arrange the categories vertically and bars horizontally. Note this is best done only for categorical / nominal data as data with an implied order (ordinal, interval, or ratio type data) should be displayed left-to-right in increasing order to be consistent with reading left to right.

Bar charts may also be stacked in order to depict both the values between categories as well as the total across them. If the absolute values are not important, then stacked bar charts may be used in this way in the place of several pie charts, with all bars having a maximum height of 100%:


Stephen Few contends that this still makes it difficult to compare proportions, similar to the problem with pie charts, and has other suggestions [PDF], though I think it is fine on some occassions, depending the nature of the data being depicted.

When creating bar charts it is important to always start the y-axis from zero so as not to produce a misleading graph.

A column chart may also be combined with a line graph of the total per category in a type of combo chart known as Pareto chart.

Scatterplot (and Bubble Graphs)

Scatterplots are used to depict a relationship between two quantitative variables. The value pairs for the variables are plotted against each other, as below:


When used to depict relationships occurring over time, we instead use a special type of scatterplot known as a line graph (next section).

A bubble chart is a type of scatterplot used to compare relationships between three variables, where the points are sized by area according to a third value. Care should be taken to ensure that the points are sized correctly in this type of chart, so as not to incorrectly depict the relative proportion of quantities

Relationships between four variables may also be visualized by colouring each point according to the value of a fourth variable, though this may be a lot of information to depict all at once, depending upon the nature of the data. When animated to include a fifth variable (usually time) it is known as a motion chart, which is perhaps most famously demonstrated in Hans Rosling's landmark TED Talk which has become somewhat of a legend.

Line Graphs

Line graphs are usually used to depict quantities changing over time. They may also be used to depict relationships between two (numeric) quantities when there is continuity in both.

For example, it makes sense to compare sales over time with a line graph, as time is numerical quantity that varies continuously:


However it would not make sense to use a line graph to compare sales across departments as that is categorical / nominal. Note that there is one exception to this rule and that is the aforementioned Pareto chart.

Omitting the points on the line graph and using a smooth graph instead of line segments creates an impression of more data being plotted, and hence a greater continuity. Compare with the plot above the one below:


So practically speaking save the smooth line graphs for when you have a lot of data and the points would just be visual clutter, otherwise it's best to overplot the points to be clear about what quantities are being communicated.

Also note that unlike a bar chart, it is acceptable to have a non-zero starting point for the y-axis of a line graph as the change in values is being depicted, not their absolute values.

Now Go Be Great!

This is just a sample of some of the finer differences between the choices for visualizing data. There are of course many more ways to depict data, and, I would argue, that possibilities for data visualization are only limited by the imagination of the visualizer. However when sticking with the tried, true and familiar, keep these points in mind to be great at what you do and get your point across quantitatively and visually.

Go, visualize the data, and be amazing!

Friday, January 4, 2013

What The Smeg? Some Text Analysis of the Red Dwarf Scripts

Introduction

Just as Pocket fundamentally changed my reading behaviour, I am finding that now having Netflix (and even before that, other downloadable or streaming digital content) is really changing my behaviour as far as television is concerned.

Where watching TV used to be an affair of browsing through 500 channels and complaining there was nothing on, now with the advent of on-demand digital services there is choice. Instead of flipping through hundreds of channels (is that a linear search or a random walk?), most of which have nothing whatsoever that interests you, now you can search for exactly the show you are looking for and watch it when you want. Without commercials.

Wait, what? That's amazing! No wonder people are 'cutting the cord' and media corporations are concerned about the future of their business model.

True, you can still browse. People complain that the selection on Netflix is bad for Canada, but for 8 dollars a month, really it's pretty good what you're getting. And given the.... eclectic nature of the selection I sometimes find myself watching something I would never think to look for directly, or give a second chance if I just caught 5 minutes of the middle of it on cable.

Such is the case with Red Dwarf. Red Dwarf is one of those shows that gained a cult following, and, despite its many flaws, for me has a certain charm and some great moments. This despite my not being able to understand all of the jokes (or dialogue!) as it is a show from the BBC.

The point is that before Netflix, I probably wouldn't come across something like this, and I definitely wouldn't watch all of it, if there wasn't that option so easily laid out.

So I watched a lot of this show and got to thinking, why not take this as an opportunity to do some more everyday analytics?

Background

If you're not familiar with the show or a fan, I'll briefly summarize here so you're not totally lost.

The series centers around Dave Lister, an underachieving chicken-soup vending machine repairman aboard the intergalactic mining ship Red Dwarf. Lister inadvertently becomes the last human being alive when being put into stasis for 3 million years by the ship's computer, Holly, when there is a radiation leak aboard the ship. The remainder of the ship's crew are Arnold J. Rimmer, a hologram of Lister's now-deceased bunkmate and superior officer; The Cat, a humanoid evolved from Lister's pet cat; Kryten, a neurotic sanitation droid; and later Kristine Kochanski, a love interest who gets brought back to life from another dimension.

Conveniently, the Red Dwarf scripts are available online, transcribed by dedicated fans of the program. This just goes to show that the series truly does have cult following, when there are fans who love the show so much as to sit and transcribe episodes just for it's own sake! But then again, I am doing data analysis and visualization on that same show....

Analysis

Of the ten seasons and 61 episodes of the series, the data set covers Seasons 1-8 and comprises and 51 episodes of those 52 (S08E03 - Back In The Red (Part III) is missing).

I did some text analysis of the data with the tm package for R. 

First we can see the prevalence of different characters within the show over the course of the series. I've omitted the x-axis labels as they made the chart appear cluttered, you can see them by interacting.



Lister and Rimmer, the two main characters, have the highest amount of mentions overall. Kryten appears in the eponymous S02E01 and is then later introduced as one of the core characters at the beginning of Season 3. The Cat remains fairly constant throughout the whole series as he appears or speaks mainly for comedic value. In S01E06, Rimmer makes a duplicate of himself which explains the high number of lines by his character and mentions of his name in the script. You can see he disappears after Episode 2 of Season 7 in which his character is written out, until re-appearing in Season 8 (he appears in S07E05 as there is an episode dedicated to the rest of the crew reminiscing about him).



Holly, the ship's computer, appears consistently at the beginning of the program until disappearing with the Red Dwarf towards the beginning of Season 6. He is later reintroduced when it returns at the beginning of Season 8.

Lister wants to bring back Kochanski as a hologram in S01E03, and she also appears in S02E04, as it is a time travel episode. She is introduced as one of the core cast members in Episode 3 of Season 7 and continues to be so until the end of the series.

Ace is Rimmer's macho alter-ego from another dimension. He appears a couple time in the series before S07E02, in which he is used as a plot device to write Rimmer out of the show for that season.



Appearance and mentions of other crew members of the Dwarf correspond to the beginning of the series and the end (Season 8) when they are reintroduced. The Captain, Hollister, appears much more frequently towards the end of the show.



Robots appear mainly as one-offs who are the focus of a single episode. The exceptions are the Scutters (Red Dwarf's utility droids) whose appearances coincide with the parts of the show where the Dwarf exists, and simulants which are mentioned occasionally as villians / plot devices. The toaster and snarky dispensing machine also appear towards the beginning and end, with the former also having speaking parts in S04E04.



As mentioned before, the Dwarf gets destroyed towards at the end of Season 5 until being reintroduced at the beginning of Season 8. During this time, the crew live in one of the ship's shuttlecraft, The Starbug. You can also see that the starbug is mentioned more frequently in episodes when the crew go on excursions (e.g. Season 3, Episodes 1 and 2).



One of the recurring themes of the show is how much Lister really enjoys Indian food, particularly chicken vindaloo. That and how he'd much rather just drink beer at the pub than do anything. S04E02 (spike 1) features a monster, a Chicken Vindaloo man (don't ask), and the whole premise of S07E01 (spike 2) is Lister wanting to go back in time to get poppadoms.



Thought this would be fun. Space is a consistent theme of the show, obviously. S07E01 is a time travel episode, and the episodes with Pete (Season 8, 6-7) at the end feature a time-altering device.

Conclusions

I recall talking to associate of mine who recounted his experiences in a data analysis and programming workshop where the data set used was the Enron emails. As he quite rightly pointed out, he knew nothing about the Enron emails, so doing the analysis was difficult - he wasn't quite sure what he was looking at, or what he should be expecting. He said he later used the Seinfeld scripts as a starting point, as this was at least something he was familiar with.

And that's an excellent point. You don't need necessarily need to be a subject matter expert to be an analyst, but it sure helps to have some idea what you exactly you are analyzing. Also I would think that there's a higher probability you care about what you are trying to analyze more if you know something about it.

On that note, it was enjoyable to analyze the scripts in this manner, and see something so familiar as a television show visualized as data like any other. I think the major themes and changes in the plotlines of the show were well represented in this way.

In terms of future directions, I tried looking at the correlation between terms using the findAssocs() function but got strange results, which I believe is due to the small number of documents. At a later point I'd like to do that properly, with a larger number of documents (perhaps tweets). Also this would work better if synonym replacement for the characters was handled in the original corpus, instead of ad-hoc and after the fact (see code).

Lastly, another thing I took away from all this is that cult TV shows have very, very devoted fan-bases. Probably due to its systemic bias, there is an awful lot about Red Dwarf on Wikipedia, and elsewhere on the internet.

Resources

code and data on github
https://github.com/mylesmharrison/reddwarf

Red Dwarf Scripts (Lady of the Cake)