Monday, October 19, 2015

Good Data Visualization Should Be Boring

So I'm going to make a statement that I'm sure some people are going to disagree with: good data visualization should be boring.

Well, at least kind of boring.

I've had a lot of conversations with a lot of people over the last few years or so about data visualization: why it's important, what constitutes good and bad, and examples of its application in both problematic and very effective ways.

A salient point someone made to me once is that part of the problem with the practice of data visualization is that it isn't viewed as a standalone discipline; it's simply done, in high school math classes, university courses, or even in the workplace by professionals, and usually assumed that people will just pick it up without discussion around it and its proper application.

I think this is gradually starting to change, as with all the talk (or hype, depending on your point of view) around “Big Data”, analytics is becoming more mainstream, and data visualization is as well as a part of it. I also think dataviz is beginning to – gradually, very gradually - become viewed as a standalone discipline, with courses now being offered in it, and the “data visualization evangelism” of academics such as Edward Tufte and Alberto Cairo and work of practitioners like Stephen Few and Mike Bostock helping to raise awareness of what's doing it wrong and what's doing it right. This, along with others creating visualizations which go viral or delivering inspirational TED talks, are doing a lot for visualization as a practice.

The thing I found when I first started to get into dataviz is that even if you're good with data that doesn't necessarily mean you're good at visualizing it. This is because, in addition to working with data, doing proper visualization involves questions of design and also the psychology of perception.

Less is More

I'm a minimalist, and therefore take what I call a functionalist perspective of data visualization. That is to say, the purpose of visualization is to most effectively represent that data so that it can be understood by the audience both most quickly and easily.

As such, I feel that good data visualization should be somewhat dull, or at least somewhat dry; in terms of depicting information and people perceiving it, it is usually the case that simpler is better. This is illustrated in principles like Tufte's data-ink ratio.

So, look at the charts below. Which is more visually appealing to you? Which is simpler? Which one depicts the quantities such that you are able to interpret them the most quickly, accurately and with the most clarity?

If you're like me, you'll say the one on the right, which is a better visualization, even though it may not be as visually appealing to some. Most often you're better served by a simpler, cleaner visualization (or perhaps several of them) than a lot of complexity and visual noise that doesn't add to the reader's understanding.

Never say always

That being said, as I mentioned, choices around data visualization are ultimately ones of design. I do believe that there are some hard and fast rules that should never be broken (e.g. always start the y-axis at 0 for bar charts of strictly positive values, don't represent data with the same units on a secondary y-axis, never use a line chart for categorical data), however I also believe there are some that are more flexible, depending on what you want to accomplish, and your audience. Should you never, ever, use a pie chart? No. Some people are more comfortable with pie charts just from their familiarity with them. Is a bar chart a better choice in terms of representing the data? Yes. But that doesn't mean there aren't exceptions (just don't make a 3D one).

The same individual that made the observation about dataviz not being taught also pointed out to me another factors that can influence design choices: what she called chart fatigue. Is the bar chart the best way to plot a single metric across a categorical variable? Almost always, yes. But show a room full of businesspeople bar chart after bar chart after bar chart and anyone can tell you that they're all going to start to look the same, and interpretation of them is going to suffer as a result. Plus you're probably going to lose the interest of your audience.

Practice makes perfect

In conclusion, I think that awareness of data visualization is only going to get better as companies (and the average consumer) become more “data savvy”. It is my sincere hope that people will give more and more emphasis, not only to the importance of visualization as a tool, but also to the design choices around it, and what constitutes good and bad depictions of data.

For now, just remember that data visualization is ultimately all about communicating and having your reader understand, not necessarily wowing them (though both together are not impossible). And sometimes, that means boring is better.

Thursday, October 1, 2015

Visual Analytics of Every PS4 and XBox One Game by Install Size


I have a startling confession to make: sometimes I just want things to be simple. Sometimes I just want things to be easier. All this talk about "Big Data", predictive modeling, machine learning, and all those associated bits and pieces that go along with data science can be a bit mentally exhausting. I want to take a step back and work with a smaller dataset, something simple, something easy, something everyone can relate to - after all, that's what this blog started out being about.

A while back, someone posted on Slashdot that the folks over at had put together data sets of the install size of every PS4 and Xbox One game released to date. Being a a console owner myself - I'm a PS4 guy, but no fanboy or hardcore gamer by any means - I thought this would be a fun and rather unique data set to play around with, one that would fit well within the category of 'everyday analytics'. So let's take a look shall we?


Very little background required here - the dataset comprises the title, release date, release type (major or indie), console (PS4 or Xbox One), and size in GiB of all games released as of September 10th, 2015. For this post we will ignore the time-related dimension and look only at the quantity of interest: install size.


Okay, if I gave this data to your average Excel jockey what's the first thing they'd do? A high level summary of the data broken apart by categorical variables and summarized by quantitative? You got it!
We can see that far more PS4 games have been released than Xbox (462 vs. 336) and the relative proportions are reversed for the former platform versus the latter as release type goes.

A small aside here on data visualization: it's worth noting that the above is a good way to go for making a bar chart from a functional perspective. Since there are data labels and the y-axis metric is in the title, we can ditch the axis and maximize the data-ink ratio (well, data-pixel anyhow). I've also avoided using a stacked bar chart as interpretation of absolute values tends to suffer when not read from the same baseline. I'm okay with doing it for relative proportions though - as in the below, which further illustrates the difference in release type proportion between the two consoles:
Finally, how does the install size differ between the consoles and game types? If I'm an average analyst and just trying to get a grip on the data, I'd take an average to start:
We can see (unsurprisingly, if you know anything about console games) that major releases tend to be much larger in size than indie. Also in both cases, Xbox install sizes are larger on average: about 1.7x for indie titles and 1.25x for major.

Okay, that's interesting. But if you're like me, you'll be thinking about how 99% of the phenomena in the universe are distributed by a power law or have some kind of non-Gaussian based distribution, and so averages are actually not always such a great way to summarize data. Is this the case for our install size data set?

Yes, it is. We can see here in this combination histogram / cumulative PDF (in the form of a Pareto chart) that the games follow a power law, with approximately 55 and 65 percent being < 5 GiB, for PS4 and Xbox games respectively

But is this entirely due to the indie games having small sizes? Might the major releases be centered around some average or median size?

No, we can see that even when broken apart by type of release the power-law like distribution for install sizes persists. I compared the averages to medians found them to be still be decent representations of central tendency and not too affected by outliers.

Finally we can look at the distribution of the install sizes by using another type of visualization suited for this task, the boxplot. While it is at least possible to jury-rig up a boxplot in Excel (see this excellent how-to over at Peltier Tech) Google Sheets doesn't give us as much to work with, but I did my best (the data label is at the maximum point, and the value is the difference between the max and Q3):
The plots show that install sizes are generally greater for Xbox One vs. PS4, and that the difference (and skew) appears to be a bit more pronounced for indie games versus major releases, as we saw in the previous figures.

Okay, that's all very interesting, but what about games that are available for both consoles? Are the install sizes generally the same or do they differ?

Difference in Install Size by Console Type
Because we've seen that the Xbox install sizes are generally larger than Playstation, here I take the PS4 size to be the baseline for games which appear on both (that is, differences are of the form XBOX Size - PS4 Size).

Of the 618 unique titles in the whole data set (798 titles if you double count across platform), 179 (~29%) were available on both - so roughly only a third of games are released for both major consoles.

Let's take a look at the difference in install sizes - do those games which appear for both reflect what we saw earlier?
Yes, for both categories of game the majority are larger on Xbox than PS4 (none were the same size). Overall about 85% of the games were larger on Microsoft's console (152/179).

Okay, but how much larger? Are we talking twice as large? Five times larger? Because the size of the games varies widely (especially between the release types) I opted to go for percentages here:
Unsurprisingly, on average indie games tend to have larger differences proportionally, because they're generally much smaller in size than major releases. We can see they are nearly twice as large on Xbox vs. PS4 while major releases about 1 and a quarter. When games are larger on PS4, there's not as big a disparity, and the pattern across release types is the same (though keep in mind the number of games here is a lot smaller than for the former).

Finally, just to ground this a bit more I thought I'd look at the top 10 games in each release type where the absolute differences are the largest. As I said before, here the difference is Xbox size minus PS4:
For major releases, the worst offender for being larger on PS4 is Batman: Arkham Night (~6.1 GiB difference) while on the opposite end, The Elder Scrolls Online has a ~19 GiB difference. Wow.

For indies, we can see the absolute difference is a lot smaller for those games bigger on PS4, with Octodad having the largest difference of ~1.4 GiB (56% of its PS4 size). Warframe is 19.6 GiB bigger on Xbox than PS4, or 503% larger (!!)

Finally, I've visualized all the data together for you so you can explore it yourself. Below is a bubble chart of the Xbox install size plotted against PS4, coloured by release type, where the size of each point represents the absolute value of the percentage difference between the platforms (with the PS4 size taken to be the baseline). So points above the diagonal are larger for Xbox than PS4, and points below the diagonal are larger for PS4 than Xbox. Also note that the scale is log-log. You can see that most of the major releases are pretty close to each other in size, as they nearly lie on the y=x line.


It's been nice to get back into the swing of things and do a little simple data visualization, as well as play with a data set that falls into the 'everyday analytics' category.

And, as a result, we've learned:

  • XBox games generally tend to have larger install sizes than PS4 ones, even for the same title
  • Game install sizes follow a power law, just like most everything else in the universe (or maybe just 80% of it)
  • What the heck a GiB is

Until next time then, don't fail to keep looking for the simple beauty in data all around you.

References & Resources

Complete List of Xbox One Install Sizes:

Complete List of PlayStation 4 Install Sizes:

Compiled data set (Google Sheets):

Excel Box and Whisker Diagrams (Box Plots) @ Peltier Tech: