Monday, May 4, 2015

Data Visualization Fundamentals with Skittles

So I have a shocking confession to make: I love Skittles.

This post is not sponsored, endorsed, compensated, or paid for in any way, shape or form, by Skittles Candy. I'm not particular - I like other types of candy that are similar - you know, those ones that are chocolate covered in a hard shell, whether they be the kind where you eat the red ones last or not.

Anyhow, I got to thinking about how, abstractly, each individual candy can be viewed like a pixel of a different color. So you can make art using candy, just like artists make a mosaic. There's lots of this on the internet you can already see: in fact, Skittles has done print advertising this way.

But.... each individual candy can also represent something else: a unit of measurement. I thought it would be cool to go through some data visualization fundamentals using the candy in this way. So let's dive in.

Data Visualization using only 1 bag of Skittles

So, what would your average first grader do with a bag of Skittles if you asked them to sort it? Probably something like below, the physical equivalent of a bubble chart depicting the quantities of each colour by area, assuming each Skittle is approximately the same size.


A perhaps more useful way to do the same would be to organize each colour in rows, with each row a set number (like tally marks). Here it's not only easy to see the relative proportions of the different colours in the bag, but also count them as each row and group is a set number (5 & 10, respectively). This is equivalent to a pictogram, with each Skittle representing, well, 1 Skittle:



It's not a big stretch of the imagination to collapse those groups together into groups of a set height. So here we have a proportional bar chart, where the length of each bar represents the percentage of the bag that is each colour. Note that because I didn't slice Skittles in half, the physical analogue is not exactly the same as what you'd put down on paper or in Excel (there is one additional unit for yellow and orange):



And, as I both often have to remind people of this rule, and also observe many people not following it, it is best practice to sort the bars in descending order for maximum clarity / comparative value (assuming there is not another more important ordering):



And, if we want to transform our proportional bar chart into one comparing absolute quantities, it is not a giant stretch of the imagination to break apart the different bars so they are only one 'pixel' high:



Here it's much easier to get an idea of the absolute number of each colour in the bag, but harder to tally that numbers exactly - for that we'd need to add an axis or data labels.

More bags please

Okay, I have another shocking confession to make: I lied. I really like Skittles. So I actually bought a whole bunch of bags.



So let's look at some more visualization fundamentals, where we required comparing not only across a categorical variable (colour) but also between groups.

Here is the equivalent to our first graph from before, only showing the different numbers of Skittles in each bag. You can see there's actually a fair amount of variance; the smallest bag had 89 pieces of candy, whereas the largest had 110.


Now let's make a bubble graph which not only compares the sizes between the different bags, but also their makeups by colour. The end result is actually closer to a collection of pie charts:


We can also group by colour only to see the overall makeup for the whole group of bags. Whereas orange dominated in the first bag we looked at, you can see here that orange and yellow are approximately at parity overall.


Now let's look at the tally mark / pictograph method. Here each row represents a bag:


You can see there's a fair bit of variance in the different colours. I also tried rearranging things so they result was less like a pictograph and more like a treemap:


Really the best way to compare would be a bar graph. Here's a stacked area graph. I didn't bother sorting by length, because at this point I was pretty tired of shuffling Skittles around:


To get a better idea of the different makeups of each bag by colour, we can break this out into a grouped bar graph, first by bag, then by colour:


And, of course, we can reverse the order if we want to more directly compare the colour makeups. The columns are in numerical order by bag. And just for fun, we'll make this one a column chart:


There. That's better! Clearly Bag 1 was an outlier as far as the number of purple went, and Bag 3 had a lot of yellow. 

Concluding Remark

I thought it'd be cool to mix things up a bit, and trying doing some data visualization using a physical medium. The end result ended up being something more like an exercise for an elementary school mathematics class (indeed, there are many examples of this online), but I think it still drives home some of the fundamental strengths and weakness of different visualization types, as well as showing how they can be depicted using different media. 

If you're really interested, you can download the data yourself and slice and dice visualizations to your heart's content. And I'm sure if you bought enough bags of Skittles you could learn something of a statistical nature about their manufacturing and packaging process - but perhaps that's for a different day. Until then I'll just enjoy good candy and data visualization.