Sunday, February 23, 2014

Toronto Data Science Group - A Survey of Data Visualization Techniques and Practice

Recently I spoke at the Toronto Data Science group. The folks at Mozilla were kind enough to record it and put it on Air, so here it is for your viewing pleasure (and critique):


Overall it was quite well received. Aside from the usual omg does my voice really sound like that?? which is to be expected, a couple of thoughts on the business of giving presentations which were quite salient here:

  • Talk slower and enunciate
  • Gesture, but not too much
  • Tailor sizing and colouring of visuals, depending on projection & audience size

I've reproduced the code which was used to create the figures made in R (including the bubble chart example, with code and data from FlowingData), which regrettably at the time I neglected to save. Here it is in a gist:

The visuals are also available on Slideshare.

Lessons learned: talk slower, always save your code, and Google stuff before starting - because somebody's probably already done it before you.

Sunday, February 16, 2014

In Critique of Slopegraphs

I've been doing more research into less common types of data visualization techniques recently, and was reading up on slopegraphs.

Andy Kirk wrote a piece praising slopegraphs last December, which goes over the construction of a slopegraph with some example data very nicely. However I've seen some other bad examples of data visualization across the web using them, and just thought I'd put in my two cents.

Introductory remarks

I tend to think of slopegraphs as a very boiled-down version of a normal line chart, in which you have only two values for your independent variable and strip away all the non-data ink. This works because if you label all the individual components, you can take away all the cruft because you don't need the legend or axes anymore, do you? Here's the example of the before and after that below, using the soccer data from the Andy's post.

First as a line graph:


Hmm, that's not very enlightening is it? There are so many values for the categorical variable (team) that the graph requires a plethora of colours in the legend, and a considerable amount of back-and-forth to interpret. Contrast with the slopegraph, which is much easier to interpret as the individual values can be read off, and it also ditches the non-data ink of the axes:


Here it is much easier to read off values for the individual teams, it feels less cluttered, and more data have been encoded both in colour (orange for a decrease between the two years, and blue for an increase) as well as the thickness of the lines (thicker lines for change of > 25%).

Pros and Cons

In my opinion, the slope graph should be viewed as an extension of the line graph, and so even though traditional chart elements like the y-axis have been stripped away, consistency should be kept with the regular conventions of data visualization.

In the above example, Andy has correctly honoured vertical position, so that each team appears on other side of the graph at the correct height according to the number of points it has. This is the same as one of Dr. Tufte's original graphs (from the Visual Display of Quantitative Information), which follows the same practice and I quite like:


Brilliant. However when you no longer honour the vertical position to encode value, you lose the ability to truly compare across the categorical variable, which tend I disagree with. This is usually done for legibility's sake (to "uncrowd" the graph when there are a lot of lines), however, I feel like it could still be avoided in most of cases. See below for the example.



Here the vertical position is not honoured, as some values which are smaller appear above those which are larger, so that the lines do not cross and the graph is uncluttered.

Also it should be noted in this case there is more than one value in the independent variable. As long as the scale in the vertical direction is still consistent, the changes in quantity can still be compared by the slope of the lines, even if the exact values cannot be compared because the vertical position no longer corresponds directly to quantity.

Either way, this type of slopegraph is closer to a group of sparklines (as Tufte originally noted), as it allows comparison of the changes in the dependent variable across values of the independent for each value of the categorical variable, but not the exact quantities.

Where things really start to fall apart though, is when slope graphs are used to connect values from two different variables. Charlie Park has some examples of this on his blog post on the subject, such as the one from Ben Fry below:


So here's the question - what exactly, does the slope of the different lines correspond to? The variable on the left is win-loss record and on the right is total salary. The first author correctly notes that in this case, the slopegraph is an extension of a parallel coordinates graph, which requires some further discussion.

A parallel coordinates graph is all very well and good for doing exploratory data analysis, and finding patterns in data with a large number of variables. However I would avoid graphs like the one above in general - because the variable on the left and the right are not the same, the slope of the line is essentially meaningless. 

In this case of the baseball data, why not just display the information in a regular scatterplot, as below? Simple and clear. You can then include the additional information using colour and size respectively if desired and make a bubble chart.


Was the disproportionately large payroll of the Yankees as obvious in the previous visualization? Maybe, but not as saliently. The relative size of the payroll was encoded in the thickness of the line, but quantity is not interpreted as quickly and accurately when encoded using area/thickness as it is when using position. Also because the previous data were ranked (vertical position did not portray quantity), the much smaller number of wins by Kansas relative to the other teams was not as apparent at is it here.

Fry notes that he chose not to use a scatterplot as he wanted ranking for both quantities, which I suppose is the advantage of the original treatment, and something which is not depicted in the alternative I've presented. Also Park correctly notes in the examples on his post that different visualizations draw the eye to different features of the data, and some people have more difficulty interpreting a visualization like a bubble chart than slopegraph. Still, I remain a skeptical functionalist as far as visualization is concerned, and prefer the treatment above to the former.

Alternatives

I've presented some criticism of the slopegraphs here, but are there alternatives? Yes. In addition to the above, let's explore some others, using the data from the soccer example.

Really what we are interested in is the change in the quantity over the two values of the independent variable (year). So we can instead look at that quantity (change between the two years), and visualize it as a bar graph with a baseline of zero. Here the bars are again coloured by whether the change is positive or negative.


This is fine; however we lost the information encoded in the thickness of the lines. We can encode that using the lightness (intensity) of the different bars. Dark for > 25% change, light for the others:



Hmm, not bad. However we've still lost the information about the absolute value of points each year. So let's make that the value along the horizontal axis instead.


Okay fine, now the length of the bars corresponds to the magnitude of the change in points across the two years, with positive changes being coloured blue and negative orange, and the shading corresponding to whether the change was greater or less than 25%.

However, even if I put a legend and told you what the colours correspond to, it's pretty common for people to think of things as progressing from left to right (at least in Western cultures). The graph is difficult to interpret because for bars in orange the score for the first year is on the right, whereas for those in blue it's on the left. That is to say, we have the absolute values, but direction of the change is not depicted well. Changing the bars to arrows solves this, as below:


Now we have the absolute values of the points in each year for each team, and the direction of the change is displayed better than just with colour. Adding the gridlines allows the viewer to read off the individual values of points more easily. Lastly, we encode the other categorical variable of interest (change greater/less than 25%) as the thickness of the line.


Like so. After creating the above independently, I discovered visualization consultant Naomi Robbins had already written about this type of chart on Forbes, as an alternative to using multiple pie charts. Jon Peltier also has an excellent in-depth description how to make these types of charts in Excel, as well as showing another alternative visualization option to slope graphs, using a dot plot.

Of course, providing the usual fixings for a graph such as a legend, title and proper axis labels would complete the above, which brings me to my last point. Though I think it's a good alternative to slopegraphs, it can in no way compete in simplicity given that Dr. Tufte's example of a slopegraph as it had zero non-data ink. And, of course, this type of graph will not work when there are more than two values in the independent variable which to compare across.

Closing Remarks

It is easy to tell who are the true thought leaders in data visualization, because they often take it upon themselves to find special cases for visualization where people struggle or visualize data poorly, and then invent new visualizations types to fill the need (Tufte with the slopegraph, and Few came up with the bullet graph to supplant god-awful gauges on dashboards).

As I discussed, there are certain cases when slopegraphs should not be used, and I feel you would be better served by other types of graphs; in particular, cases where the slopegraph is a variation of the parallel coordinates chart not the line graph, or where quantity is not encoded in vertical position and comparing quantities for each value of the independent variable is important.

That being said, it is (as always) very important when making choices regarding data visualization to consider the pros and cons of different visualization types, the properties of the data you are trying to communicate, and, of course, the target audience.

Judiciously used, slopegraphs provide a highly efficient way in terms of data-ink ratio to visualize change in quantity across a categorical variable with a large number of values. Their appeal lies both in this and their elegant simplicity.

References & Resources

Slopegraphs discussion on Edward Tufte forum
http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003nk

In Praise of Slopegraphs, by Andy Kirk
Edward Tufte's "Slopegraphs" by Charlie Park
http://charliepark.org/slopegraphs/

Peltier Tech: How to Make Arrow Charts in Excel
http://peltiertech.com/WordPress/arrow-charts-in-excel/

Salary vs. Performance of MLB Teams by Ben Fry
http://fathom.info/salaryper/

salary vs performance scatterplot (Tableau Public)

Saturday, February 8, 2014

Creepypasta - Votes vs. Rating (& learning ggplot2)

Excel:


R, base package:


R, ggplot:


Am I overfitting? Probably.


Code:
More fun stuff to come....

References

Source data at Creepypasta.com:

Code on gist:
http://gist.github.com/mylesmharrison/8886272

Creepypasta -  in list of internet phenomena (Wikipedia):
http://en.wikipedia.org/wiki/Creepypasta#Other_phenomena