Tuesday, December 16, 2014

The Mandelbrot Set in R

Introduction

I was having a conversation with someone I know about weather forecasts the other day and it went something like this:

"Yes, their forecasts are really not very good. They really need to work on improving them."
"Well, yes, I think they're okay, considering what they're trying to do is impossible."

The thought that a relatively simple set of equations can give rise to infinitely complex behavior is something which has intrigue and appeal beyond those mathematically minded and academic. Fractals became a part of pop culture, and the idea of chaos is not unknown to those outside mathematical research, as it was popularized by the term "The Butterfly Effect" (though unfortunately I can't say the movie starring Ashton Kutcher was anything to write home about).

When I was finishing high school and starting my undergraduate degree I got very interested in fractals and chaos. What really spurred this interest was James Gleick's Chaos, which outlined the history of the "mathematics of chaos" in a highly readable, narrative way. So much so that I later went on to take a pure mathematics course in chaos during my degree, and wrote a program in MATLAB for exploring the Mandelbrot set.

So I thought I'd give it a shot in R.

Background

Unquestionably the most well known fractal of all is the Mandelbrot fractal. Without going into laborious mathematical detail, the Mandelbrot set stems from using complex numbers, which are numbers of the form z = x + yi. The number i, the "imaginary unit", is the special quantity such that i2 = -1. As such complex numbers are unique in that multiplying two of them together can result in much different behavior than numbers on the real number line: the magnitude of their product is not guaranteed to be greater than the terms multiplied, even for two quantities with real and imaginary parts with magnitude greater than one.

The Mandelbrot set the set of numbers produced by the following:
1. Initialize a complex set of numbers in the complex plane that are all z = 0.
2. Iterate the formula

such where c is a set of complex numbers filling the complex plane.
3. The Mandelbrot set is the set where z remains bounded for all n.

And mathematically it can be shown that if under iteration z is greater than 2 than c is not in the set.

In practice one keeps track of the number of iterations it takes z to diverge for a given c, and then colours the points accordingly to their "speed of divergence".

Execution

In R this is straightforward code. Pick some parameters about the space (range and resolution for x and y, number of iterations, etc.) and then iterate. Naively, the code could be executed as below:

However, if you know anything about writing good code in R, you know that for loops are bad, and it's better to rely upon the highly optimized underpinnings of the language by using array-based functions like which and lapply.  So a more efficient version of the same code is below:

We can then plunk these into functions where the different criteria for the rendered fractal are parameters:


Let's compare the runtime between the two shall we? For the base settings, how does the naive version compare to the vectorized one? 

> compare_runtimes()
   user  system elapsed 
 37.713   0.064  37.773 
   user  system elapsed 
  0.482   0.094   0.576 

The results speak for themselves: a ~65x decrease in runtime by using R indexing functions instead of for loops! This definitely speaks to the importance of writing optimized code taking advantage of R's vectorized functions.

You can tweak the different parameters for the function to return different parts of the complex space. Below are some pretty example plots of what the script can do with different parameters:

Conclusion

What does all this have to do with my conversation about the weather? Why, everything, of course! It's where chaos theory came from. Enjoy the pretty pictures. It was fun getting R to churn out some nice fractals and good to take a trip down memory lane.

References and Resources

The Mandelbot Set: 

code on github:

Sunday, November 9, 2014

Toronto Cats and Dogs II - Top 25 Names of 2014

I was quite surprised by the relative popularity of my previous analysis of the data for Licensed Cats & Dogs in Toronto for 2011, given how simple it was to put together.

I was browsing the Open Data Portal recently and noticed that there was a new data set for pets: the top 25 names for both dogs and cats. I thought this could lend itself to some quick, easy visualization and be a neat little addition to the previous post.

First we simply visualize the raw counts of the Top 25 names against each other. Interestingly, the top 2 names for both dogs and cats are apparently the same: Charlie and Max.




Next let's take a look at the distribution of these top 25 names for each type of pet by how long they are, which just involves calculating the name length and then pooling the counts:

You can see that, proportionally the top dog names are a bit shorter (distribution is positively / right-skewed) compared to the cat names (slightly negatively / left skewed). Also note both are centered around names of length 5, and the one cat name of length 8 (Princess).

Looking at the dog names, do you notice something interesting about them? A particular feature present in nearly all? I did. Nearly every one of the top 25 dog names ends in a vowel. We can see this by visualizing the proportion of the counts for each type of pet by whether the name ends in a vowel or consonant:

Which to me, seems to indicate that more dogs tend to have "cutesy" names, usually ending in 'y', than cats.

Fun stuff, but one thing really bothers me... no "Fido" or "Boots"? I guess some once popular names have gone to the dogs.

References & Resources

Licensed Dog and Cat Names (Toronto Open Data)

Monday, October 20, 2014

Twitter Pop-up Analytics

Introduction


So I've been thinking a lot lately. Well, that's always true. I should say, I've been thinking a lot lately about the blog. When I started this blog I was very much into the whole quantified self thing, because it was new to me, I liked the data collection and analysis aspect, and I had a lot of time to play around with these little side projects.

When I started the blog I called it "everyday analytics" because that's what I saw it always being; analysis of data on topics that were part of everyday life, the ordinary viewed under the analytical lens, things that everyone can relate to. You can see this in my original about page for the blog which has remained the same since inception.

I was thinking a lot lately about how as my interest in data analysis, visualization and analytics has matured, and so that's not really the case so much anymore. The content of everyday analytics has become a lot less everyday. Analyzing the relative nutritional value of different items on the McDonald's menu (yeesh, looking back now those graphs are pretty bad) is very much something to which most everyone could relate. 2-D Histograms in R? PCA and K-means clustering? Not so much.

So along this line of thinking, for this reason, I thought it's high time to get back into the original spirit of the site when it was started. So I thought I'd do some quick quantified-self type analysis, about something everyone can relate to, nothing fancy. 

Let's look at my Twitter feed.

Background

It wasn't always easy to get data out of Twitter. If you look back at how Twitter's API has changed over the years, there has been considerable uproar about the restrictions they've made in updates, however they're entitled to do so as they do hold the keys to the kingdom after all (it is their product). In fact, I thought it'd be a easiest to do this analysis just using the twitteR package, but it appears to be broken since Twitter has made said updates to their API.

Luckily I am not a developer. My data needs are simple for some ad hoc analysis. All I need is the data pulled and I am ready to go. Twitter now makes this easy now for anyone to do, just go to your settings page:


And then select the 'Download archive' button under 'Your Twitter Archive' (here it is a prompt to resend mine, as I took the screenshot after):


And boom! A CSV of all your tweets is in your inbox ready for analysis. After all this talk about working with "Big Data" and trawling through large datasets, it's nice to take a breather a work with something small and simple.

Analysis

So, as I said, nothing fancy here, just wrote some intentionally hacky R code to do some "pop-up" analytics given Twitter's output CSV. Why did I do it this way, which results in 1990ish looking graphs, instead of in Excel and making it all pretty? Why, for you, of course. Reproducibility. You can take my same R code and run it on your twitter archive (which is probably a lot larger and more interesting than mine) and get the same graphs.

The data set comprises 328 tweets sent by myself between 2012-06-03 and 2014-10-02. The fields I examined were the datetime field (time parting analysis), the tweet source and the text / content.

Time Parting
First let's look at the time trending of my tweeting behaviour:

We can see there is some kind of periodicity, with peaks and valleys in how many tweets I send. The sharp decline near the end is because there are only 2 days of data for October. Also, compared to your average Twitter user, I'd say I don't tweet alot, generally only once every two days or so on average:

> summary(as.vector(monthly))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    8.00   12.00   11.31   15.00   21.00 

Let's take a look and see if there is any rhyme or reason to these peaks and valleys:

Looking at the total counts per month, it looks like I've tweeted less often in March, July and December for whatever reason (for all of this, pardon my eyeballing..)

What about by day of week?

Look like I've tweeted quite a bit more on Tuesday, and markedly less on the weekend. Now, how does that look over the course of the day;

My peak tweeting time seems to be around 4 PM. Apparently I have sent tweets even in the wee hours of the morning - this was a surprise to me. I took a stab at making a heatmap, but it was quite sparse; however the 4-6 PM peak does persist across the days of the week.

Tweets by Source
Okay, that was interesting. Where am I tweeting from?

Look like the majority of my tweets are actually sent from the desktop site, followed by my phone, and then sharing on sites. I attribute this to the fact that I mainly use twitter to share articles, which isn't easy to do on my smartphone.

Content Analysis
Ah, now on to the interesting stuff! What's actually in those tweets?

First let's look at the length of my tweets in a simple histogram:


Looks like generally my tweets are above 70 characters or so, with a large peak close to the absolute limit of 160 characters. 

Okay, but what I am actually tweeting about? Using the very awesome tm package it's easy to do some simple text mining and pull out both top frequent terms, as well as hashtags.

So apparently I tweet a lot about data, analysis, Toronto and visualization. To anyone who's read the blog this shouldn't be overly surprisingly. Also you can see I pass along articles and interact with others as "via" and "thanks" are in there too. Too bad about that garbage ampersand.


Overwhelmingly the top hashtag I use is #dataviz, followed of course by #rstats. Again, for anyone who knows me (or has seen one of my talks) this should not come as a surprise. You can also see my use of Toronto Open Data in the #opendata and #dataeh hashtags.

Conclusion

That's all for now. As I said, this was just a fun exercise to write some quick, easy R code to do some simple personal analytics on a small dataset. On the plus side the code is generalized, so I invite you to take it and look at your own twitter archive.

Or, you could pull all of someone else's tweets, but that would, of course, require a little more work.

References

code at github

Twitter Help Center: Downloading Your Archive

The R Text Mining (tm) package at CRAN

twitteR package at CRN

Monday, September 1, 2014

5 Ways to Do 2D Histograms in R

Introduction

Lately I was trying to put together some 2D histograms in R and found that there are many ways to do it, with directions on how to do so scattered across the internet in blogs, forums and of course, Stackoverflow.

As such I thought I'd give each a go and also put all of them together here for easy reference while also highlighting their difference.

For those not "in the know" a 2D histogram is an extensions of the regular old histogram, showing the distribution of values in a data set across the range of two quantitative variables. It can be considered a special case of the heat map, where the intensity values are just the count of observations in the data set within a particular area of the 2D space (bucket or bin).

So, quickly, here are 5 ways to make 2D histograms in R, plus one additional figure which is pretty neat.

First and foremost I get the palette looking all pretty using RColorBrewer, and then chuck some normally distributed data into a data frame (because I'm lazy). Also one scatterplot to justify the use of histograms.

# Color housekeeping
library(RColorBrewer)
rf <- colorRampPalette(rev(brewer.pal(11,'Spectral')))
r <- rf(32)

# Create normally distributed data for plotting
x <- rnorm(mean=1.5, 5000)
y <- rnorm(mean=1.6, 5000)
df <- data.frame(x,y)

# Plot
plot(df, pch=16, col='black', cex=0.5)

Option 1: hexbin

The hexbin package slices the space into 2D hexagons and then counts the number of points in each hexagon. The nice thing about hexbin is that it provides a legend for you, which adding manually in R is always a pain. The default invocation provides a pretty sparse looking monochrome figure. Adding the colramp parameter with a suitable vector produced from colorRampPalette makes things nicer. The legend placement is a bit strange - I adjusted it after the fact though you just as well do so in the R code.
##### OPTION 1: hexbin from package 'hexbin' #######
library(hexbin)
# Create hexbin object and plot
h <- hexbin(df)
plot(h)
plot(h, colramp=rf)

Using the hexbinplot function provides greater flexibility, allowing specification of endpoints for the bin counting, and also allowing the provision of a transformation function. Here I did log scaling. Also it appears to handle the legend placement better; no adjustment was required for these figures.
# hexbinplot function allows greater flexibility
hexbinplot(y~x, data=df, colramp=rf)
# Setting max and mins
hexbinplot(y~x, data=df, colramp=rf, mincnt=2, maxcnt=60)

# Scaling of legend - must provide both trans and inv functions
hexbinplot(y~x, data=df, colramp=rf, trans=log, inv=exp)

Option 2: hist2d

Another simple way to get a quick 2D histogram is to use the hist2d function from the gplots package. Again, the default invocation leaves a lot to be desired:
##### OPTION 2: hist2d from package 'gplots' #######
library(gplots)

# Default call
h2 <- hist2d(df)
Setting the colors and adjusting the bin sizing coarser yields a more desirable result. We can also scale so that the intensity is logarithmic as before.
# Coarser binsizing and add colouring
h2 <- hist2d(df, nbins=25, col=r)

# Scaling with log as before
h2 <- hist2d(df, nbins=25, col=r, FUN=function(x) log(length(x)))

Option 3: stat_2dbin from ggplot

And of course, where would a good R article be without reference to the ggplot way to do things? Here we can use the stat_bin2d function, either added to a ggplot object or as a type of geometry in the call to qplot.
##### OPTION 3: stat_bin2d from package 'ggplot' #######
library(ggplot2)

# Default call (as object)
p <- ggplot(df, aes(x,y))
h3 <- p + stat_bin2d()
h3

# Default call (using qplot)
qplot(x,y,data=df, geom='bin2d')
Again, we probably want to adjust the bin sizes to a desired number, and also ensure that ggplot uses our colours that we created before. The latter is done by adding the scale_fill_gradientn function with our colour vector as the colours argument. Log scaling is also easy to add using the trans parameter.
# Add colouring and change bins
h3 <- p + stat_bin2d(bins=25) + scale_fill_gradientn(colours=r)
h3

# Log scaling
h3 <- p + stat_bin2d(bins=25) + scale_fill_gradientn(colours=r, trans="log")
h3

Option 4: kde2d

Option #4 is to do kernel density estimation using kde2d from the MASS library. Here we are actually starting to stray from discrete bucketing of histograms to true density estimation, as this function does interpolation.

The default invocation uses n = 25 which is actually what we've been going with in this case. You can then plot the output using image().

Setting n higher does interpolation and we are into the realm of kernel density estimation, as you can set your "bin size" lower than how your data actually appear. Hadley Wickham notes that in R there are over 20 packages [PDF] with which to do density estimation so we'll keep that to a separate discussion.

##### OPTION 4: kde2d from package 'MASS' #######
# Not a true heatmap as interpolated (kernel density estimation)
library(MASS)

# Default call 
k <- kde2d(df$x, df$y)
image(k, col=r)

# Adjust binning (interpolate - can be computationally intensive for large datasets)
k <- kde2d(df$x, df$y, n=200)
image(k, col=r)

Option 5: The Hard Way

Lastly, an intrepid R user was nice enough to show on Stackoverflow how do it "the hard way" using base packages.
##### OPTION 5: The Hard Way (DIY) #######
# http://stackoverflow.com/questions/18089752/r-generate-2d-histogram-from-raw-data
nbins <- 25
x.bin <- seq(floor(min(df[,1])), ceiling(max(df[,1])), length=nbins)
y.bin <- seq(floor(min(df[,2])), ceiling(max(df[,2])), length=nbins)

freq <-  as.data.frame(table(findInterval(df[,1], x.bin),findInterval(df[,2], y.bin)))
freq[,1] <- as.numeric(freq[,1])
freq[,2] <- as.numeric(freq[,2])

freq2D <- diag(nbins)*0
freq2D[cbind(freq[,1], freq[,2])] <- freq[,3]

# Normal
image(x.bin, y.bin, freq2D, col=r)

# Log
image(x.bin, y.bin, log(freq2D), col=r)

Not the way I would do it, given all the other options available, however if you want things "just so" maybe it's for you.

Bonus Figure


Lastly I thought I would include this one very cool figure from Computational Actuarial Science with R which is not often seen, which includes both a 2D histogram with regular 1D histograms bordering it showing the density across each dimension.
##### Addendum: 2D Histogram + 1D on sides (from Computational ActSci w R) #######
#http://books.google.ca/books?id=YWcLBAAAQBAJ&pg=PA60&lpg=PA60&dq=kde2d+log&source=bl&ots=7AB-RAoMqY&sig=gFaHSoQCoGMXrR9BTaLOdCs198U&hl=en&sa=X&ei=8mQDVPqtMsi4ggSRnILQDw&redir_esc=y#v=onepage&q=kde2d%20log&f=false

h1 <- hist(df$x, breaks=25, plot=F)
h2 <- hist(df$y, breaks=25, plot=F)
top <- max(h1$counts, h2$counts)
k <- kde2d(df$x, df$y, n=25)

# margins
oldpar <- par()
par(mar=c(3,3,1,1))
layout(matrix(c(2,0,1,3),2,2,byrow=T),c(3,1), c(1,3))
image(k, col=r) #plot the image
par(mar=c(0,2,1,0))
barplot(h1$counts, axes=F, ylim=c(0, top), space=0, col='red')
par(mar=c(2,0,0.5,1))
barplot(h2$counts, axes=F, xlim=c(0, top), space=0, col='red', horiz=T)

Conclusion

So there you have it! 5 ways to create 2D histograms in R, plus some additional code to create a really snappy looking figure which incorporates the regular variety. I leave it to you to write (or find) some good code for creating legends for those functions which do not include them. Hopefully other R users will find this a helpful reference.

References

code on github

R generate 2D histogram from raw data (Stackoverflow)

Computational Actuarial Science with R (Google Books)

Wickham, Hadley. Density Estimation in R [PDF]

Tuesday, August 12, 2014

Stacked Area Graphs Are Not Your Friend

Stacked area graphs are not your friend. Seriously. I want to make this abundantly clear.

I'm going to expound on some of the work of Stephen Few here and lay out what stacked area graphs are, why they are a poor type of data visualization, and what are some good alternatives.

What is a stacked area graph?

A stacked area graph depicts a quantitative variable against another quantitative variable (usually time as the independent variable, i.e. on the x-axis), broken up across more than one categorical variables (or into different "data series" in MS Excel's parlance) which make up the whole. The different shaded areas are stacked on top of one another, so that the height of each shaded area represents the value for each particular categorical variable, and the total height is their sum.

For example, you can depict a quantity of interest, Y, across four groups, creatively entitled A, B, C, and D, with the combined height being the total:

Pretty, no?
So what's wrong with that? A good looking graph right? Shows all the relevant quantities as well as their total in the same figure. Maybe. Let's look in detail at some of the problems with interpreting this type of graph.

Shortcomings of Stacked Area Graphs

The problem with stacked area graphs is that of baselining. When we compare multiple lines in a line graph which are comparing from the same baseline which is the value of the y-axis where it intersects the x.

With a stacked area graph, it is easy to accurately interpret both the relative values (as graphs are not meant to read off exact values - that is a job for tables) as well as the overall trending of the total across all four groups.

It is also easy to interpret this for the data which happens to be at the bottom of the "stack" as it has the x-axis as its base (in this case, the value for Group A).

The problem arises for the stacked areas. Their baselines are the curve of the top of the areas below. Ideally one should be able to interpret each individual series by its height, but unfortunately this is not usually the case - most interpret the curve of the top of the area as indicating quantity (as one would in a line graph). Because these lines follow the baseline of those below, they make it appear that those above have the same characteristics as below.

For instance, in the example graph I produced above, it can be easy to think there are very well-defined peaks in all the series around Jan 9 and Jan 22. This is because of the effect just mentioned. Look at the same graph if I selectively shuffle the order of the stacking of the areas:


While we still see those peaks at the times mentioned because those are the peaks for the total, but look at the series for Group D (in purple). Do you still feel the same about how it fluctuates between the dates of the 8th and the 22nd as you did before, in the first figure?

Because of the inclination to interpret the top of the area as quantity, interpreting the trend in the different areas of a stacked area graph is usually quite difficult.

Alternative Approaches

So what is the optimal alternative approach? What should you use instead of a stacked area graph? Data visualization expert Stephen Few recommends individual line charts, with an additional line in a stark color (black) for the total. I've created that below for our example:


You can see that the overall trend line of the total follows the top of the stacked area graph (it is unchanged) but the individual series look quite different, and while a bit noisy, it is easier to pick out their individual trending behaviors.

When the graph gets a bit noisy like this it might also be a good idea to thin the lines.


Okay, that's better. But as the number of values of the categorical variable increases the graph is going to get increasingly noisy. What do we do in those cases?

Well, as I often have to remind myself, nowhere does it say that you have to tell your story all in one graph. There's nothing stopping us from breaking up this one graph into smaller individual graphs, one for each and also the total. The disadvantage here is that it's not as easy to compare between the different groups, however we can make it easier by using the same axis scaling for the graphs for each individual group.


Here there were an odd number of graphs so I chose to keep the graph for the total larger (giving it emphasis) and maintain their original aspect ratios. You could just as easily make a panel of 6 with equal sizes if you had a different number of graphs, or put them all in tall or wide graphic in a column or row.

Also, now that each individual graph depicts the value for a different group, we don't need the colours on the figures on the right anymore; that information is in each individual plot title. So we can ditch the color. I'll keep the total black to differentiate between the total and the value for individual group.


As the number of values for the categorical variable gets very large you go from multiple figures into true small multiple (trellis plot) territory, like in the figure below:


Another option, if you have the benefit of more dynamic visualization tools available, would be to use interactivity and gray out series in the background, such as in this amazing visualization of housing prices from the New York Times:

Click me for dataviz goodness.

So what do stacked area graphs have going for them over the approaches I laid out above? The one thing that all the alternatives I laid out do not allow as easily is the comparison of the relative proportions of the whole.

However, this can also be accomplished by using relative quantities, that is, calculating the percentages of each categorical variable and plotting those, as below. 



This approach also does not suffer from the aforementioned baseline issue, which is the case for proportional stacked area graphs (where the top of the y-axis is 100%). These types of figures are also best avoided.

Attempts to address the fundamental issue with stacked area graphs have been made with a different type of visualization, the streamgraph, however I believe this type of visualization introduces more additional problems in interpretation than it solves.

Concluding Remarks

Though I do like to put together some thoughts on data visualization practice occasionally, it is not my intent to be overly critical of poor visualization choices, as, now that I think about it, my other post was also framed somewhat negatively.

In data visualization, there is no 'right' answer; only some visualization techniques that display the data better than others. Different ways of visualizing the data have different strengths and weaknesses; the goal here is to apply critical thought to different types of visualization, so that we may be informed about making good visualization choices in order to best represent that data so that it is not misinterpreted due to our perceptual biases.

In my opinion, and my experience working with data visualization, you are almost always better served by the simpler, more minimalistic types of visualizations (the fundamental three being the bar chart, line graph and scatterplot) than more complicated ones. This has been an example of that, as stacked area graphs are really just a combination of area graphs, which are, in turn, an extension of the line graph.

Though the stacked area graph allows depiction of a total quantity as well as the proportions across a categorical variable making up its whole, I think this quality is not of sufficient benefit given the issues it introduces, as I have noted here. This is especially the case as there are other types of visualizations which accomplish the same goals without as much room for misinterpretation.

References

Few, Stephen. Quantitative Displays for Combining Time-Series and Part-to-Whole Relationships.

Sunday, June 22, 2014

PCA and K-means Clustering of Delta Aircraft


Introduction

I work in consulting. If you're a consultant at a certain type of company, agency, organization, consultancy, whatever, this can sometimes mean travelling a lot.

Many business travellers 'in the know' have heard the old joke that if you want to stay at any type of hotel anywhere in the world and get a great rate, all you have to do is say that you work for IBM.


The point is that my line of business requires travel, and sometimes that is a lot of the time, like say almost all of last year. Inevitable comparisons to George Clooney's character in Up in the Air were made (ironically I started to read that book, then left it on a plane in a seatback pocket), requests about favours involving duty free, and of course many observations and gently probing questions about frequent flier miles (FYI I've got more than most people, but a lot less than the entrepreneur I sat next to one time, who claimed to have close to 3 million).

But I digress.

Background

The point is that, as I said, I spent quite a bit of time travelling for work last year. Apparently the story with frequent fliers miles is that it's best just to pick one airline and stick with it - and this also worked out well as most companies, including my employer, have preferred airlines and so you often don't have much of a choice in the matter.

In my case this means flying Delta.

So I happened to notice in one of my many visits to Delta's website that they have data on all of their aircraft in a certain site section. I thought this would be an interesting data set on which to do some analysis, as it has both quantitative and qualitative information and is relatively complex. What can we say about the different aircraft in Delta's fleet, coming at it with 'fresh eyes'? Which planes are similar? Which are dissimilar?

Aircraft data card from Delta.com
The data set comprises 33 variables on 44 aircraft taken from Delta.com, including both quantitative measures on attributes like cruising speed, accommodation and range in miles, as well as categorical data on, say, whether a particular aircraft has Wi-Fi or video. These binary categorical variables were transformed into quantitative variables by assigning them values of either 1 or 0, for yes or no respectively.

Analysis

As this a data set of many variables (33) I thought this would be an interesting opportunity to practice using a dimensionality reduction method to make the information easier to visualize and analyze.

First let's just look at the intermediary quantitative variables related to the aircraft physical characteristics: cruising speed, total accommodation, and other quantities like length and wingspan. These variables are about in the middle of the data frame, so we can visualize all of them at once using a scatterplot matrix, which is the default for R's output if plot() is called on a dataframe.
data <- read.csv(file="delta.csv", header=T, sep=",", row.names=1)

# scatterplot matrix of intermediary (size/non-categorical) variables
plot(data[,16:22])

We can see that there are pretty strong positive correlations between all these variables, as all of them are related to the aircraft's overall size. Remarkably there is an almost perfectly linear relationship between wingspan and tail height, which perhaps is related to some principle of aeronautical engineering of which I am unaware.

The exception here is the variable right in the middle which is the number of engines. There is one lone outlier [Boeing 747-400 (74S)] which has four, while all the other aircraft have two. In this way the engines variable is really more like a categorical variable, but we shall as the analysis progresses that this is not really important, as there are other variables which more strongly discern the aircraft from one another than this.

How do we easier visualize a high-dimensional data set like this one? By using a dimensionality reduction technique like principal components analysis.

Principal Components Analysis

Next let's say I know nothing about dimensionality reduction techniques and just naively apply principle components to the data in R:
# Naively apply principal components analysis to raw data and plot
pc <- princomp(data)
plot(pc)
Taking that approach we can see that the first principal component has a standard deviation of around 2200 and accounts for over 99.8% of the variance in the data. Looking at the first column of loadings, we see that the first principle component is just the range in miles.

# First component dominates greatly. What are the loadings?
summary(pc) # 1 component has > 99% variance
loadings(pc) # Can see all variance is in the range in miles 
Importance of components:
                             Comp.1       Comp.2       Comp.3       Comp.4
Standard deviation     2259.2372556 6.907940e+01 2.871764e+01 2.259929e+01
Proportion of Variance    0.9987016 9.337038e-04 1.613651e-04 9.993131e-05
Cumulative Proportion     0.9987016 9.996353e-01 9.997966e-01 9.998966e-01
            

                         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
Seat.Width..Club.                                    -0.144 -0.110              
Seat.Pitch..Club.                                    -0.327 -0.248         0.189
Seat..Club.                                                                     
Seat.Width..First.Class.                0.250        -0.160        -0.156  0.136
Seat.Pitch..First.Class.                0.515 -0.110 -0.386  0.112 -0.130  0.183
Seats..First.Class.                     0.258 -0.124 -0.307 -0.109  0.160  0.149
Seat.Width..Business.                  -0.154  0.142 -0.108                     
Seat.Pitch..Business.                  -0.514  0.446 -0.298  0.154 -0.172  0.379
Seats..Business.                       -0.225  0.187                            
Seat.Width..Eco.Comfort.                                     0.285 -0.224       
Seat.Pitch..Eco.Comfort.                0.159                0.544 -0.442       
Seats..Eco.Comfort.                                          0.200 -0.160       
Seat.Width..Economy.                                  0.125  0.110              
Seat.Pitch..Economy.                                  0.227  0.190        -0.130
Seats..Economy.                  0.597        -0.136  0.345 -0.165         0.168
Accommodation                    0.697               -0.104                0.233
Cruising.Speed..mph.                    0.463  0.809  0.289 -0.144  0.115       
Range..miles.             0.999                                                 
Engines                                                                         
Wingspan..ft.                    0.215         0.103 -0.316 -0.357 -0.466 -0.665
Tail.Height..ft.                                     -0.100        -0.187       
Length..ft.                      0.275         0.118 -0.318  0.467  0.582 -0.418
Wifi                                                                            
Video                                                                           
Power                                                                           
Satellite                                                                       
Flat.bed                                                                        
Sleeper                                                                         
Club                                                                            
First.Class                                                                     
Business                                                                        
Eco.Comfort                                                                     

Economy                                         

This is because the scale of the different variables in the data set is quite variable; we can see this by plotting the variance of the different columns in the data frame (regular scaling on the left, logarithmic on the right):
# verify by plotting variance of columns
mar <- par()$mar
par(mar=mar+c(0,5,0,0))
barplot(sapply(data, var), horiz=T, las=1, cex.names=0.8)
barplot(sapply(data, var), horiz=T, las=1, cex.names=0.8, log='x')
par(mar=mar)


We correct for this by scaling the data using the scale() function. We can then verify that the variances across the different variables are equal so that when we apply principal components one variable does not dominate.
# Scale
data2 <- data.frame(scale(data))
# Verify variance is uniform
plot(sapply(data2, var))
After applying the scale() function the variance is now constant across variables

Now we can apply principal components to the scaled data. Note that this can also be done automatically in call to the prcomp() function by setting the parameter scale=TRUE. Now we see a result which is more along the lines of something we would expect:
# Proceed with principal components
pc <- princomp(data2)
plot(pc)
plot(pc, type='l')
summary(pc) # 4 components is both 'elbow' and explains >85% variance


Great, so now we're in business. There are various rules of thumb for selecting the number of principal components to retain in an analysis of this type, two of which I've read about are:
  1. Pick the number of components which explain 85% or greater of the variation
  2. Use the 'elbow' method of the scree plot (on right)
Here we are fortunate in that these two are the same, so we will retain the first four principal components. We put these into new data frame and plot.
# Get principal component vectors using prcomp instead of princomp
pc <- prcomp(data2)

# First for principal components
comp <- data.frame(pc$x[,1:4])
# Plot
plot(comp, pch=16, col=rgb(0,0,0,0.5))

So what were are looking at here are twelve 2-D projections of data which are in a 4-D space. You can see there's a clear outlier in all the dimensions, as well as some bunching together in the different projections.

Normally, I am a staunch opponent of 3D visualization, as I've spoken strongly about previously. The one exception to this rule is when the visualization is interactive, which allows the user to explore the space and not lose meaning due to three dimensions being collapsed into a 2D image. Plus, in this particular case, it's a good excuse to use the very cool, very awesome rgl package.

Click on the images to view the interactive 3D versions (requires a modern browser). You can better see in the 3D projections that the data are confined mainly to the one plane one the left (components 1-3), with the exception of the outlier, and that there is also bunching in the other dimensions (components 1,3,4 on right).
library(rgl)
# Multi 3D plot
plot3d(comp$PC1, comp$PC2, comp$PC3)
plot3d(comp$PC1, comp$PC3, comp$PC4)
So, now that we've simplified the complex data set into a lower dimensional space we can visualize and work with, how do we find patterns in the data, in our case, the aircraft which are most similar? We can use a simple unsupervised machine learning technique like clustering.

Cluster Analysis

Here because I'm not a data scientist extraordinaire, I'll stick to the simplest technique and do a simple k-means - this is pretty straightforward to do in R.

First how do we determine the number of clusters? The simplest method is to look at the within groups sum of squares and pick the 'elbow' in the plot, similar to as with the scree plot we did for the PCA previously. Here I used the code from R in Action:
# Determine number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
                                     centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

However, it should be noted that it is very important to set the nstart parameter and iter.max parameter (I've found 25 and 1000, respectively to be okay values to use), which the example in Quick-R fails to do, otherwise you can get very different results each time you run the algorithm, as below.

Clustering without the nstart parameter can lead to variable results for each run
Clustering with the nstart and iter.max parameters leads to consistent results, allowing proper interpretation of the scree plot
So here we can see that the "elbow" in the scree plot is at k=4, so we apply the k-means clustering function with k = 4 and plot.
# From scree plot elbow occurs at k = 4
# Apply k-means with k=4
k <- kmeans(comp, 4, nstart=25, iter.max=1000)
library(RColorBrewer)
library(scales)
palette(alpha(brewer.pal(9,'Set1'), 0.5))
plot(comp, col=k$clust, pch=16)


We can see that the one outlier is in its own cluster, there's 3 or 4 in the other and the remainder are split into two clusters of greater size. We visualize in 3D below, as before (click for interactive versions):
# 3D plot
plot3d(comp$PC1, comp$PC2, comp$PC3, col=k$clust)
plot3d(comp$PC1, comp$PC3, comp$PC4, col=k$clust)


We look at the exact clusters below, in order of increasing size:
# Cluster sizes
sort(table(k$clust))
clust <- names(sort(table(k$clust)))

# First cluster
row.names(data[k$clust==clust[1],])
# Second Cluster
row.names(data[k$clust==clust[2],])
# Third Cluster
row.names(data[k$clust==clust[3],])
# Fourth Cluster
row.names(data[k$clust==clust[4],])

[1] "Airbus A319 VIP"

[1] "CRJ 100/200 Pinnacle/SkyWest" "CRJ 100/200 ExpressJet"    
[3] "E120"                         "ERJ-145"                  

 [1] "Airbus A330-200"          "Airbus A330-200 (3L2)"
 [3] "Airbus A330-200 (3L3)"    "Airbus A330-300"      
 [5] "Boeing 747-400 (74S)"     "Boeing 757-200 (75E)"  
 [7] "Boeing 757-200 (75X)"     "Boeing 767-300 (76G)"  
 [9] "Boeing 767-300 (76L)"     "Boeing 767-300 (76T)"  
[11] "Boeing 767-300 (76Z V.1)" "Boeing 767-300 (76Z V.2)"
[13] "Boeing 767-400 (76D)"     "Boeing 777-200ER"      
[15] "Boeing 777-200LR"      

 [1] "Airbus A319"            "Airbus A320"            "Airbus A320 32-R"    
 [4] "Boeing 717"             "Boeing 737-700 (73W)"   "Boeing 737-800 (738)"
 [7] "Boeing 737-800 (73H)"   "Boeing 737-900ER (739)" "Boeing 757-200 (75A)"
[10] "Boeing 757-200 (75M)"   "Boeing 757-200 (75N)"   "Boeing 757-200 (757)"
[13] "Boeing 757-200 (75V)"   "Boeing 757-300"         "Boeing 767-300 (76P)"
[16] "Boeing 767-300 (76Q)"   "Boeing 767-300 (76U)"   "CRJ 700"            
[19] "CRJ 900"                "E170"                   "E175"                
[22] "MD-88"                  "MD-90"                  "MD-DC9-50"  

The first cluster contains a single aircraft, the Airbus A319 VIP. This plane is on its own and rightly so - it is not part of Delta's regular fleet but one of Airbus' corporate jets. This is a plane for people with money, for private charter. It includes "club seats" around tables for working (or not). Below is a picture of the inside of the A319 VIP:



Ahhh, that's the way fly (some day, some day...). This is apparently the plane professional sports teams and the American military often charter to fly - this article in the Sydney Morning Herald has more details.

The second cluster contains four aircraft - the two CRJ 100/200's and the Embraer E120 and ERJ-145. These are the smallest passenger aircraft, with the smallest accommodations - 28 for the E120 and 50 for the remaining craft. As such, there is only economy seating in these planes which is what distinguishes them from the remainder of the fleet. The E120 also has the distinction of being the only plane in the fleet with turboprops. Photos below.

Top: CRJ100/200. Bottom left: Embraer E120. Bottom right: Embraer ERJ-145.

I've flown many times in the venerable CRJ 100/200 series planes, in which I can assure you there is only economy seating, and which I like to affectionately refer to as "little metal tubes of suffering."

The other two clusters comprise the remainder of the fleet, the planes with which most commercial air travellers are familiar - your Boeing 7-whatever-7's and other Airbus and McDonnell-Douglas planes.

These are split into two clusters, which seem to again divide the planes approximately by size (both physical and accommodation), though there is crossover in the Boeing craft.
# Compare accommodation by cluster in boxplot
boxplot(data$Accommodation ~ k$cluster,
        xlab='Cluster', ylab='Accommodation',
        main='Plane Accommodation by Cluster')

# Compare presence of seat classes in largest clusters
data[k$clust==clust[3],30:33]
data[k$clust==clust[4],30:33]

First.Class Business Eco.Comfort Economy
Airbus A330-200 0 1 1 1
Airbus A330-200 (3L2) 0 1 1 1
Airbus A330-200 (3L3) 0 1 1 1
Airbus A330-300 0 1 1 1
Boeing 747-400 (74S) 0 1 1 1
Boeing 757-200 (75E) 0 1 1 1
Boeing 757-200 (75X) 0 1 1 1
Boeing 767-300 (76G) 0 1 1 1
Boeing 767-300 (76L) 0 1 1 1
Boeing 767-300 (76T) 0 1 1 1
Boeing 767-300 (76Z V.1) 0 1 1 1
Boeing 767-300 (76Z V.2) 0 1 1 1
Boeing 767-400 (76D) 0 1 1 1
Boeing 777-200ER 0 1 1 1
Boeing 777-200LR 0 1 1 1
First.Class Business Eco.Comfort Economy
Airbus A319 1 0 1 1
Airbus A320 1 0 1 1
Airbus A320 32-R 1 0 1 1
Boeing 717 1 0 1 1
Boeing 737-700 (73W) 1 0 1 1
Boeing 737-800 (738) 1 0 1 1
Boeing 737-800 (73H) 1 0 1 1
Boeing 737-900ER (739) 1 0 1 1
Boeing 757-200 (75A) 1 0 1 1
Boeing 757-200 (75M) 1 0 1 1
Boeing 757-200 (75N) 1 0 1 1
Boeing 757-200 (757) 1 0 1 1
Boeing 757-200 (75V) 1 0 1 1
Boeing 757-300 1 0 1 1
Boeing 767-300 (76P) 1 0 1 1
Boeing 767-300 (76Q) 1 0 1 1
Boeing 767-300 (76U) 0 1 1 1
CRJ 700 1 0 1 1
CRJ 900 1 0 1 1
E170 1 0 1 1
E175 1 0 1 1
MD-88 1 0 1 1
MD-90 1 0 1 1
MD-DC9-50 1 0 1 1


Looking at the raw data, the difference I can ascertain between the largest two clusters is that all the aircraft in the one have first class seating, whereas all the planes in the other have business class instead [the one exception being the Boeing 767-300 (76U)].

Conclusions

This was a little analysis which for me not only allowed me to explore my interest in commercial aircraft, but was also educational about finer points of what to look out for when using more advanced data science techniques like principal components, clustering and advanced visualization.

All in all, the techniques did a pretty admirable job in separating out the different type of aircraft into distinct categories. However I believe the way I structured the data may have biased it towards categorizing the aircraft by seating class, as that quality was replicated in the data set compared to other variables, being represented both in quantitative variables (seat pitch & width, number of seat in class) and categorical (class presence). So really the different seating classes where represented in triplicate within the data set compared to other variables, which is why the methods separated the aircraft in this way.

If I did this again, I would structure the data differently and see what relationships such analysis could draw out using only select parts of the data (e.g. aircraft measurements only). The interesting lesson here is that it when using techniques like dimensionality reduction and clustering it is not only important to be mindful of applying them correctly, but also what variables are in your data set and how they are represented.

For now I'll just keep on flying, collecting the miles, and counting down the days until I finally get that seat in first class.


References & Resources

Delta Fleet at Delta.com

Principal Components Analysis (Wikipedia):
http://en.wikipedia.org/wiki/Principal_components_analysis

The Little Book of R for Multivariate Analysis

Quick R: Cluster Analysis

Plane Luxury: how US sports stars fly (Syndney Morning Herald)