Monday, May 27, 2013

The heat is on.... or is it? Trend Analysis of Toronto Climate Data

The following is a guest post from Joel Harrrison, PhD, consulting Aquatic Scientist.

For a luddite like me, this is a big step – posting something on the inter-web.  I’m not on Facebook.  I don’t know what Twitter is.  Hell, I don’t even own a smartphone.  But, I’ve been a devoted follower of Myles’ blog for some time, and he was kind enough to let his fellow-geek-of-a-brother contribute something to everyday analytics, so who was I to pass up such an opportunity?

The impetus for my choice of analysis was this:  in celebration of Earth Day, my colleagues and I watched a film about global climate change, which was a nice excuse to eat pizza and slouch in an office chair while sipping Dr. Pepper instead of doing other, presumably useful, things.  Anyway, a good chunk of the film centred on the evidence for anthropogenic greenhouse gas emissions altering the global climate system.

While I’ve seen lots of evidence for recent increases in air temperature in the mid-latitude areas of the planet, there’s nothing quite so convincing as doing your own analysis.  So, I downloaded climate data from Environment Canada and did my own climate change analysis.  I’m an aquatic scientist, not a climate scientist, so if I’ve made any egregious mistakes here, perhaps someone well-versed in climatology will show me the error of my ways, and I’ll learn something.  Anyway, here we go.

Let’s start with mean monthly temperatures from daily means (the average, for each month of the year, of the daily mean temperatures) for the city of Toronto, for which a fairly good record exists (1940 to 2012).  Here’s what the data look like:

So, you can see the clear trend in the data, can’t you?  Trend analysis is a tricky undertaking for a number of reasons, one of which is that variation can exist on a number of temporal scales.  We’re looking at temperatures here, so obviously we would expect significant seasonality in the data, and we are not disappointed:
One method of controlling for the variation related to seasonality is to ‘deseasonalize’ the data by subtracting the monthly medians from each datum.  Let’s look at a boxplot of the deseasonalized data (in part to ensure I’ve done the deseasonalizing correctly!):
Whew, looks good, my R skills are not completely lacking, apparently.  Here are what the deseasonalized data look like, as a time series plot:
Things were clear as mud when we originally viewed the time series plot of all of the data, but after removing the variation related to seasonality, a pattern has emerged:  an increase in temperature from 1940 to 1950, relatively stable temperatures from 1950 to 1960, then a decrease in temperature from 1960 to 1970, and a fairly consistent increase from 1970 to 2012.  Viewing the annual mean temperatures makes this pattern even more conspicuous:
Hold on, you say, why bother going to the trouble of deseasonalizing the data when you could just calculate annual means and perform linear regression to test for a trend?  This is an intuitively attractive way to proceed, but the problem is, that if, say, temperatures were getting colder over time during the summer months, but proportionately warmer during the winter, the annual mean temperature would not change over time; the two opposing trends would in effect cancel each other out.  Apparently that is not the case here, as the deseasonalized data and the annual means show a similar pattern, but caution must be exercised for this reason (especially when you have little theoretical understanding of the phenomenon which you are investigating!).

So, this is all nice and good from a data visualization standpoint, but we need to perform some statistics in order to quantify the rate of change, and to decide if the change is significant in the statistical sense.  Below are the results from linear regression analyses of temperature vs. year using the original monthly means, the deseasonalized data, and the annual means.

Dependent (Response) Variable
n
slope
R2
p-value
Monthly Mean Temperature
876
0.022
0.001
0.17
Deasonalized Monthly Temperatures
876
0.022
0.05
5.82 x 10-12
Annual Mean Temperature
73
0.022
0.20
4.65 x 10-5

All 3 analyses yielded a slope of 0.022 °C/yr, which is to say, the average rate of change during the 70 years analysed was 1.54°C.  The regression based on monthly mean temperatures had a very low goodness of fit (R2 = 0.001) and was not significant at the conventional cut-off level of p < 0.05.  This is not surprising given the scatter we observed in the data due to seasonality.  What is therefore also not a surprise, is that the deseasonalized data had much better goodness of fit (R2 = 0.05), as did the annual mean temperatures (R2 = 0.20).  The much higher level of statistical significance of the regression on deseasonalized data than on the annual means is likely a function of the higher power of the analysis (i.e., 876 data vs. only 73).

Before we get too carried away here interpreting these results, is there anything we’re forgetting?  Right, those annoying underlying assumptions of the statistical test we just used.  According to Zar (1999), for simple linear regression these are:

  1. For any value of X there exists in the population a normal distribution of Y values.  This also means that, for each value of X there exists in the population a normal distribution of Ɛ’s.
  2. Must assume homogeneity of variances; that is, the variances of these population distributions of Y values (and of Ɛ’s) must all be equal to one another.
  3. The actual relationship is linear.
  4. The values of Y are to have come at random from the sampled population and are to be independent of one another.
  5. The measurements of X are obtained without error.  This…requirement…is typically impossible; so what we are doing in practice is assuming that the errors in the X data are negligible, or at least are small compared with the measurement errors in Y.

Hmm, this suddenly became a lot more complicated.   Let’s check the validity of these assumptions for the regression of the deseasonalized monthly temperatures vs. year.  Well, we can safely say that number 5 is not a concern, i.e., that the dates were measured without error, but what about the others?  Arguably, the data are not actually linear, because of the fall in temperature between 1960 and 1970, so this is something of a concern.  The Shapiro-Wilk test tells us that the residuals are not significantly non-normal (assumption 1) but just barely (p = 0.056).  We can visualize this via a Q-Q (Quantile-Quantile) plot of the residuals:
For the most part the data fall right on the line, but a few points fall below and above the line at the extremes, suggestive of a somewhat ‘heavy tailed’ distribution.  Additionally, let’s inspect the histogram:
Again, there is some slight deviation from normality, as evidenced by the distance of the first and last bars from the rest, but it’s pretty minor.  So, there is some evidence of non-normality, but it appears negligible based on visual inspection of the Q-Q plot and histogram, and it is not statistically significant according to the Shapiro-Wilk test.  So, we’re good as far as normality goes.  Check.

What about assumption 2, homogeneity of variances?  This is typically assessed by plotting the residuals against the fitted values, like so:


There does not appear to be a systematic change in the magnitude of the residuals as a function of the predicted values, or at least nothing overly worrisome, so we’re good here, too.

Last, but certainly not least, do our data represent independent measurements?  This last assumption is frequently a problem in trend analysis.  While each temperature was presumably measured on a different day, in the statistical sense this does not necessarily imply that the measurements are not autocorrelated.   Several years of data could be influenced by an external factor which influences temperature over a multi-year timescale (El Niño?) which would cause the data from sequential years to be strongly correlated.  Such temporal autocorrelation (serial dependence) can be visualized using an autocorrelation function (ACF):

The plot tells us that at a variety of lag periods (differences between years) the level of autocorrelation is significant (i.e., the ACF is above the blue line).  The Durbin-Watson test confirms that the overall level of autocorrelation in the residuals is highly significant (p = 4.04 x 10-13).

So, strictly speaking, linear regression is not appropriate for our data due to the presence of nonlinearity and serial correlation, which violate two of the five assumptions of linear regression analysis.  Now, don’t get me wrong, people violate these assumptions all the time.  Hell, you may have already violated them earlier today if you’re anything like I was in my early days of grad school.  But, as I said, this is my first blog post ever, and I don’t want to come across as some sloppy, apathetic, slap-dash, get-away-with-whatever-the-peer-reviewers-don’t-call-me-out-on type scientist - so let’s shoot for real statistical rigour here!

Fortunately, this is not too onerous a task, as there is a test that was tailor-made for trend analysis, and doesn’t have the somewhat strict requirements of linear regression.  Enter the Hirsch-Slack Test, a variation of the Seasonal Kendall Trend Test, which corrects for both seasonality and temporal autocorrelation.  I could get into more explanation as to how the test works, but this post is getting to be a little long, and hopefully you trust me by now.  So, drum roll please….

The Hirsch-Slack test gives very similar results to those obtained using linear regression; it indicates a highly significant (p = 1.48 x 10-4) increasing trend in temperature (0.020°C/yr), which is very close to the slope of 0.022°C/yr obtained by linear regression.

So, no matter which way you slice it, there was a significant increase in Toronto’s temperature over the past 70 years.  I’m curious about what caused the dip in temperature between ~1960 and ~1970, and have a feeling it may reflect changes in aerosols and other aspects of air quality related to urbanization, but don’t feel comfortable speculating too much.  Perhaps it reflects some regional or global variation related to volcanic activity or something, I really have no idea.  Obviously, if we’d performed the analysis on the years 1970 to 2010 the slope (i.e., rate of temperature increase) would have been much higher than for the entire period of record.

I was also curious if Toronto was a good model for the rest of Canada given that it is a large, rapidly growing city, and changes in temperature there could have been related to urban factors, such as the changes in air quality I already speculated about.  For this reason, I performed the same analysis on data from rural Coldwater (near where Myles and I grew up) and obtained very similar results, which suggests the trend is not unique to the city of Toronto.

In case you’re wondering, the vast majority (98%) of Canadians believe the global climate is changing, according to a recent poll by Insightrix Research (but note that far fewer believe that human activity is solely to blame.)  So, perhaps the results of this analysis won’t be a surprise to very many people, but I did find it satisfying to perform the analysis myself, and with local data.

Well, that`s all for now - time to brace ourselves for the coming heat of summer.  I think I need a nice, cold beer.

References & Resources

Zar, J.H. (1999) Biostatistical Analysis, 4th ed. Upper Saddle River, New Jersey: Prentice Hall.
http://books.google.com/books/about/Biostatistical_analysis.html?id=LCRFAQAAIAAJ

Hirsch, R.M. & Slack, J.R. (1984). A Nonparametric Trend Test for Seasonal Data With Serial Dependence. Water Resources Research 20(6), 727-732. doi: 10.1029/WR020i006p00727
http://onlinelibrary.wiley.com/doi/10.1029/WR020i006p00727/abstract

National Post: Climate Change is real, Canadians say, but they can't agree on the cause
http://news.nationalpost.com/2012/08/16/climate-change-is-real-canadians-say-while-disagreeing-on-the-causes

Climate Data at Canadian National Climate Data and Information Archive
http://climate.weatheroffice.gc.ca/climateData/canada_e.html

Joel Harrison, PhD, Aquatic Scientist
http://www.environmentalsciences.ca/newsite/staff/#harrison

Monday, May 6, 2013

xkcd: Visualized

Introduction

It's been said that the ideal job is one you love enough to do for free but are good enough at that people will pay you for it. That if you do what you love no matter what others may say, and if you work at it hard enough, and long enough, eventually people will recognize it and you’ll be a success.

Such is the case with Randall Munroe. Because any nerd worth their salt knows what xkcd is.

What started as simply a hobby and posting some sketches online turned into a cornerstone of internet popular culture, with a cult following amongst geekdom, the technically savvy, and more.

Though I would say that it’s gone beyond that now, and even those less nerdy and techie know what xkcd means – it’s become such a key part of internet popular culture. Indeed, Mr. Munroe’s work has swing due to the sheer number of people who know and love it, and content on the site has resulted in changes being made on some of the biggest sites on the Internet – take, for example, Google adding a comment read-a-loud in 2008, quite possibly because of a certain comic.

As another nerdy / tech / data citizen of the internet who knows, loves and follows xkcd, I thought I could pay tribute to it with its own everyday analysis.

Background

Initially, I thought I would have to go about doing it the hard way again. I’ve done some web scraping before with Python and thought this would be the same using the (awesome) Beautiful Soup package.

But Randall, being the tech-savvy (and Creative Commons abiding) guy that he is, was nice enough to provide an API to return all the comic metadata in JSON format (thanks Randall!).

That being said it was straightforward to write some Python with urrlib2 to download the data and then get going on the analysis.

Of course, after doing all that I realized that someone else was nice enough to have already written the equivalent code in R to access the data. D’oh! Oh well. Lesson learned – Google stuff first, code later.

But it was important to write that code in Python as I used the Python Imaging Library (PIL) (also awesome… thanks mysterious, shadowy developers at Pythonware/Secret Labs AB) to extract metadata from the comic images.

The data includes the 1204 comics from the very beginning (#1, Barrel – Part 1 posted on Jan 1, 2006) to #1204, Detail, posted on April 26, 2013.

As well as the data provided via the JSON (comic #, url, title, date, transcript and alt text) I pulled out additional fields using the Python Imaging Library (file format, filesize, dimensions, aspect ratio and luminosity). I also wanted to calculate hue, however, regrettably this is a somewhat more complicated process which my image processing chops were not immediately up to, and so I deferred on this point.

Analysis

File type
Ring chart of xkcd comics by file type Bar chart of xkcd comics by file type


You can see out of the 1204 comics, 1073 (~89.19%) were in PNG format, 128 (~10.64%) were in JPEG and only 2 (#961, Eternal Flame and #1116, Stoplight) (~0.17%) were in GIF. This of course, being because the latter are the only two comics which are animated.

Looking at the filetype over time below, you can see that initially xkcd was primarily composed of JPEG images (mostly because they were scanned sketches) and this quickly changed over time to being almost exclusively PNG with the exception of the two aforementioned animated GIFs. The lone outlying JPEG near 600 is Alternative Energy Revolution (#556).
strip chart of xkcd comics by file type
Image Mode
Next we can look at the image mode of all the xkcd images. For a little context, the image modes are roughly as following:
  • L - 8 bit black & white
  • P - 8 bit colour
  • RGB - colour
  • LA, RGBA - black & white with alpha channel (transparency), colour with alpha channel

The breakdown for all the comics is depicted below.

ring chart of xkcd comics by image modebar chart of xkcd comics by image mode



You can see that the majority are imagemode L (847, ~70.41%) followed by 346 in RGB (~28.76%) and a tiny remaining number are in P (8, ~0.7%) with the remaining two in L and RGB modes with alpha channel (LA & RGBA).

Any readers will know that the bulk of xkcd comics are simple black-and-white images with stick figures and you can see this reflected in the almost ¾ to ¼ ratio of monochrome to coloured images.

The two images with alpha channel are Helping (#383) and Click and Drag (#1110), most likely because of the soft image effect and interactivity, respectively.

Looking at the image mode over time, we can see that like the filetype, almost all of the images were initially in RGB mode as they were scans. After this period, the coloured comics are fairly evenly interspersed with the more common black and white images.
strip chart of xkcd comics by image mode
Luminosity
You can see in the figure on the left that given the black-and-white nature of xkcd the luminosity of each image is usually quite high (the maximum is 255). We can see the distribution better summarized on the right in a histogram:

scatterplot of luminosity of xkcd comicshistogram of luminosity of xkcd comics

Luminosity was the only quality of the images which had significant change over the years that Randall has created the comic. Doing an analysis of variance we can see there is a statistically significant year-on-year difference in the average comic brightness (> 99%):

> aov(data$lumen ~ data$year)
Call:
aov(formula = data$lumen ~ data$year)

Terms:
data$year Residuals
Sum of Squares 5762.0 829314.2
Deg. of Freedom 1 1201

Residual standard error: 26.27774
Estimated effects may be unbalanced
> summary(aov(data$lumen ~ data$year))
Df Sum Sq Mean Sq F value Pr(>F)
data$year 1 5762 5762 8.344 0.00394 **
Residuals 1201 829314 691
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


True, there are currently less data points for the 2013 year, however even doing the same excluding this year is significant with 99% significance.

The average luminosity decreases by year, and this is seen in the plot below which shows a downward trend:

line plot of average luminosity of xkcd per year

Image Dimensions
Next we look at the sizes of each comic. xkcd ranges from very tall comic-book style strips to ultra-simplistic single small images driving the whole point or punch line home in one frame.
scatterplot of height vs. width for xkcd comics
Looking at the height of each comic versus the width, you can see that there appears to be several standard widths which Randall produces the comic at (not so with heights). These standard widths are 400, 500, 600, 640, and 740.
distribution of image heights of xkcd comic
We can see these reflected in the distribution of all image widths, 740 is by far the most common comic width. There is no such pattern in the image heights, which appears to have a more logarithmic-like distribution.

histogram of width of xkcd comicshistogram of height of xkcd comics

Interesting, the ‘canonical’ widths are not constant over time – there were several widths which were used frequently near the beginning, after which the more common standard of 740px was used. This may be due to the large number of scanned images near the beginning, as I imagine scanning an A4 sheet of paper would often result in the same image resolutions. 
scatterplot of width of xkcd comics
The one lone outlier on the high end of image width is 780px wide and is #1193, Externalities.

Looking at the aspect ratio of the comics over time, you can see that there are appear to be two classes of comics - a larger number (about 60%) of which are more tightly clustered around an even 1:1 aspect ratio, and then a second class more evenly distributed with aspect ratio 2 and above. There are also a small peaks around 1.5 and 1.75.

scatterplot of aspect ratio of xkcd comics histogram of aspect ratio of xkcd comics
In case you were wondering the comic with an aspect ratio of ~8 is Tags (#1144) and the tallest comic proportionally is Future Timeline (#887).

Filesize
As well as examining the resolution (dimensions) of the comic images we can also examine the distribution of the images by their filesize.
distribution of file size of xkcd comics
You can see that the majority of the images are below 100K in size – in general the xkcd comics are quite small as the majority are simple PNGs displaying very little visual information.

We can also look at the comic size (area in square pixels) versus the filesize:
scatterplot of file size versus image size of xkcd comicsscatterplot of file size versus image size of xkcd comics (with trend line)


There is clearly a relationship here, as illustrated on the log-log plot on the right with the trend line.Of course, I am just stating the obvious - this relationship is not unique to the comics and exists as a property for the image formats in general.

If we separated out the images by file type (JPEG and PNG) I believe we would see different numbers for the relationship as a result of the particularities of the image compression techniques.

Conclusions

I have this theory that how funny a joke is to someone who gets it is inversely proportional to the number of other people who would get it. That is to say, the more esoteric and niche the comedy is, the funnier and more appealing it is to those who actually get the punch line. It’s a feeling of being special – a feeling that someone else understands and that the joke was made just for you, and others like you, and that you’re not alone in thinking that comics involving Avogadro’s number, Kurt Godel or Turing Completeness can be hilarious.

As an analyst who has come out of the school of mathematics, and continually been immersed in the world of technology, it is reassuring to read something like xkcd and feel like you’re not the only one who thinks matters of data, math, science, and technology can be funny, along with all the other quirkiness and craziness in life which Randall so aptly (and sometimes poignantly) portrays.

That being said, Randall’s one dedicated guy who has done some awesome work for the digitally connected social world of science, technology, and internet geekdom, and now we know how much he likes using 740px as the canvas width, and that xkcd is gradually using less white pixels over the years.

And let's hope there will be many more years to come.

Resources

xkcd

xkcd - JSON API

xkcd - Wikipedia

code on github