Monday, June 15, 2015

Visualization and Analysis of Reddit's "The Button" Data

Introduction

People are weird. And if there's anything that's greater collective proof of this fact than Reddit, you'd be hard pressed to find it.

I tend to put reddit in the same bucket as companies like Google, Amazon and Netflix, where they have enough money, or freedom, or both, to say something like "wouldn't it be cool if....?" and then they do it simply because they can.

Enter "the button" (/r/thebutton), reddit's great social experiment that appeared on April Fool's Day of this year. An enticing blue rectangle with a timer that counts down from 60 to zero that's reset when the button is pushed, with no explanation as to what happens when the time is allowed to run out. Sound familiar? The catch here being that it was an experience shared by anyone who visited the site, and each user also only got one press (though many made attempts to game the system, at least initially).

Finally, the timer reached zero, the last button press being at 2015-06-05 21:49:53.069000UTC, and the game (rather anti-climactically I might offer) ended.

What does this have to do with people being weird? Well, an entire mythology was built up around the button, amongst other things. Okay, maybe interesting is a better word. And maybe we're just talking about your average redditor.

Either way, what interests me is that when the experiment ended, all the data were made available. So let's have a look shall we?

Background

The dataset consists of simply four fields: 
press time, the date and time the button was pressed
flair, the flair the user was assigned given at what the timer was at when they pushed the button
css, the flair class given to the user
and lastly outage press, a Boolean indicator as to if the press occurred during a site outage.

The data span a time period from 2015-04-01 16:10:04.468000 to 2015-06-05 21:49:53.069000, with a total of 1,008,316 rows (unique presses).

I found there was css missing for some rows, and a lot of of "non presser" flair (users who were not eligible to press the button as their account was created after the event started). For these I used a "missing" value of -1 for the number of seconds remaining when the button was pushed; otherwise it could be stripped from the css field.

Analysis

With this data set, we're looking at a pretty straightforward categorical time series.

Overall Activity in Time
First we can just look at the total number of button presses, regardless of what the clock said (when they occurred in the countdown) by plotting the raw number of presses per day:


Hmmm... you can see there is a massive spike at the beginning of the graph and there's much, much fewer for the rest of the duration of the experiment. In fact, nearly 32% of all clicks occurred in the first day, and over half (51.3%) in the first two days. 

I think has something to do with both the initial interest in the experiment when it first was announced, and also with the fact that the higher the counter is kept at, the more people can press the button in the same time period (more on this later).

Perhaps a logarithmic graph for the y-axis would be more suitable?
That's better. We can see the big drop-off in the first two days or so, and also that little blip around the 18th of May is more apparent. This is likely tied to one of several technical glitches which are noted in the button wiki,

For a more granular look, let's do the hourly presses as well (with a log scale):
Cool. The spike on the 18th seems to be mainly around one hour with about a thousand presses, and we can see too that perhaps there's some kind of periodic behavior in the data on an hourly basis? If we exclude some of the earlier data we can also go back to not using a log scale for the y-axis:
Let's look more into the hours of the day when the button presses occur. We can create a simple bar plot of the count of button presses by hour overall:

You can see that the vast majority occurred around 5 PM and then there is a drop-off after that, with the lows being in the morning hours between about 7 and noon. Note that all the timestamps for the button pushes are in Universal Time. Unfortunately there is no geo data, but assuming most redditors who pushed the button are within the continental United States (a rather fair assumption) the high between 5-7 PM would be 11 AM to 1 PM (so, around your lunch hour at work).

But wait, that was just the overall sum of hours over the whole time period. Is there a daily pattern? What about by hour and day of week? Are most redditors pushing the button on the weekend or are they doing it at work (or during school)? We should look into this in more detail.

Hmm, nope! The majority of the clicks occurred Wednesday-Thursday night. But as we know from the previous graphs, the vast majority also occurred within the first two days, which happened to be a Wednesday and Thursday. So the figures above aren't really that insightful, and perhaps it would make more sense to look at the trending in time across both day and hour? That would give us the figure as below:
As we saw before, there is a huge amount of clicks in the first few days (the first few hours even) so even with log scaling it's hard to pick out a clear pattern. But most of the presses appear to be present in the bands after 15:00 and before 07:00. You can see the clicks around the outage on the 18th of May were in the same high period, around 18:00 and into the next day.

Maybe alternate colouring would help?
That's better. Also if we exclude the flurry of activity in the first few days or so, we can drop the logarithmic scaling and see the other data in more detail:


To get a more normalized view, we can also look at the percentage of daily clicks per hour for each day, which yields a much more interesting view, and really shows the gap in the middle and the outage on the 18th:


Activity by Seconds Remaining
So far we've only looked at the button press activity by the counts in time. What about the time remaining for the presses? That's what determined each individual reddit user's flair, and was the basis for all the discussion around the button.

The reddit code granted flairs which were specific to the time remaining when the button was pushed.  For example, if there were 34 seconds remaining, then the css would be "34s", so it was easy to strip these and convert into numeric data. There were also those that did not press the button who were given the "non presser" flair (6957 rows, ~0.69%), as well as a small number of entries missing flair (67, <0.01%), which I gave the placeholder value of -1.

The remaining flair classes served as a bucketing which functioned very much like a histogram:
ColorHave they pressed?Can they press?Timer number when pressed
Grey/GrayNYNA
PurpleYN60.00 ~ 51.01
BlueYN51.00 ~ 41.01
GreenYN41.00 ~ 31.01
YellowYN31.00 ~ 21.01
OrangeYN21.00 ~ 11.01
RedYN11.00 ~ 00.00
Silver/WhiteNNNA
We can see this if we plot a histogram of the button presses by using the CSS class which gives the more granular seconds remaining, and use breaks the same as above:
We can see there is much greater proportion of those who pressed within 51-60s left, and there is falloff from there (power law). This is in line with what we saw in the time series graphs: the more the button was pressed, the more presses could occur in a given interval of time, and so we expect that most of those presses occurred during the peak activity at the beginning of the experiment (which we'll soon examine).

What's different from the documentation above from the button wiki is the "cheater" class, which was given to those who tried to game the system by doing things like disconnecting their internet and pressing the button multiple times (as far as I can tell). You can see that plotting a bar graph is similar to the above histogram with the difference being contained in the "cheater" class:
Furthermore, looking over the time period, how are the presses distributed in each class? What about in the cheater class? We can plot a more granular histogram:
Here we can more clearly see the exponential nature of the distribution, as well as little 'bumps' around the 10, 20, 30 and 45 second marks. Unfortunately this doesn't tell us anything about the cheater class as it still has valid second values. So let's do a boxplot by css class as well, showing both the classes (buckets) as well as their distributions:
Obviously each class has to fit into a certain range given their definition, but we can see some are more skewed than others (e.g. class for 51-60s is highly negatively skewed, whereas the class for 41-50 has median around 45). Also we can see that the majority of the cheater class is right near the 60 mark.

If we want to be fancier we can also plot the boxplot using just the points themselves and adding jitter:
This shows the skew of the distributions per class/bucket (focus around "round" times like 10, 30, 45s, etc.) as before, as well as how the vast majority of the cheater class appears to be at 59s mark.

Presses by seconds remaining and in time
Lastly we can combine the analyses above and look at how the quantity and proportion of button presses varies in time by the class and number of seconds remaining.

First we can look at the raw count of presses per css type per day as a line graph. Note again the scale on the y-axis is logarithmic:
This is a bit noisy, but we can see that the press-6 class (presses with 51-60s remaining) dominate at the beginning, then taper off toward the end. Presses in the 0-10 class did not appear until after April 15, then eventually overtook the quicker presses, as would have to be the case in order for the timer to run out. The cheater class starts very high with the press-6 class, then drops off significantly and continues to decrease. I would have like to break this out into small multiples for more clarity, but it's not the easiest to do using ggplot.

Another way to look at it would be to look at the percent of presses by class per day. I've written previously about how stacked area graphs are not your friend, but in this case it's actually not too bad (plus I wanted to learn how to do it in ggplot). If anything it shows the increase presses in the 51-60 range right after the outage on May 18, and the increase in the 0-10 range toward the end (green):
This is all very well and good, but let's get more granular. We can easily visualize the data more granularly using heatmaps with the second values taken from the user flair to get a much more detailed picture. First we'll look at a heatmap of this by hour over the time period:
Again, the scaling is logarithmic for the counts (here the fill colour). We can see some interesting patterns emerging, but it's a little too sparse as there are a lot of hours without presses for a particular second value. Let's really get granular and use all the data on the per second level!

On the left is the data for the whole period with a logartihmic scale, whereas the figure on the right excludes some of the earlier data and uses a linear scale. We can see the beginning peak activity in the upper lefthand corner, and then these interesting bands around the 5, 10, 20, 30, and 45 marks forming and gaining strength over time (particular toward the end). Interestingly in addition the resurgence in near-instantaneous presses after the outage around May 18, there was also a hotspot of presses around the 45s mark close to the end of April. Alternate colouring below:

Finally, we can divide by the number of presses per day and calculate the percent each number of seconds remaining made up over the time period. That gives the figures below:
Here the flurry of activity at the beginning continues to be prominent, but the bands also stand out a little more on a daily basis. We can also see how the proportion of clicks for the smaller number of seconds remaining continues to increase until finally the timer is allowed to run out.

Conclusion

The button experiment is over. In the end there was no momentous meaning to it all, no grand scheme or plan, no hatch exploding into the jungle, just an announcement that the thread would be archived. Again, somewhat anti-climactic.

But, it was an interesting experiment. This was an interesting data set, given the relationship between the amount of data that could exist in the same interval of time because of the nature of it. 
And I think it really says something about what the internet allows us to do (both in terms of creating something simply for the sake of it, and collecting and analyzing data), and also about people's desire to find patterns and create meaning in things, no matter what they are. If you'd asked me, I never would have guessed religions would have sprung up around something as simple as pushing a button. But then again, religions have sprung up around stranger things.

You can read and discuss in the button aftermath thread, and if you want to have a go at it yourself, the code and data are below. Until next time I'll just keep pressing on.

References & Resources

the button press data (from reddit's github)

R code for plots

/r/thebutton