Monday, October 20, 2014

Twitter Pop-up Analytics

Introduction


So I've been thinking a lot lately. Well, that's always true. I should say, I've been thinking a lot lately about the blog. When I started this blog I was very much into the whole quantified self thing, because it was new to me, I liked the data collection and analysis aspect, and I had a lot of time to play around with these little side projects.

When I started the blog I called it "everyday analytics" because that's what I saw it always being; analysis of data on topics that were part of everyday life, the ordinary viewed under the analytical lens, things that everyone can relate to. You can see this in my original about page for the blog which has remained the same since inception.

I was thinking a lot lately about how as my interest in data analysis, visualization and analytics has matured, and so that's not really the case so much anymore. The content of everyday analytics has become a lot less everyday. Analyzing the relative nutritional value of different items on the McDonald's menu (yeesh, looking back now those graphs are pretty bad) is very much something to which most everyone could relate. 2-D Histograms in R? PCA and K-means clustering? Not so much.

So along this line of thinking, for this reason, I thought it's high time to get back into the original spirit of the site when it was started. So I thought I'd do some quick quantified-self type analysis, about something everyone can relate to, nothing fancy. 

Let's look at my Twitter feed.

Background

It wasn't always easy to get data out of Twitter. If you look back at how Twitter's API has changed over the years, there has been considerable uproar about the restrictions they've made in updates, however they're entitled to do so as they do hold the keys to the kingdom after all (it is their product). In fact, I thought it'd be a easiest to do this analysis just using the twitteR package, but it appears to be broken since Twitter has made said updates to their API.

Luckily I am not a developer. My data needs are simple for some ad hoc analysis. All I need is the data pulled and I am ready to go. Twitter now makes this easy now for anyone to do, just go to your settings page:


And then select the 'Download archive' button under 'Your Twitter Archive' (here it is a prompt to resend mine, as I took the screenshot after):


And boom! A CSV of all your tweets is in your inbox ready for analysis. After all this talk about working with "Big Data" and trawling through large datasets, it's nice to take a breather a work with something small and simple.

Analysis

So, as I said, nothing fancy here, just wrote some intentionally hacky R code to do some "pop-up" analytics given Twitter's output CSV. Why did I do it this way, which results in 1990ish looking graphs, instead of in Excel and making it all pretty? Why, for you, of course. Reproducibility. You can take my same R code and run it on your twitter archive (which is probably a lot larger and more interesting than mine) and get the same graphs.

The data set comprises 328 tweets sent by myself between 2012-06-03 and 2014-10-02. The fields I examined were the datetime field (time parting analysis), the tweet source and the text / content.

Time Parting
First let's look at the time trending of my tweeting behaviour:

We can see there is some kind of periodicity, with peaks and valleys in how many tweets I send. The sharp decline near the end is because there are only 2 days of data for October. Also, compared to your average Twitter user, I'd say I don't tweet alot, generally only once every two days or so on average:

> summary(as.vector(monthly))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    8.00   12.00   11.31   15.00   21.00 

Let's take a look and see if there is any rhyme or reason to these peaks and valleys:

Looking at the total counts per month, it looks like I've tweeted less often in March, July and December for whatever reason (for all of this, pardon my eyeballing..)

What about by day of week?

Look like I've tweeted quite a bit more on Tuesday, and markedly less on the weekend. Now, how does that look over the course of the day;

My peak tweeting time seems to be around 4 PM. Apparently I have sent tweets even in the wee hours of the morning - this was a surprise to me. I took a stab at making a heatmap, but it was quite sparse; however the 4-6 PM peak does persist across the days of the week.

Tweets by Source
Okay, that was interesting. Where am I tweeting from?

Look like the majority of my tweets are actually sent from the desktop site, followed by my phone, and then sharing on sites. I attribute this to the fact that I mainly use twitter to share articles, which isn't easy to do on my smartphone.

Content Analysis
Ah, now on to the interesting stuff! What's actually in those tweets?

First let's look at the length of my tweets in a simple histogram:


Looks like generally my tweets are above 70 characters or so, with a large peak close to the absolute limit of 160 characters. 

Okay, but what I am actually tweeting about? Using the very awesome tm package it's easy to do some simple text mining and pull out both top frequent terms, as well as hashtags.

So apparently I tweet a lot about data, analysis, Toronto and visualization. To anyone who's read the blog this shouldn't be overly surprisingly. Also you can see I pass along articles and interact with others as "via" and "thanks" are in there too. Too bad about that garbage ampersand.


Overwhelmingly the top hashtag I use is #dataviz, followed of course by #rstats. Again, for anyone who knows me (or has seen one of my talks) this should not come as a surprise. You can also see my use of Toronto Open Data in the #opendata and #dataeh hashtags.

Conclusion

That's all for now. As I said, this was just a fun exercise to write some quick, easy R code to do some simple personal analytics on a small dataset. On the plus side the code is generalized, so I invite you to take it and look at your own twitter archive.

Or, you could pull all of someone else's tweets, but that would, of course, require a little more work.

References

code at github

Twitter Help Center: Downloading Your Archive

The R Text Mining (tm) package at CRAN

twitteR package at CRN