Monday, October 22, 2012

Top 5 Tips for Communicating Data

Properly communicating a message with data is not always easy.

If it were, everyone could do it, and there wouldn't be questions at the end of presentations, discussions around the best way to tweak a scatterplot, or results to a Google Images search for chartjunk.

Much has been written on the subject of how to properly communicate data, and there's a real art and science to it. Many fail to appreciate this, which can result in confusion - about the message trying to be conveyed, the salience of various features of the data being presented, or why the information is important.

There's a lot to be said on the subject, but keep these 5 tips for communicating data in mind, and when you have a data-driven message to get across they will help you do so with clarity and precision.

1. Plan: Know What You Want to Say

Just like you wouldn't expect an author to write a book without a plot, or an entrepreneur to launch a new venture without a business plan, you can't expect to march blindly into creating a report or article using data without knowing what you want to say.

Sometimes all the analysis will have already been done, and so you just need to think about how to best present it to get your message across. What variables and relationships are most important? What is the best way to depict them? Why oh why am I using aquamarine in this bar chart?

Other times figuring out your exact message will come together with the analysis, and so you would instead start with a question you want to answer, like "How effective has our new marketing initiative been over the last quarter?" or "How has the size of the middle class in Canada in changed over the last 15 years?"

2. Prepare: Be Ready

As I reflected upon in a previous post, sometimes people fail to recognize that just getting the information and putting in the proper shape is a part of the process that should not be overlooked.

Before you even begin to think about communicating your message, you need to make sure you have the data available and in a format (or formats) that you can comfortably work with. You should also consider what data are most important and how to treat them accordingly, and if any other sets should also be included (see Tip #3).

On this same note, before launching into the analysis or creation of the end product (article, report, slidedeck, etc.) it is important to think about if you are ready in terms of tools. What software packages or analysis environments will be used for the data analysis? What applications will be used to create the end product, whatever it may be?

3. Frame: Context is Key

Another important tip to remember is to properly frame your message by placing the data in context.

Failure to follow this tip results in simply serving up information - data are being presented but there is no message being communicated. Context answers the questions "Why is this important?" and "How is this related to x, y, and z?"

Placing the data in context allows the audience to see how it relates to other data, and why it matters. Do not forget about context, or you will have people asking why they should care about what you are trying to communicate.

4. Simplify: Less is More

Let me be incredibly clear about this: more is not always better. If you want to get a message across, simpler is better. Incredibly complicated relationships can be discussed, depicted, and dissected, but that doesn't mean that your article, slide or infographic needs to look like a spreadsheet application threw up all over it.

Keep the amount of information that your audience has to process at a time (per slide, paragraph, or figure) small. Relationships and changes should be clearly depicted and key differences highlighted with differences in colour or shape. The amount of text on graphs should be kept to a minimum, and if this is not possible, then perhaps the information needs to be thought about being presented in a different way.

The last thing you want to do is muddle your message with information overload and end up confusing your audience.

5. Engage: It's Useless If No One Knows It Exists

In the world of business, when creating a report or presenting some data, the audience is often predefined. You create a slidedeck to present to the VP and if your data are communicated properly (because you've followed Tips 1-4, wink wink) then all is well and you're on your way to the top. You email the report and it gets delivered to the client and your dazzling data analysis skills make them an even greater believer in your product. And so on.

In other cases though, like when writing a blog post or news article, your audience may not be picked out for you and so it's also your job to engage them. All your dazzling data analysis and beautiful visual work will contribute nothing if no eyeballs are laid upon it. For this reason, another tip to remember is to engage interested parties, either directly or indirectly through channels such as social media.

What Are You Waiting For?

So there are your Top 5 Tips for Communicating Data. Like I said, it's not always easy. Keep these tips in mind, and you'll ask yourself the right questions before you give all the answers.

Go. Explore the data, and be great. Happy communicating.

Saturday, October 20, 2012

Quantified Self Toronto #15 - Text Message Analysis (rehash)

Tonight was Quantified Self Toronto #15.

Eric, Sacha and Carlos shared about what they saw at the Quantified Self Conference in California.

I presented my data analysis of a year of my text messaging behaviour, albeit in slidedeck form.



Sharing my analysis was both awesome and humbling.

It was awesome because I received so many interesting questions about the analysis, and so much interesting discussion about communications was had, both during the meeting and after.

It was humbling because I received so many insightful suggestions about further analysis which could have been done, and which, in most cases, I had overlooked. These suggestions to dig deeper included analysis of:
  • Time interval between messages in conversations (Not trivial, I noted)
  • Total amount of information exchanged over time (length, as opposed to the number of messages)
  • Average or distribution of message length per contact,  and per gender
  • Number of messages per day per contact, as a measure/proxy of relationship strength over time
  • Sentiment analysis of messages, aggregate and per contact (Brilliant! How did I miss that?)

Again, it was quite humbling and also fantastic to hear all these suggestions.

The thing about data analysis is that there are always so many ways to analyze the data (and make data visualizations), and it's what you want to know and what you want to say that help determine how to best look at it.

It's late, and on that note, I leave you with a quick graph of the weekly number of messages for several contacts, as a proxy of relationship strength over time (pardon my lack of labeling). So looking forward to the next meeting.


Carlos Rizo, Sacha Chua, Eric Boyd and Alan Majer are the organizers of Quantified Self Toronto. More can be found out about them on their awesome blogs, or by visting quantifiedself.ca

Tuesday, October 9, 2012

What's in My Pocket? Read it now! (or Read It Later)

Introduction

You know what's awesome? Pocket.

I mean, sure, it's not the first. I think Instapaper existed a little before (perhaps). And there are alternatives, like Google Reader. But Pocket is still my favorite. It's pretty awesome at what it does.

Pocket (or Read It Later, as it used to be known) has fundamentally changed the way I read.

Before I had an Android phone I used to primarily read books. But applications like Pocket allow you to save an article from the web so you can read it later. Being a big fan of reading (and also procrastination) this was a really great application for me to discover, and I'm quite glad I did. Now I can still catch up on the latest Lifehacker even if I am on the subway and don't have data connectivity.

Background

The other interesting thing about this application is that they make it fairly easy to get a hold of your data. The website has an export function which allows you to dump all your data for everything you've ever added to your reading list into HTML.

Having the URL of every article you've ever read in Pocket is handy, as you can revisit all the articles you've saved. But there's more to it than that. The HTML export also contains the time each article was added (in UNIX epoch). Combine this with an XML or JSON dump from the API, and now we've got some data to work with.

My data set comprises a list of 2975 URLs added to the application over the period 14/07/2011 - 19/09/2012. The data from the export includes the article ID, article URL, date added and updated, and tags added to each article.

In order to add to the data provided by the export functionalities, I wrote a simple Python script using webarticle2text, which is available on github. This script downloaded the all the text from each article URL and continually added it to a single text file, as well as doing a word count for each article and extracting the top-level domain (TLD).

Analysis

First of all we can take a very simple overview of all the articles I have saved by site:
And because pie-type charts make Edward R. Tufte (and some other dataviz people) cry, here is the same information in a bar chart:
Head and shoulders above all other websites at nearly half of all articles saved is Psychology Today. I would just like to be on the record as saying - don't hate. I know this particular publication is written in such a fashion that it usually thought of as being slanted towards women, however I find the majority of articles to be quite interesting (as evidenced by the number of articles I have read). Perhaps other men are not that interested in the goings-on in their own and other people's heads, but I am (apparently).

Also, I think this is largely due to the design of the site. I commented before that using Pocket has changed the way I read. Well, one example of this is that I find I save a lot more articles from sites which have well designed mobile versions, as I primarily add articles from my phone. For this reason I can also see why I have saved so many articles from Psych Today, as their well-designed mobile site has made it easy to do so. Plus the article titles are usually enough to grab me.

You can have a look at their visually appealing mobile site if you are on a phone (it detects if the browser is a desktop browser). The other top sites in the list also have well-designed mobile sites (e.g. The Globe and Mail, AskMen, Ars Technica).

Good mobile site design aside, I like reading psych articles, men's magazines, news, and tech.

Next we examine the data with respect to time.

Unfortunately the Pocket export only provides two categories: time added and time 'updated'. Looking at the data, I believe this 'updated' definition applies to mutiple actions on the article, like marking as read, adding tags, re-downloading, et cetera. It would be ideal to actually have the date/time when the article was marked as read, as then further interesting analysis could be done. For example, looking at the time interval between when articles were added and read, or the number the number of articles read per day.

Anyhow, we continue with what data are available. As in a previous post, we can get a high-level overview of the data with a scatterplot:

Pretty.
The most salient features which immediately stand out are the two distinct bands in the early morning and late afternoon. These correspond to when the majority of my reading is done, on my communte to and from work on public transit.

You can also see the general usage lining up with events in my personal life. The bands start in early October, shortly after I began my new job and started taking public transit. There is also a distinct gap from late December to early January when I was home visiting family over the Christmas holidays.

You can see that as well as being added while I am on public transit, articles are also added all throughout the day. This is as expected; I often add articles (either on my phone or via browser) over the course of the day while at work. Again, it would be interesting to have more data to look at this further, in particular knowing which articles were read or added from which platform.

I am uncertain about articles which are listed as being updated in the late hours in the evening. Although I sometimes do read articles (usually through the browser) in these hours, I think this may correspond to things like adding tags or also a delay in synching between my phone and the Pocket servers.

I played around with heatmaps and boxplots of the data with respect to time, but there was nothing particularly interesting which you can't see from this scatterplot. The majority of articles are added and updated Monday to Friday during commute hours.

We can also look at the daily volume of articles added:


This graph looks similar to one seen previously in my post on texting. There are some days where very few articles are added and a few where there are a large number. Looking at the distribution of the number of articles added daily, we see an exponential type distribution:




Lastly we examine the content of the articles I read. As I said, all the article text was downloaded using Python and word counts were calculated for each. We can plot a histogram of this to see the distribution of the article length for what I've been reading:

Hmmmmm.

Well, that doesn't look quite right. Did I really read an article 40,000 words long? That's about 64 pages isn't it? Looking at URLs for the articles with tens of thousands of words, I could see that those articles added were either malfunctions of the Pocket article parser, the webarticle2text script, or both. For example, the 40,000 word article was a post on the Dictionary.com blog where the article parser also grabbed the entire comment thread.

Leaving the data as is, but zooming in on a more reasonable portion of the histogram, we see something a little more sensical:


This is a little more what we expect. The bulk of the data are distributed between very short articles and those about 1500 words long. The spikes in the low end also correspond to failures of the article parsers.

Now what about the text content of the articles? I really do enjoy a good wordcloud, however, I know that some people tend look down upon them. This is because there are alternate ways of depicting the same data which are more informative. However as I said, I do enjoy them as they are visually appealing.

So firstly I will present the word content in a more traditional way. After removing stop words, the top 25 words found in the conglomerate file of the article text are as follows:


As you can see, there are issues with the download script as there is some garbage in there (div, the years 2011 and 2012, and garbage characters for "don't" and "are", or possibly "you're"). But it appears that my recreational reading corresponds to the most common subjects of its main sources. The majority of my reading was from Psychology Today and so the number one word we see is "people". I also read a lot articles from men's magazines, and so we see words which I suspect primarily come from there ("women", "social", "sex", "job"), as well as the psych articles.

And now the pretty visualization:


Seeing the content of what I read depicted this way has made me have some realizations about my interests. I primarily think of myself as a data person, but obviously I am genuinely interested in people as well.

I'm glad data is in there as a 'big word' (just above 'person'), though maybe not as big as some of the others. I've just started to fill my reading list with a lot of data visualization and analysis articles as of late.

Well, that was fun, and somewhat educational. In the meantime, I'll keep on reading. Because the moment you stop reading is the moment you stop learning. As Dr. Seuss said: "The more that you read, the more things you will know. The more that you learn, the more places you'll go!"

Conclusions

  • Majority of reading done during commute on public transit
  • Number of articles added daily of exponential-type distribution
  • Most articles read from very short to ~1500 words
  • Articles focused on people, dating, social topics; more recently data

Resources

Pocket (formerly Read It Later) on Google Play:
https://play.google.com/store/apps/details?id=com.ideashower.readitlater.pro

Pocket export to HTML:
http://getpocket.com/export

Mediagazer Editor Lyra McKee: What’s In My Pocket
http://getpocket.com/blog/2012/09/mediagazer-editor-lyra-mckee-whats-in-my-pocket/

Founder/CEO of Pocket Nate Weiner: What's In My Pocket
http://getpocket.com/blog/2012/08/nate-weiner-whats-in-my-pocket/

Pocket Trends (Data analysis/analytics section of Pocket Blog)
http://getpocket.com/blog/category/trends/

webarticle2text (Python script by Chris Spencer)
https://github.com/chrisspen/webarticle2text