Wednesday, June 27, 2012

Sunday, June 24, 2012

How much do I weigh? - Quantified Self Toronto #12

Recently I spoke at the Quantified Self Toronto group (you can find the article on other talk here).

It was in late November of last year that I decided I wanted to lose a few pounds. I read most of The Hacker's Diet, then began tracking my weight using the excellent Libra Android application. Though my drastic reductions of my caloric intake are no more (and so my weight is now fairly steady) I continue to track my weight day-to-day and build the dataset. Perhaps later I can do an analysis of the patterns in fluctuations in my weight separate from the goal of weight loss.

What follows is a rough transcription of the talk I gave, illustrated by the accompanying slides.

Hello Everyone, I'm Myles Harrison and today I'd like to present my first experiment in quantified self and self-tracking. And the name of that experiment is "How Much Do I Weigh?"
So I want to say two things. First of all, at this point you are probably saying to yourself, "How much do I weigh? Well, geez, that's kind of a stupid question... why don't you just step on a scale and find out?" And that's one of the things I discovered as a result of doing this, is that sometimes it's not necessarily that simple. But I'll get to that later in the presentation.

The second thing I want to say is that I am not fat.
However, there are not many people whom I know where if you ask them, "Hey, would you like to lose 5 or 10 pounds?" the answer would be no. The same is true for myself. So late last November I decided that I wanted to lose some weight and perhaps get into slightly better shape. Being the sort of person I am, I didn't go to the gym, I didn't go a personal trainer, and I didn't meet with my doctor to discuss my diet. I just Googled stuff. And that's what lead me to this



The Hacker's Diet, by John Walker. Walker was one of the co-founders of the company Autodesk which created the popular Autocad software and later went on to become a giant multinational company. Mr. Walker woke up one day and had a realization. He realized that he was very successful, very wealthy, and had a very attractive wife, but he was fat. Really fat. And so John Walker though, "I've used my intelligence and analytical thinking to get all these other great things in my life, why can't I apply my intelligence to the problem of weight, and solve it the same way?" So that's exactly what he did. And he lost 70 pounds.

Walker's method was this. He said, let's forget all about making this too complicated. Let's look at the problem of health and weight loss as an engineering problem. So there's just you:
and your body is the entire system, and all this system has, the only things we're going to think about are inputs and outputs. I don't care if you're eating McDonald's, or Subway, or spaghetti 3 times a day. We're just talking about the amount of input - how much? Therefore, from this incredibly simplified model of the human body, the way to lose weight is just to ensure that the inputs are less than the outputs.


IN < OUT

Walker realized that this 'advice' is so simple and obvious that it is nearly useless in itself. He compared it to the wise financial guru, on being asked how to make money on the stock market by an apprentice, giving the advice: "It's simple, buy low and sell high." Still, this is the framework we have as a starting point, so we proceed from here.

So now this raises the question, "Okay well how do we do that?" Well, this is a Quantified Self meet up, so as you've probably guessed, we do it by measuring.


We can measure our inputs by counting calories and keeping track of how much we eat. Measuring output is a little more difficult. It is possible to approximate the number of calories burned when exercising, but actually measuring how much energy you are using on a day-to-day basis, just walking around, sitting, going to work, sleeping, etc. is more complicated, and likely not practically possible. So instead, we measure weight as a proxy for output, since this is what we are really concerned with in the first place anyhow. i.e. Are we losing weight or not?

Okay, so we know now what we've got to do. How are we going to keep track of all this? Walker, being a technical guy, suggests entering all the information into a piece of computer software, oh, say, I don't know, like a certain spreadsheet application. This way we can make all kinds of graphs and find the weighted moving average, and do all kinds of other analysis. But I didn't do that. Now don't get me wrong, I love data and I love analyzing it, and so I would love doing all those different types of things. However, why would I use a piece of software that I hate (and am forced to on a regular basis) any more than I already have to? Especially when this is the 21st century and I have a perfectly good smartphone and somebody already wrote the software to do it for me!

So, I'm good! Starting in late November of last year I followed the Hacker's Diet directions and weighed myself every day (or nearly every day, as often as I could) at approximately the same time of day. And along the way, I discovered some things.
One day I was at work and I got a text from my roommate, and it said "Myles, did you draw a square on the bathroom floor in black permanent marker?" To which I responded, "Why yes I did." To which the response was "Okay, good." And the reason I that I drew a square on the tiles of the bathroom floor in black permanent marker was because of observational error. More specifically, measurement error. 
If you know anything about your typical drugstore bathroom scale you probably know that they are not really that accurate. If you put the same scale on an uneven surface (say, like tiles on a bathroom floor) you can make the same measurement back-to-back and get wildly different values. That is to say the scales have a lot of random error in their measurement. And that's why I drew that square on the bathroom floor. That was my attempt to control measurement error, by placing the scale in as close to the same position I could every morning when I weighed myself. Otherwise you get into this sort of bizarre situation where you start thinking, "Okay, so is the scale measuring me or am I measuring the scale?" And if we are attempting to collect some meaningful data and do a quantified self experiment, that is not the sort of situation we want to be in.
So I continued to collect data from last November up until today. And this is what it looks like.


As you can see like most dieters, I was very ambitious at the start and lost approximately 5 pounds between late November and and the tail end of December. That data gap, followed by a large upswing corresponds to the Christmas holidays, when I went off my diet. After that I continued to lose weight, albeit somewhat more gradually up until about mid-March, and since then I have ever-so-slowly been gaining it back, mostly due to the fact that I have not been watching my input as much as I was before.

So, what can we take away from this graph? Well, from my simple '1-D' analysis, we can see a couple of things. The first thing, which should be a surprise to no one, is that it is a lot easier to gain weight than it is to lose it. I think most everyone here (and all past dieters) already knew that. 

Secondly, my diet aside, it is remarkable to see how much variability there is in the daily measurements. True, some of this may be due to the aforementioned measurement error, however in my readings online I also found that a person's weight can vary by as much as 1 to 3 pounds on a day-to-day basis, due to various biological factors and processes.

Walker comments on this variability in the Hacker's Diet. It is one of his reasons as to why looking at the moving average and weighing oneself every day is important, if you want to be able to really track whether or not a diet is working. And that's why doing things like Quantified Self are important, and also what I was alluding to earlier when I said that the question of "How much do I weigh?" is not so simple. It's not simply a matter of stepping on the scale and looking at a number to see how much you weigh. Because that number you see varies on a daily basis and isn't a truly accurate measurement of how much you 'really' weigh.



!



This ties into the third point that I wanted to draw from the data. That point is that the human body is not like a light switch, it's more like a thermostat. I remember reading about a study which psychologists did to measure people's understanding of delayed feedback. They gave people a room with a thermostat, but there was a delay in the thermostat, and it was set to something very very high, on the order of several hours. The participants were tasked with getting to room to stay at a set temperature, however none of them could. Because people (or most people, anyhow) do not intuitively understand things like delayed feedback. The participants in the study kept fiddling with the thermostat and setting it higher and lower because they thought it wasn't working, and so the temperature in the room always ended up fluctuating wildly. The participants in the study were responding to what they saw the temperature to be when they should have been responding to what the temperature was going to be.
And I think this is a good analogy for the problem with dieting and why it can be so hard. This is why it can be easy to become frustrated and difficult to tell if a diet is working or not. Because if you just step on the scale every day and look at that one number, you don't see the overall picture, and it can be hard to tell whether you're losing weight or not. And if you just see that one number you'd never realize that though I can eat a pizza today and I will weight the same tomorrow, it's not until 3 days later that I have gained 2 pounds. It's a problem of delayed feedback. And that's one of the really interesting conclusions I came to ask a result of performing this experiment.

So where does this leave us for the future?


Well, I think I did a pretty good job of measuring my weight almost every day and was able to make some interesting conclusions from my simple '1-D' analysis. However, though I did very well tracking all the output, and did not track any of my inputs whatsoever. In the future if I kept track of this as well (for instance by counting calories) I would have more data and be able to draw some more meaningful conclusions about how my diet is impacting my weight.

Secondly, I did not do one other thing at all. I didn't exercise. This is something Walker gets to later in his book too (like most diet/health books) however I did not implement any kind of exercise routine or measurement thereof.

In the future I think if I implement these two things, as well as continuing with my consistent measurement of my weight, then perhaps I could 'get all the way there'

 

|---------------| 100%


 
That was my presentation, thank you for listening. If you have any questions I will be happy to answer them.




References / Resources

Libra Weight Manager for Android
https://play.google.com/store/apps/details?id=net.cachapa.libra 

The Hacker's Diet
http://www.fourmilab.ch/hackdiet/www/hackdiet.html 

Quantified Self Toronto
http://quantifiedself.ca/ 

Tuesday, June 12, 2012

Wanna go ride bikes?

Wow, with Google Fusion Tables now it's a snap to produce a quick map. Upload, click, click, geocode and boom! Your data are on a map!

City of Toronto open data for new bicycle posts and rings installed in 2011.




Legend:
blue: 1, green: 2-3, yellow: 4-5, red: 6+

Monday, June 11, 2012

Google Domestic Trends

Google's mission is to organize all the world's information and make it universally accessible and useful. In following their mission, the company has produced some amazing tools which allow any internet user to do some data visualization without so much as having to open a spreadsheet.

One of these tools which I stumbled across the other day (which apparently has has existed for some time) is Google Domestic Trends.

I was previously aware of Google Trends, which allows a user to compare the popularity of different search terms, whether if be for serious reasons (e.g. Android vs. iPhone) or say, for something less serious. In Domestic Trends, Google has aggregated relevant search terms across different sectors of the economy, with the results presumably providing insight into market trends by sector (or at least the popularity of those market sectors with respect to time).

I am not an economist, but data are data, so here goes with the pithy commentary and observations.

Air Travel
It's seasonal, unsurprisingly. Looks like there might be some deals over the holidays I was unaware of. Or that might be a really bad time to buy tickets.

Link
Auto Buyers
As Google notes on the Domestic Trends frontpage, July 2009 was when the U.S. Government instituted its "Cash for Clunkers" program. However, it was also when Toyota recalled almost half a million vehicles due to defective airbags. Oh yeah, and that spike in 2005 is related to the outrageous change in the gas prices of the time.

Link
Bankruptcy
New record. I'm glad I rent.

Link
Computers and Electronics
Seriously, who buys desktops anymore?

Link
Credit Cards
A poignant portrait of the changing state of the American economy and personal debt.

Link
Durable Goods
Merry Christmas honey, I got you a Rhoomba.

Link
Education
School's out for summer.

Link
Jobs
I want to say that the little spike later in 2011 has nothing to due with employment and is due to Mr. Jobs retiring, however then I would expect a much larger one to be in October.

Link
Mobile and Wireless
The iPhone was revealed to the public on January 9th, 2007 and went on sale in June of the same year. The iPhone 3G and 3GS came out in June and July of 2008 and 2009 respectively. The 4S was released in October 2011. Not sure about mid-2010. The Blackberry Torch came out in June but that would hardly warrant what we see here.

Link
Rental & Real Estate
Apparently it is quite seasonal. Peaks drop off around late July and early August. Students, I would guess.

Link

Shopping
We've seen this before. No surprises here.

Link
Unemployment
I know the word you're thinking of. It's on the tip of your tongue and it starts with 'R'.

Link

See also: Google NGram Experiments.

Monday, June 4, 2012

rhok (n' roll)

This past weekend was rhok Toronto which was a fun, exhausting, educational, and all around amazing weekend which I was honoured to be involved in.

The team I was fortunate enough to be a part of produced a prototype web-service to promote fair housing, and improve the ease of the submission process for investigations into housing by-law violations. An added bonus was that this resulted in this nice visualization of more City of Toronto data.

You can learn more about rhok here.


Saturday, June 2, 2012

11 Million Yellow Slips - City of Toronto Parking Tickets, 2008-2011

Introduction

I don't know about you, but I really hate getting parking tickets. Sometimes I feel like it's all just a giant cash grab. Really? I can't park there between the hours of 11 and 3, but every other time is okay? Well, why the hell not?

But ah, such is life. Rules must be in place to keep civil order, keep the engines of city life running and prevent total chaos in the downtown core. However knowing this does not make coming out to the street to find that bright yellow slip of paper under your windshield wiper any easier.

Like everything else in the universe, parking tickets are a source of data. The great people at Open Data Toronto (@Open_TO) have provided all the data from every parking ticket issued in Toronto from 2008 to the end of last year.

So, let us dive in and have a look. We might just discover why we keeping getting all these tickets, or at least ease the collective pain a little in realizing how many others are sharing in it.

Background

The data set is an anonymized record of every parking ticket issued in the city of Toronto from the period 01/01/2008 - 12/31/2011. The fields provided are: the anonymized ticket #, date of infraction, infraction code, description, fine amount, time of infraction, and location (address).

The data set and more information can be found in Open Data Toronto's data catalogue here.

Originally I had this brilliant idea to geocode every data point, and then create an awesome heat map of the geographical distribution of parking tickets issued. However, given the fact that there are ~11 million records and the Google Maps API has a daily limit of 2,500 geocoding requests per day, even if I was completely diligent and performed the task daily it would still take approximately 4400 days or about 12 years to complete. And no, I am not paying to use the API for Business (which at a limit of 100,000 requests per day would still take ~3.5 months).

If anyone knows a way around this, please drop me an email and fill me in.

Otherwise, you can check out prior art. Patrick Cain at Global News created an awesome interactive map of aggregated parking ticket data from 2010 for locations in the city where over 500 tickets were issued. This turns out to be mainly hospitals, and unsurprisingly, tickets are clustered in the downtown core. Mr. Cain did a similar analysis while at the Toronto Star back in 2009, using data from the previous year.

I just don't like throwing out data points.

Analysis

Parking Infractions by Type 
Next we consider the parking tickets for the period by infraction type. A simple bar chart outlines the most common parking ticket types:



We will consider those codes which stick out most on the bar chart (the top 10):

> sort(codeTable, decreasing=TRUE)[1:11]
    005     029     210     003     207     009     002     008     006     015
2336433 1822690 1366945 1354671  933478  718692  496283  443706  369079 173078

Putting that into more human-readable format, the most commonly issued types of parking infractions were:

1. 005 - Park on Highway at Prohibited Time of Day
2. 029 - Park Prohibited Place/Time - No Permit
3. 210 - Park Fail to Display Receipt
4. 003 - Park on Private Property w/o Consent
5. 207 - Park w/o ticket from machine
6. 009 - Stop on Highway at Prohibited Time/Day
7. 002 - Park Longer than 3 Hours
8. 008 - Vehicle Standing Prohibited Time/Day
9. 006 - Park on Highway - Excess of Permitted Time
10. 015 - Park within 3M of Fire Hydrant

In case you were wondering, the most expensive tickets (in the range of 100's of dollars, the max being $450 [!!] ) are all related to handicapped parking spaces.

Time Distribution of Parking Infractions
Let us now consider the parking ticket information with regards to time. First and foremost, we consider the ticket data as a simple tim
e series and plot the data for the exploratory purposes:

Cool.
Most strikingly, there are clearly defined dips in the total number of tickets over the holiday season each year. There also appears to be some kind of periodic variation in the number of tickets issued over time (the downward spikes). A good first guess would be that this is likely related to the day of the week, due to the cycle of the work week related to the volume of cars parked, vehicles in the city, et cetera.

Quickly whipping up a box plot up for the data, we can see that a significantly less proportion of the tickets are issued on Sunday. Also for some reason plotting there are many outliers on the low end. I suspect these are in the aforementioned dips around the holiday season though I did not investigate this.


Conclusions

Performing a quick analysis of many different aspects of the data was not as easy as I had hoped, given the size of the set. Still, it is interesting to see the most common types of violations and the distribution of the majority of the parking tickets with respect to time.

Interesting general points of note:
  • The most common parking infractions are wrong place / wrong time, followed by various types of failing to display a permit / buy a ticket
  • Significantly reduced number of parking violations during the Christmas holiday season
  • More tickets issued during the work week

For Part II, I plan to create some heat maps / 2D histograms of the ticket data with respect to time, and I may yet create a geospatial representation of the data, albeit in aggregated form.