Wednesday, January 29, 2014

Looking for Your Lens: 3 Tips on How to Be a Great Analyst

The other day as I was walking to work, all of a sudden, "pop!" one of the lenses in my glasses decided to free itself from the prison of its metal frame and take flight.

Well, damn.

The sidewalk was wet and partially covered in snow, and also with little islands of ice here and there. Finding a transparent piece of glass was not going to be easy.

So there I was, wandering about a small patch of sidewalk next to Toronto City Hall, squatting on my haunches, peering down at the sidewalk and awkwardly searching for my special missing piece of glass. I was not optimistic about my chances.

Most people walked on by and paid me no notice, but one kind soul, a woman with black, curly hair, stopped to help.

"Did you lose something?" she asked.

"Yeah," I said, defeated, and held up my half empty black frames.

"Can I help?" she kindly offered. "I'm good at this sort of thing."

"I guess," I said, having already given up on restoring my headgear to completeness.

We scoured the sidewalk while urban passerby gave the occasional puzzled look, hurrying along.

"Ah!" she said, and amazingly, picked up my lens which she had located. It had been hiding on a small patch of snow near a planter.

"WOW!" I was genuinely impressed. "You are good at this. Thanks so much."

"No problem," she said. "have a great day!" and then promptly disappeared down the street, leaving me standing there on the sidewalk, bewildered.

That single small episode, a tiny vignette of a single life in a giant city amongst millions of others, was quite profound for me. This was because it got me thinking about two things: one, the kindness of strangers, and the other, of course, what I am always thinking about - the business of doing analysis.

Because as it turns out, those few statements that kind stranger made are equally important in being a great analyst.

"Did you lose something?"

A problem that a lot of analysts deal with on a regular basis is one of communication. The business, the stakeholder, the client, whoever it may be, comes to the analyst for help. They want to find out something about their business because they have data, and it's the job of the analyst to turn that data (information) into insights (knowledge).

But here's the problem - you can't find something if you don't know what you're looking for.

Just as the kind passerby wouldn't have been able to help me find my missing lens if she didn't know what to look for, if you don't know what kinds of insights you want to pull out of the data you have, then you won't be able to find what you're looking for either.

"We want to know how our people are connecting with our brand."

It is the job of the analyst to turn these (often vague) desires of the business into specific questions that can be answered by analyzing data.

What people? (everyone, purchasers only, Boomers, Gen X, Gen Y, single mothers between the ages of 22 and 32 in urban centers?) What does connecting with the brand mean? (viewing an ad, purchase, visits to the website, app downloads, posts on social media, all of the above?)

So remember that a very large part of the job of the analyst is communication - not just about data - but working with others to determine exactly what it is they want to know. Once you know that, you can determine how to best do analysis to find the answers that are being sought after - hiding in plain sight in the data, like a piece of glass on a snowy patch of sidewalk.

"Can I help?"

Here's something I think that a lot of analytical-type thinkers (this author included) often need to be reminded of: you can't know everything. Even if you really, really want to. I'm sorry but you just can't.

And that's why once you know what it is you're looking for, and what you need, you'll need to ask for help (and that's okay, that's why we have meetings!). Sometimes the mere process of tracking down the data is a considerable task in itself. Sometimes no one really has a great overall understanding of a how a really large, complicated system works - that kind of knowledge is often very distributed. These sorts of situations may require the help of many others in your company (or another business, vendor or client) who all have varying knowledge bases and skill sets.

It's the job of the analyst to connect with the people they need to, get the data that they need, and do analysis to find the answers which are desired. Also if you're a good analyst, you'll probably provide some kind of context around the impact (i.e. business implications) of your answer, and what parties would need to be involved to make take the most beneficial actions as a result.

So even if you're a data rock star don't ever be afraid to ask for help; and conversely don't hesitate to let others know who should help them too.

"I'm good at this sort of thing."

Getting the analysis done requires not only not being afraid of asking for help, but also knowing the strengths and weaknesses of yourself, your team, and any others you may be working with.

It's hard, but in my opinion, it takes a bigger person to be honest and admit when they are out of their depth than to say they can do something they clearly cannot.

When you're out of your depth you have three options, which are really just three different ways of finishing the statement I'm not an expert. And they go something like this: I'm not an expert....
  1.  "... so I'm not going to do it because: I don't know how / wouldn't be able to figure it out / it's not in my job description."
  2.  "... but I can: learn quickly / give it a try / do my best / become one in 5 days."
  3.  "... but I know <colleague> is and could: provide context to the problem / definitely help do it / teach us how."
And the difference between answer #1 and the last two is what separates the office drones from the thought leaders, the reporting monkeys from the truly great analysts, and the unsuccessful from the successful in the world of data.

As I noted in the section above, you should never be afraid to ask for help, because there are going to be others out there that are better at things than you, and if you're good you'll recognize this fact and both of you will benefit. Hey, you might even learn something too, so next time you will be the expert.

Just remember that you can do analysis without crunching every number personally. You can work in data science without building the predictive model all by yourself. And you can work with data without writing every line of code alone. No analyst is an island.

"No problem! Have a great day!"

I hope that my little story and these points will help, or at least help you think, about the business of working with data and doing analysis, and what it means to be a great analyst.

This last point is perhaps equally, or even more  important, than the others - always be kind to the people you work with; always make it look easy, no matter how hard it was; and always be happy to help. That, above all, is what will make you a truly great analyst.

Saturday, January 11, 2014

The Mathematics of Wind Chill

Introduction

Holy crap, it was cold out.

If you haven't been reading the news, don't live in the American Midwest or Canada, or do and didn't go outside the last couple weeks (for which I don't blame you) there was some mighty cold weather lately due to something called a polar vortex.

Meteorologists stated that a lot of people (those in younger generations) would never have experienced anything like this before - cold from the freezing temperatures and high winds the likes of which parts of the US and Canada haven't seen for 40 years.

It was really cold. So cold that weird stuff happened, including the ground exploding here in Ontario due to frost quakes, or cryoseisms, as they are technically known (or as my sister suggested they should be called, "frosted quakes" - get it?)

When there is all this talk of a polar vortex, all I could think of was a particularly ridiculous TV-movie that came out lately, and that this is our Northern equivalent, which probably looked something like this artist's depiction below:

Scientific depiction of polar vortex phenomena (not to scale)

But I digress. The real point is that all this cold weather got me thinking about windchill - what is it exactly? How is it determined? Let's do some everyday analysis.

Background

Wind chill hasn't always been the same, and there is some controversy exactly how scientific it is in the way it is calculated.

Wind chill depends upon only two variables - air temperature and wind speed - and the formula was derived not from physical models of atmosphere but from participants in simulated laboratory conditions.

Also, the old formula was replaced in 2001 by a new formula, with Canada greatly leading the effort, since there was some concern that the old formula gave values too low and that people would think they can safely withstand colder temperatures than they actually could.

The old formula had strange units but I found this page at University of Carleton which provides it in degrees Fahrenheit, so we can compare the old and new systems directly.

Analysis

Since the wind chill index is a function of two variables (a surface), we can calculate it using vectors in R and visually depict the results as an image (filled contour). This is in the following code below:

Which results in the following plots:

And the absolute difference between the two:
For low wind speeds (around 5 mph - wind chill is only defined when wind speed is greater 5 mph) you can see that the new system is colder, but for wind speeds greater than 10 mph the opposite is true, especially so in the bitter bitter cold (high winds and very cold temperatures). This is in line with the desire to correct the old system for giving values which were felt were too low.

If you're really visual person, here is the last contour plot as a surface:


Which, despite some of the limitations of 3-D visualization, shows the non-linear nature of the two systems and the difference between them.

Conclusion

This was a pretty interesting exercise and shows again how mathematics permeates many of our everyday notions - even if we're not necessarily aware of it being the case. 

For me the takeaway here is that wind chill is not an exact metric based on the physical laws of the atmosphere, but instead a more subjective one based upon people's reaction to cold and wind (an inanimate object cannot "feel" wind chill).

Despite the difficulty of the problem of trying to exactly quantify how much colder the blustery arctic winds make it feel outside, saying "-32F with the wind chill" will still always be better than saying "dude, it's really really cold outside."

Either way, be sure to wear a hat.

References & Resources

Windchill (at Wikipedia)

National Weather Service - Windchill Calculator

National Weather Service - Windchill Terms & Definitions 

Environment Canada - Canada's Windchill Index

Thursday, January 2, 2014

Snapchat Database Leak - Visualized

Introduction

In case you don't read anything online, or live under a rock, the internet is all atwitter (get it?) with the recent news that Snapchat has had 4.6 million users' details leaked due to a security flaw which was compromised.

The irony here is that Snapchat was warned of the vulnerability by Gibson Security, but was rather dismissive and flippant and has now had this blow up in their faces (as it rightly should, given their response). It appears there may be very real consequences of this to the (overblown) perceived value of the company, yet another wildly popular startup with no revenue model. I bet that offer from Facebook is looking pretty good right about now.

Anyhow, a group of concerned hackers gave Snapchat what-for by exploiting the hole, and released a list of 4.6 million (4,609,621 to be exact) users details with the intent to "raise public awareness on how reckless many internet companies are with user information."

Which is awesome - kudos to those guys, once for being whitehat (they obscured two digits of each phone number to preserve some anonymity) and twice for keeping companies with large amounts of user data accountable. Gibsonsec has provided a tool so you can check if your account is in the DB here.

However, if you're a datahead like me, when you hear that there is a file out there with 4.6M user accounts in it, your first thought is not OMG am I safe?! but let's do some analysis!

Analysis

Area Code
As I have noted in a previous musing, it's difficult to do any sort of in-depth analysis if you have limited dimensionality of your data - here only 3 fields - the phone number with last two digits obscured, the username, and the area.

Fortunately because some of the data here is geographic, we can do some cool visualization with mapping. First we look at the high level view, with state and those states by area. California had the most accounts compromised overall, with just shy of 1.4 M details leaked. New York State was next at just over a million. 


Because the accounts weren't spread evenly across the states, below is a more detailed view by area code. You can see that it's mainly Southern California and the Bay Area where the accounts are concentrated.


Usernames
Well, that covers the geographic component. Which leaves the only the username and phones numbers. I'm not going to look into the phone numbers (I mean what really can you do, other than look at the distribution of numbers - which I have a strong hypothesis about already).

Looking at the number of accounts which include numerals versus those that do not, the split is fairly even - 2,586,281 (~56.1%) do not contain numbers and the remaining 2,023,340 (~43.9%) do. There are no purely numeric usernames.

Looking at the distribution of the length of Snapchat usernames below, we see what appears to be a skew-normal distribution centered around 9.5 characters or so:

The remainder of the tail is not present, which I assume would fill in if there were more data. I had the axis stretch to 30 for perspective as there was one username in the file of length 29.

Conclusion

If anything this analysis has shown anything it has reassured me that:
  1. You are very likely not in the leak unless you live in California or New York City
  2. How amazingly natural phenomena follow or nearly follow theoretical distributions so closely
I'm not in the leak, so I'm not concerned. But once again, this stresses the importance of being mindful of where our personal data are going when using smartphone apps, and ensuring there is some measure of care and accountability on the creators' end.

Update:
Snapchat has released a new statement promising an update to the app which makes the compromised feature optional, increased security around the API, and working with security experts in a more open fashion.