Thursday, January 2, 2014

Snapchat Database Leak - Visualized

Introduction

In case you don't read anything online, or live under a rock, the internet is all atwitter (get it?) with the recent news that Snapchat has had 4.6 million users' details leaked due to a security flaw which was compromised.

The irony here is that Snapchat was warned of the vulnerability by Gibson Security, but was rather dismissive and flippant and has now had this blow up in their faces (as it rightly should, given their response). It appears there may be very real consequences of this to the (overblown) perceived value of the company, yet another wildly popular startup with no revenue model. I bet that offer from Facebook is looking pretty good right about now.

Anyhow, a group of concerned hackers gave Snapchat what-for by exploiting the hole, and released a list of 4.6 million (4,609,621 to be exact) users details with the intent to "raise public awareness on how reckless many internet companies are with user information."

Which is awesome - kudos to those guys, once for being whitehat (they obscured two digits of each phone number to preserve some anonymity) and twice for keeping companies with large amounts of user data accountable. Gibsonsec has provided a tool so you can check if your account is in the DB here.

However, if you're a datahead like me, when you hear that there is a file out there with 4.6M user accounts in it, your first thought is not OMG am I safe?! but let's do some analysis!

Analysis

Area Code
As I have noted in a previous musing, it's difficult to do any sort of in-depth analysis if you have limited dimensionality of your data - here only 3 fields - the phone number with last two digits obscured, the username, and the area.

Fortunately because some of the data here is geographic, we can do some cool visualization with mapping. First we look at the high level view, with state and those states by area. California had the most accounts compromised overall, with just shy of 1.4 M details leaked. New York State was next at just over a million. 


Because the accounts weren't spread evenly across the states, below is a more detailed view by area code. You can see that it's mainly Southern California and the Bay Area where the accounts are concentrated.


Usernames
Well, that covers the geographic component. Which leaves the only the username and phones numbers. I'm not going to look into the phone numbers (I mean what really can you do, other than look at the distribution of numbers - which I have a strong hypothesis about already).

Looking at the number of accounts which include numerals versus those that do not, the split is fairly even - 2,586,281 (~56.1%) do not contain numbers and the remaining 2,023,340 (~43.9%) do. There are no purely numeric usernames.

Looking at the distribution of the length of Snapchat usernames below, we see what appears to be a skew-normal distribution centered around 9.5 characters or so:

The remainder of the tail is not present, which I assume would fill in if there were more data. I had the axis stretch to 30 for perspective as there was one username in the file of length 29.

Conclusion

If anything this analysis has shown anything it has reassured me that:
  1. You are very likely not in the leak unless you live in California or New York City
  2. How amazingly natural phenomena follow or nearly follow theoretical distributions so closely
I'm not in the leak, so I'm not concerned. But once again, this stresses the importance of being mindful of where our personal data are going when using smartphone apps, and ensuring there is some measure of care and accountability on the creators' end.

Update:
Snapchat has released a new statement promising an update to the app which makes the compromised feature optional, increased security around the API, and working with security experts in a more open fashion.

2 comments:

  1. Cool post! I made another visualization with this data as well

    http://algorithmshop.com/20140102-snapchat-leak.html#8683539695368214636

    ReplyDelete
    Replies
    1. Good stuff! I made this visualization very quickly and was wondering how to easily analyze the username field.

      Delete