clock menu more-arrow no yes mobile

Filed under:

The Facebook data breach wasn’t a hack. It was a wake-up call.

Facebook expected its user data to be harvested. It just didn’t expect Cambridge Analytica to do it millions of times.

2016 Concordia Summit Convenes World Leaders To Discuss The Power Of Partnerships - Day 1
Cambridge Analytica CEO Alexander Nix speaks at the 2016 Concordia Summit. Nix has since been suspended.
Bryan Bedder/Getty Images for Concordia Summit
Aja Romano writes about pop culture, media, and ethics. Before joining Vox in 2016, they were a staff reporter at the Daily Dot. A 2019 fellow of the National Critics Institute, they’re considered an authority on fandom, the internet, and the culture wars.

Update: The claim that 50 million Facebook accounts had been affected by the Cambridge Analytica breach has been revised. The official total, as revealed by Facebook in April, stands at 87 million users.

News broke over the weekend of March 17 that Cambridge Analytica (CA), a data analytics firm that worked with Donald Trump’s election campaign, had extracted Facebook data from 50 million user accounts. Or, more accurately, we might say that “news” “broke.”

In fact, we’ve known most of the details concerning CA’s massive data research, and the use of that research in political campaigns, for several years thanks to a 2015 Guardian article, a viral 2016 article in Das Magazin (later published by Vice), and a March 2017 article by the Intercept. Not even the specific number of 50 million accounts is new: Cambridge Analytica’s chief researcher has been boasting about having a 50 million-person sample size in his data sets since 2014, at least.

And we’ve even known, since 2015 or so, that, as Tech Crunch put it, “it was always kind of shady that Facebook let you volunteer your friends’ status updates, check-ins, location, interests and more to third-party apps.”

What is new is that, essentially, major news outlets have taken two stories — what Cambridge Analytica did, and what Facebook knew about what CA did — and pieced them together into a report that spawned immediate concern from the public and a swift response from Facebook. As a result, we now have even less room for plausible deniability about a problem we are lately frequently confronting: a failure to anticipate how technology meant to work on an individual level might be repurposed or exploited when scaled up to apply to millions.

The Facebook breach wasn’t a hack

Between 2013 and 2015, Cambridge Analytica harvested profile data from millions of Facebook users, without those users’ permission, and used that data to build a massive targeted marketing database based on each user’s individual likes and interests. Using a personality profiling methodology, the company — formed by high-powered right-wing investors for just this purpose — began offering its profiling system to dozens of political campaigns.

CA was able to procure this data in the first place thanks to a loophole in Facebook’s API that allowed third-party developers to collect data not only from users of their apps but from all of the people in those users’ friends network on Facebook. This access came with the stipulation that such data couldn’t be marketed or sold — a rule CA promptly violated.

Facebook bears a huge amount of culpability for allowing CA to get its data to begin with. However, reports calling CA’s data harvesting a “leak,” a “hack,” or a serious violation of Facebook policy are all incorrect. All of the information collected by the company was information that Facebook had freely allowed mobile developers to access.

And technically, anyone who used third-party Facebook apps also could have found out that they were allowing those apps to see data from their friends’ profiles. As a Facebook spokesperson reiterated to the New York Times, “No systems were infiltrated, and no passwords or sensitive pieces of information were stolen or hacked.”

Essentially, the data that CA took from Facebook — mainly information gleaned from user profiles and interests — wasn’t private to begin with, not really. It’s just that the vast majority of users either didn’t really know it wasn’t private or didn’t really care. Generally, apps aren’t placing their fine print front and center when we’re using them, nor are they priming us to think too hard about what the long-term or large-scale ramifications of providing our data — and in this case, our friends’ data — could be.

And Facebook didn’t care either — at least not until 2015, when it finally updated its third-party API to block access to the kind of massive data sets that Cambridge Analytica was collecting. That year, without publicly alerting users that its API had been exploited, it drastically limited what features third-party apps could access. It also instituted a review of any third-party app that asked for more than the usual amount of data — public profile, list of friends, and email address — from its users.

By that point, of course, CA had already gotten the bulk of its user data from Facebook users — most notably from their profile pages, where user interests and likes provided the company with the building blocks of personality profiles it created to help determine whether users would be susceptible to different kinds of political messaging.

Arguably, all of that user data should have been private from day one, a “privacy by default” experience that some internet advocacy groups have been begging Facebook to implement for years. The fact that it wasn’t suggests that in the beginning, at least, the exploitability of Facebook’s API was seen as a feature, not a bug — because no one was thinking that a third-party app might utilize its access to user data at the scale Cambridge Analytica did.

The story of Cambridge Analytica is a story of underanticipating the extrapolation of data at a scale of millions

Most of what we know about Cambridge Analytica originates from a blockbuster article originally published in the German publication Das Magazin in December 2016 by reporters Hannes Grassegger and Mikael Krogerus. The article was translated into English, went viral, and was ultimately republished at Vice just weeks after Trump took office in January 2017.

But reporting on Cambridge Analytica’s use of Facebook data to influence elections for specific campaigns dates back for years — including details about how the data was being collected.

In 2007, Cambridge psychology student David Stillwell, who was doing his PhD in the process of decision-making, launched a mobile app called myPersonality. Initially, the app was just supposed to be a fun side project — but within a year, millions of Facebook users had downloaded it and were volunteering their data.

Stillwell and Michal Kosinski, a fellow student at Cambridge’s Psychometric Centre, realized they could use the data the app was gathering from its users for “serious research.” In 2012, having spent several years refining its methodology, Stillwell’s team began to publish its research, including the ominous-in-retrospect “myPersonality project: Example of successful utilization of online social networks for large-scale social research.”

The publication of the research from Stillwell and Kosinski and their team generated significant press in 2013 — as well as interest from Facebook, which seems to have been aware of the potential for this kind of large-scale implementation of its user data when it reportedly reached out to him with both “the threat of a lawsuit and a job offer,” as Kosinski told Das Magazin.

It was at that point, according to the New York Times report, that the team at the Psychometric Centre was approached by Christopher Wylie, who worked at a company called Strategic Communication Laboratories, self-billed as a data analytics company “to governments and military organizations worldwide.” Wary of this shadow-puppet descriptor, Kosinksi’s team declined to join forces. But Wylie — who would later leave the company in disgust and reveal much of this information to the press — simply turned to another psychology researcher at Cambridge, Aleksandr Kogan.

According to Kosinski, Kogan found a way to mirror the data and methodology being used for the myPersonality experiment, which he referred to as “psychographics.” It most likely wasn’t hard: Stillwell had accidentally stumbled across a treasure trove of Facebook users who seemed willing, by the millions, to give up their data to his third-party app, along with all the data of people in their Facebook friend networks.

So in 2014 Kogan simply made another app, offering similar features to myPersonality, with similar data-scraping technology. Ultimately, Kogan’s app, “thisisyourdigitallife,” gained 270,000 users. That’s a paltry number in terms of successful internet apps, but in terms of the spiraling network trees each of those individual users gave Kogan access to, it was huge: Originally reported as 30 million Facebook users in total, the number was actually closer to 50 million.

Kogan marketed that data to SCL, which created a US-based company for the sole purpose of using all this research data. Longtime SCL director Alexander Nix was named its CEO (he has since been suspended); Republican hedge fund manager and Breitbart investor Robert Mercer reportedly bought into CA, while Steve Bannon, also an investor, joined its board of directors. Bannon chose the name for the company: Cambridge Analytica.

A 2015 Guardian report revealed that Cambridge Analytica received more than $2.5 million from conservative Super PACs funded by Mercer. Within a year, Kogan was boasting of having a data set of “50+ million individuals for whom we have the capacity to predict virtually any trait.”

Throughout 2015, CA worked with the Ben Carson and Ted Cruz campaigns as well as dozens of others, including the Brexit “Leave.EU” campaign, before moving on to work with the Trump presidential campaign. Former National Security Adviser Michael Flynn would later be revealed to have had a brief role as an adviser to the company.

That same year, a job service called Global Research Report on the online task site Mechanical Turks was shut down by host Amazon after a report in the Guardian revealed that the service was another offshoot of SCL, and that the data workers were being asked to provide was being harvested by Cambridge Analytica. According to the Guardian report, at that point, CA claimed to have compiled “a massive data pool of 40+ million individuals across the United States — for each of whom we have generated detailed characteristic and trait profiles.”

A year later, that number had ballooned: At the Concordia Summit in New York in September 2016, Nix announced that CA had “profiled the personality of every adult in the United States of America — 220 million people.” Nix claimed a variety of sources were being used to glean this data — everything from Facebook data to phone surveys and voting history.

At the end of 2016, the Das Magazin article put all of the pieces together and raised the alarm, sparking a year of reporting into CA’s research and methodology. In December 2017, the US House Intelligence Committee questioned Nix on an unrelated issue regarding Hillary Clinton’s emails. That same month, special counsel Robert Mueller requested documents from the company regarding its work on the Trump campaign.

Given recent reporting on the company’s international business dealings and apparently strategic spread of misinformation using the internet, untangling CA’s political maneuvers may turn out to be an entirely separate, and daunting, task for the Justice Department.

Facebook is taking action now because of the scale of the bad press it’s receiving

As reported over the weekend by the Observer, Facebook knew in 2015 that its user information had been harvested “on an unprecedented scale.” In addition to changing its API, which the Observer reports as a direct response to CA’s exploitation of user data, the company also demanded that CA certify that it had destroyed all remnants of the data set. The Times noted that as of this month, however, that doesn’t seem to have happened, and most of the data is still in CA’s possession.

Yet Facebook, despite undergoing its own grilling by Congress and despite vowing to undertake a self-reckoning in response to its unwitting influence over the past two years of geopolitics, did not take further action against CA until this past Friday, when it reported to the New York Times that it had suspended CA’s Facebook account, along with CA’s original researchers, Kogan and Wylie. It also reportedly scheduled an internal meeting with employees on Tuesday to explain what happened with Cambridge Analytica and field questions about the situation.

So what changed?

Again, everything that’s been reported this weekend has essentially already been known to us — just as all of the information about CA’s role in the election was publicly available before Das Magazin’s post-election article brought it to widespread public attention. What does seem to be new is Wylie’s personal account of his experience with the company, as well as reports of Facebook’s ongoing attempts to get CA to destroy the data it harvested.

The question of timing is also a crucial one. Cambridge Analytica has occupied a recurring role as a bit player in international political headlines for the better part of a year, while the aftermath of the 2016 election has been brutal for Facebook. As Mueller digs into CA’s political ties, and Facebook promises to restructure itself to better handle the maelstrom of problems that formed through its platform and factored into the election results, both companies are experiencing an intense amount of scrutiny, from more eyes than ever before.

Ironically, it’s not really new information that’s prompting Facebook to take action now, but rather the spread of that information at a massive scale — which is how we got here to begin with.

The factors that allowed Cambridge Analytica to hijack Facebook user data boiled down to one thing: no one involved in the development and deployment of this technology stopped to consider what the results might look like at scale. As the original creators of the methodology that Cambridge Analytica ultimately poached, Kosinski, Stillwell, and the rest of their research team failed to anticipate how the ability to harvest millions of samples of user data might be manipulated or exploited once their methodology was made public.

Cambridge Analytica members like Wylie, who recruited Kogan to work with the company, and Kogan himself, failed to anticipate that handing so much personal data to powerful people might provide them a new way to saturate Facebook with misleading or downright false information. And Facebook, in designing its API, at least initially failed to anticipate that a third-party developer might seek to harvest individual user profiles by the millions.

Facebook’s party line is that Kogan, in representing his app to Facebook, lied to them and stated the data would be used for academic research. But according to the Times, Facebook took no action to verify his claim, which raises the question of how many other “academic research” apps stole user data up until 2015, and what that theft might be used for. (Kogan reportedly retained his own copy of the data for his own personal research.)

For its part, CA has hedged about whether it even used Kogan’s research: It’s repeatedly claimed that “psychographics” — the original Cambridge personality profile data — played no significant part in its work for the Trump campaign. And there are plenty of reasons not to put too much emphasis on CA’s role in the election over a host of other contributing factors.

But speaking to the Observer, Wylie minced no words about the role psychographics played in the company: It “was the basis the entire company was built on.”

Just as with other recent large-scale data manipulations, from the recent Strava app fiasco to the widescale distribution and spread of fake news on social media, Cambridge Analytica didn’t “hack” our internet usage and our Facebook information so much as exploit the way the system was naturally designed to work. On one level, it might be unsurprising that a company guided by Steve Bannon would turn out to be using user data unethically and without permission.

But on another level, what happened to Facebook with Cambridge Analytica is a microcosm of an increasingly obvious problem that’s increasingly affecting all social media platforms — with effects that potentially impact every internet user. Whether these effects end up yielding an algorithmically botched election or just more creepy fake celebrity porn, it seems clear that we’ve entered an unprecedented era of massive online data manipulation.