clock menu more-arrow no yes mobile

Filed under:

Researchers just released profile data on 70,000 OkCupid users without permission

Brian Resnick is Vox’s science and health editor, and is the co-creator of Unexplainable, Vox's podcast about unanswered questions in science. Previously, Brian was a reporter at Vox and at National Journal.

Update: The Open Science Framework removed the OkCupid data posting after OkCupid filed a Digital Millennium Copyright Act (DMCA) complaint on May 13.


A group of researchers has released a data set on nearly 70,000 users of the online dating site OkCupid. The data dump breaks the cardinal rule of social science research ethics: It took identifiable personal data without permission.

The information — while publicly available to OkCupid users — was collected by Danish researchers who never contacted OkCupid or its clientele about using it.

The data, collected from November 2014 to March 2015, includes user names, ages, gender, religion, and personality traits, as well as answers to the personal questions the site asks to help match potential mates. The users hail from a few dozen countries around the world.

Why did the researchers want the data?

The researchers, Emil Kirkegaard and Julius Daugbjerg Bjerrekær, ran software to "scrape" the information off OkCupid's website and then uploaded the data onto the Open Science Framework, an online forum where researchers are encouraged to share raw data to increase transparency and collaboration across social science. Kirkegaard, the lead author, is a graduate student at Aarhus University in Denmark. (The university notes Kirkegaard was not working on the behalf of the university, and that "his actions are entirely his own responsibility.")

(Update: The original version of this story named Oliver Nordbjerg as a co-author as well. He says his name has since been removed from the report.)

Kirkegaard and Bjerrekær write that OkCupid is a valuable source of survey information "because users often answer hundreds if not thousands of questions."

But the data set reveals deeply personal information about many of the users. OkCupid uses a series of personal questions — on topics such as sexual habits, politics, fidelity, feelings on homosexuality, etc. — to help match people on the site.

The data dump did not reveal anyone's real name. But it's entirely possible to use clues from a user's location, demographics, and OkCupid user name to determine their identity.

This is a huge breach of social science research ethics

The American Psychological Association makes it very clear: Participants in studies have the right to informed consent. They have a right to know how their data will be used, and they have the right to withdraw their data from that research. (There are some exceptions to the informed consent rule, but those do not apply when there's a chance a person's identity can be linked to sensitive information.)

This data scrape, and potential future studies built on it, won't provide any of those protections. And scientists who use this data set may be in breach of the standard ethical code.

"This is without a doubt one of the most grossly unprofessional, unethical and reprehensible data releases I have ever seen," writes Os Keyes, a social computing researcher*, in a blog post.

A separate paper by Kirkegaard and Bjerrekær describing the methods they used in the OkCupid data scrape (also published on the Open Science Framework) contains another big ethical red flag. The authors report that they didn't scrape profile pictures because it "would have taken up a lot of hard drive space."

And when researchers asked Kirkegaard about these concerns on Twitter, he shrugged them off.

Note: The IRB is the institutional review board, a university office that reviews the ethics of research projects.

Does open science need some gatekeeping?

"Some may object to the ethics of gathering and releasing this data," Kirkegaard and his colleagues argue in the paper. "However, all the data found in the dataset are or were already publicly available, so releasing this dataset merely presents it [in] a more useful form."

(The profiles might technically be public, but why would OkCupid users expect anyone else but other users to look at them?)

Keyes points out that Kirkegaard published the methods paper in a journal called Open Differential Psychology. The editor of that journal? Kirkegaard.

"The thing [Open Differential Psychology] looks pretty much like a vanity press," Keyes writes. "In fact, of the last 26 papers it 'published', he authored or co-authored 13." The paper claims it was peer-reviewed, but the fact that Kirkegaard is the editor is a conflict of interest.

The Open Science Framework was created, in part, in response to the traditional scientific gatekeeping of academic publishing. Anyone can publish data to it, with the hope that the freely accessible information will spur innovation and keep scientists accountable for their analyses. And as with YouTube or GitHub, it's up to the users to ensure the integrity of the information, and not the framework.

If Kirkegaard is found to have violated the site's terms of use — i.e., if OkCupid files a legal complaint — the data will be removed, says Brian Nosek, the executive director of the Open Science Foundation, which hosts the site.

This seems likely to happen. An OkCupid spokesperson tells me: "This is a clear violation of our terms of service — and the Computer Fraud and Abuse Act — and we’re exploring legal options."

Overall, Nosek says the quality of the data is the responsibility of the Open Science Framework users. He says that personally he'd never post data with potential identifiers.

(For what it's worth, Kirkegaard and his crew aren't the first to scrape OkCupid user data. One user scraped the site to match with more women, but it's a bit more controversial when data is posted on a site meant to help scientists find fodder for their projects.)

Nosek says the Open Science Foundation is having internal discussions of whether it should intervene in these cases. "This is a tricky question, because we are not the moral truth of what is appropriate to share or not," he says. "That's going to require some follow-up." Even transparent science may need some gatekeeping.

It might be too late for this episode. The data has been downloaded nearly 500 times so far, and some are already analyzing it.

*This post originally identified Keyes as an employee of the Wikimedia foundation. Keyes no longer works there.

Correction: A previous version of this story stated that all three of the Danish researchers who authored the OKCupid paper were affiliated with Aarhus University in Denmark. In fact, Kirkegaard is a graduate student there, while Oliver Nordbjerg and Julius Daugbjerg Bjerrekær are not currently students or staff there.