Paul Glimcher is on the verge of launching an absurdly ambitious project in social science. The concept is simple, but the scope is spectacularly broad. Over the next few years, he and his team are going to recruit 10,000 New Yorkers and track everything about them for decades.
By everything, I mean full genome data, medical records, diet, credit card transactions, physical activity, personality test scores, intelligence test scores, social interactions, neighborhood characteristics, loan records, time spent on email, educational achievement, employment status, sleep, GPS location data, blood work, and stool samples. And there's so much more.
Here's how granular it will get: There are plans to use Bluetooth technology to track how often family members interact with one another.
This is the Kavli HUMAN Project, and Glimcher is the director. The goal is to create an atlas of the human experience — to find out how biology, psychology, and the environment all interact to shape our lives.
The multimillion-dollar initiative will use machine learning to find patterns in all that data. Patterns that could help us understand how obesity and heart disease develop over the life span. How nutrition in our 20s impacts our cognition in our 60s. How genetics shape our financial well-being. How neighborhoods protect or hurt our long-term health.
Or it could just yield a whole lot of noise.
Glimcher is a jack-of-all trades social scientist at New York University. He’s trained as a neuroscientist, an economist, and a psychologist but can also speak fluently about physics and astronomy.
Experimental social science is currently in the midst of a difficult period of introspection. Some call it a crisis: Long-established theories are coming undone in replication attempts, and scientists are reckoning with the fact that too many experiments have included too few participants to yield meaningful results.
A better way, perhaps, to learn about human behavior and its consequences is to just track people carefully over a long period of time. Some of the most profoundly basic discoveries in public health — like that smoking kills — have been made this way.
Glimcher sees the current struggles in social science (the problems with replication, with study design, with weak results) and senses an opportunity. Huge data projects have transformed astronomy and physics. Why can’t social and public health scientists use similar techniques?
The Kavli HUMAN project intends to track 10,000 New Yorkers living in around 4,000 households via a data-siphoning smartphone app and physical evaluations. The team hopes to keep the study running for at least 20 years, and along the way expand it to other cities across the United States. The project is still in its planning phases, but Glimcher says it’s likely to start recruiting participant in New York's five boroughs in 2017.
Over the summer, I talked to Glimcher by phone. We discussed why he thinks this project is needed right now, the potential risks of collecting that much data on so many people, and why social scientists should look to the stars for some answers.
This interview has been edited for length and clarity.
Social science needs to become more like astronomy
I’ve read that this project was inspired by astronomy. What’s the connection? How can stars help social science?
In the 1990s, if you were an astronomer and you were interested in quasars, what you did is you booked three nights a year on an telescope, and it was clear you got some quasar data, maybe two or three. And then you went home and thought about it for a year.
And a guy at Princeton in the department of astronomy by the name of Jim Gunn said in 1990, "That that’s exactly the wrong way to do it. ... What you would do instead was drift the telescope slowly across the sky, and capture everything and put it in a database."
And then an astronomer wouldn’t book a few nights on a telescope; they would go into a database, and instead of having three quasars, they might have, well, Jim Gunn famously said 100.
Now the number is more like 70,000 or 80,000. And that database is called the Sloan Digital Sky Survey. If you’ve ever been to the planetarium and flown through space, you’ve flown through that database.
So why is it we don’t do that for understanding people?
In 2014, I said to Kavli [The Kavli Foundation is a nonprofit that funds research ventures across the country], "Well, the answer is the really cool thing would be a giant discovery data set that would be like the Sloan Digital Sky Survey for humans." And it would go from genetics to social interactions, from electronic medical records to metadata on SMS texting, from financial expenditure data to who you hung out with.
I’m a fan of psychology and social science, but they also disappoint me. Here’s my problem: Psychological experiments can be very contrived. In political psychology, for example, experimenters will set up fake elections and give participants prompts to read about fake candidates to try to isolate something and determine causality.
Psychologists don’t have great experimental tools. And data from one survey can’t yield strong conclusions.
Your approach is much bigger. If it yields many meaningful results, do you think the age of small-scale experimentation in psychology will be over? Will it totally become a big-data science?
The idea that you isolate a variable and study it — that is the core scientific value.
But experiments don’t always generalize. Systems are complicated, and sometimes when you isolate a variable, you throw out the baby with the bathwater.
The problem is that we are at a stage where psychology’s found a bunch of interesting, really valuable ideas using that [experimental] approach.
But we’re ready and mature enough to turn it into a real-world science. That stuff [experimentation] won’t go away, but it won’t be the real-world part.
I always look to other sciences that got a head start on psychology as a hint to answer that question. So in astronomy, 90 percent of astronomy is a big-data science now. But there are still problems that are approached [experimentally].
And I think there will always be a bleeding edge, where scientists are thinking of new things to measure. And we shouldn’t expect that bleeding edge ever to go away.
The Kavli HUMAN Project is big-data social science
What are the types of questions that social scientists are really bad at answering now — and why is this Kavli HUMAN Project a better approach?
The way [most social science] studies are done is you take 100 people — usually college students at your university — and you have them perform the experiment under two conditions. And then you try to draw a conclusion about the human condition from the difference between those two groups of 50 kids — who are mostly white.
So where are we falling down? Number one, we’re underpowered. You really don’t want to do experiments on 50 people.
You want to have the statistical power to see unambiguously what’s significant. Sample sizes should be in the 5,000 to 10,000 range to really solve the reproducibility, reliability, and repeatability problem.
Second thing: You do not want to be studying white college students. That is utterly useless — I mean, it’s not utterly useless, I spent my career doing a lot of it. But it’s clearly the wrong solution.
The right solution: You want a demographically representative sample. Right? You don’t want to say something about the kids who attend NYU. You want to say something about the people of New York, or the people of America, or the people of Earth.
Three: Another place where we fall down as scientists is longitudinality.
We know so little about life trajectories, and we try to infer things from studying white college students for 10 minutes. ... We know how powerful longitudinal studies can be. The prototypical example is probably still Framingham.
That was the long-term study started in the '40s that tracked many community members in a Massachusetts town for decades, right? It helped establish that smoking cigarettes is lethal, and that obesity is linked to heart disease. Perhaps those were just low-hanging fruit. You can do better than Framingham?
The Framingham [researchers] got the most data they could. And they transformed our understanding [of health]. But they gathered half a kilobyte of data a subject a year.
So what if you could take modern psychology, the Framingham [design], build to scale, and use state-of-the-art technology so that you were getting a gigabyte of data a year per person instead of a half a kilobyte?
How the data could help solve huge public health problems like obesity
Okay, what are some things you could hope to learn from all this data from the HUMAN project?
Let me give an example: Alzheimer’s disease. We know that if you have an advanced education, and if you have a large social network, and if you have an active, busy social life, you are at lower risk for Alzheimer’s.
It’s probably obvious that just sending someone to college, just sitting in college classrooms, is probably not protective for Alzheimer’s disease. Right? There’s something that co-varies with being highly educated, which is the thing that protects you from Alzheimer’s. And who knows what it is?
We have no way to know without a study like the Kavli HUMAN Project.
So will the results be simple connections, like "sugar causes diabetes" or "reading books protects against Alzheimer’s"?
No, it’s going to be much richer than that. It will be multivariate.
Multivariate? I’m not sure I understand what you mean by that. Can you give me an example of a "multivariate" approach to health that exists today?
So, cancer. Now what we understand is that everybody’s tumor has a slightly different genetic structure.
And if you have breast cancer, there is a suite of genetic structures. There are actually six or eight main genes and about 10 minor genes, and they interact in defining your tumor’s sensitivity. Your treatment program is not as simple as, "You have breast cancer. Take this drug." Nor is it as simple as, "Your tumor is one of five types. Take one of these five drugs."
Your treatment is a function, a mathematical function, of the interaction of the different genes that make up your tumor and hence define your tumor’s vulnerability.
The genotype of your tumor will be mapped, and it will yield a very detailed treatment program, through a machine-learned algorithm.
Ahh, so it’s like an algorithmic approach to wellness? Like if you’re this tall, have a certain set of genes, and live in an area with a lot of pollen, you should then do X to live longer.
But that’s just me thinking out loud. What might a multivariate solution to a problem like obesity look like?
Obesity is almost certainly gonna be a mixture of phenomena.
There are people who believe that high-fructose corn syrup is the ticket, and if that’s true, we’ll see it. We’ll definitely have enough statistical power to see it. But if it’s a mixture of genetics, ethnicity, culture, and diet, where the structure of a diet for an individual depends on all of those things, this will be completely easy for us to see.
[The fix] may be as subtle as, "This group need to avoid high-fructose corn syrup, or if they do eat high-fructose corn syrup, they have to avoid carbohydrates."
Do we really want to know this much about human behavior? Will that hinder our free will?
And what might we lose in pursuit of a longer life driven by data? I’m thinking if we do learn a lot, people will have fewer choices.
[Sighs] I’m sighing only because I don’t want to [lose] the thing I like most in life: my opportunity to exercise my free will.
Do I think that as we understand the human condition more deeply, that will influence the way we live our lives?
I also imagine you might even kind of create a fatalistic map for people’s lives. That you’ll be able to predict some outcomes, but they’ll be based on variables people can’t change, like their IQ, personality, genetics. Or based on variables they can’t change easily, like their economics. Would that information be more depressing than useful?
I’m hesitant there, because we’re just speculating. But, yeah, of course one thinks about that. When we discovered the location and structure of the Huntington chorea gene, we had the ability, at birth, to say whether someone was essentially doomed in their 40s to die of chorea.
There was a lot of debate. There were a lot of people in their 20s who had to decide whether they wanted to be tested or not. They knew they had a family history so that they were at risk, and it's like, "Do I want to know that? Do I want to know my fate?" And a lot of people who were at risk chose not to be tested, because they just didn't want to know.
So I hear you. That will happen. There will be the ability to get greater precision on one's fate, and people will have to decide if they want to know it or not.
We are building a map. And that map will be of use to society, whether individuals choose to see where they are on that map or not.
Wait, how will all this information will be kept secure?
You’ll be collecting gigabytes of highly, highly personal information on thousands of people. How will you protect the privacy of your participants?
The core database, what we call the red cube...
The red cube?
...that's just what we call it to remind ourselves how dangerous it is.
Only four system operators will have access to the red cube data. Those operators enter a Mission Impossible–secure facility that's being built out now. There will be biometric identification; only one person goes in through an airlock door at a time.
The red cube data has seven partitions; it's dynamically reencrypted at all times.
Two system operators need to be biometrically identified to approve change in the red cube before [the changes are] made.
Intense. How would researchers draw data from the red cube?
[They’ll do it in our] "yellow zone," which lives inside a super-secure firewall.
In the super-secure firewall, that's where all research happens. In there, everything's anonymized.
Let's say you're a scientist or a policymaker, and you come to me and you say, "I need a data set of the following type."
So we sit down together and we figure out how we'll build an anonymized version of the database you need [in the yellow zone]. That's where you can work with it.
Are there restrictions, then?
Rule No. 1 is you cannot bring your laptop or your computer or memory stick into the yellow zone. That's completely off the table.
You can sit down and work. And anything you learn, you can keep. You just can't keep the raw spreadsheet.
What you learned, you can take out, and our system operator will let you email it out to yourself. But that data set you can't take out, because if you took enough of those data sets out and kept them at home, you could begin to violate the security of our subjects.
So there are some intentional roadblocks that will slow down the rate of discoveries?
I want to say that in a totally different way. We are protecting our subjects. These are people who are, because of their commitment to making New York or the world a better place, sharing fundamental data about themselves. And we have an obligation to them that we will never breach, to protect them. They are our first obligation. The people we owe the most to are our participants, and protecting them is always job one.
Why social science needs an intergenerational approach
As a social scientist, were you jealous of big-data physics and astronomy?
For me, the most exciting kind of scientist to be, right now, is a psychologist, a neuroscientist, an economist. These are the disciplines that are changing fundamentally, that are maturing to their next level. And you want to be there when that happens, right?
The really exciting stuff in physics is already done. So I hear what you’re saying. Like, "Were you jealous of physicists?" No. The answer is, a physicist today is unlucky she missed the best moment.
Something that has interested me is that this is a long-term project, and the people that are starting on it now may not see the end of it.
Well, I mean, I am 55 years old, and I don't really want the study to end in 20 years, so let's imagine it ran 30 years. The likelihood I'll be alive then is pretty low.
It could take generations of data to truly figure out human behavior, right? Maybe you’ll find over the decades that human behavior keeps morphing, changing, that it’s not a stable thing.
Yeah, I think it will. And that's one of the most exciting ideas, right? As technology changes, as industrial revolutions change, behavior probably changes in really interesting and fundamental ways. All the more reason that you have to take an intergenerational approach.