Welcome to De/code: A new series of conversations with leading thinkers in health and science.
In late 2003, Gill Bejerano was staring at some perplexing data.
The computer scientist-turned-biologist had written software that could cluster and compare genome regions for humans, rats and mice. He and his research team, a collaboration between UC Santa Cruz and the University of Queensland, found nearly 500 long stretches of DNA where the base pairs matched perfectly.
It didn’t make sense. Less than 2 percent of DNA produces the amino acids that produce the proteins that produce us. The rest was supposed to be “junk DNA,” just hanging along for the evolutionary ride. Instead, the findings, published in the journal Science in 2004, suggested this genetic refuse mattered enough to survival to stay perfectly preserved in the gene pool for millions of years.
It wasn’t the first paper to spot “conserved” regions that weren’t producing proteins across species — but it showed them occurring at a magnitude that few could have imagined. It ultimately shifted the thinking on junk DNA, arguing forcefully that these regions were regulating genes. Moreover, the acclaimed paper was a breakthrough performance for what’s known as computational biology, highlighting how big data, fast computers and smart algorithms can offer fresh insights into the befuddling world of genomics.
Indeed, the size and complexity of the genome is so vast that unraveling its mysteries demands the tools of computer science. To wrap your mind around the staggering amount of data we’re talking about, consider this sentence from the introduction to “Genome” by Matt Ridley: “If I read the genome out to you at the rate of one word per second for eight hours a day, it would take me a century.”
Now imagine trying to spot commonalities among three data sets of that rough size — for the human, mouse and rat in Bejerano’s study — without a cluster of computers at your disposal.
In the years following that research, he has continued to publish work that further refined our understanding of what “non-coding DNA” is actually doing and developed open source software tools that others can apply for similar purposes.
Given the results of his own pioneering work in computational biology, Bejerano has become something of an evangelist for the field. The assistant professor of developmental biology and computer science at Stanford University is convinced the next great scientific insights, including advances that might eliminate terrible diseases and extend lives, will occur at the intersection of these disciplines.
Bejerano, who moved to the United States from Israel for his postdoctoral work, recently sat down with Re/code at his namesake lab to discuss where his work and computational biology are headed. The interview that follows has been edited for space and clarity.
“If you like working on disruptive things, there is nothing more disruptive right now than what is going on in modern biology and medicine. We are rewriting the books completely.” — Gill Bejerano
Re/code: I’m new to this beat, but in the first few weeks what I keep hearing is that advances in health and advances in computer science are really getting to the point of offering better treatments and improved diagnostics. That the long-held promise there is starting to be realized. How do you view that?
Gill Bejerano: The people who make the machines to generate the data are way ahead of the people analyzing the data. We have amazing new machines we plug into the wall, we hit the big button, we get terabytes or petabytes of cool stuff coming out of that.
It’s completely virgin territory, with a few beautiful experiments panning out and we think hundreds of thousands of additional ones waiting for us.
Part of the fun of a computational person right now is that data is growing exponentially. It’s not even like a kid in a candy store — it’s a warehouse of candy. It’s data, data, data.
We’re building the systems and building the tools. But the gap is an education gap. We’re not filling it up with the right people yet. You need people at the intersection. It’s hard for people coming in purely from an engineering perspective to build the right thing. You have to understand your client. And for people tapping into the intersection of science and tech, the client is biology.
As someone working at this intersection, are you trying to take on the role of an evangelist to some degree? Spreading the word to smart, young people that it’s great to be a computer scientist or a biologist, but it’s really great to be both?
Absolutely. It is a lot of fun to be among the first at an intersection, but it gets lonely after a while.
If people in computer science have even a passing interest in biology, they should get their minds outside of hacker books, and get into science and watch TED talks.
The other side is, if you love biology or even want to become a doctor, the first thing you should do is get into a Computer Science 101 class. And if you get into 202 and 303, you’d probably be better off for that. You’ll become dangerous with the computer — and I really think the future is there.
There is no institutional way of pushing you toward those intersections (in academia today), or at least fewer than there will be in five or 10 years. There aren’t big hurdles, but there’s no illuminated path. You have to jump yourself.
If you like working on disruptive things, there is nothing more disruptive right now than what is going on in modern biology and medicine. We are rewriting the books completely.
You’re also affiliated with Stanford’s Artificial Intelligence Lab. How are you applying machine learning and AI to this area?
It’s a classic AI area. You have a large amount of data, a tiny fraction of that is annotated and you want to deduce from those annotations what goes on in all the other stuff. This is a prime target for machine learning, and this is exactly where biology and health care are essentially at now.
The challenge is what I said before: It’s very hard to be an agnostic machine learning person and come and say, ‘I know nothing about your field.’
If you don’t understand something about the data, biology and health, have at least a minimum level of understanding, you can’t get around some confounding facts and have an appreciation for what predictions you’re actually making.
What is your lab’s research focus today?
A big one is called phenotype-genotype association, [finding the connections between an organism’s hereditary information and their form and traits, from behavior to susceptibility to diseases].
Essentially what we’re exploring, in computer science terms, is this:
I show you the machine language code that makes you (the genome) and I show you the output (the human). There are some things about the machine language code we understand and a lot we don’t. And now I start giving you thousands of pairs of variants of the program and variants of the outputs.
[The question becomes can our data analysis tools] tell me something new about what the different pieces of code do? If you have 400 people with a certain disease and 400 people of the same age and ethnicity who don’t have the disease, and that disease is clearly about something messed up in the code, can you find the ‘bug’?
We can play multiple games like that. This is beautiful reverse engineering that is classically computational. It’s programs and output. The output are human beings.
Human beings are fairly complicated, but the program is just three billion base pairs (the rough size of the human genome). So the size stays fixed. The complexity is something we’re chipping away at all the time, which is awesome with today’s resources.
We’re seeing a lot of language from human biology show up in computer science and vice versa, like “coding” and “artificial neural networks.” Is this biomimicry — are we getting closer to computer science working the way biology works and seeing that lead to advances. Or are these just metaphors, useful ways of thinking about it?
I don’t know that I have a good answer for you on that one. I don’t know that the field of genetic algorithms (an artificial intelligence technique that attempts to mimic natural selection) has become mainstream. But I don’t know enough about the field to tell you whether it’s exhausted its opportunities or it can’t play the game the way the genome does.
But I think there is a lot more for us to learn from the way that nature does things. If you want to fix something that’s not good in human genomes, you can do 1,000 experiments in labs for 20 years. Or you can look at nature, which has done millions of experiments over millions of years, and pick the best thing it picked.
There are so many experiments being done around us at every single moment in time, that the more we get wise to, the better it will serve us.
This article originally appeared on Recode.net.