Artificial intelligence systems like ChatGPT can do a wide range of impressive things: they can write passable essays, they can ace the bar exam, they’ve even been used for scientific research. But ask an AI researcher how it does all this, and they shrug.
“If we open up ChatGPT or a system like it and look inside, you just see millions of numbers flipping around a few hundred times a second,” says AI scientist Sam Bowman. “And we just have no idea what any of it means.”
Bowman is a professor at NYU, where he runs an AI research lab, and he’s a researcher at Anthropic, an AI research company. He’s spent years building systems like ChatGPT, assessing what they can do, and studying how they work.
He explains that ChatGPT runs on something called an artificial neural network, which is a type of AI modeled on the human brain. Instead of having a bunch of rules explicitly coded in like a traditional computer program, this kind of AI learns to detect and predict patterns over time. But Bowman says that because systems like this essentially teach themselves, it’s difficult to explain precisely how they work or what they’ll do. Which can lead to unpredictable and even risky scenarios as these programs become more ubiquitous.
I spoke with Bowman on Unexplainable, Vox’s podcast that explores scientific mysteries, unanswered questions, and all the things we learn by diving into the unknown. The conversation is included in a new two-part series on AI: The Black Box.
This conversation has been edited for length and clarity.
How do systems like ChatGPT work? How do engineers actually train them?
So the main way that systems like ChatGPT are trained is by basically doing autocomplete. We’ll feed these systems sort of long text from the web. We’ll just have them read through a Wikipedia article word by word. And after it’s seen each word, we’re going to ask it to guess what word is gonna come next. It’s doing this with probability. It’s saying, “It’s a 20 percent chance it’s ‘the,’ 20 percent chance it’s ‘of.’” And then because we know what word actually comes next, we can tell it if it got it right.
This takes months, millions of dollars worth of computer time, and then you get a really fancy autocomplete tool. But you want to refine it to act more like the thing that you’re actually trying to build, act like a sort of helpful virtual assistant.
There are a few different ways people do this, but the main one is reinforcement learning. The basic idea behind this is you have some sort of test users chat with the system and essentially upvote or downvote responses. Sort of similarly to how you might tell the model, “All right, make this word more likely because it’s the real next word,” with reinforcement learning, you say, “All right, make this entire response more likely because the user liked it, and make this entire response less likely because the user didn’t like it.”
So let’s get into some of the unknowns here. You wrote a paper all about things we don’t know when it comes to systems like ChatGPT. What’s the biggest thing that stands out to you?
So there’s two connected big concerning unknowns. The first is that we don’t really know what they’re doing in any deep sense. If we open up ChatGPT or a system like it and look inside, you just see millions of numbers flipping around a few hundred times a second, and we just have no idea what any of it means. With only the tiniest of exceptions, we can’t look inside these things and say, “Oh, here’s what concepts it’s using, here’s what kind of rules of reasoning it’s using. Here’s what it does and doesn’t know in any deep way.” We just don’t understand what’s going on here. We built it, we trained it, but we don’t know what it’s doing.
Very big unknown.
Yes. The other big unknown that’s connected to this is we don’t know how to steer these things or control them in any reliable way. We can kind of nudge them to do more of what we want, but the only way we can tell if our nudges worked is by just putting these systems out in the world and seeing what they do. We’re really just kind of steering these things almost completely through trial and error.
Can you explain what you mean by “we don’t know what it’s doing”? Do we know what normal programs are doing?
I think the key distinction is that with normal programs, with Microsoft Word, with Deep Blue [IBM’s chess playing software], there’s a pretty simple explanation of what it’s doing. We can say, “Okay, this bit of the code inside Deep Blue is computing seven [chess] moves out into the future. If we had played this sequence of moves, what do we think the other player would play?” We can tell these stories at most a few sentences long about just what every little bit of computation is doing.
With these neural networks [e.g., the type of AI ChatGPT uses], there’s no concise explanation. There’s no explanation in terms of things like checkers moves or strategy or what we think the other player is going to do. All we can really say is just there are a bunch of little numbers and sometimes they go up and sometimes they go down. And all of them together seem to do something involving language. We don’t have the concepts that map onto these neurons to really be able to say anything interesting about how they behave.
How is it possible that we don’t know how something works and how to steer it if we built it?
I think the important piece here is that we really didn’t build it in any deep sense. We built the computers, but then we just gave the faintest outline of a blueprint and kind of let these systems develop on their own. I think an analogy here might be that we’re trying to grow a decorative topiary, a decorative hedge that we’re trying to shape. We plant the seed and we know what shape we want and we can sort of take some clippers and clip it into that shape. But that doesn’t mean we understand anything about the biology of that tree. We just kind of started the process, let it go, and try to nudge it around a little bit at the end.
Is this what you were talking about in your paper when you wrote that when a lab starts training a new system like ChatGPT they’re basically investing in a mystery box?
Yeah, so if you build a little version of one of these things, it’s just learning text statistics. It’s just learning that ‘the’ might come before a noun and a period might come before a capital letter. Then as they get bigger, they start learning to rhyme or learning to program or learning to write a passable high school essay. And none of that was designed in — you’re running just the same code to get all these different levels of behavior. You’re just running it longer on more computers with more data.
So basically when a lab decides to invest tens or hundreds of millions of dollars in building one of these neural networks, they don’t know at that point what it’s gonna be able to do. They can reasonably guess it’s gonna be able to do more things than the previous one. But they’ve just got to wait and see. We’ve got some ability to predict some facts about these models as they get bigger, but not these really important questions about what they can do.
This is just very strange. It means that these companies can’t really have product roadmaps. They can’t really say, “All right, next year we’re gonna be able to do this. Then the year after we’re gonna be able to do that.”
And it also plays into some of the concerns about these systems. That sometimes the skill that emerges in one of these models will be something you really don’t want. The paper describing GPT-4 talks about how when they first trained it, it could do a decent job of walking a layperson through building a biological weapons lab. And they definitely did not want to deploy that as a product. They built it by accident. And then they had to spend months and months figuring out how to clean it up, how to nudge the neural network around so that it would not actually do that when they deployed it in the real world.
So I’ve heard of the field of interpretability. Which is the science of figuring out how AI works. What does that research look like, and has it produced anything?
Interpretability is this goal of being able to look inside our systems and say pretty clearly with pretty high confidence what they’re doing, why they’re doing it. Just kind of how they’re set up being able to explain clearly what’s happening inside of a system. I think it’s analogous to biology for organisms or neuroscience for human minds.
But there are two different things people might mean when they talk about interpretability.
One of them is this goal of just trying to sort of figure out the right way to look at what’s happening inside of something like ChatGPT figuring out how to kind of look at all these numbers and find interesting ways of mapping out what they might mean, so that eventually we could just look at a system and say something about it.
The other avenue of research is something like interpretability by design. Trying to build systems where by design, every piece of the system means something that we can understand.
But both of these have turned out in practice to be extremely, extremely hard. And I think we’re not making critically fast progress on either of them, unfortunately.
What makes interpretability so hard?
Interpretability is hard for the same reason that cognitive science is hard. If we ask questions about the human brain, we very often don’t have good answers. We can’t look at how a person thinks and explain their reasoning by looking at the firings of the neurons.
And it’s perhaps even worse for these neural networks because we don’t even have the little bits of intuition that we’ve gotten from humans. We don’t really even know what we’re looking for.
Another piece of this is just that the numbers get really big here. There are hundreds of billions of connections in these neural networks. So even if you can find a way that if you stare at a piece of the network for a few hours, we would need every single person on Earth to be staring at this network to really get through all of the work of explaining it.
And because there’s so much we don’t know about these systems, I imagine the spectrum of positive and negative possibilities is pretty wide.
Yeah, I think that’s right. I think the story here really is about the unknowns. We’ve got something that’s not really meaningfully regulated, that is more or less useful for a huge range of valuable tasks, we’ve got increasingly clear evidence that this technology is improving very quickly in directions that seem like they’re aimed at some very, very important stuff and potentially destabilizing to a lot of important institutions.
But we don’t know how fast it’s moving. We don’t know why it’s working when it’s working.
We don’t have any good ideas yet about how to either technically control it or institutionally control it. And if we have no idea what next year’s systems are gonna do, and if next year we have no idea what the systems the year after that are gonna do.
It seems very plausible to me that that’s going to be the defining story of the next decade or so. How we come to a better understanding of this and how we navigate it.