The first arguments that AI “misalignment” — when artificially intelligent systems do not do what humans ask of them, or fail to align with human values — could pose a huge risk to humankind came from philosophers and autodidacts on the fringes of the actual AI industry. Today, though, the leading AI company in the world is pledging one-fifth of its computing resources, worth billions of dollars, toward working on alignment. What happened? How did AI companies, and the White House, come to take AI alignment concerns seriously?
Paul Christiano and Beth Barnes are key characters in the story of how AI safety went mainstream.
Christiano has been writing about techniques for preventing AI disasters since he was an undergrad, and as a researcher at OpenAI he led the development of what is now the dominant approach to preventing flagrant misbehavior from language and other models: reinforcement learning from human feedback, or RLHF. In this approach, actual human beings are asked to evaluate outputs from models like GPT-4, and their answers are used to fine-tune the model to make its answers align better with human values.
It was a step forward, but Christiano is hardly complacent, and often describes RLHF as merely a simple first-pass approach that might not work as AI gets more powerful. To develop methods that could work, he left OpenAI to found the Alignment Research Center (ARC). There, he is pursuing an approach called “eliciting latent knowledge” (ELK), meant to find methods to force AI models to tell the truth and reveal everything they “know” about a situation, even when they might normally be incentivized to lie or hide information.
That is only half of ARC’s mission, though. The other half, soon to become its own independent organization, is led by Beth Barnes, a brilliant young researcher (she got her bachelor’s degree from Cambridge in 2018) who did a short stint at Google DeepMind before joining Christiano at OpenAI, and now at ARC. Barnes is in charge of ARC Evals, which conducts model evaluations: She works with big labs like OpenAI and Anthropic to pressure-test their models for dangerous capabilities. For example, can GPT-4 set up a phishing page to get a Harvard professor’s login details? Not really, it turns out: It can write the HTML for the page, but fails to find web hosting.
But can GPT-4 use TaskRabbit to hire a human to do a CAPTCHA test for it? It can — and it can lie to the human in the process. You may have heard of that experiment, for which Barnes and the evaluations team at ARC were responsible.
ARC and ARC Evals’ reputations and those of its leaders are so formidable in AI safety circles that repeating to people that it’s okay if you’re not as smart as Paul Christiano has become a bit of a meme. And it’s true, it’s totally fine to not be as smart as Christiano or Barnes (I’m definitely not). But I’m glad that people like them have taken on a problem this serious.