clock menu more-arrow no yes mobile

Filed under:

The algorithms that detect hate speech online are biased against black people

A new study shows that leading AI models are 1.5 times more likely to flag tweets written by African Americans as “offensive” compared to other tweets.

A photo of two bluebirds with voice bubbles, one saying “@*#!” and the other “?!?!”.
New studies show that AI trained to identify hate speech may actually end up amplifying racial bias.
Shirin Ghaffary is a senior Vox correspondent covering the social media industry. Previously, Ghaffary worked at BuzzFeed News, the San Francisco Chronicle, and TechCrunch.

Platforms like Facebook, YouTube, and Twitter are banking on developing artificial intelligence technology to help stop the spread of hateful speech on their networks. The idea is that complex algorithms that use natural language processing will flag racist or violent speech faster and better than human beings possibly can. Doing this effectively is more urgent than ever in light of recent mass shootings and violence linked to hate speech online.

But two new studies show that AI trained to identify hate speech may actually end up amplifying racial bias. In one study, researchers found that leading AI models for processing hate speech were one-and-a-half times more likely to flag tweets as offensive or hateful when they were written by African Americans, and 2.2 times more likely to flag tweets written in African American English (which is commonly spoken by black people in the US). Another study found similar widespread evidence of racial bias against black speech in five widely used academic data sets for studying hate speech that totaled around 155,800 Twitter posts.

This is in large part because what is considered offensive depends on social context. Terms that are slurs when used in some settings — like the “n-word” or “queer” — may not be in others. But algorithms — and content moderators who grade the test data that teaches these algorithms how to do their job — don’t usually know the context of the comments they’re reviewing.

Both papers, presented at a recent prestigious annual conference for computational linguistics, show how natural language processing AI — which is often proposed as a tool to objectively identify offensive language — can amplify the same biases that human beings have. They also prove how the test data that feeds these algorithms have baked-in bias from the start.

No one knows for sure if the content moderation systems that Facebook, Twitter, and Google use show exactly the same biases in these studies; the technology these companies use to moderate content is proprietary. But the tech giants often turn to academics for guidance on how to better enforce standards around hate speech. So, if top researchers are finding flaws in widely used academic data sets, that presents a significant problem for the tech industry at large.

Anecdotally, activists have for some time accused platforms like Facebook of policing the speech of black Americans more strictly than that of white Americans. In one notable case reported on by Reveal, a black woman was banned from Facebook for posting the same “Dear White People” note that many of her white friends posted without suffering any consequences.

But these experiments provide quantitative data to suggest that these actions are not isolated incidents but are instead emblematic of a wider issue in how offensive content is policed on social media.

Flawed human decisions get reflected in algorithms

Maarten Sap, a PhD student in computer science and engineering, and his colleagues at the University of Washington set out to study what’s flagged as offensive on Twitter because of the important political conversations that happen on the platform. They first gathered more than 100,000 tweets used in two widely cited academic data sets. These tweets had been hand-flagged by human beings with labels such as being “hate speech,” “offensive,” or “abusive.”

The results were astounding. Tweets written by self-identified African American users were, on average, found to be 1.5 times more likely to be flagged as offensive. Researchers then applied this test data into a larger algorithmic model run using natural language processing on 56 million tweets and saw that these biases were only further reinforced.

Taking their research a step further, Sap and his colleagues decided to do something interesting. They primed workers labeling the same data to think about the user’s dialect and race when deciding whether the tweet was offensive or not. Their results showed that when moderators knew more about the person tweeting, they were significantly less likely to label that tweet as potentially offensive. At the aggregate level, racial bias against tweets associated with black speech decreased by 11 percent.

“The academic and tech sector are pushing ahead with saying, ‘let’s create automated tools of hate detection,’ but we need to be more mindful of minority group language that could be considered ‘bad’ by outside members,” Sap told Recode.

Sap’s study also tested the bias in these data sets as applied to an open source hate speech detecting tool for developers that’s run by Jigsaw, a subsidiary of Alphabet (Google’s parent company). The open source tool, called PerspectiveAPI, is used by news organizations such as the New York Times to help moderate comments online. It’s the publicly available version of an underlying technology that’s used throughout Google for its own products as well.

Researchers found that data run through PerspectiveAPI showed a significant bias against African American speech, labeling those tweets as toxic more often.

“There are lots of different biases that can arrive in machine learning models,” Jigsaw COO Dan Keyserling told Recode, acknowledging that PerspectiveAPI’s model — or any hate speech-detecting model — isn’t perfect. “We welcome more research in this field.” Keyserling said that his team is in touch with the authors of the report and that the company is constantly refining its model to be more fair.

But the problem is bigger than any one model or data set.

Thomas Davidson, a researcher at Cornell University, ran a study very similar to Sap’s. Davidson and his colleagues tested racial biases by training a model with data sets including the ones that Sap used, plus three more. Researchers also found “substantial racial bias” against African Americans in all of the data sets tested.

“What we’re drawing attention to is the quality of the data coming into these models,” Davidson told Recode. “You can have the most sophisticated neural network model, but the data is biased because humans are deciding what’s hate speech and what’s not.”

Both researchers offered the same warning: Automated systems for flagging hateful speech may be turning out flawed results.

But there’s also no specific consensus on what to do about it.

The researchers and others have advocated for giving moderators more social context about the people writing tweets, but that can prove tricky. When Facebook’s content moderation guidelines are already under scrutiny, would giving moderators more context open the door for more criticism, particularly from conservatives? And when content moderators at Facebook and other companies are reportedly working under grueling conditions and are pressured to rate content as offensive or not, making more nuanced decisions could make a difficult job even harder.

Nevertheless, these studies crack open the fantasy that AI will be able to rescue tech companies from making the complex decisions needed to police hateful speech on their platforms. These algorithms may seem like an easy solution to a complex problem, but they can have unintended consequences.

Sign up for the newsletter Today, Explained

Understand the world with a daily explainer plus the most compelling stories of the day.