clock menu more-arrow no yes

Watson claims to predict cancer, but who trained it to ‘think?’

Watson cannot read handwriting. Machine learning’s true potential is tied to human inputs.

The computer housing case for IBM's Watson computer in New York City.
Andrew Spear / Getty

By beating humans at games of Go and Jeopardy, artificial intelligence engines like Google’s DeepMind and IBM's Watson have captured attention for their promise of solving bigger human problems. Watson, for example, is being enlisted to help doctors predict cancer in patients.

The American internet pioneer Douglas Engelbart suggests that AI’s grandest promise is the amplification of human ability. Whether it’s automating rote cognitive tasks like tagging people in photos or assisting in complex work flows like cancer treatment, the human-augmentation promise feels almost inevitable in every product and domain.

Self-driving cars rely on massive amounts of data collected over several years from efforts like Google’s people-powered street canvassing, which provides the ability to "see" roads.


Data has crowned a new king in AI. In deep learning, the technical approach at the root of AI fever, every breakthrough in the last several years has occurred because there exists a large and highly accurate training dataset — a dataset that relies on human input. It turns out that progress toward Engelbart's hypothesis of amplification of human ability requires massive human effort first, in order to actually power the AI.

The emergence of large and highly accurate datasets have allowed deep learning to "train" algorithms to recognize patterns in digital representations of sounds, images and other data that have led to remarkable breakthroughs, ones that outperform previous approaches in almost every application area. For example, self-driving cars rely on massive amounts of data collected over several years from efforts like Google’s people-powered street canvassing, which provides the ability to "see" roads (and was started to power services like Google Maps). The photos we upload and collectively tag as Facebook users have led to algorithms that can "see" faces. And even Google’s 411 audio directory service from a decade ago was suspected to be an effort to crowdsource data to train a computer to "hear" about businesses and their locations.

Watson’s promise to help detect cancer also depends on data: decades of doctor notes containing cancer patient outcomes. However, Watson cannot read handwriting. In order to access the data trapped in the historical doctor reports, researchers must have had to employ an army of people to painstakingly type and re-type (for accuracy) the data into computers in order to train Watson. This is yet another example of the substantial manual effort required to capture training data that is the core input of deep learning.

Watson’s promise to help detect cancer also depends on data — decades of doctor notes containing cancer patient outcomes. However, Watson cannot read handwriting.

Just as Watson researchers recognized that the keys to cancer prediction lie within oncologists’ backroom shelves, a growing number of technology leaders in health and other regulated industries are realizing that they are not data-poor. They are turning toward their paper processes and legacy paper archives and seeing the stacks and folders with the eyes of a digital prospector looking at her iron mountain.

Large insurance organizations are sifting through the hieroglyphics of massive collections of hundreds of millions of pages containing policyholder data using deep learning models from my company, Captricity. They are extracting data from death certificates so the next generation of insurance products can leverage what they recognize to be their sole business advantage: Training data that literally spans lifetimes.

In the nonprofit sector, PATH, a global health nonprofit, uses the same deep learning models to digitize data out of photos of bound clinical registers' pages, so that kids who attend rural clinics can more efficiently get their vaccines. A recent effort has allowed PATH to find systematic tracking problems and reprioritize their efforts to keep Tanzanian kids healthy.

Modern AI is in an era of building the foundation for interpreting the most common mediums of human communication: Photos, videos, sounds and writing. For AI to become truly revolutionary as is hoped (and expected), able to do such things as predicting cancer, it must focus on fundamental capabilities before subsequent augmentation. The hype around the potential of future applications of AI should first ask the question, where did the training data come from?


Kuang Chen, PhD, is the founder and CEO of Captricity, a leading Data-as-a-Service (DaaS) company that transforms handwritten paper forms into digital data. On a mission to democratize data access, the company’s crowd-guided deep learning software helps organizations in both the public and private sectors fight expensive, time-consuming and ineffective paper processes. Reach him @kuang.

This article originally appeared on Recode.net.

Sign up for the newsletter Sign up for The Weeds

Get our essential policy newsletter delivered Fridays.