clock menu more-arrow no yes

The Future of Voice: What's Next After Siri, Alexa and Ok Google

Another perfect storm of market conditions is brewing for a second wave of virtual personal assistants and conversational interfaces, exceeding the first in both intelligence and pervasiveness.

Annapurna Pictures

Ever since the dawn of computing technology in the 1950s, scientists and consumers alike have dreamed of bridging the gap between man and machine with natural spoken language. As machines began to outperform humans in complex calculation-based tasks, it became frustrating that they should lag so far behind in understanding language, that most basic building block that separates us from other animals, particularly when our own species’ infants pick up language quickly and instinctively.

Despite scientists dedicating their lives to the challenge over several decades, until recently, only very slow progress had been made in teaching machines to understand spoken language at all, let alone with human-level proficiency.

The first significant advances came in speech recognition, the ability to convert sound waves into text representing spoken words. Advances in speech recognition long predated the ability to understand meaning. By the ’90s, speech recognition was sufficient to power automated corporate call centers across the globe, representing the first time speech technology stepped out of the research laboratory and into the world of business.

While speech recognition capabilities were sufficient to power menu-driven, command-and-control IVR (“interactive voice response”) phone systems, speech technology has traditionally fallen short of bringing to life that science-fiction dream of speaking conversationally to a machine and having it genuinely understand your intent. Command-and-control systems with set inputs and preprogrammed responses are like a dog that can “fetch” or “roll over.” By contrast, a large-vocabulary system with natural language understanding (NLU) is humanlike: Flexible, consistently learning and responsive to millions of statements and queries it’s hearing for the very first time.

Conversational Interfaces: Why Now?

The first generation of virtual personal assistants was conceived in response to improved speech recognition, faster wireless speeds, the cloud computing boom and a new type of consumer: The hyper-connected smartphone user, navigating a busy life, often on the go and eager to abandon the slow clumsiness of virtual keyboard input. Initially capturing the public’s fascination with a roar of media buzz, the realities of the technology soon fell short of high user expectations.

About five years later, another perfect storm of market conditions is brewing for a second wave of virtual personal assistants and conversational interfaces, exceeding the first in both intelligence and pervasiveness. This new wave of voice-driven assistant technologies rides on the back of advances in artificial intelligence, rich collections of user data and growth in keyboardless and screenless devices. Additionally, great speech recognition is now built into every major operating system. Google, Apple, Baidu, Microsoft and Amazon provide this capability for free, enabling a new generation of apps to drive user adoption.

The new wave of voice-driven assistants finally embodies the dreams scientists and consumers have held for so long, legitimately understanding the meaning — and delivering on the intent — of naturally spoken queries. Older assistants hinted at what was to come, but relied on a fragile illusion of conversational ability: Sometimes customer queries were grouped into ill-fitting categories that triggered scripted responses. Other times, optimal responses were excluded when they lacked a required keyword. This was problematic because real human conversation isn’t rigid; it is expansive, encompassing millions of different concepts and word configurations. The new wave of voice-driven technology meets this challenge head-on, interpreting and responding to queries with genuine intelligence.

What the Future Holds

Advanced voice technology will soon be ubiquitous, as natural and intelligent user interface technology integrates seamlessly into daily life.

Voice will be a primary interface for the connected home, providing a natural means to communicate with alarm systems, lights, kitchen appliances, sound systems and more, as users go about their day-to-day lives.

More and more major cars on the market will adopt intelligent, voice-driven systems for entertainment and location-based search, keeping drivers’ and passengers’ eyes and hands free.

Audio and video entertainment systems will count on naturally spoken voice for content discovery to match the multitude of ways users think about content. (This can already be seen in products like the new Apple TV, which rejects any apps whose core functionality doesn’t support the voice-controlled Siri remote.)

Small-screened and screenless wearables will continue their upward climb in popularity, requiring smooth, fluid means of operation as more and more users expect better and better results.

Voice-controlled devices will also dominate workplaces that require hands-free mobility, such as hospitals, warehouses, laboratories and production plants.

According to comScore, 200 billion searches per month will be done with voice by 2020. This creates a $50+ billion per year market opportunity around voice search. As demand for voice both intensifies and diversifies, humanlike language understanding must crucially remain front and center at all times. Users are notoriously unforgiving when it comes to natural language technology; just two or three slip-ups can greatly erode confidence in a system.

Breadth also poses a challenge to delivering high-level natural language understanding. Every knowledge domain requires that systems not only recognize new, specialized terminology but also gain an appreciation for how meanings of words shift in a new context. What’s more, a natural language system cannot deliver sufficiently high accuracy by being an expert on shoe stores alone; it must also be an expert on your shoe store, and be trained on your specific product catalog.

Thanks to groundbreaking advancements in artificial intelligence, we are finally overcoming these formerly insurmountable challenges by training and adapting systems through machine learning to smoothly address a wide variety of conditions and requirements. Intelligent virtual assistants built into mobile operating systems keep getting better, and consumers are already demanding voice in their cars, mobile apps and home appliances.

While large companies like Google, Apple, Amazon, Microsoft and Baidu are off to a running start, a new breed of AI companies, like my company, MindMeld, are emerging to provide solutions to the growing number of businesses that now need conversational interfaces to stay competitive. At long last, verbally communicating with machines in natural language is no longer a science-fiction fantasy.


Tim Tuttle, the founder and CEO of Expect Labs/MindMeld, started his career at the MIT Artificial Intelligence Lab, where he received his PhD. He has also served on the research faculty at MIT as well as at Bell Laboratories. His first company built the Internet’s first large-scale CDN for real-time data. His second company, Truveo, built the Web’s second-largest video search platform, reaching more than 70 million monthly visitors; it was acquired by AOL, and Tuttle served as senior vice president at AOL, responsible for the Truveo business unit. He is the author of 18 technical publications and has been awarded several patents. Reach him @tim_tuttle.

This article originally appeared on Recode.net.