Conversational Analytics / 08.08.2012

How Speech Recognition Works

The Speech Recognition market is growing fast – estimated to be worth $58.4 billion by 2015. Many contact centers across the globe enable speech-based navigation in their call centers, wherein customers can simply speak the name of the service they want to avail, rather than navigate lengthy menus through touchtone. Countless businesses in various industries also use speech solutions to automate and digitize their pen and paper processes. Most recently, Virtual Assistants such as Apple’s Siri and Micromax’s AISHA have become extremely popular amongst consumers.

While increasing numbers of people are enjoying the benefits Speech Recognition technology today, few people actually understand how it works. The technology is indeed complicated, and sophisticated speech engines require years of research and development. Ed Grabianowski of howstuffworks.com recently authored an extremely thorough explanation of Speech Recognition technology. In this post, we have summarized the article in laymen terms, and then explain how we built a Speech Engine contextualized to Indian vernacular.

First, Grabianowski describes how speech is converted to data, which he breaks down into three primary steps:

  1. When you speak, you create vibrations in the air. The analog-to-digital converter (ADC) digitizes the sound by taking precise measurements of the wave at frequent intervals, then filtering the sound to remove unwanted noise.
  2. Next the signal is divided into small segments and matches these segments to known phonemes in the appropriate language. A phoneme is the smallest element of a language – a representation of the sounds we make and put together to form meaningful expressions.
  3. Finally, the program examines phonemes in the context of the other phonemes around them. It runs the contextual phoneme plot through a complex statistical model and compares them to a large library of known words, phrases and sentences. The program then determines what the user was saying and either outputs it as text or issues a computer command.

The last step is by far the most difficult one. Speech recognition systems have gone through many evolutions over time in order to create the most accurate way to analyze phonemes. Today’s speech recognition systems use powerful and complicated statistical modeling systems with probability and mathematical functions to determine the most likely outcome. In these models, as Grabianowski describes, each phoneme is like a link in a chain, and the completed chain is a word. However, the chain branches off in different directions as the program attempts to match the digital sound with the phoneme that’s most likely to come next. During this process, the program assigns a probability score to each phoneme, based on its built-in dictionary and user training.

This process is most complicated for phrases and sentences, as the system has to figure out where each word stops and starts. Grabianowski gives the example of the phrase “recognize speech,” which sounds a lot like “wreck a nice beach.” The program has to analyze the phonemes using the phrase that came before it in order to get it right. The challenge becomes enormous as the vocabulary of the speech engine grows. For example, if a program has a vocabulary of 60,000 words, a sequence of three words could be any of 216 trillion possibilities.

The only way to create a Speech Recognition system that is sophisticated enough to overcome these challenges is by providing the statistical system with thousands of hours of human-transcribed speech and hundreds of megabytes of text. This is why Uniphore’s partnership with IIT-Madras is so important. We tap into the network and research facilities of this premiere institution in order to collect the exemplary training data necessary for our speech solutions to reach their optimal performance. Together, we are able to gather voice samples across Indian languages, vernacular, speaking patterns, and noise conditions. This training data is used to create acoustic models of words, word lists, and multi-word probability networks, enabling a robust and reliable Speech Recognition engine for the Indian market.

About Uniphore: Uniphore Software Systems is the leader in Multi lingual speech-based software solutions. Uniphore’s solutions allow any machine to understand and respond to natural human speech, thus enabling humans to use the most natural of communication modes, speech, to engage and instruct machines. Uniphore operates from its corporate headquarters at IIT Madras Research Park, Chennai, India and has sales offices in Middle East (Dubai, UAE) as well as in Manila, Philippines.