From Audio to Action: Demystifying Speech Recognition

From Audio to Action: Demystifying Speech Recognition

Share on twitter
Share on linkedin
Share on facebook

As speech recognition starts to live up to its promise, more and more businesses are exploring how they can apply this technology to their operations. One of the first questions we always get is, “How does the technology work?” Therefore, below we have described the process of how exactly audio streaming can be understood, converted to text, and applied to automate and enhance communication. Enjoy!

The first step of speech recognition is turning a continuous audio stream into the basic sounds of the language, which are called phonemes.  Every language has different sounds, or phonemes, from which all the native words are built. In order to identify the sequence of phonemes in continuous speech, the computer divides up the incoming audio into short time-slices called “frames”.

For each frame, the computer measures the “strength” of pre-defined frequency bands within the overall range of speech. Thus, each frame is converted into a set of numbers, one number per frequency-band. The recognition engine uses a reference table to find the best matched phoneme for a given frame. This table contains representative frequency-band-strengths for each of phoneme. This is why users are asked to record a profile when using the speech recognition platform for the first time – it needs to know how you pronounce each of the phonemes, so as to get the best possible entries in the look-up table.

Next, the system must translate phonemes into words. The recognizer uses a lexicon which contains all the words it knows about together with their pronunciations – each pronunciation is described using phonemes. For example, the pronunciation of the word “cat” has three phonemes: ‘k’, ‘ae’, ‘t’. Some words have multiple pronunciations and the lexicon has these too, for example “either” has two common pronunciations – “ay dh ax” (eye-th-er) and “iy dh ax” (ee-th-er).

As the recognizer moves along the sequence of phonemes, it looks for words “hidden” in the sequence. It is allowed to create overlapping sequences of words, as different words and phrases may share the same pronunciations. For example, the lyrics from this song, “life is butter melon cauliflower” sounds the same as “life is but a melancholy flower”.

After the engine has identified its “candidate” word sequences, it must sort out which is the correct one by using language modeling. A language model describes speech patterns in terms of words which are likely to be seen together. All language models start from collections of the things people say or write in a particular context. For example, if you want to create a language model for the Times of India, you might compile a year’s worth of editions and generate the relative counts for all the two and three word sequences you find in that collection. Similarly, if you want to create a language model for a medical specialty, you might gather transcripts of reports in that specialty and compile the relative frequencies of all the two and three word sequences in those transcripts. Models constructed in this manner represent a kind of “average”, since they reflect the combined usage of a lot of users within a given field.

The language model helps us sort between the competing sequences of words from the conversion of phonemes into possible words and phrases. For example, suppose the recognition, thus far, has yielded two possible fragments — “over there” and “over their”. If the next word identified is “heads”, then the language model would help the engine choose “over their heads” as opposed to “over there heads”.

Now that the audio has been converted to text, the world is your oyster. This text is yours to be searched, shared, analyzed, or visualized – depending on what your want to achieve.

About Uniphore: Uniphore Technologies Inc is the leader in Multi lingual speech-based software solutions. Uniphore’s solutions allow any machine to understand and respond to natural human speech, thus enabling humans to use the most natural of communication modes, speech, to engage and instruct machines. Uniphore operates from its corporate headquarters at IIT Madras Research Park, Chennai, India and has sales offices in Middle East (Dubai, UAE) as well as in Manila, Philippines.


By submitting your email, you agree that Uniphore may send you industry insights, service updates, and other marketing communications. You can unsubscribe anytime. For more details, review our privacy policy.