Forgot your password?  

Not What You Meant?  There are 39 definitions for SR.  Also try: Recognition or ASR or Transcription.

Speech Recognition | Research & Encyclopedia Articles

Print-Friendly   Order the PDF version   Order the RTF version
About 4 pages (1,154 words)
Speech recognition Summary

 


Speech Recognition

Speech recognition is a process that allows people to speak naturally to a computer on any topic and to be understood accurately. Speech is a formof communication we learn early and practice often, so the use of speech recognition software can simplify computer interfaces and make computers accessible to users unable to key text using a standard keyboard. However, computer-based speech recognition is more difficult to achieve than one might at first assume.

The speech recognition process is statistical in nature and is based on Hidden Markov Models (HMMs). An HMM is a finite set of states, each of which is associated with a probability distribution. Transitions among the states are governed by a set of probabilities called transition probabilities. The HMM is first trained using speech data for which the associated text is known. Subsequently, the trained HMM is used to "decode" new speech data into text.

The recognition vocabulary and vocabulary size play a key role in determining the accuracy of a system. A vocabulary defines the set of words that can be recognized by a speech recognition system. In addition, a language model is used to estimate the probability of a sequence of words in a particular domain. The language model assists the speech engine in recognizing speech by biasing the output toward high-probability word sequences. Together, vocabularies and language models are used in the selection of the best match for a word by the speech recognition engine. Therefore, speech systems can only "hear" words that are present in the vocabulary; a word that is not in the vocabulary will be misinterpreted as a similar sounding word that is present in the vocabulary.

Since speech recognition is probabilistic, the most probable decoding of the audio signal is output as the recognized text, but multiple hypotheses are considered during the process. Recognition systems generally have no means to distinguish between correctly and incorrectly recognized words. Therefore, during recognition, a "word lattice representation" is often used to consider all hypothesized word sequences. A word lattice representation is an acyclic directed graph that consists of nodes and arcs used to represent the multiple hypotheses considered during recognition. The nodes represent points in time, and the arcs represent the hypothesized word. The path with the highest probability is generally output as the final recognized text. Often, the multiple hypotheses (for example phrases such as "be quite" and "beak white") sound the same and may only be distinguished by higher level semantic knowledge provided by the language model.

Speech Recognition Applications

Speech recognition applications may be classified into three categories: dictation systems, navigational or transactional systems, and multimedia indexing systems. Each category of applications has a different tolerance for speech recognition errors. Advances in technology are making significant progress toward the goal of any individual being able to speak naturally to a computer on any topic and to be understood accurately.

Dictation Applications.

Such applications are those in which the words spoken by a user are transcribed directly into written text. Such applications are used to create text such as personal letters, business correspondence, or e-mail messages. Usually, the user has to be very explicit, specifying all punctuation and capitalization in the dictation. Dictation applications often combine mouse and keyboard input with spoken input. Using speech to createtext can still be a challenging experience since users have a hard time getting used to the process of dictating. Best results are achieved when the user speaks clearly, enunciates each syllable properly, and has organized the content mentally before starting. As the user speaks, the text appears on the screen and is available for correction. Correction can take place either with traditional methods such as a mouse and keyboard, or with speech.

Transactional Applications.

Speech is used in transactional applications to navigate around the application or to conduct a transaction. For example, speech can be used to purchase stock, reserve an airline itinerary, or transfer bank account balances. It can also be used to follow links on the web or move from application to application on one's desktop. Most often, but not exclusively, this category of speech applications involves the use of a telephone. The user speaks into a phone, the signal is interpreted by a computer (not the phone), and an appropriate response is produced. A custom, application-specific vocabulary is usually used; this means that the system can only "hear" the words in the vocabulary. This implies that the user can only speak what the system can "hear." These applications require careful attention to what the system says to the user since these prompts are the only way to cue the user as to which words can be used for a successful outcome.

Multimedia Indexing Applications.

In multimedia indexing applications, speech is used to transcribe words from an audio file into text. The audio may be part of a video. Subsequently, information retrieval techniques are applied on the transcript to create an index with time offsets into the audio. This enables a user to search a collection of audio/video documents using text keywords. Retrieval of unstructured multimedia documents is a challenge; retrieval using keyword search based on speech recognition is a big step toward addressing this challenge. It is important to have realistic expectations with respect to retrieval performance when speech recognition is used. The user interface design is typically guided by the "search the speech, browse the video" metaphor where the primary search interface is through textual keywords, and browsing of the video is through video segmentation techniques. In general, it has been observed that the accuracy of the top-ranking search results is more important than finding every relevant match in the audio. So, speech indexing systems often bias their ranking to reflect this. Since the user does not directly interact with the indexing system using speech input, standard search engine user interfaces are seamlessly applicable to speech indexing interfaces.

Conclusion

Advances in speech recognition technology have progressed to a point that it is practical to consider speech input in applications. Speech recognition is also gaining acceptance as a means of creating searchable text from audio streams. Dictation applications have the highest accuracy requirements and must be designed for efficient error correction. Transactional applications are more tolerant to speech errors but require careful designing of the constrained vocabulary and cueing of the user. Multimedia indexing applications are also tolerant to speech errors since the search algorithm can be adapted to meet the requirements of the application.

Input Devices; Neural Networks; Pattern Recognition.

Bibliography

Karat, C., et al. "Patterns of Entry and Correction in Large Vocabulary Continuous

Speech Recognition Systems." Proceedings of CHI '99: Human Factors in Computing Systems, (1999): 568-575.

Rabiner, L. "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition." Proceedings of IEEE 77, no. 2 (1989):257-286.

Schmandt, C. Voice Communications with Computers. New York: Van Nostrand Reinhold, 1994.

Wactlar, H., et al. "Lessons Learned from Building a Terabyte Digital Video Library." IEEE Computer (1999): 66-73.

Yankelovich, N. "How Do Users Know What to Say?" ACM Interactions 3, no. 6 (1996).

This is the complete article, containing 1,154 words (approx. 4 pages at 300 words per page).

More Information
  • View Speech Recognition Study Pack
  • 39 Alternative Definitions
  • Search Results for "Speech Recognition"
  • More Products on This Subject
    Computer Speech Recognition
    The advent of machines capable of recognizing human speech has been anxiously anticipated for many ... more


    Ask any question on Speech recognition and get it answered FAST!
    Answer questions in BookRags Q&A and earn points toward
    discounted or even FREE Study Guides and other BookRags products!
    Learn more about BookRags Q&A
    Copyrights
    Speech Recognition from Macmillan Science Library: Computer Sciences. Copyright © 2001-2006 by Macmillan Reference USA, an imprint of the Gale Group. All rights reserved.

    Join BookRagslearn moreJoin BookRags

    Join BookRagslearn moreJoin BookRags