WO2002086862A1 - Speech recognition system - Google Patents

Speech recognition system Download PDF

Info

Publication number
WO2002086862A1
WO2002086862A1 PCT/US2002/012574 US0212574W WO02086862A1 WO 2002086862 A1 WO2002086862 A1 WO 2002086862A1 US 0212574 W US0212574 W US 0212574W WO 02086862 A1 WO02086862 A1 WO 02086862A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speaker
speech
speech recognition
server
Prior art date
Application number
PCT/US2002/012574
Other languages
French (fr)
Inventor
William Hutchison
Original Assignee
William Hutchison
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by William Hutchison filed Critical William Hutchison
Publication of WO2002086862A1 publication Critical patent/WO2002086862A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present invention relates, in general, to voice recognition, and, more particularly, to software, systems, software and methods for performing voice and speech recognition over a distributed network.
  • Speech recognition is used to provide enhanced services such as interactive voice response (IVR), automated phone attendants, voice mail, fax mail, and other applications. More sophisticated speech recognition systems are used for speech-to-text conversion systems used for dictation and transcription.
  • IVR interactive voice response
  • More sophisticated speech recognition systems are used for speech-to-text conversion systems used for dictation and transcription.
  • Voice and speech recognition systems are characterized by, among other things, their recognition accuracy, speed and vocabulary size.
  • High speed, accurate, large vocabulary systems tend to be complex and so require significant computing resources to implement.
  • such systems have increased training demands to develop accurate models of users' speech patterns.
  • speech recognition products tend to be slow and or inaccurate.
  • speech recognition enabled software applications must often compromise between complex but accurate solutions, or simple but less accurate solutions. In many applications, however, the impracticality of meaningful fraining dictates that the application can only implement less accurate techniques.
  • Voice recognition is of two basic types, speaker-dependent and speaker- independent.
  • a speaker dependent system operates in environments where the system has relatively frequent contact with each speaker, where sizable vocabularies are involved, and where the cost of recognition errors is high. These systems are usually easier to develop, cheaper to buy and more accurate, but not as flexible as speaker- adaptive or speaker-independent systems.
  • a user trains the system by, for example, providing speech samples and creating a correlation between the samples and text of what was provided, usually with some manual effort on the part of the speaker.
  • voice models substantial data files, called voice models, that characterize a particular speaker for which the system has been trained. The training process can involve significant effort to obtain high recognition rates.
  • the voice model files are tightly coupled to the recognition software so that it is difficult to port the training investment to other hardware/software platforms.
  • a speaker independent system operates for any speaker of a particular type (e.g. American English). These systems are the most difficult to develop, most expensive and accuracy is lower than speaker dependent systems. However, they are highly useful in a wide variety of applications where many users must use the system such as answering services, interactive voice response (IVR) systems, call processing centers, data entry and the like. Such applications sacrifice the accuracy of speaker-dependent systems for the flexibility of enabling a heterogeneous group of speakers to use the system. Such applications are characterized in that high recognition rates are desirable, but the cost of recognition failure is relatively low.
  • IVR interactive voice response
  • a middle ground is sometimes defined as a speaker adaptive system.
  • a speaker adaptive system dynamically adapts its operation to the characteristics of new speakers. These systems are more akin to speaker-dependent models, but allow the system to be trained over time. Adaptive systems can improve their vocabulary over time and result in complex, but accurate speech models. Such systems still require significant training effort, however. As in speaker-dependent systems, the complex speech models cannot be readily ported to other systems.
  • Training methods tend to be very product specific. Moreover, the data structures in which the relationships between a user's speech and text are correlated tend to be product specific. Hence, the significant training effort applied to a first speech recognition program may not be reusable for any other program or system, h some cases, speakers must re-train systems between version updates of the same program. Temporary or permanent changes to a user's voice patterns affect performance and may require retraining. This significant training burden and lack of portability between products has worked against wide scale adoption of speech recognition systems.
  • each user tends to access computer resources via a variety of computer-implemented interfaces and computing hardware. It is contemplated that any given user may wish to access voice-enabled television, voice-enabled software on a personal computer, voice-enabled automobile controls, and the like. The effort to train and maintain each of these systems individually becomes significant with only a few applications, and prohibitive with the large number of applications that could potentially become voice enabled.
  • the present invention involves a speech recognition system in which one or more speaker-dependent voice signatures are developed for each of a plurality of speakers.
  • a plurality of configurable speech processing engines are deployed and integrated with computer applications.
  • a session is initiated between the configurable engine and a particular speaker.
  • the configurable engine identifies the user using voice recognition or other explicit or implicit user-identification methods.
  • the configurable engine accesses a copy of the speaker dependent voice signature associated with the identified speaker to perform speaker-dependent speech recognition.
  • the present invention involves voice signatures that are configured to integrate with and be used by a plurality of disparate voice-enabled applications.
  • the voice signature comprises a static data structure or a dynamically adapting data structure that represents a correlation between a speaker's voice patterns and language constructs.
  • the voice signature is preferably portable across multiple computer hardware and software platforms.
  • a plurality of voice signatures are stored in a network accessible repository for access by voice-enabled applications as needed.
  • Fig. 1 shows a computer environment in which the present invention is implemented
  • Fig. 2 shows entities and relationships in a particular embodiment of the present invention
  • Fig. 3 illustrates an exemplary packet structure in accordance with an embodiment of the present invention
  • Fig. 4 shows a flow diagram of processes involved in an implementation of the present invention.
  • Fig. 5 depicts a distributed service model implementing functionality in accordance with the present invention.
  • the present invention is directed to voice processing systems characterized by a number of distinct aspects, h general, the systems and methods of the present invention intends to reduce the burden on users and developers of speech recognition systems by enabling training files and voice models to be readily shared between disparate applications. Further, initial framing and voice model adaptation can be implemented with greater efficiency by sharing voice information across multiple disparate applications.
  • the present invention provides a "voice processing substrate” or “voice processing service” upon which other software applications can build.
  • voice processing substrate or "voice processing service” upon which other software applications can build.
  • the present invention involves applications of the voice processing service such as interactive voice response, dictation and transcription services, voice messaging services, voice automated application services, and the like that share a common repository of speech recognition resources.
  • voice processing service such as interactive voice response, dictation and transcription services, voice messaging services, voice automated application services, and the like that share a common repository of speech recognition resources.
  • These applications typically implemented as software applications, can leverage the aggregate knowledge about their user's voice and speech patterns by using the shared common speech recognition resources.
  • the present invention involves a distributed voice processing system in which the various functions involved in voice processing can be performed in a pipelined or parallel fasliion.
  • Speech tasks differ significantly in purpose and complexity, i accordance with this aspect of the present invention, the processes involved in speech processing are modularized and distributed amongst a number of processing resources. This enables the system to employ only the required resources to complete a particular task. Also, this enables the processes to be implemented in parallel or in a pipelined fashion to greatly improve overall performance.
  • the present invention is illustrated and described in terms of a distributed computing environment such as an enterprise computing system using public communication channels such as the internet.
  • an important feature of the present invention is that it is readily scaled upwardly and downwardly to meet the needs of a particular application. Accordingly, unless specified to the contrary the present invention is applicable to significantly larger, more complex network environments as well as small network environments such as conventional LAN systems.
  • FIG. 1 shows an exemplary computing environment 100 in which the present invention may be implemented.
  • Speech server 101 comprises program and data constructs that function to receive requests from a variety of sources, access voice resources 105, and provide voice services in response to the requests.
  • the provided voice services involve accessing stored voice resources 105 that implement a central repository of resources that can be leveraged to provide services for a wide variety of requests.
  • the services provided by speech server 101 may vary in complexity from simply retrieving specified voice resources (e.g., obtaining a speech sample file for a particular user) to more complex speech recognition processes (e.g., feature extraction, phoneme recognition, phoneme-to-text mapping).
  • Requests to speech server 101 may come directly from voice appliances 102, however, in preferred examples requests come from "voice portals" 110.
  • Voice portals comprise software applications and/or software servers that provide a set of fundamental behaviors and that are voice enabled by way of their coupling to speech server 101.
  • Example voice portals include interactive voice response (IVR) services 111, dictation service 112 and voice mail service 113.
  • IVR interactive voice response
  • voice portals 110 access shared speech server 101 and shared voice resources 105, they do not each need to create, obtain, or maintain duplicate or special-purpose instances of the voice resources. Instead, the voice portals can focus on implementing the logic necessary to implement their fundamental behaviors, effectively outsourcing the complex tasks associated with voice processing.
  • a set 103 of voice appliances 102 represent the hardware and software devices used to implement voice-enabled user interfaces.
  • Exemplary voice appliances 102 include, but are not limited to, personal computers with microphones or speech synthesis programs, telephones, cellular telephones, voice over IP (VoIP) terminals, laptop and hand held computers, computer games and the like. Any given speaker may use a plurality of voice appliances 102. Likewise, any given voice appliance 102 maybe used by multiple speakers.
  • a variety of techniques are used to perform voice processing.
  • speech recognition starts with the digital sampling of speech followed by acoustic signal processing.
  • Most techniques include spectral analysis such as Fast Fourier Transform (FFT) analysis, LPC analysis (Linear Predictive Coding), MFCC (Mel Frequency Cepstral Coefficients), cochlea modeling and the like.
  • FFT Fast Fourier Transform
  • LPC analysis Linear Predictive Coding
  • MFCC Mel Frequency Cepstral Coefficients
  • cochlea modeling cochlea modeling and the like.
  • phoneme recognition the preprocessed files are parsed to identify groups of phonemes and words using techniques such as DTW (Dynamic Time Warping), HMM (hidden Markov modeling), NNs (Neural Networks), expert systems, N-grams and combinations of techniques.
  • DTW Dynamic Time Warping
  • HMM hidden Markov modeling
  • NNs Neurological Networks
  • expert systems N-grams and combinations of
  • Fig. 1 The precise distribution of functionality amongst the various components shown in Fig. 1 can vary significantly. Modularization of components allows components to be placed or implemented rationally within the network architecture. For example, analog- to-digit conversion (ADC) and digital signal processing (DSP) steps may occur within voice appliances 102 such that a digital preprocessed signal is communicated to voice portals 110. Alternatively, this pre-processing can be performed by voice portals 110, or can be out-sourced to speech server 101. hi many applications it is preferable to perform these preprocessing functions as near to the analog voice source (e.g., the speaker) as possible to avoid signal loss during communication. Conversely, it is contemplated that copies of shared voice resources can be stored permanently or temporarily (i.e., cached) within voice portals 111 and/or voice appliances 102 so that more complex functions can be implemented without access to speech server 101 each instance.
  • ADC analog- to-digit conversion
  • DSP digital signal processing
  • Each of the devices shown in FIG. 1 may include memory, mass storage, and a degree of data processing capability sufficient to manage their connection to a network.
  • the computer program devices in accordance with the present invention are implemented in the memory of the various devices shown in FIG. 1 and enabled by the data processing capabiUty of the devices shown in FIG. 1.
  • Selected components of the present invention may be stored in or implemented in shared mass storage.
  • Fig. 2 shows conceptual relationships between entities in a specific embodiment of the present invention.
  • Voice appliance 102 interacts with a speaker and communicates a voice signal over network 201 to voice portal 110.
  • the term "voice signal" is intended to convey a very broad range of signals that capture the voice utterances of a user in analog or digital form and which indicate an identity of the speaker.
  • the speaker identification can be to a specific individual speaker, or an indication of a group to which the speaker belongs (e.g., English-speaking children from Phoenix, AZ).
  • the speaker identification can take a variety of forms, and may be explicitly provided by the speaker or voice appliance 102 or implied from the connection through network 201 using techniques such as caller ID, area code information, or reverse telephone directory lookup.
  • Network 201 may comprise the public switched telephone network (PSTN) including cellular phone networks, as well as local area networks (LANs), wide area networks (WANs), metropolitan area networks (MANs), as well as public internetworks such as the Internet. Any network or group of networks that are capable of transporting the voice signal and speaker identification information are suitable implementations of network 201.
  • Internet 202 is an example of a data communication network suitable for exchanging data between components of the present invention. While Internet 202 is an example of a public IP-based network, a suitable public, private, or hybrid network or internetwork topology including LANs, WANs, and MANs are suitable equivalents in most applications.
  • Voice portal 110 comprises speech-enabled application 204 and speech recognition (SR) front-end 203.
  • Application 204 implements desired fundamental behaviors of the application, such as looking up telephone numbers, weather information, stock quotes, addresses and the like.
  • Speech enabled application has an interface that couples to SR front end 203. This interface may be configured to receive voice-format data such as phoneme probabilities or text input, but may also be configured to receive commands or other structured input such as structured query language (SQL) statements.
  • SQL structured query language
  • Front-end 203 implements a defined interface that is protocol compliant with network 201 to communicate request and response traffic with voice appliances 102.
  • SR front-end 203 receives requests from voice appliances 102 where the requests identify the speaker and include a voice signal.
  • SR front-end generates a request to speech server 101 to access shared resources 105 needed to process the voice signal so as to generate input to speech enabled application 204.
  • the processing responsibilities between SR front-end 203 and speech server 101 are agreed upon in advance, but can be varied significantly, hi a particular example, the requests from SR front end 203 include a digitized speech signal, and the responses from speech server 101 include a set of phoneme probabilities corresponding to the speech signal.
  • a typical system will involve multiple SR front-end devices 203 communicating simultaneously with a single speech server 101.
  • Each front end 203 may handle multiple voice appliances 102 simultaneously.
  • One advantage of the present invention is that centralized speech server 101 can be configured to process these requests in parallel more readily than could individual voice appliances 102.
  • requests to speech server 101 are preferably accompanied by a source identification that uniquely identifies a particular SR front-end 203 and a stream identifier that uniquely identifies a particular voice session that is using the identified SR front end 203.
  • the speaker ID can also be used to identify the session, although when a particular voice appliance 102 is conducting multiple simultaneous sessions, the speaker ID alone may be an ambiguous reference. This information can be used to route the resources 105 to appropriate processes that are using the resources.
  • the SR front end 203 exchanges request/response traffic with speech server 101 over the Internet 202 in the example of Fig. 2.
  • the request/response traffic comprises hypertext transfer protocol (HTTP) packets over TCP/IP in the particular example, although other protocols are suitable and may be preferable in some instances.
  • HTTP hypertext transfer protocol
  • UDP universal datagram protocol
  • the benefits of various protocol layers and stacks are well known and readily consulted in the selection of particular protocols.
  • Voice resources 105 comprise speaker-dependent signatures 207 and speaker group signatures 208 in a particular embodiment.
  • Speaker-dependent signatures 207 comprise one or more voice models associated with a particular speaker, hi contrast
  • speaker group signatures 208 comprise one or more voice models that are associated with a group of speakers such as English speaking children from Phoenix, Arizona, rather than a particular speaker.
  • Group signatures are a useful middle ground where a particular speaker cannot be identified with certainty, but the speaker can be identified generally as a member of a particular speaker group.
  • the voice models essentially implement a mapping between voice signals and symbols, words, word portions (e.g., N-grams), phonemes, commands, statements, and the like (collectively referred to as "tokens") that have meaning to one or more speech enabled applications 204.
  • This mapping can be implemented in a variety of data structures such as lookup tables, databases, inverted indices as well as other data structures that enable mapping functionality, hi a preferred implementation the mapping is captured in a neural network fraining file that can be used to enable an artificial neural network to output appropriate tokens in response to voice signal inputs.
  • Signature cache 205 is useful in environments where a given speaker or set of speakers frequent the SR front end 203, yet the SR front end 203 still requires general purpose adaptability to any speaker.
  • the speech enabled application may provide services to a single speaker for an extended period of time ranging from a few minutes to days, h such cases, SR front end 203 can implement processes to search its own cache 205 for matches to a voice signal thereby avoiding repeated reference to speech server 101.
  • An SR front end 203 that uses a cache will typically exhibit somewhat greater complexity to implement processes that manage the cache, and that use the cache content instead of services provided by speech server 101.
  • Each cache entry withhi signature cache 205 may include, for example, a speaker identification, and an association between a particular voice signal and a token corresponding to the voice signal.
  • SR front end 203 When SR front end 203 receives a new speech signal, it checks cache 205 for a matching speech signal and returns the token (e.g., a set of phoneme probabilities) without having to access speech server 101. When a speech signal does not find a match in cache 205, a conventional reference to speech server 101 is performed.
  • some voice appliances 102 may benefit by caching signatures in signature cache 206.
  • a home telephone set or office workstation may be used by a small number of speakers frequently, yet still be available for use by other speakers.
  • the voice resources corresponding to the frequent speakers are cached locally in signature cache 206.
  • voice services can be implemented within the appliance 102 itself.
  • voice resources are either downloaded to the appliance 102, or to an SR front end 203.
  • Such functionality enables a rather simple, lightweight voice recognition system implemented within an appliance 102 to offer high quality, speaker dependent functionality without training, assuming speech server 101 contains signatures corresponding to the speakers.
  • Fig. 3 shows an exemplary format of a request made from SR front end 203 to speech server 101.
  • Each request includes a speaker ID, source address identifying a particular SR front end 203, optional context information, and a voice signal.
  • the context information may include infonnation about the application 204, information about the speaker (e.g., age, language, accents), speaker location, information about the voice appliance 102, or a stream ID for multiprocessing.
  • the context information is used to adapt the processes undertaken in speech server 101.
  • the voice signal field includes all or a portion of a voice signal to be processed. It is contemplated that some applications may be configured such that voice processing occurs in the front end 203 or within voice appliance 102.
  • the voice signal field includes processed data structures such as the output of a Fourier transform instead of raw voice data, hi such cases, it may not be necessary to send the voice signal to speech server 101 at all, instead, speech server 101 serves to supply the raw speaker-dependent resources needed to SR front end 203 and/or voice appliance 102.
  • Fig. 4 shows a simplified flow diagram of processes undertaken within a voice processing system in accordance with the present invention.
  • the processes shown in Fig. 4 can be distributed throughout the various components shown in Fig. 1 and Fig. 2 or may all be implemented in speech server 101.
  • Step 401 of capturing a voice signal is typically performed by a microphone or other acoustic energy sensor within a voice appliance 102.
  • Filtering, framing, and analog to digital conversion step 402 preferably takes place as close to step 401 as practicable to avoid transmission of analog voice signals. However, it is common to transfer voice signals over great distances using radio and telephony subsystems.
  • the filtering, framing and ADC step 402 may occur within voice appliance 102, at a device coupled to appliance 102 through a voice or data communication network, or a combination of these locations.
  • filtering may be explicitly performed by analog or DSP filter circuits, or implicitly by bandwidth constraints or other transmission channel characteristics through which analog voice signals are passed.
  • features of interest are identified within a voice signal.
  • Features are patterns within the voice signal that have a likelihood of corresponding to phonemes, but phonemes are not directly extracted in step 403.
  • Features may include, for example, mathematical properties such as frequency distribution, amplitude distribution, deviation, and the like of the processed voice signals.
  • Phonemes are abstract units of a phonetic system of a language that correspond to a set of similar speech which are perceived to be a single distinctive sound in the language. Phonemes represent identifiable components within a speech stream that, while they do not have linguistic meaning in themselves, are units with high occurrence within a spoken language. In extracting phonemes from a speech signal, it is rare to make exact matches. It is often useful to associate a probability with one or phonemes indicating a likelihood that the particular phoneme is in fact a correct representation of the speech signal. Once particular phonemes are identified, they can be used alone or in combination to create associations with tokens such as particular text, commands, or the like in step 405.
  • Step 405 functions to associate the extracted features with tokens, hi contrast with features, tokens are abstract units that have linguistic meaning.
  • Words, phrases and punctuation symbols for example, are tokens that carry linguistic meaning.
  • a token or set of tokens may represent the concept "search government databases for information about patent litigation involving speech recognition systems", however, the token may be an SQL query, a Java object, an XML document, or other abstraction of the actual verbiage that reflect the concept in the English language.
  • a response is generated to application 204 including the tokens along with identification information that enables the application processing step 406 to associate the tokens with the voice signal that generated the request.
  • Many applications 204 can use a set of phoneme probabilities, or other feature sets, directly. For example when application 204 is expecting a constrained choice between "yes” and “no", or expecting a single-digit numeric response (e.g., "one", "two”, etc.), it is relatively easy to determine the correct token from the feature sets directly. In such cases, the present invention enables the feature extraction process 403 to supply features directly to application processing step 405 such that the step of associating tokens with features can be performed within the application 204 itself. Similarly, some applications may involve supplying phoneme probabilities directly from step 404 so that token association may occur elsewhere. The application processing step 405 implements the voice enabled behavior using the features, phonemes, and/or tokens that it is supplied in steps 403 through 406 respectively.
  • Fig. 5 illustrates an important concept that speech server 101 can be implemented as a distributed computing system using distributed hardware and software resources that are coupled together by network 501. This enables, among other things, the provision of differential levels of service and parallel request processing for improved performance, hi one example, the various services shown in Fig. 5 may communicate with each other directly through network 501 to pass a voice processing task from service to service until it is completed. Alternatively, the individual services 502-506 may communicate directly with voice portal 110 and/or voice appliance 102 to perform their functions without knowledge of the other components within speech server 101.
  • the various services can receive an HTTP request, perform their process, and return the partially or completely processed data to the requesting voice portal 110 or voice appliance 102 using HTTP redirection mechanisms as needed, passing the partially processed data as attributes within the HTTP redirection responses.
  • speech server 101 comprises signal processing services such as service 502 that provides digital signal processing services. While DSP functionality is often implemented within a voice appliance 102 itself, thin clients may lack the resources to perform DSP processes. Moreover, even where a voice appliance 102 could perform DSP functions, it may be desirable to have DSP service 502 perform the functions to achieve the benefits of faster, more powerful processors and up-to-date algorithms that can be implemented in DSP service 502. DSP service 502 receives a digitized voice signal, for example, and returns a processed digital signal that may implement filtering step 402 (shown in Fig. 4).
  • Process 503 provides feature extraction services described hereinbefore.
  • Process 504 provides feature to perform, for example phoneme mapping as a specific implementation of step 404 and 405.
  • process 505 provides features and/or phonemes-to-command mapping as an implementation of step 406.
  • a number of processes 505 maybe provided for particular applications so that the command mapping is specific to a particular application 204. Requests can be directed to particular instances of process 505 to meet the needs of the application associated with the request.
  • Language translation process 506 illustrates a more complex voice processing process contemplated by the present invention.
  • Language translation process 506 may receive text from process 504, for example, and perform a language-to-language translation (e.g., English to Spanish). This results in text returned to the requesting voice portal 110 or voice appliance 102 in a different language than was originally spoken.
  • a language-to-language translation e.g., English to Spanish
  • a number of complex services similar to language translation service 506 will be apparent to those skilled in the art.
  • a speakers utterances may be translated into properly formed C++ code constructs as a program authoring tool, properly formed SQL queries to be applied to a database, and the like.
  • speech services 101 may also provide services that convert abstract tokens into speech signals that can be audibly presented through an appliance 102 having a speaker.
  • Fig. 6 illustrates an embodiment in which station-to- station duplex voice exchange is implemented between voice appliances 102 in a manner that is functionally analogous to conventional telephone service.
  • Conventional phone service provides a medium in which all parties to a conversation hear a voice that sounds substantially like the speaker. Although conventional telephone service limits audio bandwidth to reduce the amount of data that is bemg transported, it remains a very inefficient means to communicate voice information between two points.
  • the implementation of the present invention shown in Fig. 6 enables high levels of compression while retaining the benefits of conventional phone service.
  • each speaker is associated with a TX signature used to convert the speaker's voice into tokens (as described hereinbefore), and an RX signature used to convert tokens into a replica of the speaker's voice.
  • the RX signatures are akin to TX signatures in that they implement mappings between tokens and voice signals using either mapping data structures and algorithms, or neural networks, or both.
  • each appliance 102 implements processes to use speaker dependent signature files to encode and decode voice signals.
  • the appliance 102 used by speaker 1 for example, includes a TX signature file for speaker 1, and an RX signature file for speaker 2.
  • the appliance 102 used by speaker 2 conversely, includes a TX signature file for speaker 2, and an RX signature file for speaker 1.
  • speaker l's voice is encoded to tokens and tokens received from the appliance 102 operated by speaker 2 are decoded into audible speech signals that can be presented through a speaker.
  • speaker 2's voice is encoded to tokens and tokens received from the appliance 102 operated by speaker 1 are decoded into audible speech signals that can be presented through a speaker, hi this manner, station-to-station voice communication is enabled with highly compressed data communication between the stations.
  • voice portals 110 communicate through voice portals 110 such that some or all of the voice processing functions required to associate voices signals with tokens are performed by the voice portals 110.
  • voice appliances 102 may be substantially conventional telephone sets.
  • Voice portals 110 perform the token to voice signal mapping functions transparently and communicate voice signals with the respective voice appliances 102.
  • the communication between voice portals 110 can be compressed as described above greatly reducing the overall network bandwidth consumed by a given station-to-station voice communication.

Abstract

A method of speech recognition including receiving speech signals into a front-end processor and storing at least some resources used for speech recognition in a network-attached server (Figure 1). The front-end processor is coupled to the network-attached server to perform the speech recognition.

Description

SPEECH RECOGNITION SYSTEM
BACKGROUND OF THE INVENTION
1. Field of the Invention.
The present invention relates, in general, to voice recognition, and, more particularly, to software, systems, software and methods for performing voice and speech recognition over a distributed network.
2. Relevant Background.
Voice and speech recognition systems are increasingly common interfaces for obtaining user input into computer systems. Speech recognition is used to provide enhanced services such as interactive voice response (IVR), automated phone attendants, voice mail, fax mail, and other applications. More sophisticated speech recognition systems are used for speech-to-text conversion systems used for dictation and transcription.
Voice and speech recognition systems are characterized by, among other things, their recognition accuracy, speed and vocabulary size. High speed, accurate, large vocabulary systems tend to be complex and so require significant computing resources to implement. Moreover, such systems have increased training demands to develop accurate models of users' speech patterns. In applications where computing resources are limited or the ability to train to a particular user's speech patterns is limited, speech recognition products tend to be slow and or inaccurate. Currently, speech recognition enabled software applications must often compromise between complex but accurate solutions, or simple but less accurate solutions. In many applications, however, the impracticality of meaningful fraining dictates that the application can only implement less accurate techniques.
Voice recognition is of two basic types, speaker-dependent and speaker- independent. A speaker dependent system operates in environments where the system has relatively frequent contact with each speaker, where sizable vocabularies are involved, and where the cost of recognition errors is high. These systems are usually easier to develop, cheaper to buy and more accurate, but not as flexible as speaker- adaptive or speaker-independent systems. In a speaker-dependent system, a user trains the system by, for example, providing speech samples and creating a correlation between the samples and text of what was provided, usually with some manual effort on the part of the speaker. Such systems often use a generic engine coupled with substantial data files, called voice models, that characterize a particular speaker for which the system has been trained. The training process can involve significant effort to obtain high recognition rates. Moreover, the voice model files are tightly coupled to the recognition software so that it is difficult to port the training investment to other hardware/software platforms.
A speaker independent system operates for any speaker of a particular type (e.g. American English). These systems are the most difficult to develop, most expensive and accuracy is lower than speaker dependent systems. However, they are highly useful in a wide variety of applications where many users must use the system such as answering services, interactive voice response (IVR) systems, call processing centers, data entry and the like. Such applications sacrifice the accuracy of speaker-dependent systems for the flexibility of enabling a heterogeneous group of speakers to use the system. Such applications are characterized in that high recognition rates are desirable, but the cost of recognition failure is relatively low.
A middle ground is sometimes defined as a speaker adaptive system. A speaker adaptive system dynamically adapts its operation to the characteristics of new speakers. These systems are more akin to speaker-dependent models, but allow the system to be trained over time. Adaptive systems can improve their vocabulary over time and result in complex, but accurate speech models. Such systems still require significant training effort, however. As in speaker-dependent systems, the complex speech models cannot be readily ported to other systems.
Training methods tend to be very product specific. Moreover, the data structures in which the relationships between a user's speech and text are correlated tend to be product specific. Hence, the significant training effort applied to a first speech recognition program may not be reusable for any other program or system, h some cases, speakers must re-train systems between version updates of the same program. Temporary or permanent changes to a user's voice patterns affect performance and may require retraining. This significant training burden and lack of portability between products has worked against wide scale adoption of speech recognition systems.
Moreover, even where a user has trained one or more speaker-dependent systems, this training effort cannot be leveraged to improve the performance of the many speaker-independent systems that are encountered. The speaker-independent systems cannot, by design, access or use speaker-dependent speech models to improve their performance. Hence, a need exists for improved speech recognition systems, software and methods that enable portable speech models that can be used for a wide variety of tasks and leverage the training efforts across a wide variety of systems.
The dichotomy between speaker-dependent and speaker-independent technologies has resulted in an interesting dilemma in industry. Many of the applications that could benefit most from accurate speech recognition (e.g., interactive voice response systems) cannot afford the complexity of highly accurate speaker dependent systems, nor obtain the necessary voice models that would improve their accuracy. From a practical perspective, speakers will only invest the significant time required to develop a high quality voice model in applications where the result is worth the effort. The benefits realized by a business cannot compel individual speakers to submit to the necessary training regimens. Hence, these applications settle for speaker- independent solutions and invest heavily in improving the performance of such systems.
Increasingly, computer-implemented applications and services are targeting "thin clients" or computers with limited processing power and data storage capacity. Such devices are cost effective means of implementing user interfaces. Thin clients are becoming prominent in appliances such as televisions, telephones, Internet terminals and the like. However, the limited computing resources make it difficult to implement complex functionality such as voice and speech recognition. A need exists for voice processing systems, methods and software that can provide high quality voice processing services with reduced hardware requirements. hi the past, computers were used by one user, or perhaps a few users, to access a limited set of applications. As computers are used more frequently to provide interfaces to everyday appliances, the need to adapt user interfaces to multiple users becomes more pressing. Voice processing, in particular, represents a user input mode that is difficult to adapt to multiple users. In current systems, a voice model must be developed on and stored in each machine for each user. Not only does this tax the machine's resources, but it creates a burdensome need for each user to train each computer that they use.
Conversely, each user tends to access computer resources via a variety of computer-implemented interfaces and computing hardware. It is contemplated that any given user may wish to access voice-enabled television, voice-enabled software on a personal computer, voice-enabled automobile controls, and the like. The effort to train and maintain each of these systems individually becomes significant with only a few applications, and prohibitive with the large number of applications that could potentially become voice enabled.
Hence, a need exists for speech recognition systems, methods and software that provide increased accuracy with reduced cost. Moreover, there is a need for systems that require reduced effort on the part of the speaker. Further, a need for systems and software that enable users to leverage training effort across multiple, disparate speech- recognition enabled applications exists.
SUMMARY OF THE INVENTION
Briefly stated, the present invention involves a speech recognition system in which one or more speaker-dependent voice signatures are developed for each of a plurality of speakers. A plurality of configurable speech processing engines are deployed and integrated with computer applications. A session is initiated between the configurable engine and a particular speaker. The configurable engine identifies the user using voice recognition or other explicit or implicit user-identification methods. The configurable engine accesses a copy of the speaker dependent voice signature associated with the identified speaker to perform speaker-dependent speech recognition. hi another aspect, the present invention involves voice signatures that are configured to integrate with and be used by a plurality of disparate voice-enabled applications. The voice signature comprises a static data structure or a dynamically adapting data structure that represents a correlation between a speaker's voice patterns and language constructs. The voice signature is preferably portable across multiple computer hardware and software platforms. Preferably, a plurality of voice signatures are stored in a network accessible repository for access by voice-enabled applications as needed.
BRIEF DESCRIPTION OF THE DRAWINGS
Fig. 1 shows a computer environment in which the present invention is implemented;
Fig. 2 shows entities and relationships in a particular embodiment of the present invention;
Fig. 3 illustrates an exemplary packet structure in accordance with an embodiment of the present invention;
Fig. 4 shows a flow diagram of processes involved in an implementation of the present invention; and
Fig. 5 depicts a distributed service model implementing functionality in accordance with the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention is directed to voice processing systems characterized by a number of distinct aspects, h general, the systems and methods of the present invention intends to reduce the burden on users and developers of speech recognition systems by enabling training files and voice models to be readily shared between disparate applications. Further, initial framing and voice model adaptation can be implemented with greater efficiency by sharing voice information across multiple disparate applications.
In one aspect, the present invention provides a "voice processing substrate" or "voice processing service" upon which other software applications can build. By providing high quality voice recognition and speech recognition services ubiquitously, existing software applications can become "voice-enabled" with significantly lower development cost. Moreover, applications that would not have been practical heretofore due to the high cost and proprietary nature of voice recognition software, are made viable by the distributed and highly portable and scaleable nature of the present invention.
hi another aspect, the present invention involves applications of the voice processing service such as interactive voice response, dictation and transcription services, voice messaging services, voice automated application services, and the like that share a common repository of speech recognition resources. These applications, typically implemented as software applications, can leverage the aggregate knowledge about their user's voice and speech patterns by using the shared common speech recognition resources.
In yet another aspect the present invention involves a distributed voice processing system in which the various functions involved in voice processing can be performed in a pipelined or parallel fasliion. Speech tasks differ significantly in purpose and complexity, i accordance with this aspect of the present invention, the processes involved in speech processing are modularized and distributed amongst a number of processing resources. This enables the system to employ only the required resources to complete a particular task. Also, this enables the processes to be implemented in parallel or in a pipelined fashion to greatly improve overall performance.
The present invention is illustrated and described in terms of a distributed computing environment such as an enterprise computing system using public communication channels such as the internet. However, an important feature of the present invention is that it is readily scaled upwardly and downwardly to meet the needs of a particular application. Accordingly, unless specified to the contrary the present invention is applicable to significantly larger, more complex network environments as well as small network environments such as conventional LAN systems.
FIG. 1 shows an exemplary computing environment 100 in which the present invention may be implemented. Speech server 101 comprises program and data constructs that function to receive requests from a variety of sources, access voice resources 105, and provide voice services in response to the requests. The provided voice services involve accessing stored voice resources 105 that implement a central repository of resources that can be leveraged to provide services for a wide variety of requests. The services provided by speech server 101 may vary in complexity from simply retrieving specified voice resources (e.g., obtaining a speech sample file for a particular user) to more complex speech recognition processes (e.g., feature extraction, phoneme recognition, phoneme-to-text mapping).
Requests to speech server 101 may come directly from voice appliances 102, however, in preferred examples requests come from "voice portals" 110. Voice portals comprise software applications and/or software servers that provide a set of fundamental behaviors and that are voice enabled by way of their coupling to speech server 101.
Example voice portals include interactive voice response (IVR) services 111, dictation service 112 and voice mail service 113. However, the number and variety of applications and services that can be voice-enabled in accordance with the present invention is nearly limitless. Because voice portals 110 access shared speech server 101 and shared voice resources 105, they do not each need to create, obtain, or maintain duplicate or special-purpose instances of the voice resources. Instead, the voice portals can focus on implementing the logic necessary to implement their fundamental behaviors, effectively outsourcing the complex tasks associated with voice processing.
A set 103 of voice appliances 102 represent the hardware and software devices used to implement voice-enabled user interfaces. Exemplary voice appliances 102 include, but are not limited to, personal computers with microphones or speech synthesis programs, telephones, cellular telephones, voice over IP (VoIP) terminals, laptop and hand held computers, computer games and the like. Any given speaker may use a plurality of voice appliances 102. Likewise, any given voice appliance 102 maybe used by multiple speakers.
A variety of techniques are used to perform voice processing. Typically speech recognition starts with the digital sampling of speech followed by acoustic signal processing. Most techniques include spectral analysis such as Fast Fourier Transform (FFT) analysis, LPC analysis (Linear Predictive Coding), MFCC (Mel Frequency Cepstral Coefficients), cochlea modeling and the like. Using phoneme recognition, the preprocessed files are parsed to identify groups of phonemes and words using techniques such as DTW (Dynamic Time Warping), HMM (hidden Markov modeling), NNs (Neural Networks), expert systems, N-grams and combinations of techniques. Most systems use some knowledge of the language (e.g., syntax and context) to aid the recognition process.
The precise distribution of functionality amongst the various components shown in Fig. 1 can vary significantly. Modularization of components allows components to be placed or implemented rationally within the network architecture. For example, analog- to-digit conversion (ADC) and digital signal processing (DSP) steps may occur within voice appliances 102 such that a digital preprocessed signal is communicated to voice portals 110. Alternatively, this pre-processing can be performed by voice portals 110, or can be out-sourced to speech server 101. hi many applications it is preferable to perform these preprocessing functions as near to the analog voice source (e.g., the speaker) as possible to avoid signal loss during communication. Conversely, it is contemplated that copies of shared voice resources can be stored permanently or temporarily (i.e., cached) within voice portals 111 and/or voice appliances 102 so that more complex functions can be implemented without access to speech server 101 each instance.
Each of the devices shown in FIG. 1 may include memory, mass storage, and a degree of data processing capability sufficient to manage their connection to a network.
The computer program devices in accordance with the present invention are implemented in the memory of the various devices shown in FIG. 1 and enabled by the data processing capabiUty of the devices shown in FIG. 1. In addition to local memory and storage associated with each device, it is often desirable to provide one or more locations of shared storage such as disk farm (not shown) that provides mass storage capacity beyond what an individual device can efficiently use and manage. Selected components of the present invention may be stored in or implemented in shared mass storage.
Fig. 2 shows conceptual relationships between entities in a specific embodiment of the present invention. Voice appliance 102 interacts with a speaker and communicates a voice signal over network 201 to voice portal 110. The term "voice signal" is intended to convey a very broad range of signals that capture the voice utterances of a user in analog or digital form and which indicate an identity of the speaker. The speaker identification can be to a specific individual speaker, or an indication of a group to which the speaker belongs (e.g., English-speaking children from Phoenix, AZ). The speaker identification can take a variety of forms, and may be explicitly provided by the speaker or voice appliance 102 or implied from the connection through network 201 using techniques such as caller ID, area code information, or reverse telephone directory lookup.
Network 201 may comprise the public switched telephone network (PSTN) including cellular phone networks, as well as local area networks (LANs), wide area networks (WANs), metropolitan area networks (MANs), as well as public internetworks such as the Internet. Any network or group of networks that are capable of transporting the voice signal and speaker identification information are suitable implementations of network 201. Internet 202 is an example of a data communication network suitable for exchanging data between components of the present invention. While Internet 202 is an example of a public IP-based network, a suitable public, private, or hybrid network or internetwork topology including LANs, WANs, and MANs are suitable equivalents in most applications.
Voice portal 110 comprises speech-enabled application 204 and speech recognition (SR) front-end 203. Application 204 implements desired fundamental behaviors of the application, such as looking up telephone numbers, weather information, stock quotes, addresses and the like. Speech enabled application has an interface that couples to SR front end 203. This interface may be configured to receive voice-format data such as phoneme probabilities or text input, but may also be configured to receive commands or other structured input such as structured query language (SQL) statements.
Front-end 203 implements a defined interface that is protocol compliant with network 201 to communicate request and response traffic with voice appliances 102. SR front-end 203 receives requests from voice appliances 102 where the requests identify the speaker and include a voice signal. SR front-end generates a request to speech server 101 to access shared resources 105 needed to process the voice signal so as to generate input to speech enabled application 204. The processing responsibilities between SR front-end 203 and speech server 101 are agreed upon in advance, but can be varied significantly, hi a particular example, the requests from SR front end 203 include a digitized speech signal, and the responses from speech server 101 include a set of phoneme probabilities corresponding to the speech signal.
It is contemplated that a typical system will involve multiple SR front-end devices 203 communicating simultaneously with a single speech server 101. Each front end 203 may handle multiple voice appliances 102 simultaneously. One advantage of the present invention is that centralized speech server 101 can be configured to process these requests in parallel more readily than could individual voice appliances 102. h such cases, requests to speech server 101 are preferably accompanied by a source identification that uniquely identifies a particular SR front-end 203 and a stream identifier that uniquely identifies a particular voice session that is using the identified SR front end 203. hi some cases the speaker ID can also be used to identify the session, although when a particular voice appliance 102 is conducting multiple simultaneous sessions, the speaker ID alone may be an ambiguous reference. This information can be used to route the resources 105 to appropriate processes that are using the resources.
SR front end 203 exchanges request/response traffic with speech server 101 over the Internet 202 in the example of Fig. 2. The request/response traffic comprises hypertext transfer protocol (HTTP) packets over TCP/IP in the particular example, although other protocols are suitable and may be preferable in some instances. For example, universal datagram protocol (UDP) can be faster, although offers poorer reliability. The benefits of various protocol layers and stacks are well known and readily consulted in the selection of particular protocols.
Voice resources 105 comprise speaker-dependent signatures 207 and speaker group signatures 208 in a particular embodiment. Speaker-dependent signatures 207 comprise one or more voice models associated with a particular speaker, hi contrast, speaker group signatures 208 comprise one or more voice models that are associated with a group of speakers such as English speaking children from Phoenix, Arizona, rather than a particular speaker. Group signatures are a useful middle ground where a particular speaker cannot be identified with certainty, but the speaker can be identified generally as a member of a particular speaker group.
The voice models essentially implement a mapping between voice signals and symbols, words, word portions (e.g., N-grams), phonemes, commands, statements, and the like (collectively referred to as "tokens") that have meaning to one or more speech enabled applications 204. This mapping can be implemented in a variety of data structures such as lookup tables, databases, inverted indices as well as other data structures that enable mapping functionality, hi a preferred implementation the mapping is captured in a neural network fraining file that can be used to enable an artificial neural network to output appropriate tokens in response to voice signal inputs.
An optional feature in accordance with the present invention is the inclusion of signature caches 205 and/or 206. Signature cache 205 is useful in environments where a given speaker or set of speakers frequent the SR front end 203, yet the SR front end 203 still requires general purpose adaptability to any speaker. For example, the speech enabled application may provide services to a single speaker for an extended period of time ranging from a few minutes to days, h such cases, SR front end 203 can implement processes to search its own cache 205 for matches to a voice signal thereby avoiding repeated reference to speech server 101. An SR front end 203 that uses a cache will typically exhibit somewhat greater complexity to implement processes that manage the cache, and that use the cache content instead of services provided by speech server 101. Each cache entry withhi signature cache 205 may include, for example, a speaker identification, and an association between a particular voice signal and a token corresponding to the voice signal. When SR front end 203 receives a new speech signal, it checks cache 205 for a matching speech signal and returns the token (e.g., a set of phoneme probabilities) without having to access speech server 101. When a speech signal does not find a match in cache 205, a conventional reference to speech server 101 is performed.
Similarly, some voice appliances 102 may benefit by caching signatures in signature cache 206. For example, a home telephone set or office workstation may be used by a small number of speakers frequently, yet still be available for use by other speakers. The voice resources corresponding to the frequent speakers are cached locally in signature cache 206. When one of the frequent speakers uses the voice appliance 102, voice services can be implemented within the appliance 102 itself. However, when an infrequent (i.e., uncached) speaker uses the appliance 102, voice resources are either downloaded to the appliance 102, or to an SR front end 203. Such functionality enables a rather simple, lightweight voice recognition system implemented within an appliance 102 to offer high quality, speaker dependent functionality without training, assuming speech server 101 contains signatures corresponding to the speakers.
Fig. 3 shows an exemplary format of a request made from SR front end 203 to speech server 101. Each request includes a speaker ID, source address identifying a particular SR front end 203, optional context information, and a voice signal. The context information may include infonnation about the application 204, information about the speaker (e.g., age, language, accents), speaker location, information about the voice appliance 102, or a stream ID for multiprocessing. The context information is used to adapt the processes undertaken in speech server 101. The voice signal field includes all or a portion of a voice signal to be processed. It is contemplated that some applications may be configured such that voice processing occurs in the front end 203 or within voice appliance 102. h these cases, the voice signal field includes processed data structures such as the output of a Fourier transform instead of raw voice data, hi such cases, it may not be necessary to send the voice signal to speech server 101 at all, instead, speech server 101 serves to supply the raw speaker-dependent resources needed to SR front end 203 and/or voice appliance 102.
Fig. 4 shows a simplified flow diagram of processes undertaken within a voice processing system in accordance with the present invention. As noted hereinbefore, the processes shown in Fig. 4 can be distributed throughout the various components shown in Fig. 1 and Fig. 2 or may all be implemented in speech server 101. Step 401 of capturing a voice signal is typically performed by a microphone or other acoustic energy sensor within a voice appliance 102. Filtering, framing, and analog to digital conversion step 402 preferably takes place as close to step 401 as practicable to avoid transmission of analog voice signals. However, it is common to transfer voice signals over great distances using radio and telephony subsystems. Hence, the filtering, framing and ADC step 402 may occur within voice appliance 102, at a device coupled to appliance 102 through a voice or data communication network, or a combination of these locations. Moreover, filtering may be explicitly performed by analog or DSP filter circuits, or implicitly by bandwidth constraints or other transmission channel characteristics through which analog voice signals are passed.
hi step 403, features of interest are identified within a voice signal. Features are patterns within the voice signal that have a likelihood of corresponding to phonemes, but phonemes are not directly extracted in step 403. Features may include, for example, mathematical properties such as frequency distribution, amplitude distribution, deviation, and the like of the processed voice signals.
These features can be used to infer or identify phonemes in step 404. Phonemes are abstract units of a phonetic system of a language that correspond to a set of similar speech which are perceived to be a single distinctive sound in the language. Phonemes represent identifiable components within a speech stream that, while they do not have linguistic meaning in themselves, are units with high occurrence within a spoken language. In extracting phonemes from a speech signal, it is rare to make exact matches. It is often useful to associate a probability with one or phonemes indicating a likelihood that the particular phoneme is in fact a correct representation of the speech signal. Once particular phonemes are identified, they can be used alone or in combination to create associations with tokens such as particular text, commands, or the like in step 405. Step 405 functions to associate the extracted features with tokens, hi contrast with features, tokens are abstract units that have linguistic meaning. Words, phrases and punctuation symbols, for example, are tokens that carry linguistic meaning. However, there are a wide variety of more abstract tokens that represent actions, or other complex linguistic structures that are not literally reflected in words of a particular language. For example, a token or set of tokens may represent the concept "search government databases for information about patent litigation involving speech recognition systems", however, the token may be an SQL query, a Java object, an XML document, or other abstraction of the actual verbiage that reflect the concept in the English language. A response is generated to application 204 including the tokens along with identification information that enables the application processing step 406 to associate the tokens with the voice signal that generated the request.
Many applications 204 can use a set of phoneme probabilities, or other feature sets, directly. For example when application 204 is expecting a constrained choice between "yes" and "no", or expecting a single-digit numeric response (e.g., "one", "two", etc.), it is relatively easy to determine the correct token from the feature sets directly. In such cases, the present invention enables the feature extraction process 403 to supply features directly to application processing step 405 such that the step of associating tokens with features can be performed within the application 204 itself. Similarly, some applications may involve supplying phoneme probabilities directly from step 404 so that token association may occur elsewhere. The application processing step 405 implements the voice enabled behavior using the features, phonemes, and/or tokens that it is supplied in steps 403 through 406 respectively.
Fig. 5 illustrates an important concept that speech server 101 can be implemented as a distributed computing system using distributed hardware and software resources that are coupled together by network 501. This enables, among other things, the provision of differential levels of service and parallel request processing for improved performance, hi one example, the various services shown in Fig. 5 may communicate with each other directly through network 501 to pass a voice processing task from service to service until it is completed. Alternatively, the individual services 502-506 may communicate directly with voice portal 110 and/or voice appliance 102 to perform their functions without knowledge of the other components within speech server 101. h a purely HTTP implementation, the various services can receive an HTTP request, perform their process, and return the partially or completely processed data to the requesting voice portal 110 or voice appliance 102 using HTTP redirection mechanisms as needed, passing the partially processed data as attributes within the HTTP redirection responses.
h the implementation of Fig. 5, speech server 101 comprises signal processing services such as service 502 that provides digital signal processing services. While DSP functionality is often implemented within a voice appliance 102 itself, thin clients may lack the resources to perform DSP processes. Moreover, even where a voice appliance 102 could perform DSP functions, it may be desirable to have DSP service 502 perform the functions to achieve the benefits of faster, more powerful processors and up-to-date algorithms that can be implemented in DSP service 502. DSP service 502 receives a digitized voice signal, for example, and returns a processed digital signal that may implement filtering step 402 (shown in Fig. 4).
Process 503 provides feature extraction services described hereinbefore. Process 504 provides feature to perform, for example phoneme mapping as a specific implementation of step 404 and 405. Alternatively, process 505 provides features and/or phonemes-to-command mapping as an implementation of step 406. A number of processes 505 maybe provided for particular applications so that the command mapping is specific to a particular application 204. Requests can be directed to particular instances of process 505 to meet the needs of the application associated with the request.
Language translation process 506 illustrates a more complex voice processing process contemplated by the present invention. Language translation process 506 may receive text from process 504, for example, and perform a language-to-language translation (e.g., English to Spanish). This results in text returned to the requesting voice portal 110 or voice appliance 102 in a different language than was originally spoken. A number of complex services similar to language translation service 506 will be apparent to those skilled in the art. For example, a speakers utterances may be translated into properly formed C++ code constructs as a program authoring tool, properly formed SQL queries to be applied to a database, and the like.
To this point, the present invention has been described as a mechanism to receive speech signals and translate them into some other meaningful form. Additionally, it is contemplated that speech services 101 may also provide services that convert abstract tokens into speech signals that can be audibly presented through an appliance 102 having a speaker. Fig. 6 illustrates an embodiment in which station-to- station duplex voice exchange is implemented between voice appliances 102 in a manner that is functionally analogous to conventional telephone service.
Conventional phone service provides a medium in which all parties to a conversation hear a voice that sounds substantially like the speaker. Although conventional telephone service limits audio bandwidth to reduce the amount of data that is bemg transported, it remains a very inefficient means to communicate voice information between two points. The implementation of the present invention shown in Fig. 6 enables high levels of compression while retaining the benefits of conventional phone service.
In general, the embodiments shown in Fig. 6 use speaker-dependent voice models or signatures to be used to compress the audio information that is transmitted between stations. Each speaker is associated with a TX signature used to convert the speaker's voice into tokens (as described hereinbefore), and an RX signature used to convert tokens into a replica of the speaker's voice. The RX signatures are akin to TX signatures in that they implement mappings between tokens and voice signals using either mapping data structures and algorithms, or neural networks, or both.
In a first implementation shown in the upper portion of Fig. 6, each appliance 102 implements processes to use speaker dependent signature files to encode and decode voice signals. The appliance 102 used by speaker 1, for example, includes a TX signature file for speaker 1, and an RX signature file for speaker 2. The appliance 102 used by speaker 2, conversely, includes a TX signature file for speaker 2, and an RX signature file for speaker 1. At the appliance 102 operated by speaker 1, speaker l's voice is encoded to tokens and tokens received from the appliance 102 operated by speaker 2 are decoded into audible speech signals that can be presented through a speaker. Conversely, at the appliance 102 operated by speaker 2, speaker 2's voice is encoded to tokens and tokens received from the appliance 102 operated by speaker 1 are decoded into audible speech signals that can be presented through a speaker, hi this manner, station-to-station voice communication is enabled with highly compressed data communication between the stations.
hi an alternative implementation shown in the lower portion of Fig. 6, appliances
102 communicate through voice portals 110 such that some or all of the voice processing functions required to associate voices signals with tokens are performed by the voice portals 110. such as case, voice appliances 102 may be substantially conventional telephone sets. Voice portals 110 perform the token to voice signal mapping functions transparently and communicate voice signals with the respective voice appliances 102. The communication between voice portals 110, however, can be compressed as described above greatly reducing the overall network bandwidth consumed by a given station-to-station voice communication.
Although the invention has been described and illustrated with a certain degree of particularity, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the combination and arrangement of parts can be resorted to by those skilled in the art without departing from the spirit and scope of the invention, as hereinafter claimed.

Claims

WE CLAIM:
1. A method of speech recognition comprising the acts of: receiving speech signals into a front-end processor; storing at least some resources used for speech recognition in a network- attached server; coupling the front-end processor to the network-attached server to perform the speech recognition.
2. The method of claim 1 wherein the front-end processor comprises a minimally trained voice recognition system.
3. The method of claim 1 further comprising identifying a speaker in the front-end processor.
4. The method of claim 1 wherein the stored resources comprise a speaker signature data structure comprising at least one speaker dependent voice model.
5. The method of claim 1 further comprising the acts of: transmitting the speech signal to the network-attached server; and performing speech recognition of the speech signal in the network-attached processor.
6. The method of claim 5 further comprising the acts of: returning a response from the network-attached server to the front-end processor comprising phoneme probabilities corresponding to the speech signal.
7. The method of claim 5 further comprising the acts of:
returning a response from the network-attached server to the front-end processor comprising text corresponding to the speech signal.
8. The method of claim 1 wherein the front-end processor comprises a trainable neural network and the stored resources comprise a neural network training file.
9. The method of claim 1 wherein the stored resources comprise voice signahvalue pairs associating a particular value with each voice signal.
10. A speech recognition server comprising: a network interface configured to receive a request; an identification of a speaker associated with each request; speaker dependent signature data structures stored in the speech recognition server; and means for generating a response including speaker-dependent voice recognition resources in response to the received request.
11. The server of claim 10 further comprising encoded speech signals within each request.
12. The server of claim 11 wherein the means for generating a response comprises a neural network operable to receive the encoded speech signals and generate an output comprising values representing the language-based content of the encoded speech signals.
13. The server of claim 11 wherein the response includes phoneme probabilities corresponding to the speech signal.
14. The server of claim 11 wherein the response includes text corresponding to the speech signal.
15. The server of claim 10 further comprising speaker group signature data structures stored in the speech recognition server, wherein the means for generating a response includes the speaker group resources when speaker-dependent resources are not available.
16. The server of claim 10 wherein the means for generating a response generates a response including a voice model corresponding to the identified speaker.
17. The server of claim 10 further comprising: an interface for receiving a feedback message corresponding to a response, the feedback message indicating whether the voice recognition resources supplied in the response were useful; and processes executing within the server for adapting the speaker dependent signature data structures in response to the feedback message.
18. A speech recognition system comprising: a centralized resource of shared, speaker-dependent speech recognition resources; two or more applications having processes for receiving a voice signal and communicating with the centralized resource over a network to perform speech recognition on the voice signal.
19. The speech recognition system of claim 18 wherein the speaker- dependent speech recognition resources comprise voice models of individual speakers.
20. The speech recognition system of claim 18 wherein the speaker- dependent speech recognition resources comprise voice models of groups of speakers.
21. The speech recognition system of claim 18 wherein the speaker- dependent speech recognition resources comprise feature extraction processes.
22. The speech recognition system of claim 21 wherein the speaker- dependent speech recognition resources comprise processes operable to associative linguistic tokens with the extracted features.
23. The speech recognition system of claim 18 wherein the speaker- dependent speech recognition resources comprise speech samples for individual speakers.
24. The speech recognition system of claim 18 further comprising: a feedback message generated by one of the applications to the centralized resource, the feedback message indicating efficacy of the shared speech recognition resource ; and processes within the centralized resource operable to use the feedback message to adapt the shared speech recognition resource to improve efficacy.
25. A speech-enabled software application comprising: a first interface for receiving a voice signal from a speaker; a second interface for sending the voice signal over a network to a centralized speech recognition server; a third interface for receiving phoneme probabilities from the speech recognition server corresponding to the voice signal; and processes for converting using the phoneme probabilities to launch speech- enabled functions for the speaker.
26. A method of creating a speech sample database comprising: accepting a voice recognition task at an application; communicating the voice recognition task to a centralized resource; performing the voice recognition task at a the centralized resource; causing the application to evaluate correctness of the speech recognition; and storing the speech sample from the task with its recognition result in a speech sample database.
27. The method of claim 26 further comprising storing context information metadata describing the speaker with an association to the sample.
28. The method of claim 26 further comprising providing speech samples meeting specified criteria to external entities.
PCT/US2002/012574 2001-04-20 2002-04-19 Speech recognition system WO2002086862A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/838,973 US6785647B2 (en) 2001-04-20 2001-04-20 Speech recognition system with network accessible speech processing resources
US09/838,973 2001-04-20

Publications (1)

Publication Number Publication Date
WO2002086862A1 true WO2002086862A1 (en) 2002-10-31

Family

ID=25278533

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/012574 WO2002086862A1 (en) 2001-04-20 2002-04-19 Speech recognition system

Country Status (2)

Country Link
US (1) US6785647B2 (en)
WO (1) WO2002086862A1 (en)

Families Citing this family (87)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080220775A1 (en) * 1997-07-30 2008-09-11 Steven Tischer Apparatus, method, and computer-readable medium for securely providing communications between devices and networks
US20080194225A1 (en) * 1997-07-30 2008-08-14 Steven Tischer Apparatus and method for providing emergency and alarm communications
US20080207178A1 (en) * 1997-07-30 2008-08-28 Steven Tischer Apparatus and method for restricting access to data
US7149514B1 (en) 1997-07-30 2006-12-12 Bellsouth Intellectual Property Corp. Cellular docking station
US20080192768A1 (en) * 1997-07-30 2008-08-14 Steven Tischer Apparatus, method, and computer-readable medium for interfacing communication devices
US20080207197A1 (en) * 1997-07-30 2008-08-28 Steven Tischer Apparatus, method, and computer-readable medium for interfacing devices with communications networks
US20080207179A1 (en) * 1997-07-30 2008-08-28 Steven Tischer Apparatus and method for testing communication capabilities of networks and devices
US20080194208A1 (en) * 1997-07-30 2008-08-14 Steven Tischer Apparatus, method, and computer-readable medium for communicating between and controlling network devices
US7689416B1 (en) * 1999-09-29 2010-03-30 Poirier Darrell A System for transferring personalize matter from one computer to another
US9076448B2 (en) 1999-11-12 2015-07-07 Nuance Communications, Inc. Distributed real time speech recognition system
US7392185B2 (en) 1999-11-12 2008-06-24 Phoenix Solutions, Inc. Speech based learning/training system using semantic decoding
US7050977B1 (en) 1999-11-12 2006-05-23 Phoenix Solutions, Inc. Speech-enabled server for internet website and method
US7725307B2 (en) 1999-11-12 2010-05-25 Phoenix Solutions, Inc. Query engine for processing voice based queries including semantic decoding
ATE328345T1 (en) * 2000-09-19 2006-06-15 Thomson Licensing VOICE CONTROL OF ELECTRONIC DEVICES
US20020194003A1 (en) * 2001-06-05 2002-12-19 Mozer Todd F. Client-server security system and method
DE10127559A1 (en) * 2001-06-06 2002-12-12 Philips Corp Intellectual Pty User group-specific pattern processing system, e.g. for telephone banking systems, involves using specific pattern processing data record for the user group
US20030023431A1 (en) * 2001-07-26 2003-01-30 Marc Neuberger Method and system for augmenting grammars in distributed voice browsing
US20030050777A1 (en) * 2001-09-07 2003-03-13 Walker William Donald System and method for automatic transcription of conversations
US20030110040A1 (en) * 2001-12-07 2003-06-12 Creative Logic Solutions Inc. System and method for dynamically changing software programs by voice commands
US6990445B2 (en) * 2001-12-17 2006-01-24 Xl8 Systems, Inc. System and method for speech recognition and transcription
US20030115169A1 (en) * 2001-12-17 2003-06-19 Hongzhuan Ye System and method for management of transcribed documents
US20030220788A1 (en) * 2001-12-17 2003-11-27 Xl8 Systems, Inc. System and method for speech recognition and transcription
US7016842B2 (en) * 2002-03-26 2006-03-21 Sbc Technology Resources, Inc. Method and system for evaluating automatic speech recognition telephone services
US20030208451A1 (en) * 2002-05-03 2003-11-06 Jim-Shih Liaw Artificial neural systems with dynamic synapses
US20030210770A1 (en) * 2002-05-10 2003-11-13 Brian Krejcarek Method and apparatus for peer-to-peer voice communication using voice recognition and proper noun identification
US7224981B2 (en) * 2002-06-20 2007-05-29 Intel Corporation Speech recognition of mobile devices
US7174298B2 (en) * 2002-06-24 2007-02-06 Intel Corporation Method and apparatus to improve accuracy of mobile speech-enabled services
US7451207B2 (en) * 2002-06-28 2008-11-11 Intel Corporation Predictive provisioning of media resources
US8416804B2 (en) 2002-07-15 2013-04-09 At&T Intellectual Property I, L.P. Apparatus and method for providing a user interface for facilitating communications between devices
US8526466B2 (en) 2002-07-15 2013-09-03 At&T Intellectual Property I, L.P. Apparatus and method for prioritizing communications between devices
US8543098B2 (en) 2002-07-15 2013-09-24 At&T Intellectual Property I, L.P. Apparatus and method for securely providing communications between devices and networks
US8275371B2 (en) 2002-07-15 2012-09-25 At&T Intellectual Property I, L.P. Apparatus and method for providing communications and connection-oriented services to devices
US8000682B2 (en) 2002-07-15 2011-08-16 At&T Intellectual Property I, L.P. Apparatus and method for restricting access to data
US7200424B2 (en) 2002-07-15 2007-04-03 Bellsouth Intelectual Property Corporation Systems and methods for restricting the use and movement of telephony devices
US8554187B2 (en) 2002-07-15 2013-10-08 At&T Intellectual Property I, L.P. Apparatus and method for routing communications between networks and devices
US7152051B1 (en) * 2002-09-30 2006-12-19 Michael Lamport Commons Intelligent control with hierarchical stacked neural networks
GB2406183A (en) * 2003-09-17 2005-03-23 Vextra Net Ltd Accessing audio data from a database using search terms
GB2409560B (en) * 2003-12-23 2007-07-25 Ibm Interactive speech recognition model
US7580840B1 (en) * 2003-12-29 2009-08-25 Verizon Data Services Llc Systems and methods for performance tuning of speech applications
JP2005202884A (en) * 2004-01-19 2005-07-28 Toshiba Corp Transmission device, reception device, relay device, and transmission/reception system
US7107220B2 (en) * 2004-07-30 2006-09-12 Sbc Knowledge Ventures, L.P. Centralized biometric authentication
US7254383B2 (en) 2004-07-30 2007-08-07 At&T Knowledge Ventures, L.P. Voice over IP based biometric authentication
US20060056603A1 (en) * 2004-09-13 2006-03-16 Harrity John E Systems and methods for providing voicemail notifications
US20060095266A1 (en) * 2004-11-01 2006-05-04 Mca Nulty Megan Roaming user profiles for speech recognition
US8311822B2 (en) * 2004-11-02 2012-11-13 Nuance Communications, Inc. Method and system of enabling intelligent and lightweight speech to text transcription through distributed environment
US20080312926A1 (en) * 2005-05-24 2008-12-18 Claudio Vair Automatic Text-Independent, Language-Independent Speaker Voice-Print Creation and Speaker Recognition
US20060282265A1 (en) * 2005-06-10 2006-12-14 Steve Grobman Methods and apparatus to perform enhanced speech to text processing
US8787866B2 (en) * 2005-07-26 2014-07-22 International Business Machines Corporation System, method and program for controlling mute function on telephone
US7440894B2 (en) * 2005-08-09 2008-10-21 International Business Machines Corporation Method and system for creation of voice training profiles with multiple methods with uniform server mechanism using heterogeneous devices
US20070078708A1 (en) * 2005-09-30 2007-04-05 Hua Yu Using speech recognition to determine advertisements relevant to audio content and/or audio content relevant to advertisements
US7552098B1 (en) * 2005-12-30 2009-06-23 At&T Corporation Methods to distribute multi-class classification learning on several processors
US8700902B2 (en) 2006-02-13 2014-04-15 At&T Intellectual Property I, L.P. Methods and apparatus to certify digital signatures
US20080086311A1 (en) * 2006-04-11 2008-04-10 Conwell William Y Speech Recognition, and Related Systems
ES2311351B1 (en) * 2006-05-31 2009-12-17 France Telecom España, S.A. METHOD FOR DYNAMICALLY ADAPTING THE ACOUSTIC MODELS OF ACKNOWLEDGMENT OF SPEAKING TO THE USER.
US8214208B2 (en) * 2006-09-28 2012-07-03 Reqall, Inc. Method and system for sharing portable voice profiles
US7752043B2 (en) * 2006-09-29 2010-07-06 Verint Americas Inc. Multi-pass speech analytics
WO2008045508A2 (en) * 2006-10-11 2008-04-17 Enterpret Communications, Inc. Method and system for providing remote translations
US8838457B2 (en) 2007-03-07 2014-09-16 Vlingo Corporation Using results of unstructured language model based speech recognition to control a system-level function of a mobile communications facility
US20110054897A1 (en) * 2007-03-07 2011-03-03 Phillips Michael S Transmitting signal quality information in mobile dictation application
US20110060587A1 (en) * 2007-03-07 2011-03-10 Phillips Michael S Command and control utilizing ancillary information in a mobile voice-to-speech application
US20080221884A1 (en) * 2007-03-07 2008-09-11 Cerra Joseph P Mobile environment speech processing facility
US20080221880A1 (en) * 2007-03-07 2008-09-11 Cerra Joseph P Mobile music environment speech processing facility
US8949130B2 (en) 2007-03-07 2015-02-03 Vlingo Corporation Internal and external speech recognition use with a mobile communication facility
US8949266B2 (en) 2007-03-07 2015-02-03 Vlingo Corporation Multiple web-based content category searching in mobile search application
US8635243B2 (en) 2007-03-07 2014-01-21 Research In Motion Limited Sending a communications header with voice recording to send metadata for use in speech recognition, formatting, and search mobile search application
US10056077B2 (en) 2007-03-07 2018-08-21 Nuance Communications, Inc. Using speech recognition results based on an unstructured language model with a music system
US8886540B2 (en) 2007-03-07 2014-11-11 Vlingo Corporation Using speech recognition results based on an unstructured language model in a mobile communication facility application
US8886545B2 (en) 2007-03-07 2014-11-11 Vlingo Corporation Dealing with switch latency in speech recognition
US9128981B1 (en) 2008-07-29 2015-09-08 James L. Geer Phone assisted ‘photographic memory’
US20110238406A1 (en) * 2010-03-23 2011-09-29 Telenav, Inc. Messaging system with translation and method of operation thereof
US20120030315A1 (en) * 2010-07-29 2012-02-02 Reesa Parker Remote Transcription and Reporting System and Method
US8775341B1 (en) 2010-10-26 2014-07-08 Michael Lamport Commons Intelligent control with hierarchical stacked neural networks
US9015093B1 (en) 2010-10-26 2015-04-21 Michael Lamport Commons Intelligent control with hierarchical stacked neural networks
GB2493413B (en) 2011-07-25 2013-12-25 Ibm Maintaining and supplying speech models
US20140074480A1 (en) * 2012-09-11 2014-03-13 GM Global Technology Operations LLC Voice stamp-driven in-vehicle functions
US9111546B2 (en) * 2013-03-06 2015-08-18 Nuance Communications, Inc. Speech recognition and interpretation system
US9466292B1 (en) * 2013-05-03 2016-10-11 Google Inc. Online incremental adaptation of deep neural networks using auxiliary Gaussian mixture models in speech recognition
US10438581B2 (en) 2013-07-31 2019-10-08 Google Llc Speech recognition using neural networks
US9311915B2 (en) 2013-07-31 2016-04-12 Google Inc. Context-based speech recognition
US9514753B2 (en) 2013-11-04 2016-12-06 Google Inc. Speaker identification using hash-based indexing
JP6585154B2 (en) * 2014-07-24 2019-10-02 ハーマン インターナショナル インダストリーズ インコーポレイテッド Text rule based multiple accent speech recognition using single acoustic model and automatic accent detection
US10417525B2 (en) 2014-09-22 2019-09-17 Samsung Electronics Co., Ltd. Object recognition with reduced neural network weight precision
US10529328B2 (en) 2015-06-22 2020-01-07 Carnegie Mellon University Processing speech signals in voice-based profiling
WO2018087967A1 (en) * 2016-11-08 2018-05-17 ソニー株式会社 Information processing device and information processing method
KR102136464B1 (en) * 2018-07-31 2020-07-21 전자부품연구원 Audio Segmentation Method based on Attention Mechanism
CN110858479B (en) * 2018-08-08 2022-04-22 Oppo广东移动通信有限公司 Voice recognition model updating method and device, storage medium and electronic equipment
WO2021112840A1 (en) * 2019-12-04 2021-06-10 Google Llc Speaker awareness using speaker dependent speech model(s)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5475792A (en) * 1992-09-21 1995-12-12 International Business Machines Corporation Telephony channel simulator for speech recognition application
US6021387A (en) * 1994-10-21 2000-02-01 Sensory Circuits, Inc. Speech recognition apparatus for consumer electronic applications
US6092045A (en) * 1997-09-19 2000-07-18 Nortel Networks Corporation Method and apparatus for speech recognition
US6163769A (en) * 1997-10-02 2000-12-19 Microsoft Corporation Text-to-speech using clustered context-dependent phoneme-based units

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1994014270A1 (en) * 1992-12-17 1994-06-23 Bell Atlantic Network Services, Inc. Mechanized directory assistance
US5664058A (en) * 1993-05-12 1997-09-02 Nynex Science & Technology Method of training a speaker-dependent speech recognizer with automated supervision of training sufficiency
JPH07210190A (en) 1993-12-30 1995-08-11 Internatl Business Mach Corp <Ibm> Method and system for voice recognition
GB2323693B (en) * 1997-03-27 2001-09-26 Forum Technology Ltd Speech to text conversion
US6246987B1 (en) * 1998-02-04 2001-06-12 Alcatel Usa Sourcing, L.P. System for permitting access to a common resource in response to speaker identification and verification
US6587822B2 (en) * 1998-10-06 2003-07-01 Lucent Technologies Inc. Web-based platform for interactive voice response (IVR)
US6519562B1 (en) * 1999-02-25 2003-02-11 Speechworks International, Inc. Dynamic semantic control of a speech recognition system
US6633846B1 (en) * 1999-11-12 2003-10-14 Phoenix Solutions, Inc. Distributed realtime speech recognition system
ATE235773T1 (en) * 2000-05-16 2003-04-15 Swisscom Ag VOICE PORTAL HOST COMPUTER AND METHOD
US20020072916A1 (en) * 2000-12-08 2002-06-13 Philips Electronics North America Corporation Distributed speech recognition for internet access
US20020091527A1 (en) * 2001-01-08 2002-07-11 Shyue-Chin Shiau Distributed speech recognition server system for mobile internet/intranet communication

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5475792A (en) * 1992-09-21 1995-12-12 International Business Machines Corporation Telephony channel simulator for speech recognition application
US6021387A (en) * 1994-10-21 2000-02-01 Sensory Circuits, Inc. Speech recognition apparatus for consumer electronic applications
US6092045A (en) * 1997-09-19 2000-07-18 Nortel Networks Corporation Method and apparatus for speech recognition
US6163769A (en) * 1997-10-02 2000-12-19 Microsoft Corporation Text-to-speech using clustered context-dependent phoneme-based units

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DOBRISEK ET AL.: "HOMER: a voice driven text-to-speech system for the blind", ISIE'99, 1999, pages 205 - 208, XP002951180 *
HATAOKA N. ET AL.: "Speech recognition system for automatic telephone oprator based on CSS architecture", IVTTA'94, 1994, pages 77 - 80, XP010124366 *

Also Published As

Publication number Publication date
US6785647B2 (en) 2004-08-31
US20020156626A1 (en) 2002-10-24

Similar Documents

Publication Publication Date Title
US6785647B2 (en) Speech recognition system with network accessible speech processing resources
US9761241B2 (en) System and method for providing network coordinated conversational services
EP1125279B1 (en) System and method for providing network coordinated conversational services
US10326869B2 (en) Enabling voice control of telephone device
Cox et al. Speech and language processing for next-millennium communications services
JP4267081B2 (en) Pattern recognition registration in distributed systems
JP5598998B2 (en) Speech translation system, first terminal device, speech recognition server device, translation server device, and speech synthesis server device
CN105118501B (en) The method and system of speech recognition
US20100217591A1 (en) Vowel recognition system and method in speech to text applictions
WO2013150526A1 (en) A client-server architecture for automatic speech recognition applications
US7072838B1 (en) Method and apparatus for improving human-machine dialogs using language models learned automatically from personalized data
US20080004880A1 (en) Personalized speech services across a network
US7949651B2 (en) Disambiguating residential listing search results
US6243677B1 (en) Method of out of vocabulary word rejection
Rose et al. An efficient framework for robust mobile speech recognition services
Maes et al. Conversational networking: conversational protocols for transport, coding, and control.
US20230186900A1 (en) Method and system for end-to-end automatic speech recognition on a digital platform
CN117041430A (en) Method and device for improving outbound quality and robustness of intelligent coordinated outbound system
CN116978383A (en) Voice recognition text method based on Android operating system
Kumar MEDISCRIPT-MOBILE CLOUD COLLABRATIVE SPEECH RECOGNITION FRAMEWORK
Hataoka et al. Speech recognition system for automatic telephone operator based on CSS architecture
Stier et al. Domain Adaptation of a Distributed Speech-To-Speech Translation System
Coyner et al. Distributed speech recognition services (DSRS)

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP