CA2308950A1

CA2308950A1 - Method and apparatus for controlling voice controlled devices

Info

Publication number: CA2308950A1
Application number: CA002308950A
Authority: CA
Inventors: Michael Geilhufe; David Macmillan; Avraham Barel; Amos Brown; Karin Lissette Bootsma; Lawrence Kent Gaddy; Phillip Paul Pyo
Original assignee: Information Storage Devices, Inc.; Michael Geilhufe; David Macmillan; Avraham Barel; Amos Brown; Karin Lissette Bootsma; Lawrence Kent Gaddy; Phillip Paul Pyo; Winbond Electronics Corporation
Current assignee: Winbond Electronics Corp
Priority date: 1999-05-21
Filing date: 2000-05-19
Publication date: 2000-11-21
Also published as: US6584439B1; CA2308946A1; EP1054390A3; JP2001027897A; EP1054390A2; KR20010020875A

Abstract

Voice controlled devices with speech recognition have user assignable appliance names and default appliance names to address and control the voice controlled devices. Methods of controling voice controlled devices include addressing a voice controlled device by name and providing a command.

Description

METHOD AND APPARATUS
FOR
CONTROLLING VOICE CONTROLLED DEVICES
MICROFICHE APPENDIX
This application contains a microfiche appendix which is not printed herewith entitled "ISD-SR 300, Embedded Speech Recognition Processor" by Information Storage Devices, Inc. which is hereby incorporated by reference, verbatim and with the same effect as though it were fully and completely set forth herein.
FIELD OF THE INVENTION
This invention relates generally to machine interfaces. More particularly, the invention relates to voice user interfaces for devices.
BACKGROUND OF THE INVENTION
Graphical user interfaces (GUIs) for computers are well known. GUIs provide an intuitive and consistent manner for human interaction with computers. Generally, once a person learns how to use a particular GUI, they can operate any computer or device which operates using the same or similar GUI. Examples of popular GUIs are MAC OS by Apple, and MS Windows by Microsoft. GUIs are now being ported to other devices. For example, the MS
Windows GUI has been ported from computers to palm tops, personal organizers, and other devices so that there is a common GUI amongst a number of differing devices.
However, as the name implies, GUIs require at least some sort of visual or graphical display and an input device such as a keyboard, mouse, touch pad or touch screen.
The displays and the input devices tend to utilize space in an device, require additional components and increase the costs of an device. Thus, it is desirable to eliminate the display and input devices from devices to save costs.
Recently, voice user interfaces (WIs) have been introduced that utilize speech recognition methods to control a device. However, these prior art VLTIs have a number of shortcomings that prohibit them from being universally utilized in all devices. Prior art VLlIs are usually difficult to use. Prior art VLJIs usually require some sort of display device such as an LCD, or require a manual input device such as keypads or buttons, or require both a display and a manual input device. Additionally, prior art WIs usually are proprietary and restricted in use to a single make or model of hardware device, or a single type of software application. They usually are not widely available, unlike computer operating systems, and accordingly software programmers can not write applications that operate with the WI in a variety of device types.
Commands associated with prior art VLTIs are usually customized for that single type of device or software application. Prior art VLTIs usually have additional limitations in supporting multiple users such as how to handle personalization and security. Furthermore, prior art WIs require that a user know of the existence of the device in advance. Prior art VLJIs have not provided ways of determining the presence of devices.
Additionally, prior art VLTIs usually require a user to read instruction manuals or screen displayed commands to become trained in their use. Prior art WIs usually do not include audible methods for a user to learn commands. Furthermore, a user may be required to learn how to use multiple prior art WIs when utilizing multiple voice controlled devices due to a lack of standardization.
Generally, devices controlled by WIs continue to require some sort of manual control of functions. With some manual control required, a manual input device such as a button, keypad or a set of buttons or keypads is provided. To assure proper manual entry, a display device such as an LCD, LED, or other graphics display device may be provided. For example, many voice activated telephones require that telephone numbers be stored manually. In this case a numeric keypad is usually provided for manual entry. An LCD is usually included to assure proper manual entry and to display the status of the device. A speech synthesis or voice feedback system may be absent from these devices. The addition of buttons and display devices increases the manufacturing cost of devices. It is desirable to be able to eliminate all manual input and display from devices in order to decrease costs. Furthermore, it is more convenient to remotely control devices without requiring specific buttons or displays.
Previously, devices were used by few. Additionally they used near field microphones to listen locally for voices. Many prior devices were fixed in some manner or not readily portable or were server based systems. It is desirable to provide voice control capability for portable devices. It is desirable to provide either near field or far field microphone technology in voice controlled devices. It is desirable to provide low cost voice control capability such that it is included in more devices. However, these desires raise a problem when multiple users of multiple voice controlled devices are in the same area. With multiple users and multiple voice controlled devices within audible range of each other, it makes it difficult for voice controlled devices to discern which user to accept commands from and respond to. For example, consider the case of voice controlled cell phones where one user in an environment of multiple users wants to call home. The user issues a voice activated call home command. If more than one voice controlled cell phone audibly hears the call home command, multiple voice controlled cell phones may respond and start dialing a home telephone number.
Previously this was not as significant a problem because there were few voice controlled devices.
Some voice controlled devices are speaker dependent. Speaker dependency refers to a voice controlled device that requires training by a specific user before it may be used with that user. A speaker dependent voice controlled device listens for tonal qualities in how phrases are spoken. Speaker dependent voice controlled devices do not lend themselves to applications where multiple users or speakers are required to use the voice controlled device. This is because they fail to efficiently recognize speech from users that they have not been trained by. It is desirable to provide speaker independent voice controlled devices with a WI requiring little or no training in order to recognize speech from any user.
In order to achieve high accuracy speech recognition it is important that a voice controlled device avoid responding to speech that isn't directed to it. That is, voice controlled devices should not respond to background conversation, to noises, or to commands to other voice controlled devices. However, filtering out background sounds must not be so effective that it also prevents recognition of speech directed to the voice controlled device. Finding the right mix of rejection of background sounds and recognition of speech directed to a voice controlled device is particularly challenging in speaker-independent systems. In speaker-independent systems, the voice controlled device must be able to respond to a wide range of voices, and therefore can not use a highly restrictive filter for background sounds. In contrast, a speaker-dependant system need only listen for a particular person's voice, and thus can employ a more stringent filter for background sounds. Despite this advantage in speaker dependant systems, filtering out background sounds is still a significant challenge.
In some prior art systems, background conversation has been filtered out by having a user physically press a button in order to activate speech recognition. The disadvantage of this approach is that it requires the user to interact with the voice controlled device physically, rather than strictly by voice or speech.
One of the potential advantages of voice controlled devices is that they offer the promise of true hands-free operation. Elimination of the need to press a button to activate speech recognition would go a long way to making this hands-free objective achievable.
Additionally, in locations with a number of people talking, a voice controlled device should disregard all speech unless it is directed to it. For example, if a person says to another person "I'll call John", the _7_ cellphone in his pocket should not interpret the "call John" as a command. If there are multiple voice controlled devices in one location, there should be a way to uniquely identify which voice controlled device a user wishes to control. For example, consider a room that may have multiple voice controlled telephones -perhaps a couple of desktop phones, and multiple cellphones - one for each person. If someone were to say "Call 555-1212", each phone may try to place the call unless there was a means for them to disregard certain commands. In the case where a voice controlled device is to be controlled by multiple users, it is desirable for the voice controlled device to know which user is commanding it. For example, a voice controlled desktop phone in a house may be used by a husband, wife and child. Each would could have their own phonebook of frequently called numbers. V~hen the voice controlled device is told "Call Mother", it needs to know which user is issuing the command so that it can call the right person (i.e. should it call the husbands mother, the wife's mother, or the child's mother at her work number?). Additionally, a voice controlled device with multiple users may need a method to enforce security to protect it from unauthorized use or to protect a user's personalized settings from unintentional or malicious interactions by others (including snooping, changing, deleting, or adding to the settings). Furthermore, in a _g_ location where there are multiple voice controlled devices, there should be a way to identify the presence of voice controlled devices. For example, consider a traveler arriving at a new hotel room. Upon entering the hotel room, the traveler would like to know what voice controlled devices may be present and how to control them. It is desirable that the identification process be standardized so that all voice controlled devices may be identified in the same way.
In voice controlled devices, it is desirable to store phrases under voice control. A phrase is defined as a single word, or a group of words treated as a unit.
This storing might be to set options or create personalized settings. For example, in a voice-controlled telephone, it is desirable to store people's names and phone numbers under voice control into a personalized phone book. At a later time, this phone book can be used to call people by speaking their name (e. g. "Cellphone call John Smith", or "Cellphone call Mother").
Prior art approaches to storing the phrase ("John Smith") operate by storing the phrase in a compressed, uncompressed, or transformed manner that attempts to preserve the actual sound. Detection of the phrase in a command (i.e. detecting that John is to be called in the example above) then relies on a sound-based comparison between the original stored speech sound and the spoken command. Sometimes the stored waveform is transformed into the frequency domain and / or is time adjusted to facilitate the match, but in any case the fundamental operation being performed is one that compares the actual sounds. The stored sound representation and comparison for detection suffers from a number of disadvantages. If a speaker's voice changes, perhaps due to a cold, stress, fatigue, noisy or distorting connection by telephone, or other factors, the comparison typically is not successful and stored phrases are not recognized. Because the phrase is stored as a sound representation, there is no way to extract a text-based representation of the phrase.
Additionally, storing a sound representation results in a speaker dependent system. It is unlikely that another person could speak the same phrase using the same sounds in a command and have it be correctly recognized. It would not be reliable, for example, for a secretary to store phonebook entries and a manager to make calls using those entries. It is desirable to provide a speaker independent storage means. Additionally, if the phrases are stored as sound representations, the stored phrases can not be used in another voice controlled device unless the same waveform processing algorithms are used by both voice controlled devices. It is desirable to recognize spoken phrases and store them in a representation such that, once stored, the phrases can be used for speaker independent recognition and can be used by multiple voice controlled devices.
Presently computers and other devices communicate commands and data to other computers or devices using modem, infrared or wireless radio frequency transmission. The transmitted command and/or data are usually of a digital form that only the computer or device may understand. In order for a human user to understand the command or data it must be decoded by a computer and then displayed in some sort of format such as a number or ASCII text on a display. When the command and/or data are transmitted they are usually encoded in some digital format understood by the computer or devices or transmitting equipment. As voice controlled devices become more prevalent, it will be desirable for voice controlled devices to communicate with each other using human-like speech in order to avoid providing additional circuitry for communication between voice controlled devices. It is further desirable to allow multiple voice controlled devices to exchange information machine-to-machine without human user intervention.
BRIEF SUN~iARY OF THE INVENTION
The present invention includes a method, apparatus and system as described in the claims. Briefly, a standard voice user interface is provided to control various devices by using standard speech commands. The standard VUI provides a set of standard VUI commands and syntax for the interface between a user and the voice controlled device. The standard VUI commands include an identification phrase to determine if voice controlled devices are available in an environment. Other standard VUI commands provide for determining the names of the voice controlled devices and altering them.
Voice controlled devices are disclosed. A voice controlled device is defined herein as any device that is controlled by speech, which is either audible or non-audible. A voice controlled device may also be referred to herein as an appliance, a machine, a voice controlled appliance, a voice controlled electronic device, a name activated electronic device, a speech controlled device, a voice activated electronic appliance, a voice activated appliance, a voice controlled electronic device, or a self-identifying voice controlled electronic device.
In order to gain access to the functionality of voice controlled devices, a user communicates to the voice controlled device one of its associated appliance names after a period of relative silence. The appliance name may be a default name or a user-assignable name.
The voice controlled device may have a plurality of user-assignable names associated with it for providing personalized functionality to each user.
Other aspects of the present invention are described in the detailed description.
BRIEF DESCRIPTIONS OF THE DRAWINGS
FIG. 1A is an illustration of an environment containing voice controlled devices of the present invention.
FIG. 1B is an illustration of remote communications with the voice controlled devices in the environment illustrated in FIG. 1A.
FIG. 2 is an illustration of exemplary voice controlled devices.
FIG. 3 is a detailed block diagram of the voice controlled device of the present invention.
FIG. 4 is a detailed block diagram of a voice communication chip.
FIG. 5 is a block diagram of the standard voice user interface of the present invention.
FIGs. 6A-6C are flow charts of the core command structure for the standard voice user interface of the present invention.

FIGS. 6D-6E are flow charts of the telephone command structure for the standard voice user interface of the present invention.
FIG. 7 is a flow chart of the "Store Name"
telephone command structure for the standard voice user interface of the present invention.
FIG. 8 is a flow chart of the "Delete Name"
telephone command structure for the standard voice user interface of the present invention.
FIGs. 9A-9B are flow charts of the "GETYESNO"
function for the standard voice user interface of the present invention.
FIGS. 10A-10C are flow charts of the "GETRESPONSE"
function for the standard voice user interface of the present invention.
FIG. 11 is a flow chart of the "GETRESPONSEPLUS"
function for the standard voice user interface of the present invention.
FIG. 12 is a flow chart of the "LISTANDSELECT"
function for the standard voice user interface of the present invention.
FIG. 13 is a block diagram of a pair of voice controlled devices communicating using the standard voice user interface of the present invention.
Like reference numbers and designations in the drawings indicate like elements providing similar functionality.
DETAILED DESCRIPTION OF THEPREFERRED EMBODIMENT
In the following detailed description of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be obvious to one skilled in the art that the present invention may be practiced without these specific details. In other instances well known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the present invention.
The present invention includes a method, apparatus and system for standard voice user interface and voice controlled devices. Briefly, a standard voice user interface is provided to control various devices by using standard speech commands. The standard WI
provides a set of core WI commands and syntax for the interface between a user and the voice controlled device. The core WI commands include an identification phrase to determine if voice controlled devices are available in an environment. Other core WI commands provide for determining the names of the voice controlled devices and altering them.
Voice controlled devices are disclosed. A voice controlled device is defined herein as any device that is controlled by speech, which is either audible or non-audible. Audible and non-audible are defined herein later. A voice controlled device may also be referred to herein as an appliance, a machine, a voice controlled appliance, a voice controlled electronic device, a name activated electronic device, a speech controlled device, a voice activated electronic appliance, a voice activated appliance, a voice controlled electronic device, or a self-identifying voice controlled electronic device.
The present invention is controlled by and communicates using audible and non-audible speech.
Speech as defined herein for the present invention encompasses a) a signal or information, such that if the signal or information were passed through a suitable device to convert it to variations in air pressure, the signal or information could be heard by a human being and would be considered language, and b) a signal or information comprising actual variations in air pressure, such that if a human being were to hear the signal, the human would consider it language. Audible speech refers to speech that a human can hear unassisted. Non-audible speech refers to any encodings or representations of speech that are not included under the definition of audible speech, including that which may be communicated outside the hearing range of humans and transmission media other than air. The definition of speech includes speech that is emitted from a human and emitted from a machine (including machine speech synthesis, playback of previously recorded human speech such as prompts, or other forms).
Prompts which are communicated by a voice controlled device and phrases which are communicated by a user may be in languages or dialects other than English or a combination of multiple languages. A
phrase is defined herein as a single word, or a group of words treated as a unit. A user, as defined herein, is a human or a device, including a voice activated device.
Hence "a user's spoken phrase", "a user issuing a command", and all other actions by a user include actions by a device and by a human.
Voice controlled devices include some type of speech recognition in order to be controlled by speech.
Speech recognition and voice recognition are used synonomously herein and have the same meaning.
Preferably, speeker independent speech recognition systems are used to provide the speech recognition capabilty of the voice controlled devices. Speaker independent speech recognitions systems are responsive to speaker-independent representations of speech. In the prefered embodiment, a speaker-independent representation of speech is a phonetic representation of speech. However, other speaker-independent representations of speech may also be used in accordance with the present invention.
In order to gain access to the full functionality of a voice controlled device with the present invention, a user must communicate to the voice controlled device one of its associated appliance names. The appliance name may include one or more default names or one or more user-assignable names. A voice controlled device may have a plurality of user-assignable names associated with it in order to provide personalized functionality to each user.
Additionally, the present invention provides a way to leave a speech recognition engine on throughout ongoing conversations (including local conversations or those over a telephone link), without having it be falsely triggered by background noise or speech that is not directed to it. To accomplish this, the invention makes use of a naming scheme for voice controlled devices provided by the standard WI of the present invention. In general, unless a voice controlled device is addressed by its appliance name, it will disregard all speech. (There are a couple of special exceptions to this rule that will be discussed later.) In certain cases the criteria for recognizing a command may be further tightened requiring a voice controlled device to be addressed by its user-assigned appliance name. A
voice controlled device may have multiple users, each of whom assign it a unique appliance name using commands of the standard VLTI of the present invention. When a voice controlled device is addressed by one of its user-assigned names, the voice controlled device can determine both that it is being addressed, and which user is addressing it. This allows the voice controlled device to use the personalized settings for that particular user. For example, a voice-activated telephone might have four different user-assigned names (e. g. Aardvark, Platypus, Socrates, and Zeus), and each user might have a different telephone number associated with the phonebook entry for Mother. When the first user says "Aardvark call mother", the first user's mother s called. When the second user says "Platypus Call Mother", the second user's mother is called. The command "Geronimo call Mother" would not be acted on by this voice controlled device, since Geronimo is not one of its appliance names.
Another aspect of the present invention improves the recognition accuracy of voice controlled devices.
The present invention collectively improves recognition accuracy by requiring first a period of relative silence prior to a phrase directed at the voice controlled device, second the appliance name, and third a valid command. Complete silence is not necessary but a relative silence is needed, where relative silence is defined as a sound level that is quieter than the sound level while the phrase is being spoken. The specific period of relative silence required, and the allowed decibel difference between the relative silence and the sound intensity of the spoken phrase directed at the voice controlled device, will depend on the type of voice controlled device being built, its intended operating environment, the capabilities of the speech recognition system used, and other factors. In some cases, the duration and / or decibel difference of relative silence required may also be varied by the voice controlled device or associated circuits or software, so as to maximize the recognition accuracy obtained in that particular circumstance. In accordance with the standard VUI, each user can assign a voice controlled device a unique name or use a default appliance name. After communicating the appliance name to a voice controlled device, a command must be spoken.
Valid input at this point includes special phrases like "Help" or "Cancel", which are part of the standard VUI
grammar. If a valid command is not recognized, the voice controlled device rejects the entire sequence and returns to the state where it is waiting for silence.
Additionally, depending on the command, one or more additional phrases, typically representing modifiers to the command, may be provided or required (for example, the phone number in the command sequence "<silence> Call 555-1212"). Valid phrases at this point also include special phrases like "Help" or "Cancel", which are part of the standard VUI grammar. Failure to detect valid phrases after the command within a short period of time can be used as a basis for rejecting the entire command sequence, or for prompting the user to clarify his intentions. Either way, this serves as an additional level of accuracy checking. Alternatively, if a phrase is not detected during the short period of time after the command, the command may be performed anyway.
Voice controlled devices can be identified either by visual identification, or acoustic identification, or both. Acoustic identification is defined as including both audible and non-audible communications with the voice controlled device. Audible and non-audible are defined elsewhere. Visual identification can occur through use of a standard logo or other visual identifier. A blinking LED is another example of a visual identifier. Visual identification is particularly appropriate for voice controlled devices that do not have a speech recognition engine that is always turned on. For example, to minimize battery consumption, battery operated voice controlled devices may require the user to push a switch (or its equivalent, such as flipping open a flip-type cellphone) to activate the speech recognition engine. Acoustic identification only works for voice controlled devices that are actively listening for recognizable commands.
Acoustic identification is accomplished by a user saying an identification phrase. An example of an identification phrase is "Gdhat is out there?". A voice controlled device may have one or more identification phrases. Any voice controlled device that hears its identification phrase responds to identify its presence.
In accordance with the standard VLTI, the response is a random delay of up to 2 seconds of silence, followed by a standard signal (for example, one or more tones or beeps or other sounds), then at least one of the voice controlled device's appliance names, and any applicable basic operation instructions (e.g. "<beep> I am Telephone. You can say Telephone help."). In order to coordinate responses from multiple voice controlled devices in the same communication environment, each voice controlled device must during its silence period listen for another voice controlled device's response, the start of which is marked by the standard signal.
Detection of the other voice controlled device's standard signal can be accomplished by any means that is convenient, including by the voice recognition system, by a DSP, by a microprocessor, or by special circuitry.
In the event another voice controlled device starts responding during this silence period, the listening voice controlled device must restart its silence timing after the responding voice controlled device finishes.
In the event two voice controlled devices start responding at approximately the same time [for example, so that they're standard signals overlap in time], they both must back off for a new randomly selected silence delay, but this time the delay must be of up to twice the length of the previous silence delay, but not to exceed 16 seconds.
In order to restrict which voice controlled devices respond to an identification phrase, a user may include a voice controlled device's name in the identification phrase. For example, one could say "Socrates are you out there?" to see if a voice controlled device named Socrates was nearby. Similarly, one could say "Clock are you out there" which would cause all voice controlled devices with an appliance name of Clock (whether a default appliance name or a user appliance name) to respond. A possible variation is that voice controlled devices may respond with some response other than their names, as for example, might be needed for security reasons.
A voice controlled device may use both visual and acoustic identification methods. For example, even though a speech recognition engine is continuously on, it may still display the visual logo and / or other visual identifier. Similarly, in a voice controlled device that requires manual activation of the speech engine, once enabled, the engine could then be responsive to the command "Vdhat is out there?"
In another aspect of the present invention, the initial storage of a user's spoken phrase (for example, when making a new phonebook entry under voice control) is processed by the speaker-independent speech recognition engine of the voice controlled devices. This engine returns a speaker-independent phonetic representation of the phrase. This speaker-independent phonetic representation is what is stored.
Vdhen a command is issued by a user, it is also processed by the speaker-independent speech recognition engine of the present invention. This could be the same speaker-independent engine use for storing the original entries, or a completely different speaker-independent engine. In either case, the engine returns a speaker-independent phonetic representation of the command sequence. This speaker-independent phonetic representation can be compared to earlier stored phonetic representations to determine whether the command is recognizable.
By converting both the stored spoken entries and any commands to speaker-independent phonetic representation a number of advantages are provided.
~ Recognition will be reliable even if the user's voice has changed, perhaps due to a sickness, stress, fatigue, transmission over a noisy or distorting phone link, or other factors that might change a human user's or machine user's speech. Text-based information can be stored and then recognized.
~ Recognition will be reliable even if some other user had stored the original voice phrase.
~ Recognition can be speaker-independent, even for user-stored commands and phrases.
~ Stored entries originating from text sources and from different speakers can all be combined and reliably for recognition.
~ The use of speaker-independent phonetic representations facilitates upgrading to improved recognition engines as they become available. Improved speech recognition engines can use existing stored information without impacting reliability or requiring re-storage, since all stored entries are held in phonetic form. New information stored using the improved speech recognition engines can be used on equipment with older recognition engines. Old and new generations of equipment can interoperate without prior coordination by using phonetic representations.
This allows, for example, two PDAs to exchange voice-stored phonebook entries and provide reliable recognition to the new users of that information.

Finally, there are no legacy restrictions to hold back or restrict future development of speaker-independent recognition engines as long as they can create phonetic representations, unlike waveform-storage based systems, which must always be able to perform exactly the same legacy waveform transformations.
VOICE CONTROLLED DEVICES
Referring now to FIG. 1A, environment 100 is illustrated. Environment 100 may be any communication environment such as an office, a conference room, a hotel room, or any location where voice controlled devices may be located. Within environment 100, there are a number of human users 101A-101H, represented by circles. Also within the environment 100, are voice controlled devices 102A-102H, represented by squares and rectangles, each operationally controlled by the standard voice user interface (VUI) of the present invention. Voice controlled devices 102A-102E, represented by rectangles, are fixed within the environment 100. Voice controlled devices 102F-102H, represented by squares, are mobile voice controlled devices that are associated with human users 101F-101H
respectively. Voice controlled devices 102A-102H may be existing or future devices. Voice controlled devices 102A-102E may be commonly associated with a user's automobile, home, office, factory, hotel or other locations where human users may be found.
Alternatively, if the voice controlled devices 102A-102E
are to be controlled by non-audible speech, voice controlled devices may be located anywhere.
In the present invention, the standard VL1I allows a user to associate a user-assignable name with these voice controlled devices 102A-102H. The user-assignable name of the voice controlled device may be generic such as telephone, clock, or light. Alternatively, the name may be personalized such as those ordinarily given to humans such as John, Jim, or George. In either case, the voice controlled devices 102A-102H while constantly listening will not respond to commands until it recognizes one of its names (user-assigned or default).
Although any name can be assigned to a voice controlled device, to minimize confusion between the voice controlled device and real people, users may choose to use unusual names such as Aardvark or Socrates, which are unlikely to occur during normal conversation. With reference to Figure 1A, consider the environment 100 to be a conference room where human users 101A-101H are meeting. Further assume that voice controlled device 102A is a telephone having speaker phone capabilities in the conference room 100 and the appliance name is Telephone. The human user such as 101A would first call out the name of the Telephone before desiring to give commands to that voice controlled device. By providing names to the voice controlled devices, the voice controlled devices can properly respond to given commands and avoid confusion between multiple users and voice controlled devices. The voice controlled device may be a telephone, an organizer, a calculator, a light fixture, a stereo system, a microwave over, a TV set, a washer, a dryer, a heating system, a cooling system, or practically any system. Voice controlled devices 102A-102H may include an audible communications interface (ACI) in order to listen to commands and data input from human users 101A-101H and audibly notify a user that the command or data was properly interpreted and executed.
Voice controlled devices 102A-102H further include a speech recognition and synthesis system (SRS). The speech recognition of the SRS provides for interpreting speech in different dialects independent of which user is speaking, and independent of whether the user is a human or device. V~lhile the preferred embodiments of the present invention utilize a speaker independent voice recognition system, the present invention is also compatable with speaker dependent voice recognition systems. The SRS may operate with one or more than one language. The speech synthesis of the SRS provides for generation of speech responses, status commands, or data by the voice controlled devices which may be audibly communicated or non-audibly communicated. Speech synthesis, also refered to herein as speech generation, is defined herein to include any method of responding with speech (audible or non-audible), including but not limited to, speech recording, storage and playback systems, pre-recorded vocabulary systems with playback, sophisticated speech synthesis systems generating utterances from a combination of characters, and some combination of the above. Preferably the voice controlled devices contain both a speech recording, storage and playback system and a pre-recorded vocabulary system with playback.
Voice controlled devices 102A-102H may optionally include an communications interface (ECI) for providing remote control of voice controlled device via wireless or wired means using non-audible voice or speech. As illustrated in FIG. 1A, voice controlled device 102A has a connection 105 for connection to a telephone system.
In this manner, the voice controlled device 102A may remotely communicate to a user and accept and acknowledge commands. Referring now to FIG. 1B, the human user 101I communicates by telephone 112 over the wired or wireless transmission media 114 over the telephone company switch 116. The telephone company switch 116 is connected by a wire means or wireless means through connection 105 to the voice controlled device 102A. Telephone 112 may be a wireless or wired telephone. In this matter, human user 101I may remotely interface to a voice controlled device 102A within a communications environment 100. Alternatively, a voice controlled device such as voice controlled device 102E
may be remotely controlled over a network by a remote computer 118. In this case, a remote human user 101J
can send voice commands or instructions through remote computer 118 which is coupled to the voice controlled device 102E through the network connection 120 and connection 106. The network connection 120 may be a wireless or wired connection, realtime or store-and-forward, through a computer network such as the Internet. There are a wide variety of ways that a remote user can be connected to a voice controlled device, including but not limited to, the use of wired and wireless connections. Wired connections may include, but are not limited to, realtime communications systems such as the telephone system and realtime Internet connections, store-and-forward systems such as email of voice representations and other non-realtime Internet protocols. Wireless systems may include, but are not limited to, radio and infrared systems. Any of these alternatives can include circuit-based systems and packet-based systems, and can include analog and digital systems. Any of these alternatives can be used with or without various modulation and/or encoding and/or encryption schemes.
Referring now to Figure 2, exemplary voice controlled devices 1021-102M are illustrated. The voice controlled device 102I is exemplary of white goods such as freezers, refrigerators, washers, dryers, air conditioners, heating units, microwave ovens, ovens, and stoves. Voice controlled device 102J is exemplary of voice controlled devices requiring an optional communication interface (ECI). This may include voice controlled devices for consumer electronics such as television, video cassette recorders, stereos, camcorders, tape recorders, dictation units, alarm clocks, and clock radios as well as telephone products such as standard wired telephones, telephone answering machines, light switches, alarm systems, computing devices, Internet access devices, and servers, etc.
Voice controlled device 102K is exemplary of portable or wireless systems such as cellular telephones, walkman style systems, camcorders, and personal digital systems.
Voice controlled device 102L is exemplary of automobile voice controlled systems such as car cellular telephone systems, automobile radio systems, car navigation systems, HAV (heating, air conditioning and ventilation) systems, and other control systems for an automobile.
Voice controlled device 102M is exemplary of remote controlled devices, such as voicemail systems.
Voice controlled device 102I includes an audible communications interface (ACI) 202, a speech recognition and synthesis system (SRS) 204, and an appliance peripheral and control circuit (APCC) 206. The ACI 202 ,.,..

is coupled to SRS 204 and SRS 204 is coupled to APCC 206 In the voice controlled device 102I, ACI 202 is its primary means of speech communication.
Voice controlled device 102J includes ACI 202, SRS
204, APCC 206, communications interface (ECI) 207, and connection 208. ACI 202 is coupled to SRS 204. APCC
206 is coupled to SRS 204. ECI 207 couples to SRS 204 and connection 208 couples to the ECI 207. Voice controlled device 102J can alternatively communicate using speech or voice communication signals through ACI
202 or ECI 207. Voice controlled device 102K includes ACI 202, SRS 204, APCC 206, and an antenna 209.
Voice controlled device 102K can communicate using audible speech signals through the ACI 202 or using encoded speech signals through the ECI 207. ECI 207 couples to APCC 206. ECI 207 also couples to Connection 212. Connection 212 could, for example, be an antenna or infrared port. Voice controlled device 102L also includes an ACI 202, SRS 204, APCC 206, and an antenna 209. ACI 202 couples to SRS 204. SRS 204 couples to APCC 206. Antenna 209 couples to APCC 206. Voice controlled device 102L can communicate by means of ACI
202 and APCC 206 through antenna 209.
Voice controlled device 102M includes an APCC 206, SRS 204, an ECI 207, and connection 210. Connection 210 may be a wired or wireless connection, including an antenna. SRS 204 couples to APCC 206 and also to ECI

207. Connection 210 couples to ECI 207. Voice controlled device 102M can communicate via ECI 207 over connection 210.
The APCC 206 represents the elements of the voice controlled device 102 that are to be controlled. For example, in the case of white goods, the items to be controlled may be temperature, a time setting, a power setting, or a cycle depending on the application. In the case of consumer electronics, the APCC 206 may consist of those items normally associated with buttons, switches, or knobs. In the case of telephone products, the APCC 206 may represent the buttons, the dials, the display devices, and the circuitry or radio equipment for making wired or wireless calls. In the case of automobile systems, the APCC 206 may represent instrumentation panels, temperature knobs, navigational systems, the automobile radios channels, volume, and frequency characteristics.
Referring now to FIG. 3, the voice controlled device 102 is illustrated. Voice controlled device 102, illustrated in FIG. 3, is exemplary of the functional blocks within voice controlled devices described herein.
Voice controlled device 102 includes the ACI 202, the APCC 206 and the SRS 204. The voice controlled device 102 may also have an ECI 207 such as ECI 207A or ECI
207B.
The ACI 202 illustrated in FIG. 3 includes microphone 303, speaker 304, and amplifiers 305. The SRS 204 as illustrated in FIG. 3 includes the voice communication chip 301, coder/decoder (CODEC) 306 and 308, host microcontroller 310, power supply 314, power on reset circuit 316, quartz crystal oscillator circuit 317, memory 318, and memory 328. The SRS 204 may optionally include an AC power supply connection 315, an optional keypad 311 or an optional display 312. For bidirectional communication of audible speech, such as for local commands, prompts and data, the speech communication path is through the VCC 301, CODEC 306, and the ACI 202. For bidirectional communication of non-audible speech, such as for remote commands, prompts and data, the non-audible speech communication path is through the VCC 301, CODEC 308, ECI 207A or the VCC 301, host microcontroller 310, APCC 206, and ECI 207B. The ECI 207 may provide for a wired or wireless link such as through a telephone network, computer network, Internet, radio frequency link, or infrared link.
Voice communication chip 301 provides the voice controlled device 102 with a capability of communication via speech using the standard voice user interface of the present invention. Microphone 303 provides the voice controlled device 102 with the capability of listening for audible speech, such as voice commands and the device's appliance names. Microphone 303 may be a near field or far field microphone depending upon the application. For example, near field microphones may be preferable in portable cell phones where a user's mouth is close while far field microphones may be preferable in car cell phones where a user's mouth is a distance away. Speaker 303 allows the voice controlled device 102 to respond using speech such as for acknowledging receipt of its name or commands. Amplifiers 305 provides amplification for the voice or speech signals received by the microphone 303. Additionally, the amplifiers 305 allow amplification of representations of voice signals from the CODEC 306 out through the speakers 303 such that a human user 101 can properly interface to the voice controlled device 102.
Microphone 303 and Speaker 304 are each transducers for converting between audible speech and representations of speech. CODEC 306 encodes representations of speech from the ACI 202 into an encoded speech signal for VCC 301. In addition, CODEC
306 decodes an encoded speech signal from the VCC 301 into an representation of speech for audible communication through the ACI 202.
Alternatively, non-audible speech signals may be bi-directionally communicated by the voice controlled device 102. In this case, VCC 301 provides encoded speech signals to CODEC 308 for decoding. CODEC 308 decodes the encoder speech signal and provides it to the ECI 207A for communication over the connection 105.

Speech signals may be received over the connection 105 and provided to the ECI 207A. The ECI 207A couples the speech signals into the CODEC 308 for encoding. CODEC
308 encodes the speech signals into encoded speech signals, which are coupled into the VCC 301.
Speech signals may also be electronically communicated through the APCC 206. Speech signals from the VCC 301 for transmission are passed to the microcontroller 310. Microcontroller 310 couples these into the APCC 206, which transmits the speech signals out to the ECI 207B. Speech signals to be received by the voice controlled device 102 may be received by the ECI 207B and passed to the APCC 206. The APCC 206 then may couple these received speech signals to the microcontroller 310, which passes these onto the VCC 301 for recognition.
The voice controlled device 102 controls the APCC
206 by means of signals from the host microcontroller 310. The host microcontroller 310 is coupled to the APCC 206 to facilitate this control. Voice controlled device 102 may optionally have a keypad 311 coupled to the microcontroller 310 as a further input means.
Keypad may be a power button, a push to talk button or a security code input means, in addition to optionally being used to input other information. Voice controlled device 102 may optionally include a display 312 coupled to the host microcontroller 310 in order to visually display its status or other items of interest to a user.
However, the voice controlled device can function generally without the optional keypad 311 or the optional display 312.
The voice controlled device 102 includes power supply 314. Power supply 314 may generate power from a DC supply source or an AC supply source, or from both.
The source of DC supply may be a battery, solar cell, or other DC source. In the case of an AC supply source, the optional AC power cord 315 is provided. VCA 102 includes a power on reset circuit 316 to reset its system when the power supply 314 is turned on.
Quartz crystal oscillator circuit 317 in conjunction with other circuitry within the VCC 301 provides an accurate oscillation input to the VCC 301 for generation of clock signals.
Memory 318 is coupled to VCC 301 and provides rewritable non-volatile and volatile memory as well as a read only memory. These typically are a flash RAM, a static RAM, and a ROM. Memory 318 is used to store programs as well as store pre-recorded and recorded phrases. Additionally, memory 318 provides scratch memory for program operation. As is standard practice in the industry, the types of memories used may vary depending on the specific voice controlled device being constructed. Program storage for the present invention may be permanent, as with a ROM, non-volatile but changeable, as with a flash, or volatile, as in a RAM, in which case the program could be downloaded from a non-volatile memory, or from a remote source.
Memory 328 may be volatile memory, non-volatile memory, or a mixture. If only volatile memory is used, its contents can be downloaded from another location for initialization. The size and capabilities of Memory 328 will depend on the type of voice controlled device being built. Alternatively, memory may be substituted in some cases for a type of magnetic, optical or other type of storage medium.
In the voice controlled device 102, VCC 301 may additionally include the functionality of the host microcontroller 310 such that only one processing unit is contained within the voice controlled device 102.
Similarly, the APCC 206, codecs 306 and / or 308, ECI
207A, ECI 2078, memory 318, memory 328, amplifiers 305, or other elements maybe integrated into VCC 301, as is customary in the industry as ever-increasing levels of integration are achieved.
Referring now to FIG. 4, a block diagram of the voice communication chip (VCC) 301 is illustrated. The voice communication chip 301 is an integrated circuit and includes the processing units 402, memory units 403, a Bus and Memory Controller (BMC) 404, a bus adapter 405, and Peripherals 406. The voice communication chip 301 is further described in the microfiche appendix entitled "ISD-SR 300, Embedded Speech Recognition Processor" by Information Storage Devices, Inc. The processing units 402 includes a microprocessor and a digital signal processing module (DSPM). The memory units 403 include a DSPM random access memory (RAM) 407, a system RAM 408, and a read only memory (ROM) 409. The peripherals 406 include I/0 ports 420, an Interrupt Control Unit (ICU) 422, a coder/de-coder (CODEC) interface 424, a Pulse Width Modulator (PWM) 426, a MICROWIRE interface 428, Master MICROWIRE controller 430, a reset and configuration controller 432, a clock generator 434 and a WATCHDOG timer 436. In order to communicate effectively, the voice communication chip 301 includes a core bus 415 and a peripheral bus interconnecting the components as shown in FIG. 4.
The microprocessor 416 is a general purpose 16-bit microprocessor core with a RISC architecture. The microprocessor 416 is responsible for integer arithmetic logic and program control. The DSP Module (DSPM) 418 performs DSP arithmetic. ROM 409 and system RAM 408 are used for the storage of programs and data. DSPM RAM 407 can be accessed directly by the DSPM 418. When the DSPM
418 is idle, the microprocessor 416 can access the DSPM
RAM 407.
The Bus and Memory Controller (BMC) 404 controls access to off-chip devices, such as DRAM, Expansion Memory, off-chip Base Memory and I/0 Expansion. The I/O

ports 420 provide the interface to devices coupled to the voice communication chip 301. The I/0 ports 420 represents twenty-six I/0 pins of the voice communication chip 301. Using the internal ROM 409 for program memory without expansion options, sixteen I/0 pins can be individually configured for input or output, eight I/0 pins dedicated for output only and two I/0 pins dedicated for input only. The ICU 422 provides the capability of processing five maskable interrupts (four internal and one external) and three internal Non-Maskable Interrupts (HMIs). The CODEC interface 424 provides a direct interface to one CODEC device 306 in the case of ACI 202 only or two CODEC devices 306 and 308 in the case of ACI
202 and ECI 207A. The Pulse Width Modulator (PWM) 426 generates a square wave with a fixed frequency and a variable duty cycle. The MICROWIRE interface 428 allows serial communication with the host microcontroller 310.
The Master MICROWIRE controller 430 allows interface to serial flash memory and other peripherals. The reset and configuration block 432 controls definition of the environment of the voice communication chip 301 during reset and handles software controlled configurations.
Some of the functions within the voice communication chip 301 are mutually exclusive. Selection among the alternatives is made upon reset or via a Module Configuration register. The clock generator 434 interfaces to the quartz crystal oscillator circuit 317 to provide clocks for the various blocks of the voice communication chip including a real-time timer. The clock generator can also be used to reduce power consumption by setting the voice communication chip 301 into a powerdown mode and returning it into normal operation mode when necessary. When the voice communication chip 301 is in power-down mode, some of its functions are disabled and contents of some registers are altered. The watchdog timer 436 generates a non-maskable interrupt whenever software loses control of the processing units 402 and at the expiration of a time period when the voice communication chip 301 is in a power-down mode.
STANDARD VOICE USER INTERFACE
Similar to computer operating systems providing a GUI, the standard voice user interface (VUI) can be thought as being provided by a standard VUI operating system code. The standard VUI operating across a wide array of voice controlled devices allows a user to interface any one of the voice controlled devices including those a user has never previously interacted with. Once a user is familiar with the standard VUI, they can walk up to and immediately start using any voice controlled device operating with the standard VUI.
The standard VUI operating system code has specific standardized commands and procedures in which to operate a voice controlled device. These standardized commands and procedures are universal to machines executing the standard WI operating system code. Voice controlled application software, operating with the standard VUI
operating system code, can be written to customize voice controlled devices to specific applications. The voice controlled application software has voice commands specific to the application to which the voice controlled device is used. A particular voice controlled device may also have additional special features that extend the core capabilities of the standard VLTI.
Some of the standard VUI functionality in the core VUI include a way to discover the presence of voice controlled devices, a core common set of commands for all voice controlled devices, a way to learn what commands (both core commands and appliance-specific commands) the voice controlled device will respond to, a vocalized help system to assist a user without the use of a manual or display, a way to personalize the voice controlled device to a user with user assignable settings, security mechanisms to control use of voice controlled devices to authorized users and protect user assignable settings and information from other users, and standard ways for a user to interact with voice controlled devices for common operations (e. g. selecting yes or no, listing and selecting items from a list of options, handling errors gracefully, etc.).
The standard VUI includes an API (Applications Programming Interface) to allow software developers to write custom voice controlled applications that interface and operate with the standard VUI and extend the voice controlled command set.
Referring now to FIG. 5, a block diagram illustrates the Software 500 for controlling Voice Controlled Device 102 and which provides the standard VUI and other functionality. The Software 500 includes Application Code 510, a VUI software module 512 and a Vocabulary 524. Application code 510 may be further modified to support more than one application, representing multiple application code modules, to provide for further customization of a voice controlled device 102. The Vocabulary 524 contains the phrases to be detected. The phrases within the Vocabulary are divided into groups called Topics, of which there may be one or more. In Figure 5, the Vocabulary 524 consists of two Topics, Topic 551 and Topic 552.
Typically, Application Code 510 interfaces to the VUI software 512 through the Application Programming Interface (API) 507. The VUI software 512 provides special services to the Application Code 510 related to voice interface, including recognition and prompting.
The interrelationship between the VUI software 512 and the application code 510 is analogous to that between Microsoft's MS Windows and Microsoft Word. Microsoft Windows provides special services to Microsoft Word related to displaying items on a screen and receiving mouse and keyboard inputs.
Generally, the Application Code 510 may be stored in host memory and executed by the host microcontroller 310. However, the functionality of the host microcontroller 310 can be embedded into the VCC 301 such that only one device or processor and one memory or storage device is needed to execute the code associated with the software 500.
All phrases that can be recognized, including those phrases for the core and application specific commands, are included in the Vocabulary 524. The VUI software module 512 can directly access the vocabulary phrases, for example for use during recognition. The VUI software module 512 can also processes tokens. Tokens abstractly relate to the phrases within the Topics 551-552. Tokens are integer numbers. For example, the phrase for 'dial' might have a token value of '5', and the phrase for 'hangup' might have a token value of '6'. There is a token value assigned to every phrase that can be recognized. Because the VUI software module 512 can process tokens related to the vocabulary file 524, it can refer to phrases without having to directly access them. This makes it possible to change languages (from English to French, etc.) without modifying the VLTI
software module 502. Thus, the standard VLTI will function using different dialects or languages simply by modifying the vocabulary file 524.
Core capabilities of the standard VIU operating in a voice controlled device allow a user to: name the voice controlled device, identify the presence of voice controlled devices, activate a user's previously stored personalized preferences, recover from misrecognitions by canceling an operation, use a Help function to identify the commands and options that can be used with the voice controlled device, use a standard core set of commands and use other additional commands, confident that they follow a standard syntax. (Although the syntax of commands is common, the specific list of commands on any voice controlled device will depend on the nature of the voice controlled device). The standard VLTI also includes standard functions for the following user interactions for the API: GETYESNO - Accepting a Yes /
No response from the user; GETRESPONSE - Accepting an arbitrary input from the user; GETRESPONSEPLUS -Accepting an arbitrary input from the user, with enhanced error recovery features; LISTANDSELECT -Providing the user with a list of choices, and allowing the user to select one; and ACOUSTICADDWORD - Adding a phrase that can thereafter be recognized.
In orderly to properly function with the standard W I, the SRS 204 of the voice controlled device 102 can provide continuous recognition of speech and digits when powered up. However, pauses exceeding certain durations may be recognized by the SRS 204 as marking the end of a command or providing an indication that an incomplete command sequence has been received.
NAMES
A key element of the standard VCTI of the present invention is that each voice controlled device has one or more appliance names, each of which is a phrase. The initial appliance name is a default name for a voice controlled device programmed.by the manufacturer at the factory. However, users can generally assign a user-assigned appliance name of their choosing to a voice controlled device. Naming a voice controlled device is different from other kinds of naming, such as naming people. A person has a single (first) name that can be used by everyone who wants to talk with them. In contrast, with naming of voice controlled devices, every user of a voice controlled device usually gives the voice controlled device a different, unique name.
Accordingly, a voice controlled device may have as many names as it has users.
When a user addresses a voice controlled device by name two things happen. First, when the voice controlled device recognizes one of its names, the voice controlled device is notified that it is being addressed and will need to listen for a command. Second, since each user usually employs a different name for a voice controlled device, it is informed of a user's identity (speaker identification). If a user has stored preferences related to the functionality of the voice controlled device, the voice controlled device can personalize itself to the preferences of that user.
To illustrate this naming concept, consider the following example of a desktop telephone, the voice controlled device, having two users. User 1 has named the phone "Aardvark" and user 2 named the phone "Platypus". If the phone hears "Aardvark Call Mom", the phone will recognize that it is being addressed by user 1 and it should use User 1's phonebook. Accordingly, it will dial the number for "Mom" programmed by User 1.
Similarly, if the phone hears "Platypus Call Mom", it will recognize that user 2 is addressing it, and it will dial the number for "Mom" programmed by user 2.
In order to minimize false recognition, it is preferable that users assign names to the voice controlled devices that are generally not spoken during normal speech. Choosing unusual names helps ensure that two voice controlled devices within audible range of each other don't have identical names (perhaps assigned by different users). A maximum time limit for saying the phrase name may be required in some cases due to memory limitations in the voice controlled device.
Referring now to FIGS. 6A-6E, flow charts of the detailed operation of the standard VIJI with voice controlled devices 102 are described. In the flow charts of FIGS. 6A-6E, a solid box shows a phrase communicated by a user (placed in quotes) or a user action (no quotes). A dotted box shows a phrase communicated by the voice controlled device (in quotes) or an action taken (no quotes). In the case where there is a solid box directly below a dotted box, a path exiting from the right of a dotted box is taken if the action within the current dotted box is completed normally and the path to the solid box below a dotted box is taken if an unusual event occurs. Generally, the solid box directly below the dotted box indicates the unusual event.
STANDARD VLJI COMMAND SYNTAX
Referring now to FIG. 6A, the general syntax for all voice commands is:
<silence><name> <command> <modifiers &
variables>.
The <silence> is a period of relative silence during which the user is not speaking although background noise and background speech may still be present. The <name>
is the appliance name associated with a voice controlled device 102. The <command> is an operation that a user wants performed. The <modifiers & variables> consist of additional information needed by some commands. The SRS
204 recognizes the elements in their syntax in order for a user to control voice controlled devices.
Most voice controlled devices will continuously listen for the voice command sequence. V~lhen a voice controlled device hears its <name>, it knows that the following <command> is intended for it. Since each user has a different <name> for a voice controlled device, the <name> also uniquely identifies the user, allowing the voice controlled device to select that user's personalization settings. Commands include core VLTI
commands included with all voice controlled devices, and commands specific to a given application, all of which are stored within the vocabulary 524.
Requiring <silence> before detection of <name>
helps prevent false detection of <name> during normal conversational speech (i.e. during periods when the user is speaking conversationally to other users and not to the voice controlled device). In all cases, the duration of <silence> can be configured by the manufacturer and can range from 0 (no <silence>
required) to a second or more. Typically it will be about a quarter of a second.
Examples of voice command sequences that might be used with a voice controlled device such as a telephone named Aardvark include "Aardvark Call The Office", "Aardvark Dial 1-800-55-1212", and "Aardvark Hang-up".
(In the command examples and descriptions provided, for the sake of brevity the <silence> is often not shown, and even where it is shown or described, the option always exists of a manufacturer choosing to use a silence duration of zero.) There are two special cases where the command syntax is permitted to differ from the general syntax.
The first special case is in voice controlled devices that do not continuously listen for <silence><name>.
For example, in some battery operated applications, power consumption limitations may require the VCC 301 in the voice controlled device 102 to be powered down during idle periods. Another example is a voice controlled device located where false recognition of a name would have undesirable results, for example, a desktop phone in a conference room during a presentation. A third example is voice controlled devices where there is a high risk of false recognition, for example, where multiple conversations can be heard.
For these types of situations, an alternate command syntax is used in conjunction with a button or switch of some type. The first alternate command syntax is:
<activation of a switch> <silence (optional)>
<name> <command> <modifiers & variables>.

In this syntax, the <activation of a switch> means the user presses a button or performs some other mechanical act (e. g. opening a flip-style cell phone) to activate the recognition capability.
A second special case is where the user normally enters a series of commands in quick succession. For these cases, the user can identify themselves once to the voice controlled device using a password protection method, or by issuing a command that includes the voice controlled device's appliances <name>, and thereafter continue entering commands. The second alternate command syntax (in this example, for three successive commands) is:
<silence> <name> <command> <modifiers & variables as needed>
<silence> <name (optional)> <command> <modifiers &
variables as needed>
<silence> <name (optional)> <command> <modifiers &
variables as needed>
With this syntax, the user can issue a series of commands without having to constantly repeat the voice controlled device's appliances <name>. However, the user is permitted to say the <name> at the start of a command. Note that in this syntax, the <silence> is required to properly recognize the spoken <name> or <command> .
V~lhen either of the first or second alternate syntaxes is used, it is desirable to ensure that if a new user starts working with the voice controlled device, they are properly identified. This can be ensured by explicitly requiring the <name> after a period of inactivity or after power-up of the voice controlled device or other similar protocol.
STANDARD CORE V'UI COMMANDS
There are a number of standard core commands included in the vocabulary 524 of voice controlled devices 102 operating using the standard VL1I. FIGS. 6A-8 illustrate the syntax of the following commands.
Referring to FIG. 6A, at start 600, the appliance name, <name>, of a voice controlled device is usually spoken prior to a command. Any of the voice controlled device's appliances names can be spoken whenever the voice controlled device is listening for a command. If the <name> is not followed by a command within some period of time, the voice controlled device will go back to return to start 600 in its original idle state. This is indicated by the solid box Silence of N seconds. N
in this case is a programmable value usually application dependent and assigned by the voice controlled device manufacturer. After supplying the appliance name, a user is granted access to further commands of the standard VLTI operating on the voice controlled device at 601.

The syntax of the Help command is:
<name> Help <command (optional)>
or Help <command (optional)>
The help command can be invoked at any time, including when any other command can be given, or whenever the voice controlled device is waiting for a response. If the Help command is issued while the voice controlled device is waiting for a valid command, Help must be preceded with <name> if the voice controlled device requires a <name> before other commands. If the Help command is requested while the voice controlled device is waiting for any other type of response, <name> does not need to proceed the Help command. In all cases where <name> is not required before Help, if the user says "<name> Help", the use of <name> does not generate an error.
The help function is context sensitive - whenever Help is requested, the voice controlled device responds with a description of the available options, given the current context of the voice controlled device. If Help is requested when the voice controlled device is listening for a command, the voice controlled device will respond with its state and the list the commands that it can respond to (e.g. "At Main menu. You can say ... .") Further detail on any specific command can be obtained with the "Help <command>" syntax (e. g. "Help Dial", "Help Call", and even "Help Help"). If "Help" is requested while the voice controlled device is waiting for some type of non-command response (e.g. "Say the name"), then the voice controlled device will respond with a statement of the voice controlled device's current status, followed by a description of what it is waiting for (e.g. "Waiting for user response. Say the name of the person whose phonebook entry you wish to create, or say Nevermind to cancel.").
The syntax of the cancellation command is:
<name (optional)> Nevermind or <name (optional)> Cancel The Nevermind or Cancel command can be issued whenever the voice controlled device is executing a command and waiting for a response from the user. Nevermind or Cancel causes the voice controlled device to cancel the current command and respond with a statement that the operation has been cancelled (e.g. "Cancelled."). If Nevermind or Cancel is issued while the voice controlled device is waiting for a command, it can be ignored.
The use of <name> with Nevermind or Cancel is optional -it works identically whether or not <name> is spoken.
The syntax of the return to main menu command is:
<name> Main Menu For voice controlled devices that have submenus of commands, <name> Main Menu returns the user to the main menu and causes a response of "At Main menu." or the like. This command provides an easy way for the user to return to a known point from any submenu. The Main Menu command does not have to be recognized in voice controlled devices that only have one menu, but is a mandatory command for voice controlled devices with submenus.
Changing Voice Controlled Device Names In some cases it may be desirable to change the user-assigned name of a voice controlled device.
Referring now to FIGS. 6A-6B, the syntax of the Change Name c ommand i s <old name> Change Your Name This command allows a user to name or rename a voice controlled device. Tnlhen a voice controlled device is new, it has at least one default factory programmed appliance name (e. g. Telephone). Most voice controlled devices have the capability of supporting one or more user-assignable appliance names. A user can name the appliance name by saying "<factory programmed name>
Change your name" (e.g. "Telephone change your name ").
The voice controlled device will then ask for the new name to be repeated and then change its name. This process can be repeated once for each user-assignable name. For example, consider a 4-user telephone that can be assigned four user-assignable appliance names. A

user may execute the four name changes with the commands: "Telephone change your name " followed by the dialog to set the name for user 1 to (for example) Aardvark. "Telephone change your name " followed by the dialog to set the name for user 2 to (for example) Barracuda. "Telephone change your name" followed by the dialog to set the name for user 3 to (for example) Coyote. "Telephone change your name " followed by the dialog to set the name for user 4 to (for example) Doggone. If the user attempted to change a fifth user-assignable name in sequence with the command ("Telephone change your name "), it would result in an error message because all available user-assignable appliance names were assigned. Note that the voice controlled device always responds to the factory programmed name, even if all user-assigned names are defined. Accordingly, in this example of a fifth attempt, the voice controlled device still recognizes the "Telephone" factory programmed name - it is just unable to assign a fifth new user-assignable appliance name.
An existing user-assignable appliance name can also be changed with the "Change Your Name " command.
Continuing the above example, "Aardvark change your name " would alter the appliance's name for the first user (for example, it could be changed to Platypus), and leave the other three user names unchanged. Similarly, "Platypus change your name " followed by a dialog to set the name to "Telephone" would reset the first user name to the factory-programmed default.
Identification of Voice Controlled Devices As voice controlled devices proliferate, it is important that users be capable of readily identifying what, if any, voice controlled devices are present when they enter a new environment. For example, a user walks into a hotel room that has a number of devices. In order to use them a user needs to know which devices are voice controlled devices. Additionally a user needs to know the appliance names in order to properly control them.
Beside being audibly identified, voice controlled devices can be identified visually as well as by using a logo signifying a voice controlled device utilizing the standard VUI.
Acoustic identification works when voice controlled devices are actively listening for recognizable commands. In most cases, this means the voice controlled device is constantly listening and attempting recognition. Typically, these voice controlled devices will be AC powered, since the power drain from continuous recognition will be unacceptable for most battery operated voice controlled devices. Referring to FIG. 6A and 6C, the acoustic identification is accomplished by a user communicating an identification phrase to command the voice controlled device. The identification phrase "V~lhat Is Out There?" or some other suitable identification phrase may be used for causing the voice controlled devices to identify themselves.
The syntax of the standard VL1I Identification phrase is:
<silence> V,lhat Is Out There?
In response to this query, any voice controlled device that hears the question must respond. The typical voice controlled devices response is a random delay of up to 2 seconds of relative silence, followed by a beep (the standard signal) , and the response "You can call me <name>", where <name> is the factory-programmed name that can be used to address the voice controlled device.
In the telephony voice controlled device example described above, a response might be "<beep> You can call me Telephone."
Referring to FIG. 6C, during the random delay of up to 2 seconds, each responding voice controlled device listens for another voice controlled device's response (specifically, for another voice controlled device's beep). In the event another voice controlled device starts responding (as evidenced by a beep) during this silence period, the listening voice controlled device must restart its silence timing after the responding voice controlled device finishes. In the event two voice controlled devices start responding at the same time (overlapping beeps), they both must back off for a new randomly selected silence delay. However, this time the random delay may be greater than the first, up to twice the length of the previous silence delay. In any event, the delay should not exceed 16 seconds.
Additional back off periods for further conflict resolution is provided if other voice controlled devices respond.
Referring to FIG. 6A, the syntax of the Request User-Assignable Names command is:
<name> Tell Me Your Name or <name> Tell Me Your Names If security permits, any user-programmed <name> or the default <name> can be used. The Request User-Assignable Names command is used to ask a voice controlled device to list all the user-programmed <names> that it will respond to. If security permits, the voice controlled device communicates each use-programmed name in a list fashion. Between each user-assigned name it pauses for a moment. During this pause a user may communicate a command to the voice controlled device and it will be executed as if given with that user-programmed <name>.
For example consider the telephony voice controlled device example above. The command "Telephone Tell Me Your Name" provided after a pause will cause the telephone to respond by saying "I have been named Aardvark, (pause) Barracuda (pause), Coyote (pause), and Doggone (pause)." During the pause that followed the voice controlled device saying "Coyote", a user may say "Call Mom", in which case the phone calls user Coyote's Mom (assuming that a phone number for Mom had been previously stored by user Coyote).
SECURITY CONSIDERATIONS
The command for Requesting User Assignable names raised the issue of security in the voice controlled devices. In some cases it is necessary to limit access to a voice controlled device to authorized users.
Various methods of security protection can be employed in a voice controlled device which are supported by the standard WI.
The simplest and least secure security protection is provided through the WI's naming capability. In this case every user is required to choose a unique name for a voice controlled device. The user assigned appliance names are kept confidential within the voice controlled device and only changed or deleted by a user.
In this manner the appliance name can be used to provide basic security. However, there are many shortcomings with this approach. First, the user must typically repeat the name before issuing each command, which makes it easy for someone to overhear the name, resulting in a loss of security. Second, most voice controlled devices will include a capability for deleting or changing a user's name for the device. It is preferable to make deletions and changes easy to perform. Additionally changes may need to be performed by someone other than that particular user. For example, the user may have forgotten the name he originally assigned to the voice controlled device, or the user may have stopped using the device and not be available to delete his settings. In the case of using the appliance name as security, there is an inherent conflict between the need for ease of use in changing a name and the quality of security.
A greater level of security can be achieved by requiring the user to say a secret numeric sequence, password or phrase in order to gain access to the voice controlled device. The login might be required when the user starts using the voice controlled device after some period of inactivity, or based on some other criteria.
A disadvantage of this approach is that the spoken numeric sequence or phrase might be overhead. Another security alternative is to require the user to enter the numeric sequence, password, or phrase on a keypad such as optional keypad 311. Although this introduces additional hardware, it eliminates the risk of a secret code being overheard by another. A variety of other security options are also possible, including use of a physical key or a security card (e.g. magnetic stripe or smartcard).

Additional security is provided by automatic cancellation or termination of user access to the voice controlled device. In some cases access may be automatically cancelled after every command execution.
In other cases automatic cancellation of access may occur following some period of inactivity, power-down or reset, completion of some operation (e.g. in a phone, at the end of a call), or upon the specific request of a user by use of a "Cancel Access" command.
APPLICATION-SPECIFIC COMMANDS
The standard WI provides each voice controlled device with a number of application specific commands.
The application specific commands provided by the standard V(JI are associated with telephone and answering machine applications. Additional application specific commands can be programmed for and included in the vocabulary by a manufacturer.
General guidelines for developing commands for the standard WI are as follows. Sub-menus should be limited in number and organized around logical groups of commands. For example, a telephone TAD might have a main menu that included telephony functions, a submenu for phonebook management, and another submenu for TAD
functions.
The number of commands in any menu or submenu should generally be limited to ten or less to minimize complexity. The help function should clearly describe the available commands.
Complex commands should be broken down into manageably small units. Command phrases should be selected that ensure high recognition success. The standard VUI commands have been selected to ensure high recognition accuracy. Care should be exercised when creating of a custom vocabulary to avoid using confusable phrases.
For destructive events (delete, etc.), user-confirmation of the correct entry and verification of the operation should be requested.
TELEPHONY VOCABULARY
Referring now to FIGS. 6D-6E, 7, and 8, flow charts for the telephony vocabulary for the standard VUI are illustrated. The telephony vocabulary is particularly for telephony voice controlled devices such as desktop telephones, cellular telephones, cellular telephone car kits, and cordless phones. The SRS 204 of the present invention is capable of recognizing the commands in the telephony vocabulary and converting them into recognized tokens for control of the telephony voice controlled devices. The telephony vocabulary includes all the standard VUI Core Commands and the following application specific commands.
The syntax of the Call command is:

<name> Call <voicetag>
or <name> Call <digits>
The Call command is used to dial a specific phone number, expressed either as a series of digits or as a phonebook voicetag. The <digits> can be any list of numeric digits. The telephony voice controlled device allows for the synonyms "oh" for zero, and "hundred" for zero-zero to be enabled. The sequence of <digits> can contain embedded pauses. However, if a pause exceeds a programmable duration, the sequence is terminated and the command executed after recognition of a pause that exceeds a duration set by the system designer. The telephony voice controlled device response to a Call command should be "Calling <digits>" or "Calling <voicetag>" with the recognized digits or recognized voicetag voiced to verify accurate recognition. The "Cancel" command can be used to cancel the calling operation in the event of misrecognition.
The syntax of the Dial command is:
<name> Dial <voicetag>
or <name> Dial <digits>
The Dial command is the same as the Call command.
The syntax of the Answer command is:
<name> Answer This command is used to answer an incoming call. The response prompt is "Go ahead".
The syntax of the Hangup command is:
<name> Hangup This command is used to hangup an active call. The response prompt is a high-pitched beep.
The syntax of the Redial command is:
<name> Redial This command is used to redial a number. The response is "Redialing <digits>" or "Redialing <voicetag>", depending on whether the previous Call or Dial command was to <digits> or a <voicetag>. If there was no earlier call made, the response is "Nothing to redial".
The syntax of the Store command is:
<name> Store The Store command is in the phonebook submenu and is used to add a new voicetag.
The syntax of the Delete command is:
<name> Delete The Delete command is in the phonebook submenu and is used to delete a voicetag.
The syntax of the Mute command is:
<name> Mute This command mutes the microphone. The response by the voice controlled device is "Muted"
The syntax of the Online command is:

<name> Online This command unmutes the microphone. The response is "Online".
Prompts can be communicated by the voice controlled devices to request a response from the user. Prompts may be communicated (i.e. prompting) by a speech synthesizer, playback of pre-recorded speech or other means. The prompts in the telephone vocabulary include the following context-sensitive help prompts:
"Calling <digits> "Please say the name you "Online"

<voicetag>" want to call"

"Dialing <digits> "Please start over" "one"

<voicetag>"

"Go ahead" "My name is now <name>" "two"

"Goodbye" (for the "Redialing <digits> "three"

hangup command) <voicetag>"

"Cancelled" "Sorry, I didn't "four"

understand"

"Please say the name you "Please say the name "five"

want to delete" again"

"Are you sure you want "Name change canceled" "six"

to delete <voicetag>?"

"<voicetag>deleted" "The names did not "seven"

match"

"Please say the new "Please repeat the "eight"

name" number"

"Please repeat the new "The number for "nine"

name" <voicetag> is <digits>.

Is this correct?"

"Please say the number "The number for "zero"

for <voicetag>" <voicetag> has been stored"

"That name is not in "Do you want to store "hundred"
the it phone book" now?"

"Muted" "Nothing to redial"

"Star"

"Flash"

"Pound"

In addition to these prompts, the voice controlled devices can generate a number of different tones or beeps. These include a medium pitch beep (e.g. 200 millisecond, 500 Hz. sine wave), a low pitched beep (e. g. a buzzer sound or 250 millisecond, low frequency beep signifying erroneous entry) and a high pitched beep (e. g. 200 milliseconds, 1200 Hz. sine wave). Other sounds are possible and would be within the intended scope of the present invention.
Vocabulary For Telephone Answering Voice Controlled Device In addition to the forgoing, application specific commands for the standard VUI enable a user to interface to a telephone answering voice controlled device using voice commands. A user can manage message functions and obtain remote access from a telephone answering voice controlled device without using a keypad. The following lists the additional voice commands to be included in the vocabulary 224 for telephone answering voice controlled device.
<name> Play new <name>Rewind <n> <name> Stop ~ ~~

<name> Play all <name>Record Greeting Greeting <name> Play <name> Delete this <name>Record message <name> Room monitor <name> <name>Answer On <name> Password Delete <password phrase>
all messages <name> Forward <n> <name>Answer Off Automobile Control Vocabulary Additional specific commands for the standard VUI
enable a user to interface to automobile accessories using voice control. Two primary areas for automotive voice control include the control of interior accessories and control of entertainment systems.
Automotive accessories include environmental controls, windows, door locks, and interior lights. It is preferable that "Mission critical" elements in an automobile, such as steering, braking, acceleration, and exterior lights not be controlled by voice due to potential safety concerns if misrecognition occurs.
Entertainment controls are used primarily for a DC
player/changer and for the radio.
The automobile control vocabulary 224 for voice controlled devices includes Air conditioning, Fan speed, Temperature, Driver window, Passenger window, Left rear window, Right rear window, Windows, Door locks, Wipers, Low, Medium, High, Increase, Decrease, Set, Reset, Cancel, Clear, Recall, On, Off, Colder, and Warmer.
STANDARD USER INTERFACE FUNCTIONS FOR THE API
The standard VLTI of the present invention includes standard functions for user interactions, which are accessed by an applications programming interface (API).
These standard functions for the API include GETYESNO, GETRESPONSE, GETRESPONSEPLUS, and LISTANDSELECT which are used by custom software developers to develop applications that operate on top of the standard VLTI of the present invention. FIGS. 9A-9B, 10A-10C, 11, and 12 are flow charts illustrating the functionality of these standard user interface functions within the standard VC1I. Briefly, the GETYESNO function is for prompting and accepting a positive (Yes) or negative (No) response from a user. The GETRESPONSE function is for prompting and accepting an input from a user that corresponds to an expected list of responses. The GETRESPONSEPLUS
function is for prompting and accepting input from a user similar to the GETRESPONSE function but includes enhanced error recovery features. The LISTANDSELECT
function provides a user with a list of choices and allows the user to select one. The operation of the GETYESNO, GETRESPONSE, GETRESPONSEPLUS, and LISTANDSELECT are adapted from "Debouncing the Speech Button: A Sliding Capture Window Device for Synchronizing Turn-Taking" by Bruce E. Balentine et al, International Journal of Speech Technology, 1997. FIG.
9A illustrates the use of a Yes/No menu and FIG. 9B
illustrates how to resolve a rejection or a bad recognition. FIG. 10A illustrates the initiation or begin window for the GETRESPONSE and GETRESPONSEPLUS
functions. FIG. 10B illustrates the speech startup or open window functionality for the GETRESPONSE and GETRESPONSEPLUS functions. FIG. 10C illustrates the end recognition or close window functionality for the GETRESPONSE and GETRESPONSEPLUS functions. FIG. 11 illustrates the dual capture window functionality for the GETRESPONSEPLUS function. FIG. 12 illustrates the menu list functionality for the LISTANDSELECT function.
Referring to FIGS. 9A-9B, the GETYESNO user interface function is used to ask the user a question and to accept a positive or negative response such as "Yes" or "No" (or the equivalent phrases in other languages). The parameters associated with the GETYESNO
are the QUESTION and a TIMEOUT period. The question parameter is a voice prompt to the user which asks a question that can be answered positively or negatively such as "yes" or "no" The TimeOut parameter is the number of seconds to wait for a response before flagging that a response was not detected. The voice controlled device returns a byte value depending upon the response or outcome. A 0 is returned if "No" response is detected. A 1 is returned if a "Yes" response was detected. A 17 is returned if a response was not detected in the allowed time indicating a TimeOut error.
An 18 is returned if a response was detected, but it was not recognizable indicating an out-of-vocabulary-word error.
Referring to FIGs. 10A-lOC, GETRESPONSE user interface function plays a Prompt to a user that solicits a response and waits for the response.
GETRESPONSE looks for a spoken response that matches a topic within a list known as TopicList. GETRESPONSE
either returns an array of recognized tokens, or an error indicator. The parameters associated with the GETRESPONSE are Prompt, TimeOut, STS_Sound, and TopicList. The Prompt parameter is the initial prompt to be played to the user. The TimeOut parameter is the number of milliseconds to wait for a response before flagging that a response was not detected. The STS_Sound parameter (Spoke-Too-Soon Sound) is the sound or prompt to be played if a user speaks before the Prompt finishes playing. Typically, the STS_Sound will be a short tone or beep sound rather than a spoken phrase. The parameter TopicList is the vocabulary subset for the list of topics which the SRS 204 should use to identify the spoken response. The voice controlled device returns a pointer to an integer array.
If the recognition of a response associated with the TopicList was successful, the first element in the array is the number of tokens returned and the following elements in the array are the tokens for each identified speech element (one or more words). Element 1 is n the Number of tokens returned. Elements 2 through n+1 are the Token values for each speech element recognized.
For example, consider the phrase "Telephone Dial Office". If the token value for the speech element "Telephone" is 7, for the speech element "Dial" is 12, and for the speech element "Office" is 103, then if they are all recognized successfully, the complete array returned would be four elements long with the values 3, 7, 12, 103. If the recognition of the response was not successful, the array is two elements long. The first element is set to zero and the second element indicates the type of error that occurred. In this case, Element 1 is set to 0 indicating that an error was detected.

Element 2 is set to 17 indicating that a response was not detected in the allowed time (TimeOut error) or 18 indicating that a response was detected, but it was not recognizable (out-of-vocabulary-word error). The array returned for a timeout error is two elements long with values 0, 17 and the array returned for an out-of-vocabulary-word error is two elements long with values 0, 18.
Referring to FIG. 11, GETRESPONSEPLUS user interface function plays a Prompt to a user that solicits a response and waits for the response.
GETRESPONSEPLUS is similar to GETRESPONSE in that it plays a Prompt for the user and then waits for a spoken response. However, GETRESPONSEPLUS includes the capability to play prompts to recover from error situations where the user has not spoken or has excessive noise in the background. GETRESPONSEPLUS
listens for a spoken response that matches the topics in TopicList. GETRESPONSEPLUS either returns an array of recognized tokens, or an error indicator. The parameters for GETRESPONSEPLUS are Initial_Prompt, Timeout, STS Sound, TopicList, MaxTries, Intervene_Prompt, Repeat_Prompt, and the Help_Prompt.
The Initial Prompt parameter is the initial prompt to be played to a user to solicit a response. The TimeOut parameter is the number of milliseconds to wait for a response before flagging that a response was not detected. The STS_Sound prompt is a sound or prompt to be played if user speaks before Prompt finishes playing.
Typically, STS Sound prompt will be a short tone or beep sound rather than a spoken phrase. The parameter TopicList is the vocabulary subset for the list of topics which the SRS 204 should use to identify the spoken response. The MaxTries parameter is the maximum number of times GETRESPONSEPLUS will re-prompt the user in an effort to get a good recognition. If recognition does not occur after MaxTries, GETRESPONSEPLUS will return and indicate an error. The Intervene_Prompt parameter is a prompt played to ask the user to repeat himself (e. g. "There was too much noise. Please repeat what you said."). This prompt is played when there was too much noise during the previous recognition attempt.
The Repeat Prompt parameter is the prompt played to ask the user to repeat what was just said (e. g. "Please repeat what you said"). This prompt is used when a spoke-too-soon error occurred. The Help_Prompt parameter is the prompt played when the user seems to need further instructions, including when the user says nothing. The voice controlled device returns a pointer to an integer array upon completion of the user interface function. If the recognition of a response associated with the TopicList was successful, the first element in the array is the number of tokens returned and the following elements in the array are the tokens for each identified speech element (one or more words).
Element 1 is n the Number of tokens returned.
Elements 2 through n+1 are the Token values for each speech element recognized. For example, consider the phrase "Telephone Dial Office". If the token value for the speech element "Telephone" is 7, for the speech element "Dial" is 12, and for the speech element "Office" is 103, then if they are all recognized successfully, the complete array returned would be four elements long with the values 3, 7, 12, 103. If recognition was not successful, the array is four elements long. The first element is zero. The second element indicates the most recent type of error that occurred. The third through fifth elements indicate the number of times each type of error occurred between when GETRESPONSEPLUS was called to when GETRESPONSEPLUS
returned. In this case Element 1 has a value of 0 indicating that an error was detected. Element 2 has a value of 17 indicating that a response was not detected in the allowed time (TimeOut error) or 18 indicating that a response was detected, but it was not recognizable (out-of-vocabulary-word error) or 19 indicating that a spoke-to-soon error was detected.
Element 3 has a value of x indicating the number of times a TimeOut error was detected. Element 4 has a value of y indicating the number of times an out-of-vocabulary-word error was detected. Element 5 has a value of z indicating the number of times a spoke-too-soon error was detected.
Referring to FIG. 12, LISTANDSELECT user interface function first plays a Prompt. Then it plays each prompt in array ListOfMenuPrompts, pausing after each for a PauseTime. During these pauses, the recognizer listens for a spoken response that matches the topics in TopicList. LISTANDSELECT either returns an array of recognized tokens, or an error indicator. The parameters for LISTANDSELECT include Initial_Prompt, Timeout, STS_Sound, TopicList, ListOfMenuPrompts, PauseTime, and the Help_Prompt. The Initial_Prompt parameter is the initial prompt to be played to the user. The TimeOut parameter is the number of milliseconds to wait for a response, after playing all the prompts in ListOfMenuPrompts, or before flagging that a response was not detected. The STS Sound parameter is the sound or prompt to be played if user speaks before a prompt finishes playing. Typically, STS_Sound will be a short tone or beep sound rather than a spoken phrase. The parameter TopicList is the vocabulary subset for the list of topics which the SRS
204 should use to identify the spoken response. The ListOfMenuPrompts parameter is an array of prompts which will be played one at a time. The first element in the array is a count of the number of prompts in ListOfMenuPrompts. The PauseTime parameter is the time to pause after playing each prompt in ListOfMenuPrompts.
The PauseTime parameter has a value in milliseconds.
The Help_Prompt parameter is the prompt played when the user seems to need further instructions, including when the user says nothing. The voice controlled device returns a pointer to an integer array upon completion of the user interface function. If recognition was successful, the first element in the array is the number of tokens returned, and the following elements in the array are the tokens for each identified speech element (one or more words). Element 1 has a value of n indicating the number of tokens returned. Elements 2 through n+1 have a value of x indicating the token values for each speech element recognized. If recognition was not successful, the array is two elements long. The first element is zero. The second element indicates the type of error that occurred. In this case, Element 1 has a value of 0 indicating that an error was detected. Element 2 has a value of 17 indicating a response was not detected in the allowed time (TimeOut error) or 18 indicating that a response was detected, but it was not recognizable (out-of-vocabulary-word error).
The ACOUSTICADDWORD function is used by application software to allow a user to add a phrase, also called a _77_ voicetag, into the voice controlled device. These phrases can later be recognized using the GETRESPONSE
and GETRESPONSEPLUS functions. The ACOUSTICADDWORD
function can be used, for example, in a telephone to create dial-by-name entries. By storing a person's name ("John Smith") or identity ("Mother") or other distinguishing phrase ("My office number") with ACOUSTICADDWORD, a person could later call the number by saying "Call John Smith", "Call Mother", or "Call my office number".
ACOUSTICADDWORD stores the voicetag into a specified TopicList. In its operation, ACOUSTICADDWORD
plays a prompt, receives and records a voicetag, verifies the voicetag, then stores the voicetag.
AcousticAddWord has the ability to recover from errors by re-checking the voicetag more than once.
AcousticAddWord checks and returns an error to the user in the event of duplication. The parameters for ACOUSTICADDWORD include Initial_Prompt, Timeout, STS_Sound, TopicList, MaxTries, Repeat_Prompt, Intervene Prompt, Error_Prompt, Ok-Prompt, and Help_Prompt. The Initial_Prompt parameter is the initial prompt to be played to a user, such as "Say the new name" in the example of storing names in a voice controlled telephone's phonebook. The Timeout parameter is the number of milliseconds to wait before flagging a response that a failure was detected. The STS_Sound _7g_ (Spoke-Too_Soon Sound) parameter is the sound or prompt to be played if user speaks before the Prompt finishes playing. Typically, the STS_Sound will be a short tone or beep sound rather than a spoken phrase. The parameter TopicList is the vocabulary subset for which the SRS 204 should store the new voicetag in. The MaxTries parameter is the maximum number of times AcousticAddWord will re-prompt the user in an effort to get a good recognition.
If recognition does not occur after MaxTries, AcousticAddWord will return an error indication. The Repeat_Prompt parameter is the prompt played to ask the user to repeat what was just said (e. g. "Please repeat what you said"). This prompt is used when a spoke-too-soon error occurred. The Intervene_Prompt parameter is a prompt played to ask the user to repeat himself (e. g.
"There was too much noise. Please repeat what you said."). This prompt is played when there was too much noise during the previous recognition attempt.
Error_Prompt parameter is the prompt played when the repeated name does not match the initial name, or if the name is a duplicate (e.g. "Please try again."). The OK_Prompt parameter is the prompt played when the new name has been successfully recorded and stored (e. g.
"<name> is now stored in the address book"). The Help_Prompt parameter is the prompt played when the user seems to need further instructions, including when the user says nothing. The voice controlled device returns a _79_ pointer to an integer array upon completion of the user interface function. If the recognition of a response associated with the AcousticAddWord was successful, the array is seven elements long. Element 1 is a value of 1 indicating successful recognition. Element 2 is a value indicating the token number assigned by the SRS 204, which corresponds to the voicetag that was stored.
Element 3 is a pointer to a recorded copy of the voicetag. Element 4 is a value indicating the number of timeout errors that occurred. Element 5 is a value indicating the number of times there was a failure to match the name. Element 6 is a value indicating the number of times spoke-too-soon occurred. Element 7 is a value indicating the number of times the help prompt was played. If recognition was not successful, the array is six elements long. The first element is zero. The second element indicates the most recent type of error that occurred. The third through fifth elements indicate the number of times each type of error occurred between when AcousticAddWord was called to when AcousticAddWord returned. The sixth element indicates the number of times the help prompt was played. In this case, Element 1 is a value of indicating that an error was detected. Element 2 has a value of 17 indicating that a response was not detected in the allowed time (TimeOut error); 18 indicating that a response was detected, but it was not recognizable (Noise error); 19 indicating that a spoke-to-soon error was detected; 20 indicating a Recognition failure (no match on repeat);
or 21 indicating a Voicetag list already full. Element 3 is a value of x indicating the number of times a TimeOut error was detected. Element 4 is a value of y indicating the number of times a recognition error was detected. Element 5 is a value of z indicating the number of times a spoke-too-soon error was detected.
Element 6 is a value indicating the number of times the help prompt was played.
ETIQUETTE FOR VOICE CONTROLLED DEVICES
The standard VUI includes an etiquette for voice controlled devices. Generally, voice controlled devices (also referred to as machines) should conduct themselves like well-behaved guests.
However, human factors and human issues involved in living with voice controlled devices are largely unexplored. In designing voice controlled devices, the following suggestions should be considered.
Machine Requests to Humans Machines can ask humans to do things. Any request should be polite. For example, a voice activated cellular telephone might ask to be placed in its charger when its batteries are running low. Humans should always have the option to refuse a machine's request, and the machine should politely accept that, unless the machine considers the situation threatening to human life or valuable data, in which case its protests can be more urgent.
Machines That Use the Telephone On Their Own If a voice controlled device answers the telephone, or places a call to a human user, it should clearly identify itself as a machine if there is any risk of it being considered human.
Recording User Speech No machine should record or transcribe a human user's conversations unless those humans present are aware that this is occurring.
Volume Levels Machines should modulate their volume levels in response to ambient noise levels, unless specifically overridden by a human. Machines should be sensitive to when humans want them to be silent (for example, when humans are sleeping). Machines shouldn't babble needlessly, and should permit a user barge-in as a means to silence them.
Machine-to-Machine Communication FIG. 13 is a block diagram of a pair of voice controlled devices 102M and 102N (each also referred to as a machine) communicating, neither, one or both of which could be using the standard voice user interface 500 of the present invention in the communication environment 1300. Voice controlled devices can talk to each other to find out what other voice controlled devices are present, what kinds of information they understand, and to exchange information. For example, a voice controlled TV may ask a voice controlled VCR about necessary settings for it to operate. Machine-to-machine communication between voice controlled devices occurs in both audible and non-audible formats. Essentially, machine-to-machine communication using speech may occur over any speech-compatible media, including sound waves through air, conventional telephone links, Internet voice links, radio voice channels, and the like.
Machine-to-machine communication can occur where none of the machines, some of the machines, or all of the machines include the VUI of the present invention.
Using the standard VUI, a voice controlled device can locate other voice controlled devices within a communications environment in a number of ways. These include overhearing a human interact with another machine, overhearing a machine interact with another machine, explicitly requesting nearby machines to identify themselves by using the identification phrase "<silence> V~hat is out there?", explicitly seeking a specific class of machines (e.g. all clocks) by addressing them by a name category "<silence> Clock are you out there?", or explicitly seeking a specific machine (e.g. a clock named Socrates) by addressing it by name "<silence> Socrates are you out there?".
In the first two cases, the process of listening to other conversations would reveal the other machines' names. In the other three cases the machines within earshot who respond to the "are you out there" command would respond with their names. In the last two cases, the "What is out there?" command is restricted to certain classes of machines and a specific named machine thereby limiting the number of machines that will respond to the command. Once the name of the target voice controlled device is known, the initiating voice -controlled device can issue other commands (e. g.
"Socrates what time is it?") to the other.
In some cases, a voice controlled device may need to talk to another voice controlled device, one or both of which may not adhere to the above protocol. In these cases, the machines can be explicitly programmed to issue the correct commands and recognize appropriate responses. A simple example of this interaction would be a voice controlled device with voice recognition capability and a telephone voice interface dialing a voice-based service such as a spoken report of the time, and simply capturing the desired data (the time).

The preferred embodiments of the present invention for METHOD AND APPARATUS FOR STANDARD VOICE USER
INTERFACE AND VOICE CONTROLLED DEVICES are thus described. Vdhile the preferred embodiments of the present invention utilize a speaker independent voice recognition system, the present invention is also compatable with speaker dependent voice recognition systems. While the present invention has been described in particular embodiments, the present invention should not be construed as limited by such embodiments, but rather construed according to the claims that follow below.

MICROFICHE APPENDIX
ISD CONFIDENTIAL INFORMATION

Embedded Speech Recognition Processor Advanced Information SPEECH RECOGNITION IS NOW EASY TO ADD TO COMMAND AND CONTROL APPLICATIONS
VOICE RECOGNITION FEATURES
Speech recognition processorRecognition always active optimized for command and control applicationsAllows for voice control of on-the-air services Complete voice recognition . Zero power voicetag storage subsystem requir-ing no host processor overhead User-friendly application-specific Voice User Su orts both s Baker-inde Interface (VUI) endent and PP P P

speaker-dependent continuous speech True hands-free control Up to 65 speaker-dependent user-defined Activation by voice voicetags for voice-activated speed dialing Standardized, easy to use interface for Supports up to four independentvoice activated appliances users through keyword activation Minimizes training by users and assigned phone books Accelerated application development by Up to 30 speaker-independentproviding standard interface application software specific commands Flexible API provides high level commands suit-Continuous digit recognitionable for a wide variety (0 through 9) of applications Natural Number digit recognition99% recognition accuracy "oh" for for both speaker-zero, eight-hundred for 800,independent commands and etc.) Continuous Recognition of'star', 'pound',digits.
and 'flash' for voice-dialing applications Uses Hidden Markov Models and probabilistic modeling to recognize a wide variety of speakers.

IMPORTANT NOTICE: This product concept and specifications are preliminary and subject to change without notice. Please contact lSD before using this information in any product design.
February 1999 iSD . 2045 Hamilton Avenue, San Jose, CA 95125 . TEL: 408/369-2400 . FAX:
408/369-2422 . Mtp:llwww.isd.com tsv 1-1 ISD CONFIDENTIAL INFORMATION
RECOGNITION ENGINE VOICE ACTIVATED APPLICATIONS

Combination DSP module and Command and control applications RISC processor where optimized for speech recognitionvoice control is preferable to keypad control Interfaces to N-Law, A-Law,Desktop or cordless phones or linear voice CODEC Cellular handsets Serial interface to host . Cellular car kits microcontroller Single +5V or +3.3V power Automobile navigation or supply interior accessory Quiescent current: 40mA control Power-down curent: 1 mA Home applicance and entertainment control.

Package: 80-pin QFP

Figure i: Stand-Alone Speech Recognition System Diagram v~~
RESETy EMCS ADDR
BM S CONT 76 Mblt TST InrS FLASH

X2 32k X16 MWCLK
Sp~~ Mic preamp CDOUT MWDIN HOST
and speak- CDIN MWRQST MICROPRO-Mic ~ er driver CODEC ~ CCLK MWCS CESSOR
MWRDY
CFSO Vss MWDOUT
Voice Solutions Jn Slliconry ISD CONFIDENTIAL INFORMATION
Chapter 1 HARDWARE
Figure 1-1: 80-MQFP Package Connection Diagram Y
U
a z z z z z °a > a > a a a a a a a > > IU >

A9 1 60VggA

DWEIMMDIN4 57PAOlWRO

ISE 5 56PA1lWR1 DO 6 55PA2lCTTL

VgS 9 ISD-SR3OOO 52PASIMWDOUT

V~~HI 11 50Vss V~~ 12 49PA718ST1 D4 13Top View 48VCC

PCOlAII17 44CDIN

N M ~ N M O pp O U O ~ ~_ N M V N N
,,,, U
4G°.°zzZZ~~>O~~DDOD~,.'n,,, Z
V' W ' IJJ 4J O M ~ Ln ~ n N
a a a w U U U a ~ a a a a a a a in O ~ m z a U U a aU.
a N01E: Pins marked NC should not be connected.
tso 1-3 _8$_ ISD CONFIDENTIAL INFORMATION
PIN ASSIGNMENT
The following sections detail the pins of the ISD-SR3000 processor. Slashes separate the names of signals that share the same pin.
PIN SIGNAL ASSIGNMENT
Table 1-1 shows all the pins, and the signals that use them in different configurations. It also shows the type and direction of each signal.
Table 1-1: ISD-SR3000 Pin-Signal Assignment P ~ ~L., , A(0-75) TTL A(0-15) Output CCLK TTL CCLK Output CDIN TTL CDIN Input CDOUT TTL CDOUT Output CFSO TTL CFSO Output D(0-T 5) TTL D(0-15) InputlOutput EMCS/ENVO TTL1 EMCS Output ~

CMOSz ENVO Input MWCLK TTL MWCLK Input MWCS TTL3 MWCS Input MWDIN TTL MWDIN Input MWRDY TTL MWRDY Output MWRQST TTL MWRQST Output MWDOUT TTL MWDOUT Output BMCS TTL BMCS Output IOCS TTL IOCS Output RESET Schmitt3 RESET Input TST TTL TST Input Vcc Power Vcc Vu Power Vss X2/CLKIN XTALTTL X2 CLKIN OSC Input 1. TTL7 output signals provide CMOS levels in the steady state for small loads.
2. Input during RfSEl, CMOS level input.
3. Schmitt Vigger input.
Voice Solutions In SIIIconT'"

_89_ ISD CONFIDENTIAL INFORMATION
Table 1-2: ISD-SR3000 Pin-Signal Assignment (Continued) ~,~,~~~ i ii I,;i~ ~ ~i~f ~i~y'~p i ~ I i~,~f,, i i ..
1 A9 A9 i~
O ~ ~ i Address bit 9 2 A10 A10 O Address bit 10 3 MMDOUT MMDOUT O Master MICROWIRE data output 4 MMDIM MMDIN 1 Master MICROWIRE data input 6 DO DO I/OData bit 0 7 D1 D1 I/OData bit 1 8 D21RA11 D2 I/OData bit 2 g Vu V55 PowerGround for on-chip logic and output drivers 10 D3 D3 IIOData bit 3 11 VccHl VccHl PowerPower: +3.3V or +5V for on-chip voltage regulator 12 Vcc Vcc PowerPower: +3.3V or +5V for on-chip logic and output drivers 13 D4 D4 I/OData bit 4 14 D5 D5 UO Data bit 5 15 D6 D6 IIOData bit 6 16 D7 D7 I/OData bit 7 17 PCO/A11 PCO Il0Port C, bit 0 A11 O Address bit 11 16 NC NC Do not connect 19 NC NC Do not connect 20 NC NC Do not connect 21 PC/A72 PC1 O Port C, bit 1 A12 O Address bit 12 22 PC2/A13 PC2 O Port C, bit 2 A13 O Address bit 13 23 PC3/A14/BE0PC3 O Port C, bit 3 A14 O Address bit 14 BEO O Byte enable bit 0 24 PC4/A158E1PC4 O Port C, bit 4 A15 O Address bit BE1 O Byte enable bit 1 ENV2 O Environment select bit 2 fso 1-5 ISD CONFIDENTIAL INFORMATION
Table 1-2: ISD-SR3000 Pin-Signal Assignment (Continued) ,I I I, ~ ~i~~l'I~I~!'i,J , , I ~ ~,~ iI I ~ I i ' ' I
I ~~B I il y ~l j 25 'I','~il ',,~I ~ I
I I , Port C, bit 5 p PC5/IOCSlENV3 O

IOCS O Il0 expansion chip select ENV3 O Environment select bit 3 26 PC6IEMCS/ENVO PC6 O Port C, bit 6 EMCS O Expansion memory chip select ENVO O Environment select bit 0 27 PC7IBMCS/ENV1 PC7 O Port C, bit 7 BMCS O Base memory chip select ENV1 O Environment select bit 1 28 PBO/D8 PBO O Port B, bit 0 DB I/O Data bit 8 29 PBIID9 PB1 O Port B, bit 1 D9 I/O Data bit 9 30 V~~ V~~ PowerPower: +3.3V or +5V for on-chip logic and output drivers 31 P821D10 PB2 O Port B, bit 2 D10 I/O Data bit 10 32 Vss Vss PowerGround for on-chip logic and output drivers 33 P831D11 PB3 O Port B, bit 3 D11 I/O Data bit 11 34 PB4/D72 PB$ O Port B, bit 4 D12 Il0 Data bit 12 35 PBSID13 PB5 O Port B, bit 5 D13 I/O Data bit 13 36 P86/D74 PB6 O Port B, bit 6 D14 1/O Data bit 14 37 P87/D15 P87 O Port B, bit 7 D15 I/O Data bit 15 38 INT3/MWCS INT3 I External interrupt MWCS I MICROWIRE chip select 39 RESET RESET I Reset 40 NC NC Do not connect 41 CFS1/PWM CFS1 0 CODEC 1 Frame synchronization PWM O Pulse width modulation 1-s Volce Solutions In SIIiconT"

ISD CONFIDENTIAL INFORMATION
Table 1-2: ISD-SR3000 Pin-Signal Assignment (Continued) PinPin Name Signal TypeDescription Name 42 CDOUT CDOUT O Data output to CODEC

43 CFSO CFSO I/O CODEC 0 Frame synchronization 44 CDIN CDIN I Data input from CODEC

45 CCLK CCLK I/O CODEC Masterlslave clock 46 PD1/MWDIN PD1 I Port D, bit 1 MWDIN I MICROWIRE data input 47 PDO/MWCLK PDO I Port D, bit 0 MWCLK I MICROWIRE clock 48 Vcc Vcc PowerPower: +3.3V or +5V for on-chip logic and output drivers 49 PA7/BST7 PA7 I/O Port A, bit 7 BSTt O Bus status bit 1 50 Vss Vss PowerGround for on-chip logic and output drivers 51 PA6/BSTO PA6 I/O Port A, bit 6 BSTO O Bus status bit 0 52 PA5/MWOUT PA5 UO Port A, bit 5 MWOUT O MICROWIRE data output 53 PA4/MWRDY PA4 UO Port A, bit 4 MWRDY O MICROWIRE ready 54 PA3/PFS PA3 I/O Port A, bit 3 PFS

55 PA2/CTTL PA2 I/O Port A, bit 2 CTTL O CPU Clock 56 PA1/WR1 PA1 UO Port A, bit 1 57 PAOIWRO PAO I/O Port A, bit 0 WRO

58 X1/PL1 X1 Osc,Crystal oscillator interface 59 X2/CLKIN X2 Osc,Crystal oscillator interface CLKIN I Oscillator clock input 60 VssA VssA PowerGround for on-chip analog circuitry 61 Vss Vss PowerGround for on-chip logic and output drivers 62 CAS/MMCLK CAS O DRAM column address sVObe fsv 1-7 _92_ ISD CONFIDENTIAL INFORMATION
Table 1-2: ISD-SR3000 Pin-Signal Assignment (Continued) Pin Pin Name Signal TY~ Description Name MMCLK O Master MICROWIRE clock 63 VccA VccA PowerPower: +3.3V or +5V for on-chip analog circuitry 64 Vcc Vcc PowerPower: +3.3V or +5V for on-chip logic and output drivers 65 AS A8 O Address bit 8 66 A7 A7 O Address bit 7 67 A6 A6 O Address bit 6 68 A5 A5 O Address bit 5 69 A4 A4 O Address bit 4 70 A3 A3 O Address bit 3 71 A2 A2 O Address bit 2 72 Vcc Vcc PowerPower: +3.3V or +SV for on-chip logic and output drivers 73 A1 A1 O Address bit 1 74 V55 V55 PowerGround for on-chip logic and output drivers 75 AO/A16/DDIN AO O Address bit 0 A16 O Address bit 16 DDIN O Data direction 76 NC NC Do not connect 77 NC NC Do not connect 78 NC NC Do not connect 79 NC NC Do not connect 80 NC NC Do not connect Voice Solutions In SIIIconT"

r'-.

ISD CONFIDENTIAL INFORMATION
SIGNAL DESCRIPTION
The following signals are used for the interface OUTPUT SIGNALS
protocol. Input and output are relative to the ISD-SR3000. MWDOUT
MICROWIRE Data Out. Used for output only, for INPUT SIGNALS transferring data from the ISD-SR3000 to the mi-crocontroller. When the ISD-SR3000 receives data MWDIN it is echoed back to the microcontroller on this signal, unless the received data is OxAA. In this MICROWIRE Data In. Used for input only, for trans- case, the ISD-SR3000 echoes a command's ferring data from the microcontroller to the ISD- return value.
SR3000.
MWRDY
MWCLK MICROWIRE Ready. When active (0), this signal in-This signal serves as the synchronization clock dur- dicates that the ISD-SR3000 is ready to transfer (re-ing communication. One bit of data is transferred ceive or transmit) another byte of data.
on every clock cycle. The input data is available on MWDIN, and is latched on the clock rising This signal is set to 1 by the ISD-SR3000 after each edge. The transmitted data is output on MWDOUT byte transfer has been completed. It remains 1, on the clock falling edge. The signal should re- while the ISD-SR3000 is busy reading the byte, writ-main low when switching MWCS. ing the next byte, or executing the received com-mand (after the last parameter has been received). MWRDY is cleared to 0 after reset. For MWCS proper operation after a hardware reset, this sig MICROWIRE Chip Select. The MWCS signal is nal should be pulled up.
cleared to 0, to indicate that the ISD-SR3000 is be-ing accessed. Setting MWCS to 1 causes the ISD- MWRQST
SR3000 to start driving MWDOUT with bit 7 of the Uansmitted value. Setting the MWCS signal resets MICROWIRE Request. When active (0), this signal the transfer-bit counter of the protocol, so the sig- indicates that new status information is available.
nal can be used to synchronize between the ISD- MWRQST is deactivated (set to 1), after the ISD-SR3000 and the microcontroller. SR3000 receives a GSW (Get Status Word) com-mand from the microcontroller. After reset, this sig-To prevent false detection of access to the ISD- nal is active (0) to indicate that a reset occurred.
SR3000 due to spikes on the MWCLK signal, use MWRQST, unlike all the signals of the communica-this chip select signal, and toggle the MWCLK in- lion protocol, is an asynchronous line that is con-put signal, only when the ISD-SR3000 is accessed. trolled by the ISD-SR3000 firmware.
~sn 1-9 ISD CONFIDENTIAL INFORMATION
SIGNAL USE IN THE INTERFACE PROTOCOL
After reset, both MWRQST and MWRDY are cleared 3. The MWRDY signal is activated (cleared to to 0. 0) by the ISD-SR3000 when it is ready to re-The MWRQST signal is activated to indicate that a ceive the first parameter byte (if there are reset occurred. The EV_RESET bit in the status reg- any parameters) and so on till the last byte ister is used to indicate a reset condition. of parameters is transferred. An active MWRDY signal after the last byte of param-The GSW command should be issued after reset eters indicates that the command was to verify that the EV RESET event occurred, and to parsed and (if possible) executed. If that deactivate the MWRQST signal. command has a return value, the micro-controller must read the value before issu-While the MWCS signal is active (0), the ISD-SR3000 reads data from MWDIN on every rising edge of ing a new command.
MWCLK. ISD-SR3000 also writes every bit back to 4. When a return value is transmitted, the MWDOUT. This bit is either the same bit which was MWRDY signal is deactivated after every read from MWDIN (in this case it is written back as byte, and activated again when the ISD-a synchronization echo after some propagation SR3000 is ready to send another byte, or to delay), or it is a bit of a value the ISD-SR3000 trans- receive a new command.
mils to the microcontroller (in this case it is written The MWRDY signal is activated (cleared to 0) after on every falling edge of the clock). reset, and after a protocol time-out.
(See "INTER-When a command has more than one parame- FACE PROTOCOL ERROR HANDLING" on page ter/return-value, the parameters/return-values are 5.) transmitted in the order of appearance. If a pa- The MWRQST signal is used as follows:
rameter/return-value is more than one byte long, the bytes are transmitted from the most significant 1. The MWRQST signal is activated (cleared to 0), to the least significant. when the status word is changed.
The MWRDY signal is used as follows: 2. The MWRQST signal remains active (0), until the 1. Active (0) MWRDY signals the microcontrol- ISD-SR3000 receives a GSW
command.
ler that the last eight bits of data transferred Figure 1-2 illustrates the sequence of activities dur-to/from the voice module were accepted ing a MICROWIRE data transfer.
and processed (see below).
2. The MWRDY signal is deactivated (set to 1 by the ISD-SR3000) after 8-bits of data were transferred to/from the ISD-SR3000. The bit is set following the falling edge of the eighth MWCLK clock-cycle.
1-10 Voice Solutions in Silicon"

ISD CONFIDENTIAL INFORMATION
Figure 1-2: Sequence of Activities During a MICROWIRE Byte Transfer twcut &It 7 Qxt ~ ~ ~ ~ gM 1 BB Gt II i i i i i i i i i i i i ~ i i i i i w ~ vi x x AiBtlalrl ~ mn ew a eff s ~ rr i er o p:~el~
r ~nrn~ roar ~~er orr INTERFACEPROTOCOLERROR HANDLING
Interface Protocol Time-outs Echo Mechanism Depending on the ISD-SR3000'sThe ISD-SR3000 echoes back state, if more than to the microcontrol-100 milliseconds elapse ler all the bits received between the assertion of by the ISD-SR3000. Upon the MWRDY signal and the detection of an error in transmission 8th bit of the echo, the microcon-the next byte pertaining trolley should stop the to the same command protocol clock, which even-transaction, a time-out tually causes a time-out event occurs, and the ISD- error (i.e., ERR_TIMEOUT
bit SR3000 responds as follows:is set in the error word).

1. Sets the error bit in the status word to 1.
NOTE When a command has a return value, the 2. Sets the EV TIMEOUT bit in the error word ISD-SR3000 transmits bytes of the return tol. value instead of the echo value.
3. Activates the MWRQST signal (clears it to 0).
4. Activates the MWRDY signal (clears it to 0). The ISD-SR3000 transmits a byte as an echo when it receives the value OxAA from the microproces 5. Waits for a new command. (After a time- soy. Upon detection of an error the ISD-SR3000 ac out occurs, i.e., the microcontroller re- tivates the MWRQST signal, and sets the ceived MWRQST during the command ERR_COMM bit in the error word.
transfer, or result reception, the microcon-troller must wait at least four milliseconds before issuing the next command.) tsv 1-11 ISD CONFIDENTIAL INFORMATION
FUNCTIONAL DESCRIPTION
This section provides details of the functional char- Figure 1-3: Recommended Power-On Reset acteristics of the ISD-SR3000 processor. It is divid- Circuit ed into the following sections:
~cc ~ Resetting ~ Clocking ~ Power-down mode o R

~ Power and grounding ~ CODEC interface RESET
C
RESETTING '1 The RESET pin is used to reset the ISD-SR3000 pro-cessor.
On application of power, ~~KING
RESET must be held low for at least tp,~ after The ISD-SR3000 processor Vcc is stable. This ensuresprovides an internal os-that all on-chip voltages are cillator that interacts completely stable before with an external clock source operation. Whenever RESET through the X1 and X2/CLKIN
is applied, it must also pins. Either an exter-remain active for not less nal single-phase clock signal, than tRST. During this or a crystal oscilla-pe-riod, and for 100 ms after,tor, may be used as the the TST signal must be clock source.

high. This can be done with a pull-up resistor on the TST pin. External Single-Phase Clock Signal The value of MWRDY is undefinedIf an external single-phase during the reset clock source is used, it period, and for 100 ms after.should be connected to the The microcontroller CLKIN signal as should either wait before shown in Figure 1-4, and polling the signal for should conform to the the first time, or the signal voltage-level requirements should be pulled high dur- for CLKIN stated in ing this period. "Electrical Characteristics"
on page 1-11.

Upon reset, the ENVO signal is sampled to deter-mine the operating environment.
During reset, the Figure 1-4: External Clock Source EMCS/ENVO pin is used for the ENVO input signals.

An internal pull-up resistor~--I
sets ENVO to 1.

After reset, the same pin is used for EMCS. ' ISD~SR3000 System Load on ENVO X~ X2/CLKIN
For any load on the ENVO pin, the voltage should not drop below VENVh~
Single-phase Clock Signal If the load on the ENVO pin causes the current to exceed 10 NA, use an external pull-up resistor to Clock Generator keep the pin at 1.
Figure 1-3 shows a recommended circuit for gen-erating a reset signal when the power is turned on.
1-12 Voice Solutions in Sllicon'~

_97_ LSD CONFIDENTIAL INFORMATION
Crystal Oscillator You can use crystal oscillators with maximum load A crystal oscillator is connected to the on-chip os- capacitance of 20 pF, although the oscillation cillator circuit via the X1 and X2 signals, as shown frequency may differ from the crystal's specified in Figure 1-5. value.
Table 1-3 lists the components in the crystal oscil-lator circuit.
Figure 1-5: Connections for an External Crystal Oscillator Table 1-3: Crystal Oscillator Component List CrystalResonance 4.096 ResonatorFrequency MHz Resistor10Mf2 5 Capacitors33pF 20%

C1.

Keep stray capacitance and inductance, in the oscillator circuit, as low as possible, The crystal res-onator, and the external components, should be as close to the X1 and X2/CLKIN pins as possible, to keep the trace lengths in the printed circuit to an absolute minimum.
rsv 1-13 _98_ ISD CONFIDENTIAL INFORMATION
POWER-DOWN MODE
Power-down mode is useful during a power failure NOTE Entering or exiting power-down mode can or in a power-saving model, when the power distort the real-time clock by up to 500 source for the processor is a backup battery or in Ltsec. Thus, to maintain the accuracy of battery-powered devices, while the processor is in the real-time clock, enter or exit the idle mode. power-down mode as infrequently as possible.
In power-down mode, the clock frequency of the ISD-SR3000 processor is reduced and some of the processor modules are deactivated. As a result, the ISD-SR3000 processor consumes considerably less power than in normal-power mode.
NOTE In Power-down mode all the chip select signals, CSO to CS3, are set to 1. To guar-antee that there is no current flow from these signals to the Flash devices, the power supply to these devices must not be disconnected.
The ISD-SR3000 stores voice tags in Flash memory.
When Flash memory is used for storage, power does not need to be maintained to the processor to preserve stored messages.
To keep power consumption low in power-down mode, the RESET, MWCS, MWCLK and MWDIN sig-nals should be held above Vcc - 0.5 V or below Vss + 0.5 V.
The PDM (Go To Power-down Mode) command switches the ISD-SR3000 to power-down mode.
(For an explanation of the ISD-SR3000 processor commands, see Table 2-1.) This command may only be issued when the processor is in the idle mode. (For an explanation of the SR3000 states, see "Command Execution" on page 45.) If it is necessary to switch the power-down mode from any other state, the controller must first issue an S
command to switch the processor to the idle state, and then issue the PDM command. Send-ing any command while in power-down mode re-sets the ISD-SR3000 processor detectors, and returns it to normal operation mode.
1-14 Volce Solutions In SIlIcon~"

ISD CONFIDENTIAL INFORMATION
THE CODEC INTERFACE SHORT FRAME PROTOCOL

The ISD-SR3000 provides When the short frame protocol an on chip interface for is configured, eight analog and digital telephony,or sixteen data bits are supporting master exchanged with each CO-and slave CODEC interface DEC in each frame (i.e., modes. In master the CFSO cycle). Data mode, the ISD-SR3000 controlstransfer begins when CFSO
the operation of is set to 1 for one CCLK

the CODEC for use in analogcycle. The data is then telephony. In the transmitted, bit by bit, via slave mode, the ISD-SR3000 the CDOUT pin. Concurrently, CODEC interface is the received data is controlled by an external shifted in through the CDIN
source. This mode is pin. Data is shifted one used in digital telephony bit per CCLK cycle. After (i.e., ISDN or DECT lines).the last bit has been shift-The slave mode is implementeded, CFS1 is set to 1 for with respect to one CCLK cycle. Then, the IOM-2TMICGI specifications.data from the second CODEC
is shifted out via CDOUT, concurrently with the inward shift of the See Table 1-4 for CODEC data received via CDIN.
options for the ISD-SR3000 (ISD supports compatible CODECS in ad-dition to those listed below).

The CODEC interface supportsSONG FRAME PROTOCOL
the following fea-tures: When long frame protocol is configured, eight or sixteen data bits are exchanged with each CO-Master Mode or Slave Mode. DEC, as for the short frame protocol. However, for 8- or 16-bit channel width.the long frame protocol, data transfer starts by Long (Variable) or Short setting CFSO to 1 for eight (Fixed) Frame Protocol. or sixteen CCLK cycles.

Short or long frame protocol is available in both Sin le or Double bit clock Master and Slave modes.
rate.
' 9 Single or Dual Channel CODECS

One or Two CODECS

Multiple clock and sample rates.

One or Two frame sync signals This CODEC interface uses five signals: CDIN, CD-OUT, CCLK, CFSO, and CFS1.
The CDIN, CDOUT, CCLK, and CFSO pins are connected to the first CODEC. The second CODEC
is connected to CDIN, CDOUT, CCLK, and CFS1 pins. Data is trans-ferred to the CODEC through the CDOUT output pin. Data is read from the CODEC through the CDIN input pin. The CCLK
and CFSO pins are out-put in Master Mode and input in Slave Mode. The CFS1 is an output pin.

rso 1-15 ~D CONFIDENTIAL INFORMATION
Table 1-4:Supported CODEC Devices National TP3054 Single 5V E!-Law Semiconductor CODEC

National TP3057 Single 5V A-Law Semiconductor CODEC

OKI MSM7533VDual CODEC5V ,ll-Law, A-Law OKt MSM7704 Dual CODEC3.3V ,1l-Law, A-Law, LV

Macronix MX93002FCDual rail 5V El-Law CODEC

Lucent T7502 Dual CODEC5V A-Law Lucent T7503 Dual CODEC5V EL-Law Channel Width SLAVE MODE
The CODEC interface supports both 8-bit and 16- The ISD-SR3000 supports digital telephony appli-bit channel width in Master and Slave Modes. Fig- cations including DECT and ISDN by providing a ure 1 shows how the CODEC interface signals be- Slave Mode of operation. In Slave Mode opera-have when short frame protocol is configured. tion, the CCLK signal is input to the SR-3000 and controls the frequency of the CODEC interface operation. The CCLK may be any frequency be-tween 500 kHz and 4 MHz. Both long and short frame protocols are supported with only the CFS1 output signal width affected. The CFSO input sig-nal must be a minimum of one CCLK cycle.
1-16 Volce Solutions in Silicon ISD CONFIDENTIAL INFORMATION
In slave mode, a double The CODEC interface is enabled clock bit rate feature while the system is available as well. When is in normal operation mode the CODEC interface is and when configured to double clock MCFG.CDE = 1. It is disabled bit rate, the CCLK in- when MCFG.CDE =

put signal is divided internally0, during reset, or whenever by two and the re- the system is in power suiting clock used to controldown mode.
the frequency of the CODEC interface operation.

This interface supports ISDN protocol with one bit clock rate or double bit clock rate. The exact for-mat is selected with the CFG command. The slave CODEC interface uses four signals: CDIN, CDOUT, CCLK, and CFSO. The CDIN, CCLK, and CFSO input pins and the CDOUT output pins are connected to the ISDN/DECT
agent. Data is trans-ferred to the ISD-SR3000 through the CDIN pin and read out through the CDOUT
pin. The CFSO pin is used to define the start of each frame (see be-low). The source of that signal is at the master side.

The CCLK is used for bit timing of CDIN and CD-OUT. The rate of the CCLK
is configured via the CFG command and can be twice the data rate, or at the data rate. The source of that signal is at the master side.

Table 1-5:Typical CODEC Applications Analog single Master8 short 1 2.0488000 1 1 or long ISDN--8bitdual 2 Slave8 short 1 2.0488000 1 digital-- or A-Law Linear single Master16 short 7 2.0488000 1 IOM-21GCIsingle Slave8 short 1 1.5368000 1 or 1-3 or dual ~so 1-17 ISD CONFIDENTIAL INFORMATION
SPECIFICATIONS
ABSOLUTE MAXIMUM RATINGS ELECTRICAL CHARACTERISTICS
TA = 0C to +70C, V~~ = 5 V
t 10%, GND = 0 V

Storage temperature-65C to +150C

erature under -40C to bias +85C
T
m p e All input or -0.5 V to NOTE Absolute maximum ratings output voltages,+6.5 V indicate limits with respect beyond which permanent damage to GND may occur. Continuous operation at these limits is not intended,' operation should be limited to the conditions specked below.

Table 1-6:Electrical Characteristics CX X1 and X2 Capacitance' 17.0 pF

Icc~ Active Supply Normal Operation 40.0 80.0 mA
Current Mode, Running Speech Applications2 IccZ Standby Supply Normal Operation 30.0 mA
Current Mode, DSPM
IdIe2 Iccs Power-down Mode Power-Down 0.7 NA
Supply Modez~3 Current h Input Load Current'0 V 5 ViN -5.0 5.0 NA
<_ Vcc Ip Output Leakage 0 V <- Vpur -5.0 5.0 NA
(Off)Current (I/O <_ Vcc pins in Input Mode) VHh CMOS Input with 2.1 V
Hysteresis, Logical 1 Input Voltage VHF CMOS Input with 0.8 V
Hysteresis, Logical 0 Input Voltage VHys Hysteresis Loop 0.5 V
Width4 ViH TTL Input, Logical 2.0 Vcc + 0.5 V
1 Input Voltage Vit TTL Input, Logical -0.5 0.8 V
0 Input Voltage VoH Logical 1 TTL, IoH = -0.4 2.4 V
Output Voltage NA

1-18 Voice Solutions in Silicon"' ISD CONFIDENTIAL INFORMATION
Table 1-6: Electrical Characteristics (Continued) VoHwcMMCLK, MMDOUT IpH = -0.4 2.4 V
and EMCS LogicalNA
Out ut Volta e p IpH = -50 VCC V
g NAS -0, , 2 VpL Logical 0, TTL Ipt = 4 mA 0.45 V
Output Voltage Ipt = 50 0.2 V
NAs VoLwcMMCLK, MMDOUT Ipt = 4 mA 0.45 V
and EMCS Logical ut Volta Out e p IpL = 50 0.2 V
g NAs , VxH CLKIN Input, External 2.0 V
High Voltage Clock Vxt CLKIN Input, External 0.8 V
Low Voltage Clock 9. Maximum 20 NA for all pins together.
2. tour = 0, TA = 25°C, Vcc = 5 V, operating from a 4.096 MHz crystal and running from internal memory with Expansion Memory disabled.
3. All input signals are tied to 1 or 0 (above Vcc - 5 V or below Vss = 5 V.) 4. Guaranteed by design.
5. Measured in power-down mode. The total current driven or sourced by all the ISD-SR3000 processors output signals is less than 50 NA.
tsD 1-19 SWITCHING CHARACTERISTICS
Definitions All timing specifications in this section refer to 0.8 V or 2.0 V on the rising or falling edges of the signals, as illustrated in Figure 1-7 through 1-12, unless specifically stated otherwise.
Maximum times assume capacitive loading of 50 pF.
CLKIN crystal frequency is 4.096 MHz.
NOTE CTTL is an internal signal and is used as a reference to explain the timing of other signals. See Fioure 1-21.
NOTf: Signal valid, active or inactive time, after a rising edge of CTTL or MWCLK.
Figure 1-7: Synchronous Output Signals (Valid) MWCLK
0.8 V
2.0 V
Signal 0.8 V
LSignal Note: Signal valid time, aRer a falling edge of MWCLK
1-20 Voice Solutions In SIIIcon"' Figure 1-6: Synchronous Output Signals (Valid, Active and Inactive) ISD CONFIDENTIAL INFORMATION
NOTE Absolute maximum ratings indicate limits beyond which permanent damage may occur. Continuous operation at these limits is not intended, operation should be limited to the conditions specified below.
NOTE: Signal hold time, after a rising edge of CTTL.
Figure 1-9: Synchronous Output Signals (Hold), After a Falling Edge of MWCLK
Note: Signal valid time, after a falling edge of MWCLK.
fsn 1-21 Figure 1-8: Synchronous Output Signals (Hold), After a Rising Edge of CTTL

ISD CONFIDENTIAL INFORMATION
Signal A
Figure 1-10: Synchronous Input Signals Figure 1-11: Asynchronous Signals 0.8 V
2.0 V
2.0 V
Signal B
0.8 V
tSignal NO1E: Signal B starts after rising or falling edge of signal A.
The RESET signal has a Schmitt trigger input buffer. Figure 1-12 shows the characteristics of the input buff-er.
1-22 Voice Solutions in SlliconT"
NOTE: Signal setup time, before a rising edge of CTTI or MWCLK, and signal hold time after a rising edge of CTIL or MWCIK.

ISD CONFIDENTIAL INFORMATION
Figure 1-12: Hysteresis Input Characteristics Vout VHI VHh SYNCHRONOUS TIMING TABLES
In this section, R.E. means Rising Edge and F.E. means Falling Edge.
Output Signals Table 1-7: Output Signals Io, j~l'~~~Iill9il~I',I. !~'V~,7.n1AiII~ !'~I
tAh 1-20 n h, I"P,' i' I~ il Address Hold After R.E. CTTL~,, 0.0 tA~ 1-20 Address Valid After R.E. CTTL, 12.0 tccLxa1-11 CCLK Active After R.E. CTTL 12.0 tcct~,1-11 CCLK Hold After R.E. CTTL0.0 tccLaa1-11 CCLK Inactive After R.E. CTTL 12.0 tcoon1-11 CDOUT Hold After R.E. CTTL0.0 tcoo~1-11 CDOUT Valid After R.E. CTTL 12.0 tcrp 1-21 CTTL Clock PeriodsR.E. CTTL to 25.0 50,000 next R.E. CTTL

tEMCSa1-20 EMCS Active After R.E. CTTL. 12.0 tEMCSh1-20 EMCS Hold After R.E. CTTL0.0 tEMCSia1-20 EMCS Inactive After R.E. CTTL 12.0 trsa 1-11 CFSO Active After R.E. CTTL 25.0 tFSh 1-11 CFSO Hold After R.E. CTTL0.0 tFSia1-11 CFSO Inactive After R.E. CTTL 25.0 tMMCLKa Master MICROWIREAfter R.E. CTTL 12.0 Clock Active fso 1-23 ISD CONFIDENTIAL INFORMATION
Table 1-7: Output Signals (Continued) i I ~, Il,h~ ii~l' tMMCLKh Master MICROWIREp i~l ~, I 0.0 Clock Hold ~i After R.E.
CTTL

4.AMCLKia Master MICROWIREAfter R.E. 12.0 Clock Inactive CTTL

tMMOOn Master MICROWIREAfter R.E. 0.0 Data Out Hold CTTL

tMMDOv Master MICROWIREAfter R.E. 12.0 Data Out Valid CTTL

tMwoor1-4 MICROWIRE Data After R.E. 70.0 FloatZ MWCS

tMWDOh1-4 MICROWIRE Data After F.E. 0.0 Out Holdz MWCK

tMwoonr1-4 MICROWIRE Data After F.E. 0.0 70.0 No FloatZ MWCS

tMWDOv1-4 MICROWIRE Data After F.E. 70.0 Out Valid2 MWCK

tMwirop1-13 MWDIN to MWDOlITPropagation 70.0 Time tMwaoora1-4 MWRDY Active After R.E. 0.0 35.0 of CTTL

4nwROria1-4 MWRDY Inactive After F.E. 0.0 70.0 MWCLK

tPnecn1-14 PB and MWRQST After R.E. 0.0 CTTL

tPABCv1-14 PB and MWRQST After R.E. 12.0 CTTL. T2W1 1. In normal operation mode tcrp must be 25.Onr in power-down mode, tcrp must be 50,000 ns.
2. Guaranteed by design, but not fully tested.
Input Signals Table 1-8: Input Signals tcpin1-11 CDIN Hold After R.E. CTTL 0.0 tcois1-11 CDIN Setup Before R.E. C1TL 11.0 tpin Data in Hold (D0:7)After R.E. CTTL 0.0 T1, T3 or TI

toes Data in Setup (D0:7)Before R.E. CTTL 15.0 T7, T3 or TI

tMMDINh Master MICROWIRE After R.E. CTTL 0.0 Data In Hold tMMDINs Master MICROWIRE Before R.E. CTTL 11.0 Data In Setup tMwcxn1-4 MICROWIRE Clock At 2.0 V (both edges) 100.0 High (slave) tMwc~1-4 MICROWIRE Clock At 0.8 V (both edges) 100.0 Low (slave) tMWCKp1-4 MICROWIRE Clock R.E. MWCLK to next 2.5 Period (slave) R.E. MWCLK NS

(MWCtxn1-4 MWCLK Hold After MWCS becomes 50.0 inactive tMWCLKS1-4 MWCLK Setup Before MWCS becomes 100.0 active 1-24 Voice Solutions In SIlIcon""

ISD CONFIDENTIAL INFORMATION
Table 1-8: Input Signals (Continued) tMwcsn1-4 MWCS Hold otter r.t. nnwctr< 5u.v tMwcss1-4 MWCS Setup Before R.E. MWCLK 100.0 tMwom1-4 MWDIN Hold After R.E. MWCLK 50.0 tnnwois1-4 MWDIN Setup Before R.E. MWCLK 100.0 tP,~,R1-23 Power Stable After Vcc reaches 30.0 to RESET R.E.Z 4.5 V ms tRSrw1-23 RESET Pulse At 0.8 V (both edges)10.0 Width ms txn 1-21 CLKIN High At 2.0 V (both edges)tX1 p/2 txi 1-21 CLKIN Low At 0.8 V (both edges)tXlp/2 txP 1-21 CLKIN Clock R.E. CLKIN to next 24.4 Period R.E. CLKIN

1. Guaranteed by design, but not fully tested in power-down mode.
2. Guaranteed by design, but not fully tested.
TIMING DIAGRAMS
cxtl.
aa.a~, fsv 1-25 Figure 1-13: SRAM Read Cycle Timing ISD CONFIDENTIAL INFORMATION
Figure 1-14: CODEC Short Frame Timing cm ccuc CFSOI

CDOUT
CDIN
NOTE: This cycle may be either TI (Idle), T2, T3 or T3H.
Figure 1-15: CODEC Long Frame Timing cm Cf501 CDODT
CDIN
1-26 Volce Solutions In STIIconT"' ISD CONFIDENTIAL INFORMATION
Figure 1-16: Slave CODEC CCLK and CFSO Timing CCLK
CFSO
Figure 1-17: MICROWIRE Transaction Timing--Data Transmitted to Output ~te~r crrc i ~sn 1-27 ISD CONFIDENTIAL INFORMATION
Figure 1-18: MICROWIRE Transaction Timing--Data Echoed to Output rwauc ,,~:
rwcw iwcour ,~ca ~anrrt '-crrr Figure 1-19: Master MICROWIRE Timing W
1-28 Voice Solutions in SiliconT°
«. f_. s., :11.... 4... 4.... 14.. t,..tm._ ~,d.:"~ L.z_ 1.~~ 9t#a:..

ISD CONFIDENTIAL INFORMATION
Figure 1-20: Output Signal Timing for Port PB and MWRQST
1. This cycle may be either TI (Idle), T3 or T3H.
2. Data can be driven by an external device at T2W7, T2VY, T2 and T3.
Figure 1-21: CTTL and CLKIN Timing P ~ ~r~y t~~
~r ~,w tsn 1-29 r3 r, ,xirc rw tf t9. tdl~

ISD CONFIDENTIAL INFORMATION
Figure 1-22: Reset Timing When Reset is not at Power-Up Figure 1-23: Reset Timing When Reset Is at Power-Up Jk Yk ~-3p Voice Solutions In SIlIcon'"

ISD CONFIDENTIAL INFORMATION
Chapter 2 SOFTWARE
OVERVIEW TYPES OF RECOGNITION
The ISD-SR3000 software resides in the on-chip The ISD-SR3000 is capable of both speaker-inde-ROM. It includes voice recognition algorithms, sys- pendent and speaker dependent recognition.
tem support functions and a software interface to Speech input pattern is continuous for both com-hardware peripherals. The following sections de- mands and digits, allowing for a natural speech scribe the ISD-SR3000 software in detail. pattern. The commands and digits are speaker-independent, with models constructed from a large corpus of speakers. The user-provided voic-etags for the phone book are partially speaker-RECOGNITION PROCESSOR dependent. However, they are constructed by ISD-SR3000 uses a segmented sub-phoneme rec- creating acoustic models "on-the-fly" from the ognition process. The sampled speech utterance Phoneme base. This means only two passes are is split into distinct phonetic sounds, the smallest required for entering the names, and recognition units of speech. Because these phonemes vary in is possible with some variation in the way the both sound and duration, the processor must be name is spoken. The first pass is used to create the able to determine boundaries between the Phoneme model, and the second pass is used for sounds. The ISD-SR3000 uses Hidden Markov Mod- recognition confirmation.
els to hypothesize boundaries between sounds GR~~R
and to form probabilistic models on each possi-ble combination. Grammar is used to define the structure of the The outputs are then classified by determining commands. ISD-SR3000 is designed to work with matches between the phonetic sounds and the finite-state grammar. This type of grammar is de-stored phoneme models. The acoustic models for signed to limit perplexity by pre-defining the num-the phonemes are gathered from a large sample ber of allowable words at a given state. Perplexity of speakers, allowing for a wide variation across is defined as the number of branches possible accents, dialect, and gender. This allows the rec- during recognition. For example, a prompt that ognizer to associate the sound segments with a requires a "yes" or "no"
response has a perplexity number of possible phonemes, enabling recogni- of two. Greater perplexities increase the chances tion when words are pronounced differently. for substitution errors. During recognition, a limited number of topics are active. Topics are groups of The phonemes are then matched to vocabulary words that are active at a given time. For exam-words or phrases using a search routine. The set of ple, in a voice dialing application, digit topics are phonemes is compared to the vocabulary mod- active after the user issues the "dial" command.
els for the active topics, and the recognized word No other topics are open (except the global top-is returned. If the phonemes do not match any of ics such as "cancel" or "help") so that the recog-the active vocabulary words, a token is returned nizer is only trying to recognize digits. This type of indicating the word is not in the vocabulary. This grammar and active topics inherently increases token can be used by the Voice User Interface to recognition accuracy.
return a help prompt to the user. The ISD-SR3000 does not return a score with the word; like a digital system, it either recognizes a word, or it doesnlt.
iso 1-33 ISD CONFIDENTIAL INFORMATION
Figure 2-1: Topic and Grammar Organization Command To ics Call Dial tore a ete Answer Parameters Hangup Name Mute Digits n ine ~Redial From this example, it can on the tokens, refer to the be seen how topics are ISD-SR3000 Voice User linked, and how only specificInterface specification.
topics are active.

This is a voice dialing ISD supplies recommended command set furnished as vocabulary sets as ' s standard VUI. Independentpart of the VUI for specific VUIs and vocab- applications. The vo-ISD

ulary can be developed, cabulary sets have been carefully but it is necessary to selected to en-fol-low the grammar syntax as sure high recognition (avoiding shown here. confusable words) and effective user utility.
The accuracy specifica-VOCABULARY tions for ISD-SR3000 are based on the ISD provid-A vocabulary defines the ed commands. Although use following characteristics of these of the ISD-SR3000: commands is highly recommended, it is possible Speaker-independent commandto create custom vocabulary words and sets. Contact ISD for digits for which ISD-SR3000information about vocabulary responds development tools. The vocabulary can be stored either in exter-Topics under which the commandsnal ROM or flash memory.
and digits are organized Mapping of tokens to the LANGUAGE
vocabulary Strings returned by the ISD-SR3000 uses a set of CTW command acoustic models de-Default keywords used for signed to recognize American activation English. Additional languages require different acoustic models.

Contact ISD for availability of additional languag-ISD-SR3000 is designed to es.
work with a specific vo-cabulary set. Up to 30 speaker-independent commands may be used. When the processor recognizes the commands, tokens (values) are re-turned to the host conVoller.
Certain types of er-rors, such as spoke-too-soon, also return specific tokens. The host controller can use the tokens to accomplish tasks, such as generating DTMF for di-aling a phone number. For detailed information 1-34 Voice Solutions in Silicon ISD CONFIDENTIAL INFORMATION
SPEECH SYNTHESIS vocabulary content Speech synthesis is the If memory space is not an technology that is used issue, the vocabulary to create messages out of predefinedcould contain all the required words and sentences, each phrases stored in a vocabulary.recorded separately.

There are two kinds of predefinedIf memory space is a concern, messages: fixed the vocabulary messages (e.g., voice menusmust be compact; it should in a voice-mail sys- contain the minimum tem) and programmable messagesset of words and phrases (e.g., time- required to synthesize all and-day stamp, or the "You the sentences. The least have n messages" memory is used when announcement in a DTAD). phrases and words that are common to more than one sentence are recorded only once, and A vocabulary includes a the IVS tool is used to synthesize set of predefined words sentences out of and phrases, needed to synthesizethem.
messages in any language. Applications which support more than one language require Vocabulary Recording a separate vocabu-lary for each language. When recording vocabulary words, there is a compromise between space and quality. On one INTERNATIONAL VOCABULARY hand, the words should be SUPPORT (IVS) recorded and saved in IVS is a mechanism by whicha compressed form, and you the ISD-SR3000 pro- would like to use the cessor can use several vocabulariesbest voice compression for stored on an that purpose. On the external storage device. other hand, the higher the IVS enables ISD-SR3000 compression rate, the processor to synthesize worse the voice quality.
messages with the same meaning, but in different Another issue to consider languages, from sepa- is the difference in voice rate vocabularies. quality between synthesized and recorded Among IVS features: prompts and voicetags. It is more pleasant to the human ear to hear them both in the same quality.

Multiple vocabularies are stored on a single storage device. Vocabulary Access Plug-and-play. The same Sometimes compactness and microcontroller high quality are code is used for all languages.not enough. There should be a simple and flexible interface to access the vocabulary elements. Not Synthesized and recorded only the vocabulary, but messages use the also the code to access same voice compression algorithmit should be compact.
to achieve equal quality.

Support for voicetag recognitionWhen designing for a multi-lingual confirmation: environment, there are more issues to consider. Each vocabu-- Calling name. lary should be able to handle language-specific - Are you sure you want structures and designed in to delete a cooperative way with name? the other vocabularies so that the code to access - The number for name has each vocabulary is the same.
been stored.

IVS VOCABULARY COMPONENTS

VOCABULARY DESIGN This section describes the basic concept of an IVS

There are several issues, vocabulary, its components, sometimes conflicting, and the relationships which must be addressed between them.
when designing an IVS-vocabulary.

rsn 1-35 ISD CONFIDENTIAL INFORMATION
Basic Concepts Sentence Table An IVS vocabulary consists of words, sentences, The sentence table describes the predefined sen-and special codes that control the behavior of fences in the vocabulary. The purpose of this table the algorithm which the ISD-SR3000 processor is to make the microcontroller that drives the ISD-uses to synthesize sentences. SR3000 processor independent of the language Word Table being synthesized.
The words are the basic units in the vocabulary. For example, if the Flash andlor ROM contain vo-You create synthesized sentences by combining cabularies in various languages, and the first sen-words in the vocabulary. Each word in the vocab- fence in each vocabulary means you have n ulary is given an index which identifies it in the messages, the microcontroller switches languag-word table. es by issuing the following command to ISD-SR3000 processor:
Number Tables SV <storage_media> , < vocabulary_id> - Se-The number tables allow you to treat numbers dif- lect a new vocabulary ferently depending on the context. A separate number table is required for each particular type The microcontroller software is thus independent of use. The number table contains the indices of of the grammar of the language in use.
the words in the vocabulary that are used to syn- The sentences consist of words, which are repre-thesize the number. Up to nine number tables can sented by their indices in the vocabulary.
be included in a vocabulary.
Figure 2-2: The Interrelationship of a Word Table. a Sentence Table and a Number Table Sentence Table Word Table Number Table 1-36 Voice Solutions In Silicon' ISD CONFIDENTIAL INFORMATION
Control and Option Codes Graphical User Interface (GUI) The list of word indices alone cannot provide the The IVS package includes a Windows utility that as-entire range of sentences that the ISD-SR3000 pro- sists the vocabulary designer to synthesize sen-cessor can synthesize. IVS control and option fences. With this utility, you can both compose codes are used as special instructions that control sentences and listen to them.
the behavior of the speech synthesis algorithm in the ISD-SR3000 processor. HOW TO USE THE IVS TOOL WITH THE
For example, if the sentence should announce ISD-SR3000 PROCESSOR
the time of day, the ISD-SR3000 processor should The IVS tool creates IVS
vocabularies, and stores be able to substitute the current day and time in them as a binary file. This file is burnt into a ROM
the sentence. These control words do not repre- device or programmed into a Flash memory de-sent recorded words, rather they instruct the ISD- vice using the INJ command.
The ISD-SR3000 pro-SR3000 processor to take special actions. cessor SV command is used to select the required THE IVS TOOL vocabulary. The SW, SO, SS and SAS commands are used to synthesize the required word or sen The IVS tool includes two utilities: fence. The typical vocabulary-creation process is The DOS-based IVS Compiler as follows:
IVSTOOL for Windows. A Windows 3.1 based Design the Vocabulary.
utility Create the vocabulary files (as described in detail The tools allow you to create vocabularies for the below). Use IVSTOOL for Windows 3.1 to simplify this ISD-SR3000 processor. They take you all the way Process.
from designing the vocabulary structure, through Record the words using any standard PC sound defining the vocabulary sentences, and record- card and sound editing software, that can create ing the vocabulary words. .wav files.
IVS Compiler Run the IVS compiler to compress the .wav files, The IVS compiler runs on MS-DOS (version 5.0 or and compile them and the vocabulary tables into later). It allows you to insert your own vocabulary, an IVS vocabulary file.
i.e., basic words and data used to create num- Repeat steps 1 to 4 to create a separate IVS vo-bers and sentences, as directories and files in MS- cabulary for each language that you want to use.
DOS. Burn the IVS vocabulary files into a ROM or Flash The IVS compiler then outputs a binary file contain- memory device. Use the INJ (Inject IVS) command ing that vocabulary. This information can be to program the data into a Serial Flash device.
burned into an EPROM or Flash memory for use by the ISD-SR3000 software. Once the vocabulary is in place, the speech syn thesis commands of the ISD-SR3000 processor Voice Compression can be used to synthesize sentences.
Each IVS vocabulary can be compiled with either Figure 2-2 shows the vocabulary-creation process the 5.2 KbiUs or the 7.3 Kbitls voice compression for a single table on a ROM
or Flash memory de-algorithm. You define the compression rate be- vice.
fore compilation. The ISD-SR3000 processor auto-matically selects the required voice decompression algorithm when the SV com-mand is used to select the active vocabulary.
rsn 1-37 ISD CONFIDENTIAL INFORMATION
Figure 2-3: IUS Components .wav File ~ ,wav Files Editor Compressed Files (.vcd) Number IVS
P C Compiler Tables INJ

Sound IVS VocabularyCommand Card Files Sentence Table IVSTOOL a Flash mmer Progr for Windows .ini File ~

ROM

Editor 1-38 Voice Solutions In SIIiconT"

ISD CONFIDENTIAL INFORMATION
INITIALIZATION ports any status change by clearing the MWRQST
Use the following procedures to initialize the ISD- signal to 0.
SR3000 processor: If processor command's parameter is larger than one byte, the microcontroller transmits the Most NORMAL INITIALIZATION Significant Byte (MSB) first. If a return value is larger than one byte, the ISD-SR3000 processor transmits Reset the ISD-SR3000 processor by activating the the MSB first.
RESET signal. (See "RESETTING" on page 1-6.) Issue a CFG (Configure ISD-SR3000) command to INTERFACE PROTOCOL TIME-OUTS
change the configuration according to your envi- pepending on the ISD-SR3000 processor's state, if ronment. more than 100 milliseconds elapse between the Issue an INIT (Initialize System) command to initial- assertion of the MWRDY
signal and the transmis-ize the ISD-SR3000 firmwareISD-SR3000. sion 8th bit of the next byte pertaining to the same command transaction, a time-out event occurs, MICROWIRE Serial Interface and the ISD-SR3000 processor responds as fol-lows:
MICROWIRE/PLUS'"" is a synchronous serial com-munication protocol minimizes the number of 1. Sets the error bit in the status word to 1.
connections, and thus the cost, of communicat- 2, Sets the EV_TIMEOUT bit in the error word ing with peripherals. to 1.
The ISD-SR3000 MICROWIRE interface implements 3, Activates the MWRQST signal (clears it to 0).
the MICROWIRE/PLUS interface in slave mode, with an additional ready signal. It enables a microcon- 4. Activates the MWRDY
signal (clears it to 0).
trolley to interface efficiently with the ISD-SR3000 5, Waits for a new command. (After a time-processor application.
out occurs, i.e., the microcontroller re-The microcontroller is the protocol master and ceived MWRQST during the command provides the clock for the protocol. The ISD- transfer, or result reception, the microcon-SR3000 processor supports clock rates of up to trolley must wait at least four milliseconds 400 KHz. This transfer rate refers to the bit transfer; before issuing the next command.) the actual throughput is slower due to byte pro-cessing by the ISD-SR3000 processor and the mi-crocontroller.
Communication is handled in bursts of eight bits (one byte). In each burst the ISD-SR3000 processor is able to receive and transmit eight bits of data.
After eight bits have been transferred, an internal interrupt is issued for the ISD-SR3000 processor to process the byte, or to prepare another byte for sending. In parallel, the ISD-SR3000 processor sets MWRDY to 1, to signal the microcontroller that it is busy with the byte processing. Another byte can be transferred only when the MWRDY signal is cleared to 0 by the ISD-SR3000 processor. When the ISD-SR3000 processor transmits data, it ex-pects to receive the value OxAA before each transmitted byte. The ISD-SR3000 processor re-rso 1-39 ISD CONFIDENTIAL INFORMATION
MICROWIRE ECHO MECHANISM ROM Interface The ISD-SR3000 echoes back to the microcontrol- IVS Vocabularies can be stored in either Flash ler all the bits received by the ISD-SR3000. Upon memory and/or ROM. The ISD-SR3000 supports IVS
detection of an error in the echo, the microcon- ROM devices through an Expansion Memory trolley should stop the protocol clock, which even- mechanism. Up to 64 Kbytes (64kx8) of Expansion tually causes a time-out error (i.e., ERR_TIMEOUT bit Memory are directly supported. Nevertheless, the is set in the error word). processor uses bits of the on-chip port (PB) to fur-When a command has a return value, the ISD- they extend the 64 Kbytes address space up to 0.5 SR3000 transmits bytes of the return value instead Mbytes address space.
of the echo value. ROM is connected to the ISD-SR3000 using the The ISD-SR3000 transmits a byte as an echo when data bus, D(0:7), the address bus, A(0:15), the ex-it receives the value OxAA from the microproces- tended address signals, EA(16:18), and Expansion soy. Upon detection of an error the ISD-SR3000 ac- Memory Chip Select, EMCS, controls. The number tivates the MWRQST signal, and sets the of extended address pins to use may vary, de-ERR COMM bit in the error word. pending on the size and configuration of the - ROM. ISD-SR3000 configured with Samsung Flash memory can not support extension ROM.
Master Mode The ISD-SR3000 Master MICROWIRE controller im- Reading From Expansion Memory elements the MICROWIRE/PLUS interface in master An Expansion Memory read bus-sycle starts at T1, mode, thus enabling the processor to control the when the data bus is in TRI-STATE, and the address Flash memory devices. Several devices may is driven on the address bus. EMCS
is asserted share the Master MICROWIRE channel by con-necting devicen selection signals to general pur- (cleared to 0) on a T2W1 cycle. This cycle is fol pose output ports. towed by three T2W cycles and one T2 cycle. The ISD-SR3000 processor samples data at the end of the T2 cycle.
Signals The Master MICROWIRE controller's signals are the The transaction is terminated at T3, when EMCS
Master MICROWIRE Serial Clock (MMCLK), the becomes inactive (set to 1). The address remains Master MICROWIRE Serial Data Out (MMDOUT) sig- valid until T3 is complete. A
T3H cycle is added af-nal, and the Master MICROWIRE Serial Data In ter the T3 cycle. The address remains valid until the (MMDIN) signal. end of T3H.
The Master MICROWIRE conVOller can handle up to four Flash memory devices. The processor uses the signals, CSO-CS3, relative to the number of de-vices used, as device chip-select signals.
Clock for Master MICROWIRE Data Transfer Before date can be sent, the transfer rate must be determined and set. The MMCLK signal sets the data transfer rate on the Master MICROWIRE. This rate is the same as the CODEC Clock (CCLK) sig-nal. As long as the Master MICROWIRE is transfer-ing data, the CODEC interface must be enabled and its sampling rate should not be changed.
1-40 Vorce Solutions in Silicon'"

ISD CONFIDENTIAL INFORMATION
Figure 2-4: Master MircroWIRE Data Transfer end at reuy~ter MWCLK
k / I 1 I t I
t X11. i t 7 I ':
i 1 1 i ~ I I
i C ~. 1 ~ I f MYGD~IfT Brt 10~ ~ g X15 Bia 1 Bit d iI~ISB) r l 5 ~ I I
4 f I A I I
q I P I I
w ~ w r w x ~A1N~IN ~ .. 71,~ .. B~& ...
iMSBy a ! Sa~pis Point ~, Shil! OI
Y
~so 1-41 ISD CONFIDENTIAL INFORMATION
Table 2-1: ISD-SR3000 Command Summary "I " 1, ~,;r,~ijE"
CCIO S 34 RESET, config_value1 ',~o,~
IDLE None CODEC Change Il0 CFG S Config p1 RESET No config 2 None ISD- value SR3000 Change CKF S Check TBDIDLE ChangeNone - TEST 1 Flash RESULT

CKV S TBDIDLE None _ TEST 1 RESULT

Voicetags Change CTW S Convert TBDIDLE, No token 2 STRINGVARIES
Token RECD number to Word Change Suing EVA S TBDIDLE numbers 1 None -user Vo cetags Change_ EVS S Erase TBDIDLE No voicetag_1 None Selected Voicetags Changenumber Flush FRQ S RecognitionTBDIDLE Cha~geNone _ None Queue Get Version, GCFG S Configuration02 RESET,ChangeNone - Config,3 IDLE

Value Vocab GEW S Get Error18 All ChangeNone _ Error 2 Word States Word Get IDLE, PLAY, GI S Information25 RECORD,ChangeItem 1 Value 2 Item SYNTHESIS

Count, GNR S t TBDIDLE, None - Topic,4 RECD

gnition Change Token Reco GSW S tus 14 All ~ None - Status2 Ge States Word Word Cha ge INIT S Isytsae~m13 RESET None None IDLE

, Change -INJ S Inject 29 RESET,No n, bytes,4+n None IVS IDLE
Data Changebyte"...

KWNR S K TBDIDLE None None q RECD
ot R , Change -uired KWR S q TBDIDLE None None RECO

i ed , Change -Re PDM S Go to 1A IDLE No None None Power-Down Change -Mode PRD S Disable TBDIDLE, No None None _ Pause RECD

ReportingI I I Change _ 1-42 Votee Solutions In Silicon'"

ISD CONFIDENTIAL INFORMATION
PRE S Enable TgDIDLE,No None _ None _ Pause RECD

Reporting Change PV A Play VoicetagTBDIDLE Play ag num 1 None voicet b IDLE, PLAY, RES S Resume 1D RECORD,changeNone - None SYNTHESIS

RESK S Reset TBDIDLE user_numbers1 None -Keyword change Reset RESR S RecognitionTBDIDLE,IDLE None _ None RECO

Engine RD S Disable TBDRECD IDLE None - None _ Recognition RE A Enable TBDIDLE RECD None - None _ Recognition RKW A Record TBDIDLE Recorduser 1 None id Keyword id, user RTAG A Record TBDIDLE Record_ 1+1 None _ voicetag_ Voicetag number Say sentence_n, SAS A Augmented1E IDLE Synthesis 1+1 None _ Sentence arg SDET S Set Detectors10 IDLE No detectors_1 None Mask changemask SO A Say One 07 IDLE Synthesisword 1 None Word number SS A Say Sentence1 IDLE Synthesissentence_n1 None _ F

IDLE, Suspend PLAY, SUSP S 1C RECORD,changeNone - None SYNTHESIS

SW A Say Words21 IDLE Synthesisn, words'1 None +n word"...

user id info, TAGC S ~ oho ~ IDLE ~ voicetag1+1 None ~a9 TBD ~ num ~
~ ~

changeber iso 1-43 ISD CONFIDENTIAL INFORMATION
TACQ S Query TBDIDLE No voicetag_~ Voicetag T

Voicetag changenumber info TOPD S Disable TBDIDLE changetopic 1 None _ Topics id TOPE S Enable TBDIDLE changetopic_id1 None _ Topics TOPQ S Query TBDIDLE changetopic 1 Topic info Topics id 1 VC S Volume ZB IDLE,No vol ~ ~ None PLAY,~ level 1 ~ ~ ~ change_ Control SYNTHESIS

NOIE: Note for the column labeled S/A, S = Synchronous command and A =
Asynchronous command.
1-44 Voice SoluUons in SilieonT"

ISD CONFIDENTIAL INFORMATION
THE STATE MACHINE SYNCHRONOUS COMMANDS
The ISD-SR3000 processor A synchronous command must functions as a state ma- complete execu-chine. It changes state tion before the host microcontroller either in response to a can send a command sent by the host new command microcontroller, after execution of a command is A synchronous command sequence completed, or as a starts when result of an internal eventthe host microcontroller (e.g. memory full or sends an 8-bit opcode to power failure).The ISD-SR3000the ISD-SR3000 processor, processor states are followed by the com-listed below. mand's parameters (if any).

RESET The ISD-SR3000 processor executes the com-mand and, if required, transmits a return value to The ISD-SR3000 processor the host microcontroller.
is initialized to this Upon completion, the state after a full hardware resetISD-SR3000 processor notifies by the RESET signal. the host microcon-troller that it is ready to accept a new command IDLE by asserting the MWRQST
signal.

This is the state from which most commands are executed. As soon as a commandASYNCHRONOUS COMMANDS
and all its pa-rameters are received, the An asynchronous command ISD-SR3000 processor runs in the back-starts executing the command.ground. During execution of an asynchronous command, other commands can be executed.

PLAY

In this state, a prompt Status Word is played.

The 16-bit Status Word indicates events that occur SYNTHESIS during normal operation.
The ISD-SR3000 proces-sor asserts the MWRQST signal to indicate a An individual word or sentencechange in the Status Word.
is synthesized from This signal remains as-a vocabulary. serted until the ISD-SR3000 processor receives a GSW command. The status word is cleared during RECORD reset, and upon successful execution of the GSW

In this state, a user's command.
speech is recorded and stored. Error Word RECO The 16-bit Error Word indicates errors that occurred during execution of the last command. If an error In this state, speech recognitionis detected, the command is active. is not processed, the EV ERROR bit in the Status Word is set to 1, and the MWRQST signal is asserted.

COMMAND EXECUTION Error Handling An ISD-SR3000 command is represented by an 8- When the host microcontroller detects that the bit opcode. Some commands have parameters, MWRQST signal has been asserted, the host and some commands return values to the host should issue the GSW (Get Status Word) com-microcontroller. Commands are either synchro- mand, which de-asserts the MWRQST signal. Then nous or asynchronous. the host should test the EV_ERROR bit in the data (the Status Word contents) returned by the GSW
command. If the EV ERROR bit is set, the host-should issue the GEW (Get Error Word) command to read the Error Word for details of the error.
rsn 1-45 ISD CONFIDENTIAL INFORMATION
COMMAND DESCRIPTIONS
Commands are listed in alphabetical order, with their hex value in brackets after their mnemonic name.
All command opcodes are one byte in length.
All opcodes, parameters and examples are shown using hex values for 8 bit and larger quantities, and binary values for bit values, unless otherwise noted.
Each command description includes an example application of the command. The examples show the opcode issued by the microcontroller, and the response returned by the ISD-SR3000 processor. For com-mands which require a return value from the ISD-SR3000 processor, the start of the return value is indicat-ed by a thick vertical line. When a return value is required, the host microcontroller must pass value AA
(hex) to the ISD-SR3000 engine as a placeholder for each byte to be returned.
CCIO (34 hex) Configure Codec I/O con>ig_value Configures the voice sample paths in various states. It should be used to change the default ISD-SR3000 processor configuration.
The config_value parameter is a byte in size and is encoded as follows:
Bit 0-Loopback control.
l.oopback disabled (default).
Loopback enabled. In the RECORD state, the input samples are echoed back unchanged (i.e., no volume control) to the codec. This is useful for debugging the analog and codec circuitry.
Bits 1-7-Reserved These bits must be set to 0.
Example:

Byte Sequence:Microcontroller 34 01 Description: Configure the codec to have loopback on.

CFG (01 hex) Configure ISD-SR3000 config_value Configures the ISD-SR3000 processor for various hardware environments. It should be used to change the default ISD-SR3000 processor configuration.
The config_value parameter is a 16-bit word and is encoded as follows:
Bit 0-Codec configuration.
Short-frame format (default).
Long-frame format. (Guaranteed by design, but not tested.) 1-46 Volce Solutions In SIIiconT"' ISD CONFIDENTIAL INFORMATION

Bit Bits 0 This bit must be Number set to 0. of installed Flash devices.
The default is for one flash device Bit Reserved -do not use.

Reserved-Do not use One flash device installed Bit Two flash devices installed 0 This bit must be set to 0.

Three flash devices installed Bits Four flash devices installed 00 Reserved - do not use.

Reserved -do not use.

01 Reserved - do not use.

Reserved -do not use.

10 Toshiba's TC58A040F
Flash Reserved -do not use.

11 Samsung's KM29N040T
Flash Bits The value is 10 for Toshiba's defaultSerial Flash.

These bits must be set to 0.

Bits 00 These bits must be set to 0.

Fvamnln Byte Sequence:Microcontroller 01 01 24 Description: Configure the ISD-SR3000 processor to work with:

CODEC that supports short frame format.

One Toshiba TC58A040F
flash device Echo cancellation on CKF (TBD hex)Check Flash Checks (checksum) if the flash data is correctly programmed in the Flash devices. The flash checksum is stored in the first Flash device. This checksum checks all of the flash mem-ory except for voicetags.
If the data is correct the return value is FF (hex). Otherwise the return value is 0.
Example:
CKF

Byte Sequence:Microcontroller TBD AA

Description: ISD-SR3000 is instructed to run a Checksum on the Flash memory, and it returns a code of FF, indicating the data is correct.

~sv 1-47 ISD CONFIDENTIAL INFORMATION
CKV (TBD hex) Check Voicetags Checks (checksum) if the voicetag data is correctly programmed in the Flash devices. Each voicetag has its own checksum, which is stored with the voicetag. This command checks the voicetag checksums for all voicetags, reporting an error if any one is wrong.
If all the voicetag checksums are correct the return value is FF (hex).
Otherwise the return value is the num-ber of the first voicetag with a bad checksum. By clearing that voicetag number with command EVS, and then repeating command CKV, bad voicetag flash entries can be stepped through and erased.
Example:
CKV

Byte Sequence:Microcontroller TBD AA

Description: ISD-SR3000 is instructed to run a Checksum on the Voicetag memory, and it re-turns a code of 03, indicating that an error was found in voicetag 03, and per-haps also others.

CTW (TBD hex) Convert Token to Word String token_num6er Converts the token number indicated by token_numberto an English (or other language) string. Param-eter token-number is a 16-bit word.
This command is intended to allow language personalization to be entirely contained in ISD-SR3000's flash.
Note that this command assumes a token number can uniquely identify a recognition word string-with ISD-SR3000, a given token number occurs only once across all topics.
The CTW command returns a string which may be up to 256 bytes long, in following format:
Byte 1 - Number of bytes in string. A zero (00) value means the token was not found, in which case this is the only byte returned.
Bytes 2-N - String of bytes representing the word string, with the first character in byte 2 and successive characters in subsequent bytes.
The ISD-SR3000 engine makes no assumptions about the content of the string that is returned - interpre-tation of the contents is entirely up to the host microcontroller. Any byte values may be used, including the values for spaces, international characters, and upper and lower cases.16-bit Unicode values or cus-tom string representations can also be used, as long as they are represented by an integral number of bytes with total length 255 or less.
'i-48 Voice Solutions In SftlconT"

ISD CONFIDENTIAL INFORMATION
Example:
CTW

Byte Sequence:Microcontroller TBD 37 AA AA AA AA

Description: Token 37 is passed to ISD-SR3000, and it responds with the 3 byte character string 42, 79, 65 (ASCII code for'Bye~. This is an example only - the actual value of token 37 depends on the vocabulary programmed into ISD-SR3000's flash by the customer.
See "PROGRAMMING
DIFFERENT VOCABULARIES
AND LAN-GUAGES" on page 2-1.

EUA (TBD hex) Erase all Uoicetags for user 1-4 user_numbers Erases all voicetags for users 1 to 4, as selected by the bits in byte user_numbers For confidentiality reasons, the voicetag is actually erased from flash - it is not just marked 'unused'.
If user numbers is all zeros, then ERR_PARAM is set in the Error Word (See "GEW (1 B hex) Get Error Word"
on page 2-51).
If the erase operation can not be performed, then ERR_MEM is set in the Error Word (See "GEW (1B hex) Get Error Word" on page 2-51).
If the EVA command completes normally, then EV_NORMAL_END is set in the Status Word (See "GSW (14 hex) Get Status Word" on page 2-55).
Bit 0 - Reserved 0 This bit must be set to 0 Bit 1 - User 1 0 Do not erase any voicetags for user 1 1 Erase all voicetags for User 1 Bit 2 - User 2 0 Do not erase any voicetags for user 2 1 Erase all voicetags for User 2 Bit 3 - User 3 0 Do not erase any voicetags for user 3 1 Erase all voicetags for User 3 Bit 4 - User 4 0 Do not erase any voicetags for user 4 1 Erase all voicetags for User 4 Bits 5-7 - Reserved 000 These bits must be set to 0 ~sv 1-49 ISD CONFIDENTIAL INFORMATION
Exam le:
EVA

Byte Sequence:Microcontroller TBD OC

Description: ISD-SR3000 is instructed to erase the voicetags for users 3 and 4.

EVS (TBD hex) Erase Selected Voicetag voicetag_number Erases the voicetag indicated by byte voicetag_number. As part of the erasing procedure, the check-sum for that entry is automatically updated (See "CKV (TBD hex) Check Voicetags" on page 2-16).
For confidentiality reasons, the voicetag is actually erased from flash - it is notjust marked 'unused'.
If voicetag_numberhas a value greater than the number of voicetags (i.e. if voicetag_numberis larger than 64 for a system with 65 voicetags, since voicetag numbering starts at 0), then ERR_PARAM is set in the Error Word (See "GEW (1B hex) Get Error Word" on page 2-51).
If the erase operation can not be performed, then ERR_MEM is set in the Error Word (see Command GEW).
If the EVS command completes normally, then EV_NORMAL_END is set in the Status Word (See "GSW (14 hex) Get Status Word" on page 2-55).
Example:
EVS

Byte Sequence:MicroconVOller TBD 18 Description: ISD-SR3000 is instructed to erase voicetag number 18.

FRQ (TBD hex) Flush Recognition Queue Flushes all entries from the recognition queue. Following this command, the EV_RECO_QUEUE bit in the Status word is cleared to zero.
Example:
FRQ

Byte Sequence:Microcontroller TBD

Description: ISD-SR3000 is instructed to flush the recognition queue.

1-50 Voice Solutions In SilIconT"

ISD CONFIDENTIAL INFORMATION
GCFG (02 hex) Get Configuration Ualue Returns a 24-bit word containing the following information:
Bits 0-7 Magic number, which specifies the ISD-SR3000 firmware version.
Bits 9-8 Memory type.
00 Reserved 01 Reserved 10 Toshiba's TC58A040F Flash 11 Samsung's KM29N040T Flash Bits 10-23 Bits 10-23 indicate the vocabulary data set and version number in the ISD-SR3000.
This command should be used together with the CFG and INIT commands during ISD-SR3000 processor initialization.
Example:
GCFG

Byte Sequence:Microcontroller 02 AA AA

Description: ISD-SR3000's configuration value is requested, and it returns the code for Toshi-ba flash, and magic number (firmware revision number) 03 (hex).

GEW (1 B hex) Get Error Word Returns the 16-bit error word Error Word The 16-bit error word indicates errors that occurred during execution of the last command. If an error is detected, the command is not processed; the EV_ERROR bit in the status word is set to 1, and the MWRQST signal is activated (driven low).
The GEW command reads the error word. All bits in the error word are cleared to zero during reset and after execution of each GEW command.
If errors ERR_COMMAND or ERR PARAM occur during the execution of a command that has a return val-ue, the return value is undefined. The microcontroller must still read the return value, to ensure proper syn-chronization.
rsv 1-51 ISD CONFIDENTIAL INFORMATION
The bits of the error word are as follows:
Bits 0 - ERR BARGE
0 No barge-in error 1 Barge-in. The user interrupted an ISD-SR3000 operation by 'barging in'.
Barge-in occurs when the user issues the barge-in command while the ISD-SR3000 processor is executing a command. The precise wording of the barge-in command is set by the vocabulary -typi-cally the command is something like "< keyword > Cancel" or "< keyword >
nevermind".
(See "PROGRAMMING DIFFERENT VOCABULARIES AND LANGUAGES" on page 2-1).
Bit 1 - ERR OPCODE
0 No opcode errors 1 Illegal opcode. The ISD-SR3000 processor does not recognize the opcode.
Bit 2 - ERR COMMAND
0 No command errors 1 Illegal command sequence. The command is not legal in the current state.
Bit 3 - ERR PARAM
0 No parameter errors 1 Illegal parameter. The value of the parameter is out of range, or is not appropriate for the command.
Bit 4 0 or 1 Bit 4 is reserved and should be disregarded.
Bit 5 - ERR COMM
0 No communications error 1 Microcontroller MICROWIRE communication error.
Bit 6 - ERR TIMEOUT
0 No timeout error 1 Time-out error. Depending on the ISD-SR3000 processor's state, more than 100 milliseconds elapsed between the arrival of two consecutive bytes (for commands that have parame-ters).
Bit 7 - ERR INVALID
0 No context error 1 Command can not be performed in current context.
Bits 15-8 0 or 1 Bits 15-8 are reserved and should be disregarded. These bits may return any mix of 0 and 1.
1-52 Voice Solutions In SllJcon"~

ISD CONFIDENTIAL INFORMATION
Example:
GEW

Byte Sequence:Microcontroller 1B AA AA

Description:ISD-SR3000's error word is requested, and it returns the code for a communica-tions error.

GI (25 hex) Get Information item Returns the 16-bit value specified by byte value item from one of the internal registers of the ISD-SR3000 processor. Note that some values returned will never exceed 255 - the high order bits will always be zero in these cases.
item may be one of the following:
00 Returns the total number of voicetag memory locations 01 Returns the number of unused voicetags in the shared pool 02 Returns the voicetag number of the next voicetag that should be used from the shared pool. Any flash memory supports only a finite number of write cycles. ISD-SR3000 tracks the number of times each voicetag is written to and allocates voicetags from the shared pool using an algorithm that tries to distribute write cycles evenly throughout the flash, to maxi-mize the flash memory lifetime.
03-OE Reserved. Do not use. Returns unpredictable values, but will not cause an ERR_PARAM in the Error Word.
OF Returns the number of Topics in the ISD-SR3000 vocabulary. (Note - topics may not be in-cluded in ISD-SR3000, in which case this is a reserved value for item and should not be used.) Returns the number of voicetags used by user 1 11 Returns the number of voicetags used by user 2 12 Returns the number of voicetags used by user 3 13 Returns the number of voicetags used by user 4 14-FF Reserved. Do not use. Returns unpredictable values, but will not cause an ERR_PARAM in the Error Word.
~sv 1-53 ISD CONFIDENTIAL INFORMATION
Example:

Byte Sequence:Microcontroller 25 00 AA AA

Description: Information item number 00 (hex) is requested from ISD-SR3000, and it returns the 16-bit value 0003 (hex).

GNR (TBD hex) Get Next Recognition Gets the token for the next word recognized. This command should be issued by the microcontroller after it is interrupted by ISD-SR3000, following ISD-SR3000's recognition of a word (as evidenced by the EV RECO bit being set in the Status Word-see the GSW command). If the GNR
command is issued when the EV RECO bit is not set, the GNR command will execute but the data returned will not be valid.
The GNR command returns 4 bytes, defined as follows:
Byte 1 The number of recognition events in the queue after this one was removed.
Value FF (hex) means an overflow has occurred and does not necessarily mean there are precisely 255 left.
Byte 2 The Topic number that the recognition event occurred in. Topic FF (hex) indicates an error code and the following two bytes can be used for reporting error information.
The error information is TDB.
Bytes 3 & 4 The Token number that recognition event matches. The special token number zero hex (OOh) means a pause was found. Token numbers that are used in one topic must not appear in another topic - thus by knowing only the token number, the microcontroller can uniquely identify the specific word found and the topic. See also "CTW (TBD hex) Convert Token to Word String token_number"
on page 2-48.
Example:
GNR

Byte Sequence:Microcontroller TBD AA AA AA AA

Description: The next recognition event is requested from ISD-SR3000.

ISD-SR3000's response indicates:

there are 02 (hex) events in the recognition queue, in addition to the event returned with this command this recognition event is from Topic 01 (hex) the Token recognized is Token number 0003 (hex).

1-54 Voice Solutions in SlliconT"' ISD CONFIDENTIAL INFORMATION
GSW (14 hex) Get Status Word Returns the 16-bit status word.
Status Word The ISD-SR3000 processor has a 16-bit status word to indicate events that occur during normal operation.
The ISD-SR3000 processor asserts the MWRQST signal (driven low), to indicate a change in the status word.
This signal remains active until the ISD-SR3000 processor receives a GSW
command.
The status word is cleared during reset, and upon successful execution of the GSW command.
The bits in the status word are used as follows:
Bit 0 - EV_RECO_QUEUE
0 There are no events in the ISD-SR3000 recognition queue.
1 Reco-1 has one or more recognition events in its queue. Use command GNR to retrieve items from the queue.
Bit 1 - EV NORMAL END
0 When this bit is zero, it means either:
a) no command is underway, or b) a command is being processed but has not yet completed, or c) a command completed but had an error (as indicated by a '1' in the EV_ERROR
bit) 1 Normal completion of an operation, e.g., end of playing of a prompt, or final token detect ed in a command sequence.
Bit 2 - EV MEMFULL
0 Memory is not full 1 Memory is full.
Bit 3 - EV ERROR
0 No error detected 1 Error detected in the last command. The host microcontroller must issue the GEW com-mand to return the error code and clear the error condition.
Bit 4 - EV RESET
0 Normally, this bit changes to 0 after performing the INIT command.
1 When the ISD-SR3000 processor completes its power-up sequence and enters the RESET
state, this bit is set to 1, and the MWRQST signal is activated (driven low).
Normally, this bit changes to 0 after performing the INIT command. If this bit is set during normal operation of the ISD-SR3000 processor, it indicates an internal ISD-SR3000 processor error. The microcontroller can recover from such an error by re-initializing the system.
~sn 1-55 ISD CONFIDENTIAL INFORMATION
Bit 6,5 - EV_RECORD_STATUS
00 Displayed following commands other than RKW and RTAG, or if the RKW and RTAG com-mands cause an error.
01 This bit combination is only displayed when the RKW or RTAG commands complete suc-cessfully (as evidenced by a 1 in EV_NORMAL_END). When this code is displayed, it means the keyword or voicetag was successfully captured and analyzed, but the RKW or RTAG
command must be repeated for verification. Normally, one repeat is needed for a verifi-cation cycle. However, if the recognition engine is having trouble analyzing the voicetag or keyword, it may require multiple repeats. It is up to the host microcontroller to limit the num-ber of repeats (i.e. to avoid endless looping). The loop should be broken by erasing the same keyword or voicetag location (using the EKEY or EVS commands).
Never used.
11 This bit combination is only displayed when the RKW or RTAG commands complete suc-cessfully (as evidenced by a 1 in EV_NORMAL_END). When this code is displayed, it means the keyword or voicetag was successfully captured and analyzed, and a repeat of the command is not needed.
Bits 15-7 0 or 1 Bits 15-7 are reserved and should be disregarded. These bits may return any mix of 0 and 1.
Example:
GSW

Byte Sequence:Microcontroller 14 AA AA

Description:ISD-SR3000's error word is requested, and it returns code 0002 hex, indicating normal command completion (EV_NORMAL_END).

INIT (13 hex) Initialize Execute this command after the ISD-SR3000 processor has been configured (see the CFG and GCFG
commands).
Performs a soft reset of the ISD-SR3000 processor, which includes:
~ TBD
Example:
INIT

Byte Sequence:Microcontroller 13 Description:ISD-SR3000's is initialized.

1-56 Volce Soluflons In SIIIconT"

ISD CONFIDENTIAL INFORMATION
INJ (29 hex) Inject 11IS data n bytes . . . bytes Injects vocabulary data of size n bytes (where n is expressed as a 32-bit value) to good Flash blocks.
This command programs Flash devices, on a production line, with vocabulary data. ("PROGRAMMING
DIFFERENT VOCABULARIES AND LANGUAGES" on page 2-1) It is optimized for speed;
all ISD-SR3000 pro-cessor detectors are suspended during execution of the command. Use the CKV
command to check whether programming was successful.
If there is not enough memory space for the vocabulary data, ERR_PARAM is set in the error word, and execution stops.

Byte Sequence:Microcontroller29 00 OE 00 00 n bytes of vocabulary data ISD-SR3000 29 00 OE 00 00 Echo of n bytes of vo-cabulary data Description: Inject E0000 (hex) (917,504 decimal) bytes of data into the Flash.

Example:
KWNR (TBD hex) Keyword Not Required After this command is issued, spoken commands do not have to be prefaced with a valid keyword to be recognized. Note that although a keyword is not required after this command is issued, use of a key-word is still allowed - commands prefaced with a keyword will still be recognized.
Example:
KWNR

Byte Sequence:Microcontroller TBD

Description: This command tells ISD-SR3000 that a keyword is not required for recognition.

KWR (TBD hex) Keyword Required After this command is issued, all spoken commands must be prefaced with a valid keyword to be rec-ognized.
rsv 1-57 r-,.,.

ISD CONFIDENTIAL INFORMATION
Example:
KWR

Byte Sequence:Microcontroller TBD

Description: This command tells ISD-SR3000 to require a keyword.

PDM (1A hex) Go To Power-down Mode Switches the ISD-SR3000 processor to power-down mode. Sending any command while in power-down mode returns the ISD-SR3000 processor to normal operation mode.
If an event report is pending (i.e., MWRQST is active), and it is not processed by the microcontroller prior to issuing the PDM command, the event is lost.
Example:
PDM

Byte Sequence:Microcontroller 1A

Description: This command tells ISD-SR3000 to go into power-down mode.

PKW (TBD hex) Play Keyword keyword_number Plays the keyword indicated by byte keyword_number, where keyword_number is defined as follows:
Bits 2-0 000 Play the user-programmed keyword for user 1 (see RKW on page 2-62) 001 Play the user-programmed keyword for user 2 (see RKW on page 2-62) 010 Play the user-programmed keyword for user 3 (see RKW on page 2-62) 011 Play the user-programmed keyword for user 4 (see RKW on page 2-62) 100 Play the factory-default keyword for user 1 101 Play the factory-default keyword for user 2 110 Play the factory-default keyword for user 3 111 Play the factory-default keyword for user 4 Bits 7-3 0 These bits must be set to zero.
1-58 Voice Solutions In SfllconT"

ISD CONFIDENTIAL INFORMATION
Example:
PKW

Byte Sequence:Microcontroller TBD 12 Description: ISD-SR3000 is instructed to play voicetag number 12 (hex).

PRD (TBD hex) Disable Pause Reporting Disables reporting of detection of pause events. When pause reporting is disabled, ISD-SR3000 does not generate a recognition event for each pause heard. Typically, if ISD-SR3000 is left in recognition mode, while waiting for a keyword, pause detection should be disabled to avoid accumulating recognition events on the recognition queue during each detection of a pause in conversation. PRD and PRE affect placement of pause events on the recognition queue - the contents already on the recognition queue do not change as a result of executing PRD or PRE.
Example:
PRD

Byte Sequence:Microcontroller TBD

Description: This command tells ISD-SR3000 to disable pause reporting.

PRE (TBD hex) Enable Pause Reporting Enables reporting of detection of pause events. See PRD on page 2-59.
Example:
m PRE

Byte Sequence:Microcontroller TBD

Description: This command tells ISD-SR3000 to enable pause reporting.

PV (TBD hex) Play Voicetag voicetag_number Plays the voicetag indicated by byte voicetag_number.
~sn 1-59 ISD CONFIDENTIAL INFORMATION
Example:
PV

Byte Sequence:Microcontroller TBD 12 Description: ISD-SR3000 is instructed to play voicetag number 12 (hex).

RES (1 D hex) Resume Resumes the activity that was suspended by the SUSP command.
Example:
RES

Byte Sequence:Microcontroller 1 D

Description: This command tells ISD-SR3000 to resume.

RESK (TBD hex) Reset Keyword user_numbers Resets the keyword for users 1 to 4, as selected by the bits in byte user_numbers.
Bit 0 - User 1 0 Do not change keyword 1 Reset keyword to factory-programmed value for User 1 Bit 1 - User 2 0 Do not change keyword 1 Reset keyword to factory-programmed value for User 2 Bit User 0 Do not change keyword 1 Reset keyword to factory-programmed value for User 3 Bit User 0 Do not change keyword 1 Reset keyword to factory-programmed value for User 4 Bits erved - Res 000 These bits must be set to Voice Solutions in SiliconT"

r~..

ISD CONFIDENTIAL INFORMATION
Example:
RES

Byte Sequence:Microcontroller TBD 08 Description: Resets the keyword for user 4.

RESR (TBD hex) Reset Recognition Engine Resets the recognition engine to the initial power up state with recognition disabled, the recognition queue is flushed (as if the FRQ command were issued), and all topics are disabled. The keywords and voicetag entries are not affected.
Example:
RESR

Byte Sequence:Microcontroller TBD

Description: This command tells ISD-SR3000 to reset the recognition engine.

RD (TBD hex) Disable Recognition Stops the recognition engine. Any recognition events on the recognition queue are preserved. The setting of the EV RECO_QUEUE bit in the Status Register is not altered by this command. If the RD command is issued while the recognition engine is in the middle of recognizing a word or pause, that recognition event is discarded.
Example:
RD

Byte Sequence:Microcontroller TBD

Description: This command tells ISD-SR3000 to reset the recognition engine.

RE (TBD hex) Enable Recognition Turns the ISD-SR3000 recognition engine on, allowing it to start listening for a keyword or command (as determined by the KWR and KWNR commands).
~sn 1-61 ISD CONFIDENTIAL INFORMATION
Examele:
RE

Byte Sequence:Microcontroller TBD

Description: This command tells ISD-SR3000 to start the recognition engine.

RKW (TBD hex) Record Keyword user_id Records a new keyword for the user number indicated in byte user_id. See the description for the EV RECORD_STATUS bit under the GSW command. The definition of the user_id byte is as follows:
Bits 1,0 00 Record the keyword for user 1 01 Record the keyword for user 2 Record the keyword for user 3 11 Record the keyword for user 4 Bits 7-2 0 These bits must be set to zero.
Example:
RKW

Byte Sequence:Microcontroller TBD 01 Description: ISD-SR3000 is instructed to record the keyword for user 2.

RTAG (TBD hex) Record lloicetag user id voicetag_number Records a new voicetag into the voicetag location given by byte voicetag_numberand marks the voic-etag to show it belongs to the user numbers flagged in byte user id. Note that a voicetag can be used by more than one user (for example, this might be used for a voicetag of 'Police".) See the description for the EV RECORD STATUS bit under the GSW command.
The byte user_id is coded as follows:
Bit 0 0 Do not assign this voicetag for user 1 1 Assign this voicetag to user 1 1-62 Voice Solutions In Silicon"' ISD CONFIDENTIAL INFORMATION
Bit 1 0 Do not assign this voicetag for user 2 1 Assign this voicetag to user 2 Bit 2 0 Do not assign this voicetag for user 3 1 Assign this voicetag to user 3 Bit 3 0 Do not assign this voicetag for user 4 1 Assign this voicetag to user 4 Bits 7-4 0 These bits must be set to zero.
Example:
RTAG

Byte Sequence:Microcontroller TBD 03 10 Description: ISD-SR3000 is instructed to record a new voicetag into voicetag number (hex) and mark the tag as belonging to user 4.

SAS (1 E hex) Say Argumented Sentence sentence_n arg Announces sentence number sentence_n of the currently selected vocabulary, and passes arg to it.
sentence_n and arg are each 1-byte long.
When playing is complete, the ISD-SR3000 processor sets the EV_NORMAL_END bit in the status word, and activates the MWRQST signal.
If the current vocabulary is undefined, ERR_INVALID is reported.
Example:
SAS

Byte Sequence:Microcontroller 1 E 00 03 Description: Announce the first sentence in the sentence table of the currently selected vo-cabulary with '3' as the actual parameter.

~sn 1-63 ISD CONFIDENTIAL INFORMATION
SDET (10 hex) Set Detectors Mask detectors_mask Controls the reporting of detection for tones and VOX according to the value of the detectors_mask pa-rameter. A bit set to 1 in the mask, enables the reporting of the corresponding detector. A bit cleared to 0 disables the reporting.
Disabling reporting of a detector does not stop or reset the detector.
The 1-byte detectors_mask is encoded as follows:
Bit 0 Report detection of a busy tone.
Bit 1 Report detection of a dial tone.
Bits 2-3 Reserved. Must be cleared to 0.
Bit 4 Report detection of a constant energy.
Bit 5 Report detection of no energy (VOX) on the line. (The VOX attributes are specified with the tunable pa-rameters VOX TIME_COUNT and VOX_ENERGY_LEVEL.) Bit 6 Report the ending of a detected DTMF.
Bit 7 Report the start of a detected DTMF (up to 40 ms after detection start).
Example:

Byte Sequence:MicroconVOller 10 A3 Description: Set reporting of all ISD-SR3000 processor detectors, except for end-of-DTMF

SO (07 hex) Say One Word word_number Plays the word number word_numberin the current vocabulary. The 1 byte word_numbermay be any value from 0 through the index of the last word in the vocabulary.
When playback of the selected word has been completed, the ISD-SR3000 processor sets the EV_NORMAL_END bit in the status word, and activates the MWRQST signal.
1-64 Voice Solutions in SlliconT"

ISD CONFIDENTIAL INFORMATION
If word_number is not defined in the current vocabulary, or if it is an IVS
control or option code, ERR PARAM is set in the error word.
If the current vocabulary is undefined, ERR_INVALID is reported.
Example:

Byte Sequence:Microcontroller 07 00 Description: Announce the first word in the word table of the currently selected vocabulary.

SS (1 F hex) Say Sentence sentence_n Say sentence number sentence_n of the currently selected vocabulary.
sentence_n is 1-byte long.
If the sentence has an argument, 0 is passed as the value for this argument.
When playing has been completed, the ISD-SR3000 processor sets the EV_NORMAL_END bit in the status word, and activates the MWRQST signal.
If sentence_n is not defined in the current vocabulary, ERR_PARAM is set in the error word. ' If the current vocabulary is undefined, ERR_INVALID is reported.
Example:

Byte Sequence:Microcontroller 1F 00 Description: Announce the first sentence in the sentence table of the currently selected vo-cabulary.

SUSP (1 C hex) Suspend Suspends the execution of the current PV, SAS, SO, SS, or SW command. The SUSP
command does not change the state of the ISD-SR3000 processor; execution can be resumed with the RES command.
Examule:
SUSP

Byte Sequence:Microcontroller 1 C

Description: This command tells ISD-SR3000 to pause.

rso 1-65 ISD CONFIDENTIAL INFORMATION
SW (21 hex) Say Words n wordy . . . word"
Plays n words, indexed by words to word. On completion, the EV_NORMAL_END bit in the status word is set, and the MWRQST signal goes low.
If one of the words is not defined in the current vocabulary, or if it is an IVS control or option code, or if n > 8, ERR PARAM is reported.
If the current vocabulary is undefined, ERR_INVALID is reported.
Example:

Byte Sequence:Microcontroller 21 02 00 00 Description: Announce the first word, in the word table of the currently selected vocabulary, twice.

TAGC (TBD hex) Change Voicetag user_id_info, voicetag_num6er Changes the user id info associated with a particular voicetag given in byte voicetag_number. See TAGQ also. Byte user_id_info is defined as shown below.
Bits 4,0 00, 01 This voicetag's data for user 1 is not changed Voicetag data is updated to show that this voicetag is not used by user 1.
11 Voicetag data is updated to show that this voicetag is used by user 1.
Bits 5,1 00, 01 This voicetag's data for user 2 is not changed 10 Voicetag data is updated to show that this voicetag is not used by user 2.
11 Voicetag data is updated to show that this voicetag is used by user 2.
Bits 6,2 00, 01 This voicetag's data for user 3 is not changed 10 Voicetag data is updated to show that this voicetag is not used by user 3.
11 Voicetag data is updated to show that this voicetag is used by user 3.
Bits 7,3 00, 01 This voicetag's data for user 4 is not changed 10 Voicetag data is updated to show that this voicetag is not used by user 4.
11 Voicetag data is updated to show that this voicetag is used by user 4.
1-ss Volce SoluUons In SIIIconT'"

.., ISD CONFIDENTIAL INFORMATION
E:ample:
TAGC

Byte Sequence:Microcontroller TBD 31 10 Description: ISD-SR3000 is instructed to update voicetag 10's user data to show that it is used by user 1, not used by user 2, and user 3 and 4's information is left unchanged.

TAGQ (TBD hex) Query Voicetag voicetag_number Returns the user info for the voicetag given by byte voicetag_number.
The data returned is as follows:
Bit 0 0 This voicetag is not used by user 1.
1 This voicetag is used by user 1.
Bit 1 0 This voicetag is not used by user 2.
lThis voicetag is used by user 2.
Bit 2 0 This voicetag is not used by user 3.
1 This voicetag is used by user 3.
Bit 3 0 This voicetag is not used by user 4.
1 This voicetag is used by user 4.
Bits 4-7 x These bits are reserved and may unpredictable values.
EYamole:
TAGQ

Byte Sequence:Microcontroller TBD 15 AA

Description: ISD-SR3000 is instructed to return the user information for voicetag 15. The data returned indicates that the voicetag is used by user 1 and by user 3.

~sv 1-67 ISD CONFIDENTIAL INFORMATION
TOPD (TBD hex) Disable Topics topic_id Disables selected topics. If the topic number doesn't exist, this command sets bit ERR_PARAM in the Error Word (see the GEW command on page 2-51).
Byte value topic_id identifies the topic to be disabled. The special value FF
(hex) disables all topics. (Note:
topics are currently not planned for the ISD-SR3000 product, but we wanted to include this command description in case this decision is changed.) Example:
TOPD

Byte Sequence:Microcontroller TBD 10 Description:ISD-SR3000 is instructed to disable Topic number 10 (hex).

TOPE (TBD hex) Enable Topics topic_id Enables selected topics. If the topic number doesn't exist, this command sets bit ERR_PARAM in the Error Word (see the GEW command on page 2-51).
Byte value topic_id identifies the topic to be enabled. The special value FF
(hex) enables all topics. (Note:
topics are currently not planned for the ISD-SR3000 product, but we wanted to include this command description in case this decision is changed.) Example:
TOPE _ Byte Sequence:Microcontroller TBD 05 Description:ISD-SR3000 is instructed to enable Topic number 05 (hex).

TOPQ (TBD hex) Query Topics topic_id Tests if a specific topic is enabled. If the topic number doesn't exist, this command sets bit ERR PARAM
in the Error Word (see the GEW command on page 2-51) and indicates the topic is invalid in a bit in the returned byte.
Byte value topic_id identifies the topic number to be queried. (Note: topics are currently not planned for the ISD-SR3000 product, but we wanted to include this command description in case this decision is changed.) The TOPQ command returns a single byte with the following bit definitions:
1-68 Vofce Solutions In SIIfconT"

ISD CONFIDENTIAL INFORMATION
Bit 0 0 Topic is not enabled.
1 Topic is enabled.
Bit 1 0 Topic number is valid 1 Topic number is invalid. (Bit ERR_PARAM in the Error Word is also set.) Bits 7-2 0 or 1 Bits 7-2 are reserved and should be disregarded. These bits may return any mix of 0 and 1.
Example:
TOPQ

Byte Sequence:Microcontroller TBD 10 AA

Description: ISD-SR3000 is queried for the status of Topic 10 (hex). The response from ISD-SR3000 indicates the Topic is a valid number and is enabled.

VC (28 hex) Volume Control vol level Controls the energy level of all the voice outputs. The resolution is ~3 dB.
The actual output level is composed of the tunable level variable, plus the vol_IeveL The valid range for the actual output level of each output generator is defined in Table 24.
For example, if the tunable variable VCD_LEVEL is 6, and vo!_level is -2, then the output level equals VCD LEVEL + vol level = 4.
Example:
VC

Byte Sequence:MicroconVOller 28 05 Description: ISD-SR3000 is instructed to set the vol level to +5 (hex).

~s~ 1-69

Claims

1. A voice controlled device, comprising:
a processor;
a processor readable storage medium;
code recorded in the processor readable storage medium to store at least one user assignable appliance name in the processor readable storage medium;
code recorded in the processor readable storage medium to recognize the at least one user assignable appliance name;
code recorded in the processor readable storage medium to recognize a command; and code recorded in the processor readable storage medium to control the voice controlled device in response to recognizing the user assignable appliance name and the command.

2. The voice controlled device of claim 1, wherein, the user assignable appliance name and the command are provided using audible speech.

3. The voice controlled device of claim 1, wherein, the user assignable appliance name and the command are provided using non-audible speech.

4. The voice controlled device of claim 1 further comprising:
code recorded in the processor readable storage medium to store personal preferences of the voice controlled device associated with the at least one user assignable appliance name; and code recorded in the processor readable storage medium to personalize the voice controlled device to the stored personal preferences associated with the at least one user assignable appliance name upon recognition of the at least one user assignable applicance name.

5. The voice controlled device of claim 1 further comprising:
code recorded in the processor readable storage medium to store a default appliance name associated with the voice controlled device;
code recorded in the processor readable storage medium to recognize the default appliance name associated with the voice controlled device; and wherein, code recorded in the processor readable storage medium to control the voice controlled electronic device is further responsive to recognizing the default appliance name and the command.

6. The voice controlled device of claim 5, wherein, the default appliance name associated with the voice controlled device is factory assignable.

7. The voice controlled device of claim 5, wherein, the default appliance name associated with the voice controlled device is factory and user assignable.

8. A method of controlling a voice controlled device, the method comprising:
providing a voice controlled device having a speech recognition system for recognizing speech;
storing at least one user assignable appliance name into the voice controlled device;
communicating a communicated appliance name and a command to the voice controlled device; and controlling the voice controlled device if the communicated appliance name is recognized as matching the at least one user-assignable appliance name and the command is recognized by the voice controlled device.

9. The method of claim 8 for activating a voice controlled device, wherein, the communicated appliance name and the command are communicated using audible speech.

10. The method of claim 8 for activating a voice controlled device, wherein, the communicated appliance name and the command are communicated using non-audible speech.

11. A method of controlling a voice controlled device, the method comprising:
providing a voice controlled device having a speech recognition system for recognizing speech;
storing a default appliance name into the voice controlled device;
communicating a communicated name and a command to the voice controlled device; and controlling the voice controlled device if the communicated name is recognized as matching the default appliance name and the command is recognized by the voice controlled device.

12. The method of claim 11 for activating a voice controlled device, wherein, the communicated appliance name and the command are communicated using audible speech.

13. The method of claim 11 for activating a voice controlled device, wherein, the communicated appliance name and the command are communicated using non-audible speech.

14. A method for activating a voice controlled device, the method comprising:
providing a voice controlled device having a speech recognition system for recognizing speech;
storing a default appliance name into the voice controlled device;
storing at least one user assignable appliance name into the voice controlled device;
communicating a communicated name and a command to the voice controlled device; and controlling the voice controlled device if the communicated name is recognized as matching the at least one user assignable appliance name or the default appliance name and the command is recognized by the voice controlled device.

15. A method of assigning a new name to a voice controlled device, the method comprising:
providing a voice controlled device having a speech recognition system for recognizing speech;
activating the voice controlled device; and communicating a new name to the voice controlled device at least once.

16. The method of claim 15 for assigning a new name to a voice controlled device, wherein, the voice controlled device is activated by communicating a current appliance name and a change name command.

17. The method of claim 15 for assigning a new name to a voice controlled device, wherein, the new name is communicated using audible speech.

18. The method of claim 15 for assigning a new name to a voice controlled device, wherein, the new name is communicated using non-audible speech.

19. The method of claim 15 for assigning a new name to a voice controlled device, wherein:
the voice controlled device includes prompting capability and the voice controlled device communicates audible prompts to a user in order to request communication from the user of the new name.

20. The method of claim 15 for assigning a new name to a voice controlled device, wherein:
the voice controlled device includes prompting capability and the voice controlled device communicates non-audible prompts to another voice controlled device in order to request communication from the device of the new name.

21. A first voice controlled device capable of operating in a communication environment with at least one other voice controlled device, the first voice controlled device comprising:
a processor;
a processor readable storage medium;
code recorded in the processor readable storage medium to store a plurality of user assignable appliance names in the processor readable storage medium for activating the voice controlled device;
code recorded in the processor readable storage medium to recognize the plurality of user assignable appliance names associated with the one voice controlled device;
code recorded in the processor readable storage medium to recognize a command; and code recorded in the processor readable storage medium to control the voice controlled electronic device in response to recognizing one of the plurality of user assignable appliance names and the command.

22. The first voice controlled device of claim 21 capable of operating in a communication environment with at least one other voice controlled device, wherein, the user assignable appliance names and the command are provided using audible speech.

23. The first voice controlled device of claim 21 capable of operating in a communication environment with at least one other voice controlled device, wherein, the user assignable appliance names and the command are provided using non-audible speech.

24. The first voice controlled device of claim 21 capable of operating in a communication environment with at least one other voice controlled device, the first voice controlled device further comprising:
code recorded in the processor readable storage medium to store personal preferences of the voice controlled device associated with the at least one user assignable appliance name; and code recorded in the processor readable storage medium to personalize the voice controlled device to the stored personal preferences associated with the at least one user assignable appliance name upon recognition of the at least one user assignable applicance name.

25. The first voice controlled device of claim 21 capable of operating in a communication environment with at least one other voice controlled device, the first voice controlled device further comprising:

code recorded in the processor readable storage medium to store a default appliance name associated with the voice controlled device;
code recorded in the processor readable storage medium to recognize the default appliance name associated with the voice controlled device; and wherein, code recorded in the processor readable storage medium to control the voice controlled electronic device is further responsive to recognizing the default appliance name and the command.

26. The first voice controlled device of claim 25 capable of operating in a communication environment with at least one other voice controlled device, wherein, the default appliance name associated with each of the voice controlled devices is factory assignable.

27. The first voice controlled device of claim 25 capable of operating in a communication environment with at least one other voice controlled device, wherein the default appliance name associated with each of the voice controlled devices is factory and user assignable.

28. The first voice controlled device of claim 21 capable of operating in a communication environment with at least one other voice controlled device, the first voice controlled device further comprising:
a security means to protect each voice controlled device from unauthorized use.

29. The first voice controlled device of claim 24 capable of operating in a communication environment with at least one other voice controlled device, the first voice controlled device further comprising:
a security means to protect each voice controlled device from unauthorized use.

30. The first voice controlled device of claim 27 capable of operating in a communication environment with at least one other voice controlled device, the first voice controlled device further comprising:
a security means to protect each voice controlled device from unauthorized use.