« PreviousContinue »
RETAINING PROSODY DURING SPEECH
ANALYSIS FOR LATER PLAYBACK
CROSS REFERENCE TO RELATED
The subject matter of the present application is related to the subject matter of U.S. patent application attorney docket number 2207/4031, entitled "Representing Speech Using MIDI," to Dale Boss, Sridhar Iyengar and T. Don Dennis and assigned to Intel Corporation, filed on even date herewith, and U.S. patent application attorney docket number 2207/4069, entitled "Audio Fonts Used For Capture and Rendering," to Timothy Towell and assigned to Intel Corporation, filed on even date herewith.
The present invention relates to speech systems and more particularly to a system for encoding speech signals into a compact representation that includes speech segments and prosodic parameters that permits accurate and natural sounding playback.
Speech analysis systems include speech recognition systems and speech synthesis systems. Automatic speech recognition systems, also known as speech-to-text systems, include a computer (hardware and software) that analyzes a speech signal and produces a textual representation of the speech signal. FIG. 1 illustrates a functional block diagram of a prior art automatic speech recognition system. An automatic speech recognition system can include an analogto-digital (A/D) converter 10 for digitizing the analog speech signal, a speech analyzer 12 and a language analyzer 14. Initially, the system stores a dictionary including a pattern (i.e., digitized waveform) and textual representation for each of a plurality of speech segments (i.e., vocabulary). These speech segments may include words, syllables, diphones, etc. The speech analyzer divides the speech into a plurality of segments, and compares the patterns of each input segment to the segment patterns in the known vocabulary using pattern recognition or pattern matching in attempt to identify each segment.
Language analyzer 14 uses a language model, which is a set of principles describing language use, to construct a textual representation of the received speech segments. In other words, the speech recognition system uses a combination of pattern recognition and sophisticated guessing based on some linguistic and contextual knowledge. For example, certain word sequences are much more likely to occur than others. The language analyzer may work with the speech analyzer to identify words or resolve ambiguities between different words or word spellings. However, due to a limited vocabulary and other system limitations, a speech recognition system can guess incorrectly. For example, a speech recognition system receiving a speech signal having an unfamiliar accent or unfamiliar words may incorrectly guess several words, resulting in a textual output which can be unintelligible.
One proposed speech recognition system is disclosed in Alex Waibel, "Prosody and Speech Recognition, Research Notes In Artificial Intelligence," Morgan Kaufman Publishers, 1988 (ISBN 0-934613-70-2).
Waibel discloses a speech-to-text system (such as an automatic dictation machine) that extracts prosodic information or parameters from the speech signal to improve the accuracy of text generation. Prosodic parameters associated with each speech segment may include, for example, the pitch (fundamental frequency F0) of the segment, duration
of the segment, and amplitude (or stress or volume) of the segment. Waibel's speech recognition system is limited to the generation of an accurate textual representation of the speech signal. After generating the textual representation of
5 the speech signal, any prosodic information that was extracted from the speech signal is discarded. Therefore, a person or system receiving the textual representation output by a speech-to-text system will know what was said, but will not know how it was said (i.e., pitch, duration, rhythm,
10 intonation, stress).
Similarly, as illustrated in FIG. 2, speech synthesis systems exist for converting text to synthesized speech, and can include, for example, a language synthesizer 16, a speech synthesizer 18 and a digital-to-analog (D/A) converter 20.
15 Speech synthesizers use a plurality of stored speech segments and their associated representation (i.e., vocabulary) to generate speech by, for example, concatenating the stored speech segments. However, because no information is provided with the text as to how the speech should be generated
20 (i.e., pitch, duration, rhythm, intonation, stress), the result is typically an unnatural or robot sounding speech. As a result, automatic speech recognition (speech-to-text) systems and speech synthesis (text-to-speech) systems may not be effectively used for the encoding, storing and transmission of
25 natural sounding speech signals. Moreover, the areas of speech recognition and speech synthesis are separate disciplines. Speech recognition systems and speech synthesis systems are not typically used together to provide for a complete system that includes both encoding an analog
30 signal into a digital representation and then decoding the digital representation to reconstruct the speech signal. Rather, speech recognition systems and speech synthesis are employed independently of one another, and therefore, do not typically share the same vocabulary and language
A functional block diagram of a prior art system which may be used for encoding, storage and transmission of audio signals is illustrated in FIG. 3. An audio signal, which may include a speech signal, is digitized by an A/D converter 22.
40 A compressor/decompressor (codec) 24 compresses the digitized audio signal by, for example, removing superfluous or unnecessary information. The digitized audio may be transmitted over a transmission medium 26. At the receiving end, the signal is decompressed by a codec 28 and converted
45 to an analog signal by a D/A converter 30 for output to a speaker 32. Even though the system of FIG. 3 can provide excellent speech rendering, this technique requires a relatively high bit rate (bandwidth) for transmission and a very large storage capacity for storing the digitized speech
50 information, and provides no flexibility.
Therefore, a need has arisen for a speech system that provides a compact representation of a speech signal for efficient transmission, storage, etc., and which permits accu
55 rate (i.e., what was said) and natural sounding (i.e., how it was said) reconstruction of the speech signal.
SUMMARY OF THE INVENTION
The present invention overcomes disadvantages and go drawbacks of prior art speech systems.
An embodiment of a speech encoding system of the present invention includes a memory for storing a speech dictionary. The dictionary includes a pattern and a corresponding identification (ID) for each of a plurality of speech 65 segments (i.e., phonemes). The speech encoding system also includes an A/D converter for digitizing an analog speech signal. A speech analyzer is coupled to the memory and
receives the digitized speech signal from the A/D converter. The speech analyzer identifies each of the speech segments in the received digitized speech signal based on the dictionary. The speech analyzer outputs each of the digitized speech segments and the segment ID for each of the iden- 5 tified speech segments. The speech encoding system also includes one or more prosodic parameter detectors, such as a pitch detector, a duration detector, and an amplitude detector coupled to the memory and the analyzer. The prosodic parameter detectors detect various prosodic param- 10 eters of each digitized segment, and output prosodic parameter values indicating the values of the detected parameters. The speech encoding system also includes a digital data encoder coupled to the prosodic parameter detectors and the speech analyzer. The digital data encoder generates a digital 15 data stream for transmission or storage, or other use. The digital data stream includes a speech segment ID and the corresponding prosodic parameter values for each of the digitized speech segments of the received speech signal.
An embodiment of a speech decoding system of the 20 present invention includes a memory storing a dictionary comprising a digitized pattern and a corresponding segment ID for each of a plurality of speech segments (i.e., phonemes). The speech decoding system also includes a digital data decoder coupled to the memory and receiving a 25 digital data stream from a transmission medium. The decoder identifies and outputs speech segment IDs and the corresponding prosodic parameter values (i.e., 1 KHz for pitch, 0.35 ms for duration, 3.2 volts peak-to-peak for amplitude) in the received data stream. A speech synthesizer 30 is coupled to the memory and the decoder. The synthesizer selects digitized patterns in the dictionary corresponding to the segment IDs received from the decoder and modifies each of the selected digitized patterns according to the corresponding prosodic parameter values received from the 35 decoder. The speech synthesizer then outputs the modified speech patterns to generate a speech signal.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 illustrates a functional block diagram of a prior art automatic speech recognition system.
FIG. 2 illustrates a functional block diagram of a prior art speech synthesis system.
FIG. 3 illustrates a functional block diagram of a prior art 45 system which may be used for encoding, storage and transmission of audio signals.
FIG. 4 illustrates a functional block diagram of a speech encoding system according to an embodiment of the present invention. 50
FIG. 5 illustrates a functional block diagram of a speech decoding system according to an embodiment of the present invention.
FIG. 6 illustrates a block diagram of an embodiment of a 55 computer for implementing the speech encoding system of FIG. 4 and speech decoding system of FIG. 5.
FIG. 4 illustrates a speech encoding system according to 60 an embodiment of the present invention. Speech encoding system 40 includes an A/D converter 42 for digitizing an analog speech signal received on line 44. Encoding system 40 also includes a memory 50 for storing a speech dictionary, comprising a digitized pattern and a correspond- 65 ing phoneme identification (ID) for each of a plurality of phonemes. A speech analyzer 48 is coupled to A/D converter
42 and memory 50 and identifies the phonemes of the digitized speech signal received over line 46 based on the stored dictionary. A plurality of prosodic parameter detectors, including a pitch detector 56, a duration detector 58, and an amplitude detector 60, are each coupled to memory 50 and speech analyzer 48 for detecting various prosodic parameters of the phonemes received over line 52 from analyzer 48, and outputting prosodic parameter values indicating the value of each detected parameter. A digital data encoder 68 is coupled to memory 50, detectors 56, 58 and 60, and analyzer 48, and generates a digital data stream including phoneme IDs and corresponding prosodic parameter values for each of the phonemes received by analyzer 48.
The speech dictionary (i.e., phoneme dictionary) stored in memory 50 comprises a digitized pattern (i.e., a phoneme pattern) and a corresponding phoneme ID for each of a plurality of phonemes. It is advantageous, although not required, for the dictionary used in the present invention to use phonemes because there are only 40 phonemes in American English, including 24 consonants and 16 vowels, according to the International Phoneme Association. Phonemes are the smallest segments of sound that can be distinguished by their contrast within words. Examples of phonemes include Pol, as in bat, Id/, as in dad, and fk/ as in key or coo. Phonemes are abstract units that form the basis for transcribing a language unambiguously. Although embodiments of the present invention are explained in terms of phonemes (i.e., phoneme patterns, phoneme dictionaries), the present invention may alternatively be implemented using other types of speech segments, such as diphones, words, syllables, etc.
The digitized phoneme patterns stored in the phoneme dictionary in memory 50 can be the actual digitized waveforms of the phonemes. Alternatively, each of the stored phoneme patterns in the dictionary may be a simplified or processed representation of the digitized phoneme waveforms, for example, by processing the digitized phoneme to remove any unnecessary information. Each of the phoneme IDs stored in the dictionary is a multi bit quantity (i.e., a byte) that uniquely identifies each phoneme.
The phoneme patterns stored for all 40 phonemes in the dictionary are together known as a voice font. A voice font can be stored in memory 50 by a person saying into a microphone a standard sentence that contains all 40 phonemes, digitizing, separating and storing the digitized phonemes as digitized phoneme patterns in memory 50. System 40 then assigns a standard phoneme ID for each phoneme pattern. The dictionary can be created or implemented with a generic or neutral voice font, a generic male voice (lower in pitch, rougher quality etc.), a generic female voice font (higher pitch, smoother quality), or any specific voice font, such as the voice of the person inputting speech to be encoded.
A plurality of voice fonts can be stored in memory 50. Each voice font contains information identifying unique voice qualities (unique pitch or frequency, frequency range, rough, harsh, throaty, smooth, nasal, etc.) that distinguish each particular voice from others. The pitch, duration and amplitude of the received digitized phonemes (patterns) of the voice font can be calculated (for example, using the method discussed below) and are assigned the average pitch, duration and amplitude for this voice font. In addition, a speech frequency (pitch) range can be estimated for this voice, for example as the speech frequency range of an average person (i.e., 3 KHz), but centered at the average frequency for each phoneme. Range estimates for duration and amplitude can similarly be used.
Also, with eight bits, for example, to represent the value of each prosodic parameter, there are 256 possible quantized values for pitch, duration and amplitude, and for example, can be spaced evenly across their respective ranges. Each of the average pitch, duration and amplitude values for each 5 voice font are assigned, for example, the middle quantized level, number 128 out of 256 total quantized levels. For example, with 256 quantized pitch levels spread across a 3 kHz pitch range, with an average pitch for the phoneme \b\ of, for example, 11.5 kHz, the 256 quantized pitch levels would extend across the range 10-13 kHz, having spacing between each quantized level of approximately 11.7 Hz (3000 Hz/256). Any number of bits can be used to represent each prosodic parameter, and it is not necessary to center the ranges on the average value. Alternatively, each person may read several sentences into the decoding system 40, and decoding system 40 may estimate a range of each prosodic parameter based on the variation of each prosodic parameter between the sentences.
Therefore, one or more voice fonts can be stored in 20 memory 50 including the phoneme patterns (indicating average values for each prosodic parameter). Although not required, to increase speed of the system, encoding system 40 may also calculate and store in memory 50 with the voice font the average prosodic parameter values for each pho- 25 neme including average pitch, duration and amplitude, the ranges for each prosodic parameter for this voice, the number of quantization levels, and the spacing between each quantization level for each prosodic parameter.
In order to assist system 40 in accurately encoding the 30 speech signal received on line 44 into the correct values, memory 50 should include the voice font of the person inputting the speech signal for encoding, as discussed below. The voice font which is used by system 40 to assist in encoding speech signal 44 can be user selectable through a 35 keyboard, pointing device, etc., or a verbal command at the beginning of the speech signal 44, and is known as the designated input voice font. Also, as discussed in greater detail below regarding FIG. 5, the person inputting the sentence to be encoded can also select a designated output 40 voice font to be used to reconstruct and generate the speech signal.
Speech analyzer 48 receives the digitized speech signal on line 46 output by A/D converter 42 and has access to the phoneme dictionary (i.e., phoneme patterns and correspond- 45 ing phoneme IDs) stored in memory 50. Speech analyzer 48 uses pattern matching or pattern recognition to match the pattern of the received digitized speech signal 46 to the plurality of phoneme patterns stored in the designated input voice font in memory 50. In this manner, speech analyzer 48 50 identifies all of the phonemes in the received speech signal. To identify the phonemes in the received speech signal, speech analyzer 48, for example, may break up the received speech signal into a plurality of speech segments (syllables, words, groups of words, etc.) larger than a phoneme for 55 comparison to the stored phoneme vocabulary to identify all the phonemes in the large speech segment. This process is repeated for each of the large speech segments until all of the phonemes in the received speech signal have been identified.
After identifying each of the phonemes in the speech 60 signal received over line 46, speech analyzer 48 separates the received digitized speech signal into the plurality of digitized phoneme patterns. The pattern for each of the received phonemes can be the digitized waveform of the phoneme, or can be a simplified representation that includes 65 information necessary for subsequent processing of the phoneme, discussed in greater detail below.
Speech analyzer 48 outputs the pattern of each received phoneme on line 52 for further processing, and at the same time, outputs the corresponding phoneme ID on line 54. For 40 phonemes, the phoneme ID may be a 6 bit signal provided in parallel over line 54. Analyzer 48 outputs the phoneme patterns and corresponding phoneme IDs sequentially for all received phonemes (i.e., on a first-in, first-out basis). The phoneme IDs output on line 54 only indicate what was said in the speech signal input on line 44, but does not indicate how the speech was said. Prosodic parameter detectors 56, 58 and 60 are used to identify how the original speech signal was said. Also, the designated input voice font, if it was selected to be the voice font of the person inputting the speech signal, also provides information regarding the qualities of the original speech signal.
Pitch detector 56, Duration detector 58 and amplitude detector 60 measure various prosodic parameters for each phoneme. The prosodic parameters (pitch, duration and amplitude) of each phoneme indicate how the speech was said and are important to permit a natural sounding reconstruction or playback of the original speech signal.
Pitch detector 56 receives each phoneme pattern on line 52 from speech analyzer 48 and estimates the pitch (fundamental frequency F0) of the phoneme represented by the received phoneme pattern by any one of several conventional time-domain techniques or by any one of the commonly employed frequency-domain techniques, such as autocorrelation, average magnitude difference, cepstrum, spectral compression and harmonic matching methods. These techniques may also be used to identify changes in the fundamental frequency of the phoneme (i.e., a rising or lowering pitch, or a pitch shift). Pitch detector 56 also receives the designated input voice font from memory 50 over line 51. With 8 bits used to indicate phoneme pitch, there are 256 distinct frequencies or quantized levels, which are spaced evenly across the frequency range and centered at the average frequency for this phoneme, as indicated by information stored in memory 50 with the designated input voice font. Therefore, there are approximately 128 frequency values above the average, and 128 frequency values below the average frequency for each phoneme. Due to the unique qualities of each voice, different voice fonts can have different average pitches (frequencies) for each phoneme, different frequency ranges, and different spacing between each quantized level in the frequency range.
Pitch detector 56 compares the pitch of the phoneme represented by the received phoneme pattern (received over line 52) to the pitch of the corresponding phoneme in the designated input voice font. Pitch detector 56 outputs an eight bit value on line 62 identifying the relative pitch of the received phoneme as compared to the average pitch for this phoneme (as indicated by the designated input voice font).
Duration detector 58 receives each phoneme pattern on line 52 from speech analyzer 48 and measures the time duration of the received phoneme represented by the received phoneme pattern. Duration detector 58 compares the duration of the phoneme to the average duration for this phoneme as indicated by the designated input voice font. With, for example, 8 bits used to indicate phoneme duration, there are 256 distinct duration values, which are spaced evenly across a range centered at the average duration for this phoneme, as indicated by the designated input voice font. Therefore, there are approximately 128 duration values above the average, and 128 duration values below the average duration for each phoneme. Duration detector 58 outputs an eight bit value on line 64 identifying the relative duration of the received phoneme as compared to the average phoneme duration indicated by the designated input voice font.