US RE43099 E1 Abstract Coding systems that provide a perceptually improved approximation of the short-term characteristics of speech signals compared to typical coding techniques such as linear predictive analysis while maintaining enhanced coding efficiency. The invention advantageously employs a non-linear transformation and/or a spectral warping process to enhance particular short-term spectral characteristic information for respective voiced intervals of a speech signal. The non-linear transformed and/or warped spectral characteristic information is then coded, such as by linear predictive analysis to produce a corresponding coded speech signal. The use of the non-linear transformation and/or spectral warping operation of the particular spectral information advantageously causes more coding resources to be used for those spectral components that contribute greater to the perceptible quality of the corresponding synthesized speech. It is possible to employ this coding technique in a variety of speech coding techniques including, for example, vocoder and analysis-by-synthesis coding systems.
Claims(38) 1. A method for coding a speech signal to generate a coded signal comprising:
generating a sequence of spectral magnitude values for a frame interval of said speech signal representing voiced speech, said spectral magnitude value sequence characterizing spectral components of a short-term frequency spectrum of said interval;
performing at least one of a non-linear transformation or spectral warping process on said sequence to produce an intermediate spectral value sequence having an enhanced characterization of at least one particular frequency range relative to another frequency range in the intermediate spectral sequence; and
coding said intermediate spectral value sequence to produce at least a portion of said coded signal for said interval of said speech signal.
2. The method of
3. The method of
inverse transforming said intermediate spectral values into a time domain representation signal; and
generating linear predictive codes for said time domain representation signal.
4. The method of
^{N}[A(i)]^{N}, where A(i) represents the respective values in said sequence portion and the value N is not 0 or 1.5. The method of
6. The method of
7. The method of
8. The method of
9. The method of
10. The method of
11. The method of
12. The method of
13. The method of
14. The method of
15. The method of
16. The method of
identifying a portion of said frame interval of said speech signal representing a pitch period;
performing a discrete Fourier transform of said identified portion of said frame interval to generate a sequence of spectral component values; and
determining respective magnitudes of said spectral component values to produce said spectral magnitude value sequence for said frame interval.
17. A method for decoding a coded speech signal, said coded signal including successive coded frame intervals of a speech signal, the decoding of a frame interval of said coded signal comprising the steps of:
generating an intermediate spectral value sequence for at least a portion of said interval representing voiced speech, said intermediate spectral value sequence characterizing spectral components of a short-term frequency spectrum of said interval and further having an enhanced characterization of at least one particular frequency range relative to another frequency range; and
processing said intermediate spectral value sequence with at least one of an inverse non-linear transformation or inverse spectral warping process to produce a sequence of spectral magnitude values characterizing the short-term frequency spectrum for the voiced portion of said interval.
18. The method of
19. The method of
^{N}[Ā′(i)]^{N}, where Ā″(i)Ā′(i) represents the respective values in said sequence portion and the value N is not 0 or 1, and wherein said expression performs an inverse transformation of a non-linear transformation used in coding said coded signal interval.20. The method of
21. The method of
22. The method of
23. The method of
24. The method of
25. The method of
26. The method of
27. A coder for generating a coded signal based on a speech signal comprising:
a spectral transformer for generating a sequence of spectral magnitude values for a frame interval of said speech signal representing voiced speech, said spectral magnitude value sequence characterizing spectral components of a short-term frequency spectrum of said frame interval;
an encoder coupled to said spectral processor, said encoder for performing at least one of a non-linear transformation or a spectral warping process on said sequence to produce an intermediate spectral value sequence having an enhanced characterization of at least one particular frequency range relative to another frequency range in the intermediate spectral sequence; and
a spectral coder coupled to said encoder, said spectral coder for coding said intermediate spectral value sequence to produce at least a portion of said coded signal for said interval of said speech signal.
28. The coder of
an inverse transformer for inverse transforming said spectral parameters processed by said spectral processor into a time domain representation signal; and
a linear predictive code generator for generating linear predictive coefficients for said coded signal based on said time domain representation signal for said interval of said speech signal.
29. The coder of
30. The coder of
31. The coder of
32. The coder of
33. The coder of
a window processor and pitch detector for identifying an interval in said frame interval of said speech signal representing a pitch period; and
a discrete Fourier transformer coupled to said window processor, said discrete Fourier transformer for generating said spectral magnitude value sequence for said interval.
34. A coder for generating a coded signal from a speech signal comprising:
means for generating a sequence of spectral magnitude values for a frame interval of said speech signal representing voiced speech, said spectral magnitude value sequence characterizing spectral components of a short-term frequency spectrum of said interval;
means for performing at least one of a non-linear transformation or spectral warping process on said sequence to produce an intermediate spectral value sequence having an enhanced characterization of at least one particular frequency range relative to another frequency range in the intermediate spectral sequence; and
means for coding said intermediate spectral value sequence to produce at least a portion of said coded signal for said interval of said speech signal.
35. A decoder for decoding a coded speech signal, said coded signal including successive coded frame intervals of a speech signal, said decoder comprising:
a spectral decoder, said spectral decoder for generating an intermediate spectral value sequence for voiced speech represented in said frame interval of the coded signal, said intermediate spectral value sequence characterizing spectral components of a short-term frequency spectrum of said voiced speech and further having an enhanced characterization of at least one particular frequency range relative to another frequency range; and
inverse processor coupled to said spectral decoder, said inverse processor for processing said intermediate spectral value sequence with at least one of an inverse non-linear transformation or inverse spectral warping process to produce a sequence of spectral magnitude values characterizing a short-term frequency spectrum for the voiced portion of said interval.
36. The decoder of
37. The decoder of
38. A decoder for decoding a coded speech signal, said coded signal including successive coded frame intervals of a speech signal, said decoder comprising:
means for generating an intermediate spectral value sequence for voiced speech represented in said frame interval of the coded signal, said intermediate spectral value sequence characterizing spectral components of a short-term speech spectrum of voiced speech represented in said interval and further having an enhanced characterization of at least one particular frequency range relative to another frequency range; and
means for processing said intermediate spectral value sequence with at least one of an inverse non-linear transformation or inverse spectral warping process to produce a sequence of spectral magnitude values characterizing said short-term frequency spectrum for the voiced portion of said interval.
Description The invention relates generally to speech communication systems and more specifically to systems for encoding and decoding speech. Digital speech communication systems including voice storage and voice response systems use speech coding and data compression techniques to reduce the bit rate needed for storage and transmission. Voiced speech is produced by a periodic excitation of the vocal tract by the vocal chords. As a consequence, a corresponding signal for voiced speech contains a succession of similarly but evolving waveforms having a substantially common period which is referred to as the pitch period. Typical speech coding systems take advantage of short-term redundancies within a pitch period interval to achieve data compression in a coded speech signal. In a typical voice coder (vocoder) system, such as that described in U.S. Pat. No. 3,624,302, which is incorporated by reference herein, the speech signal is partitioned into successive fixed duration intervals of 10 msec. to 30 msec. and a set of coefficients are generated approximating the short-term frequency spectrum resulting from the short-term redundancies or correlation in each interval. These coefficients are generated by linear predictive analysis and referred to as linear predictive coefficients (LPC's). The LPC's represent a time-varying all-pole filter that models the vocal tract. The LPC's are useable for reproducing the original speech signal by employing an excitation signal referred to as a prediction residual. The prediction residual represents a component of the original speech signal that remains after removal of the short-term redundancy by linear predictive analysis. In vocoders, the prediction residual is typically modeled as white noise for unvoiced sounds and a periodic sequence of impulses for voiced speech. A synthesized speech signal can be generated by a vocoder synthesizer based on the modeled residual and the LPC's of the linear predictive filter modeling the vocal tract. Vocoders approximate the spectral information of an original speech signal and not the time-domain waveform of such a signal. Moreover, a speech signal synthesized from such codes often exhibits a perceptible synthetic quality that is, at times, difficult to understand. Alternative known speech coding techniques having improved perceptual speech quality approximate the waveform of a speech signal. Conventional analysis-by-synthesis systems employ such a coding technique. Typical analysis-by-synthesis systems are able to achieve synthesized speech having acceptable perceptual quality. Such systems employ both linear predictive analysis for coding the short-term redundant characteristics of the pitch period as well as a long-term predictor (LTP) for coding long term pitch correlation in the prediction residual. In LTP's, characteristics of past pitch periods are used to provide an approximation of characteristics of a present pitch period. Typical LTP's have included an all-pole filter providing delayed feedback of past pitch-period characteristics, or a codebook of overlapping vectors of past pitch-period characteristics. In particular analysis-by-synthesis systems, the prediction residual is modeled by an adaptive or stochastic codebook of noise signals. The optimum excitation is found by searching through the codebook of candidate excitation vectors for successive speech intervals referred to as frames. A code specifying the particular codebook entry of the found optimum excitation is then transmitted on a channel along with coded LPC's and the LTP parameters. These particular analysis-by-synthesis systems are referred to as code-excited linear prediction (CELP) systems. Exemplary CELP coders are described in greater detail in B. Atal and M. Schroeder, “Stochastic Coding of Speech Signals at Very Low Bit Rates”, Proceedings IEEE Int. Conf Comm., p. 48.1 (May 1984); M. Schroeder and B. Atal, “Code-Excited Linear Predictive (CELP): High Quality Speech at Very Low Bit Rates”, Proc. IEEE Int. Conf ASSP., pp. 937-940 (1985) and P. Kroon and E. Deprettere, “A Class of Analysis-by-Synthesis Predictive Coders for High-Quality Speech Coding at Rate Between 4.8 and 16 KB/s”, IEEE J on Sel. Areas in Comm., SAC-6(2), pp. 353-363 (Feb. 1988), which are all incorporated by reference herein. However, in vocoder and analysis-by-synthesis systems as well as other types of speech coding systems, there is a recognized need for methods of coding characteristics of the short-term frequency spectrum with enhanced perceptual accuracy. As shown in In particular, spectral warping spreads frequency ranges that substantially effect the perceptual quality of corresponding synthesized speech and compress perceptually less significant frequency ranges. In a corresponding manner, the non-linear transformation performs a magnitude warping operation on the spectral magnitude values. Such transformation amplifies and/or attenuates spectral magnitude values to enhance the characterization of the perceptual quality of a corresponding synthesized speech signal. The invention is based on the realization that typical coding methods, including linear predictive analysis, perform coding of the short-term frequency spectrum of a speech signal with substantially equal coding resources used for respective frequency components whether such frequency components substantially effect the perceptual quality of a speech signal synthesized from the coded signal or otherwise. In other words, typical coding techniques do not perform coding of frequency components of the short-term frequency spectrum characterization based on the perceptual accuracy such frequency components produce in a corresponding synthesized speech signal. In contrast, the present invention processes the spectral component values by spectral warping and/or non-linear transformation to produce a transformed and/or warped characterization that causes subsequent spectral coding, such as by linear predictive analysis, to provide more coding resources for perceptually more significant spectral components and less coding resources to those spectral components that are less perceptually significant. Accordingly, the resulting synthesized voiced speech produced from such a coded signal would have an improved perceptual quality while maintaining an advantageous coding efficiency relative to the coding process alone. A corresponding decoder according to the invention employs a complementary inverse non-linear transformation and/or spectral warping process to obtain the corresponding approximation of the original short-term frequency spectrum of the respective frames of the speech signal with improved perceptual quality. It is possible to employ the coding technique of the invention in a variety of spectral coding arrangements including, for example, vocoder and analysis-by-synthesis coding systems, or other techniques where linear prediction analysis has been used for characterizing the short-term frequency spectrum of a speech signal. Additional features and advantages of the present invention will become more readily apparent from the following detailed description and accompanying drawings. The invention advantageously employs processing of successive frames of a speech signal by performing a non-linear transformation and/or spectral warping process on a spectral magnitude value sequences characterizing the short-term frequency spectrum of respective voiced speech frames prior to spectral coding by, for example, linear predictive analysis. As used herein, “short-term frequency spectrum” refers to spectral characteristics arising from the short-term correlation in the speech signal excluding the correlation resulting from the pitch periodicity. The short-term frequency spectrum is alternatively referred to as the short-time frequency spectrum in the art, and is described in greater detail in L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, sects. 6.0-6.1, pp. 250-282 (Prentice-Hall, New Jersey, 1978), which is incorporated by reference herein in its entirety. Spectral warping spreads or compresses particular frequency ranges represented in the spectral magnitude value sequence based on the effect such frequency ranges have on the perceptual accuracy produce in corresponding speech synthesized from the coded signal. In a corresponding manner, the non-linear transformation performs a magnitude warping operation on the spectral magnitude values. Such transformation amplifies and/or attenuates the spectral magnitude values to enhance the characterization for producing an improved perceptual accuracy in corresponding synthesized speech. The invention is based on the realization that typical coders, including linear predictive coders, code frequency components of a voiced speech signal interval such that perceptually significant frequency components are coded using identical or similar resources to that used for coding perceptually less significant frequency components. In contrast, the invention processes the spectral magnitude values by spectral warping and/or non-linear transformation to produce a transformed and/or warped characterization having an enhanced characterization of at least one particular frequency range that causes the coder to provide more coding resources to perceptually more significant spectral components and less coding resources to those spectral components that are less perceptually significant. Accordingly, synthesized speech produced from such a coded speech signal has an improved perceptual quality relative to the coding process alone while maintaining an advantageous coding efficiency. The invention is described below with regard to using linear predictive analysis for providing the spectral coding for illustration purposes only and is not intended to be a limitation of the invention. It is alternatively possible to employ numerous other spectral coding techniques that code the frequency components of the short-term frequency spectrum by methods other than coding based on a corresponding perceptual quality or accuracy that such components would have in corresponding synthesized speech. For instance, it is possible to use a spectral coder according to the invention that does not allocate coded signal bits or coding resources based on the perceptual quality of the respective spectral components. The invention is useable in a variety of coder systems for encoding the short-term vocal tract characteristics of voiced speech including, for example, vocoders or analysis-by-synthesis systems such as CELP coders. Exemplary vocoder and CELP type coder and decoder systems employing the technique of the invention are illustrated in For clarity of explanation, the illustrative embodiments of the invention are shown as including, among other things, individual function blocks. The functions these blocks represent may be provided through the use of either shared or dedicated hardware including hardware capable of executing software instructions. For example, such functions can be performed by digital signal processor (DSP) hardware, such as the Lucent DSP16 or DSP32C, and software performing the operations discussed below, which is not meant to be a limitation of the invention. It is also possible to use very large scale integration (VLSI) hardware components as well as hybrid DSP/VLSI arrangements in accordance with the invention. An exemplary vocoder-type coder arrangement The processor Nevertheless, in the encoder The processor The quantized coefficient sequence {acute over (α)} The remaining output signals of the processor An exemplary configuration for the short-term frequency spectrum processor The pitch detector Exemplary methods for determining if a frame contains a voiced speech component and for identifying pitch period intervals are described in the previously cited Digital Processing of Speech Signals book, sects. 4.8, 7.2, 8.10.1, pp. 150-157, 372-378, 447-450. It is possible to determine a pitch period interval by examining the long-term correlation in the speech frame and/or by performing linear predictive analysis on the speech frame and identifying the location of pitch impulse in the resulting prediction residual. The pitch detector The window processor Moreover, it is advantageous to align the determined window function relative to the frame sequence of digitized speech samples for obtaining essentially a pitch period interval of samples from the beginning of a pitch period to the beginning of a next pitch period. It is possible for the pitch detector The sequence S The spectral magnitude sequence A(i) represents a sampled version of a continuous, i.e., non-discrete, short-term frequency spectrum A(z). However, the spectral magnitude sequence A(i) will alternatively be referred to as the short-term frequency spectrum for ease of explanation. A conventional DFT processor is useable to generate the desired spectral magnitude values A(i). However, phase components in addition to the desired magnitude components are typically produced by conventional DFT processors and are not required for this particular embodiment of the invention. Accordingly, since the phase component is not required according to the invention, other transforms that directly generate magnitude values are useable for the spectral processor Moreover, the previous described method for producing the spectral magnitude value sequence A(i) characterizing the short-term frequency spectrum of the frame j is for illustration purposes only and is not meant as a limitation of the invention. It should he readily understood that numerous other techniques are useable for producing such a sequence characterizing the short-term frequency spectrum of the frame j. Referring again to An exemplary method for the spectral warper It is possible to compress the frequency ranges Z In a similar manner, it is possible to expand or spread the frequency ranges 0 to Z The warped spectral magnitude values A′(i), i=0, 1, . . . , K′−1, is obtained by concatenating the magnitude values in the four warped groups. The total number of warped spectral magnitude values K′ will likely be different than the original number of spectral magnitude values K. Further, it is possible to perform only compression of particular groups or only spreading of other groups to produce the warped spectral magnitude values A′(i) according to the invention. The previously described warping method first performs the discrete Fourier transformation to generate a sequence of spectral magnitude values A(i) characterizing the short-term frequency spectrum of a digitized speech frame S Moreover, the previously described warping methods for spreading and compressing the spectral characterization of the short-term frequency spectrum in a voiced speech frame are based on piece-wise linear warping functions for illustration purposes only. It should be readily understood that the frequency warping can also be performed by other invertible warping functions. For instance, the particular warping process used for the spectral magnitude value sequence A(i) for respective voiced speech frame intervals can be chosen from a codebook of transforms. In such instance, the signal W is generated by the spectral warper The warped sequence spectral magnitude values A′(i) generated by the spectral warper When the value N is negative, the linear predictive analysis of the transformed spectrum represented by the to sequence A″(i) effectively provides an all-zero spectrum representation for the spectrum represented by the sequence A′(i). When the order of the linear predictive analysis is relative small, such as less than 30, it is often advantageous to use a value N corresponding to −1/B, where B is greater than one to reduce the dynamic range of the spectrum. Such a reduction of the dynamic range of the spectrum effectively shortens its time response facilitating the subsequent modeling of the spectrum by an all-zero filter of smaller order. Although the non-linear transformation was previous described with a negative value N, it alternatively possible to use a positive value N, that is not equal to one, to produce a corresponding all-pole spectrum representation according to the invention. The previously described non-linear transformation is a fixed transformation and is typically known by a corresponding decoder for decoding the coded speech signal according to the invention. However, it is alternatively possible for the non-linear transformation to base the value N on a particular property of the current or previously processed speech frame such as, for example, the pitch period duration X that is provided in the coded signal received from the channel. The value N of the non-linear transformation can also be determined from a codebook of transformation. In such instance, the corresponding codebook index is included in the coded signal produced by the channel coder The transformed and warped sequence A″(i) generated by the transformer The generated autocorrelation coefficients are then provided to a P-th order linear predictive analyzer The exemplary embodiment of the short-term frequency spectrum processor An exemplary decoder The short-term frequency spectrum decoder The filter An exemplary configuration for the short-term frequency spectrum decoder The LPC's generated by the inverse transformer Each of the spectral magnitude values Ā″(i) generated by the block The inverse transformed spectral magnitude value sequence Ā″(i) generated by the processor Although the previously described signal W indicates a respective codebook entry, it is alternatively possible, for the signal W to indicate the particular employed spectral warping operation performed by the encoder for the short-term frequency spectrum of respective speech frames in another manner. Also, the warping signal W can be omitted if the employed warping function for a coded speech frame is based on a property of the speech frame such as, for example, the duration of the pitch period. In such a system, the signal X indicating the pitch period duration for the interval should also be provided to the inverse warper In operation, if the spectral warper Each of the K″ inverse warped and transformed magnitude values in the sequence Ā(i) are then squared by squarer The reciprocal sequence of power spectral values produced by the processor Although the exemplary short-term frequency spectrum decoder The method for encoding the short-term frequency spectrum of speech signals according to the invention has been described with respect to vocoder-type speech coders in Referring to the CELP coder The difference between the encoders In addition, a stochastic codebook or code store For each speech frame, the synthesis filter Then, a peak picker The decoder Although several embodiments of the invention have been described in detail above, many modifications can be made without departing from the teaching thereof. All of such modifications are intended to be encompassed within the following claims. For example, although the previously described embodiments have employed LPC analysis to code the non-linear transformed and/or warped spectral parameters, such coding can be performed by numerous alternative techniques according to the invention. It is possible for such alternative techniques to include those techniques that code the frequency components of the short-term frequency spectrum by methods other than coding based on a corresponding perceptual quality or accuracy that such components would have in corresponding synthesized speech. Patent Citations
Non-Patent Citations
Classifications
Legal Events
Rotate |