CLAIM OF PRIORITY

[0001]
This application is a divisional application of a copending U.S. Utility Application, entitled, “Apparatus and Quality Enhancement Algorithm for Mixed Excitation Linear Predictive (MELP) and Other Speech Coders,” to Unno et al., filed Sep. 29, 1999, granted Ser. No. 09/408,195, which is incorporated herein by reference in its entirety.
FIELD OF THE INVENTION

[0002]
The present invention relates to speech signal coding using a parametric coder to model a speech waveform. The speech signal parameters are communicated via a communications channel and used to synthesize the speech waveform at the receiver. More specifically, the present invention enhances the speech quality and reduces the computations of the mixed excitation linear predictive (MELP) speech coder.
BACKGROUND OF THE INVENTION

[0003]
Low bitrate speech coding technology is widely used for digital voice communication in narrowbandwidth channels. The common objective of this technology is to transfer the digital speech signal information at a low bit rate (typically 2,400 bits/sec) while providing good quality speech synthesis at the destination. This technology also strives to provide low computational complexity, low memory requirements, and a small algorithmic delay particularly for realtime lowcost voice communications. FIG. 1A illustrates the general environment surrounding speech encoders and decoders as used in a oneway communications system. Full duplex communications are easily enabled by integrating both an encoder and decoder at both ends of the communications system.

[0004]
The first widelyused low bitrate speech coder was the Federal Standard linear predictive coding (LPC) vocoder (FS1015) in which either a periodic pulse train or white noise excites an allpole filter in order to synthesize speech. While the 2.4 kbps bit rate was attractive, the LPC vocoder was not acceptable for many speech applications as users characterized the synthesized speech as synthetic and buzzy.

[0005]
The LPC vocoder analyzes the speech waveform and extracts such parameters as filter coefficients, pitch period, voicing decision, and gain are updated every 2030 ms and transmitted to the communications channel. The artifacts residing in the traditional LPC vocoder include buzzes, clicks, and tonal noise. In addition, the speech quality is very poor in the presence of background noise. These unintended additions to the synthesized speech are the result of the simple excitation model and the binary voicing decision error.

[0006]
Over the years, several low bitrate speech coding algorithms have been developed, and some stateoftheart coders now provide a good natural quality. The mixed excitation linear predictive (MELP) coder is one of these speech coders. The MELP coder is a linearpredictionbased speech coder, which includes five features not found in the LPC vocoder: mixed excitation, a periodic pulses, adaptive spectral enhancement, pulse dispersion, and Fourier magnitude modeling. These features improve the synthesized speech quality by removing distortions resident in the LPC vocoder. FIG. 1B and FIG. 1C illustrate block diagrams of the MELP encoder and decoder respectively.

[0007]
However, the MELP still has some perceivable distortions, particularly around the nonstationary speech segments and for some lowpitch male speakers. These distortions can also be observed with other low bitrate speech coders. The distortion around the nonstationary speech segments results from the update of speech parameters at a low frame rate (typically 3050 frames/sec). It is known that increasing the frame rate helps to solve this problem. Unfortunately, this solution requires a much higher bit rate. Another possible solution is a variable framerate system that updates the speech parameters in the less stationary segments at a higher frame rate while maintaining a low frame rate in the stationary segments. Such an approach is provided by the delayed decision approach based on dynamic programming, which uses the future frame information to control the frame rate. This system can produce highquality speech while maintaining a relatively low bit rate by reducing the average frame rate. However, this method requires a considerably longer algorithmic delay (around 150 ms), which is unacceptable in many applications (such as twoway voice communications).

[0008]
The distortion for lowpitch male speakers in the MELP is characterized by a highpass filtered quality of the coded speech. In other words, the synthesized speech lacks “sound pressure” in the low frequencies. This distortion is caused by a post filter and a preprocessing highpass filter, which are used in the modern low bitrate speech coders to remove 60 Hz noise and to enhance the coded speech quality. These filters suppress the harmonic magnitudes in the low frequencies, particularly for lowpitch male speakers whose fundamental frequencies are less than 100 Hz. The suppression of these low frequency harmonics results in a highpass filtered speech that is perceived as too synthetic.

[0009]
The most significant speech distortion present in the prior art is the lack of a suitable model or method to accurately synthesize a plosive sound. Plosive sounds are characterized by the sudden opening or closing of the vocal chords. Plosive phonemes are created when most English speaking persons create sounds such as: “b,” “d,” “g,” “k,” “p,” “t,” “th,” “ch,” or “tch.” It is important to note that the preceding list of plosive phonemes is not exclusive and that not all speakers will create like sounds. Plosive phonemes may be created both at the start and at the end of syllables (i.e., “pop,” “tank,” “tot”), at the end of syllables (i.e., “sound,” “sat,” “shrug”), or at the start of syllables (i.e., “toy,” “boy,” “boss”). Plosive sounds are easily identified in a speech waveform but difficult to model and synthesize in low bitrate speech coders. Plosive sounds are characterized by an impulse of energy followed by a brief period where the speech waveform is aperiodic. Prior art speech encoders have been unable to model and synthesize plosive sounds in a manner acceptable to the human ear.
SUMMARY OF THE INVENTION

[0010]
As described briefly, an object of the present invention is to enhance the coded speech quality of the existing low bitrate speech coders including the MELP vocoder while maintaining its low bit rate, small algorithmic delay, and low computational complexity.

[0011]
Another object of the present invention is to provide an efficient mixed excitation algorithm to reduce the computational complexity of the existing MELP vocoder. Another object of the present invention is to provide bitstream compatibility with the existing MELP vocoder in order to permit the introduction of the invention into systems where only the present MELP decoder is available. This would allow for backward compatibility through the introduction of an updated encoder while allowing for full system upgrades where both the encoder and the decoder could be updated.

[0012]
The present invention provides four embodiments. The first is a robust pitchdetection algorithm. In the encoder, the fixedlength pitch analysis window is manipulated around the original position to seek the position that contains the signal with the highest pitch correlation. Once the window position is determined, pitch is estimated using the signal that is contained in the selected window. Other parameters such as LPC coefficients, gain, and voicing decision are also estimated using the signal corresponding to the selected window. The estimated parameters are used to synthesize the coded speech in the decoder on each sample window in the same manner as earlier fixedposition windows in the prior art.

[0013]
The second embodiment is a plosive analysis/synthesis method. In the encoder, the system first detects the frame that contains the plosive signal. The plosive detection is performed with slidingwindow peakiness analysis. The detected plosive signal is quantized to only a small number of bits and transmitted via the communication channel to the decoder. In the decoder, the plosive signal is synthesized independently and added back to the coded speech.

[0014]
The third embodiment is a postprocessor for the Fourier magnitude model. In the decoder, the harmonic magnitudes of the coded speech in the low frequencies are emphasized to overcome the muffling effect of the high pass filter. In this way, the decoded speech is synthesized without the muffling effect often observed in the highpass filtered speech of current low bitrate speech encoders.

[0015]
The fourth embodiment is a new mixedexcitation algorithm. In the decoder, a pulse train is mixed with random noise in the frequency domain in unvoiced frequency bands to eliminate the bandpass filtering operations, which are required to generate the mixedexcitation signal in the existing MELP coder. The elimination of the filters results in a significant reduction of computational complexity in the MELP decoder. As a result, the present system is shown to be compatible in terms of bitstream and is interchangeable with the coder/decoder of the existing MELP speech coder.
BRIEF DESCRIPTION OF THE DRAWINGS

[0016]
The present invention will be more fully understood from the accompanying drawings of the embodiments of the invention, which however, should not be taken to limit the invention to the specific embodiments enumerated, but are for explanation and for better understanding only. Finally, like reference numerals in the figures designate corresponding parts throughout the drawings.

[0017]
[0017]FIG. 1A is a block diagram of a communications system having a MELP speech encoder and decoder;

[0018]
[0018]FIG. 1B is a block diagram illustrating the MELP encoder of FIG. 1A;

[0019]
[0019]FIG. 1C is a block diagram illustrating the MELP decoder of FIG. 1A;

[0020]
[0020]FIG. 2A is a block diagram highlighting the new embodiments of the present system;

[0021]
[0021]FIG. 2B is a block diagram illustrating the new encoder of FIG. 2A;

[0022]
[0022]FIG. 2C is a block diagram illustrating the new decoder of FIG. 2A;

[0023]
[0023]FIG. 3A illustrates plosive signal types and locations in a sample sentence and reveals how plosive sounds remain undetected in the prior art;

[0024]
[0024]FIG. 3B illustrates plosive signal synthesis in coded speech;

[0025]
[0025]FIG. 3C illustrates a typical LPC residual waveform for a plosive signal;

[0026]
[0026]FIG. 3D illustrates the Fourier spectrums of an original plosive sound along with the replacement plosive model;

[0027]
[0027]FIG. 3E illustrates the Fourier spectrums of an original plosive sound with a click with the replacement plosive model;

[0028]
[0028]FIG. 4 illustrates the relative time shifting in the robust pitch detector shown in FIG. 2B;

[0029]
[0029]FIG. 5 illustrates a block diagram of the plosive analysis/synthesis system of the present invention as shown in FIG. 2B and FIG. 2C;

[0030]
[0030]FIG. 6 illustrates the plosive detector of the present invention as shown in FIG. 5;

[0031]
[0031]FIG. 7 illustrates a block diagram of the plosive synthesizer of the present invention as shown in FIG. 5;

[0032]
[0032]FIG. 8 illustrates a block diagram of the postprocessor for the Fourier magnitude of the present invention as shown in FIG. 2C;

[0033]
[0033]FIG. 9 illustrates a block diagram of the new mixed excitation method of the present invention as shown in FIG. 2C;

[0034]
[0034]FIG. 10 illustrates the flow diagram of bit packing for the plosive signal parameters within voiced and unvoiced frames;

[0035]
[0035]FIG. 11 illustrates the flow diagram of the bit unpacking for the plosive signal parameters for voiced and unvoiced frames.

[0036]
[0036]FIG. 12 illustrates words with plosive sounds;

[0037]
[0037]FIG. 13 illustrates the replacement of different plosive types in the present invention;

[0038]
[0038]FIG. 14 reveals the bit allocation for the plosive signal model;

[0039]
[0039]FIG. 15 reveals the 99level Pitch and Voicing level quantization in the existing MELP;

[0040]
[0040]FIG. 16A reveals the bit allocation in the existing MELP frame; and

[0041]
[0041]FIG. 16B reveals the bit transmission order in the existing MELP frame.
DESCRIPTION OF THE PREFERRED EMBODIMENT

[0042]
The present invention is embedded in the existing MELP coder as shown in FIG. 2A to enhance coded speech quality. It will be apparent to those skilled in the art that the MELP coder can be replaced with other low bitrate speech coders that are based on a parametric speech coding algorithm in order to practice the current invention. The present invention consists of four embodiments. The first embodiment, a robust pitch detector, is shown as 52 in FIG. 2A. The robust pitch detector 52 replaces a portion of the refinement of pitch and voicing decision 37 in the MELP coder and does not require additional bits for transmission.

[0043]
The second embodiment, the plosive analysis/plosive synthesis function is illustrated in FIG. 2A. Plosive analysis 55 is added to the encoder. Plosive synthesis 59 is added to the decoder and requires two bits for transmission.

[0044]
The third embodiment, a postprocessor for the Fourier magnitude 62, is shown in FIG. 2A. It is added to the decoder and does not require additional bits for transmission.

[0045]
The fourth embodiment, a new mixed excitation 35, is also shown in FIG. 2A. It replaces the mixed excitation method of the prior art. The new mixed excitation 35 is embedded in the decoder, and does not require additional bits for transmission.

[0046]
MELP Encoder

[0047]
[0047]FIG. 1B illustrates a block diagram of the processing flow within the MELP encoder. A frame of speech data is processed by the MELP coder every 22.5 ms. Each frame contains 180 voice samples or 8,000 samples per second. The MELP is a parametric speech coder that creates a 54bit per frame concatenated code that is used by the MELP decoder to synthesize the speech waveform at the receiver. Each frame contains the following parameters: Line Spectral Frequencies (LSFs), Fourier Magnitudes, Gain, Pitch, Bandpass Voicing, Aperiodic Flag, Error Protection (in unvoiced frames only), and a synchronization bit.

[0048]
Input speech is encoded as follows. First, the input speech signal is processed through highpass filter 11 with a cutoff frequency of 60 Hz to remove lowfrequency noise. A buffer containing the most recent samples of the actual input speech signal is maintained in the encoder. One of the samples is identified as the last sample of the current frame. The buffer contains samples that extend beyond the current frame both in the past and into the future to enable the coding process. This designated last frame of the sample is the reference point for many of the encoder calculations.

[0049]
Next, the speech signal is bandpassed filtered into 5 frequency bands from 0500, 500100, 10002000, 20003000, and 30004000 Hz for voicing analysis. An initial pitch estimation is made using the 0500 Hz filter output signal. The measurement is centered on the filter output produced when its input is the last sample in the current frame. The initial pitch estimation from the first bandpass filter is used as the initial reference point for robust pitch detector 52 (FIG. 2B). For each of the remaining frequency bands, the bandpass voicing strength is determined using the pitch determined by the robust pitch detector 52 described below. The time envelopes of each of the bandpass filters are calculated by fullwave rectification followed by a smoothing filter. The analysis windows for each of the remaining frequency bands are centered on the last sample in the current frame as in the case of the first band.

[0050]
Robust Pitch Detection

[0051]
Most low bitrate speech coders use the normalized pitch correlation to estimate pitch lag. In the MELP coder, the pitch correlation is also used to make bandpass voicing decisions. The normalized pitch correlation r(T) is computed with the signal in the fixedposition analysis window in the prior art as follows:
$\begin{array}{cc}r\ue8a0\left(T\right)=\frac{{c}_{T}\ue8a0\left(0,T\right)}{\sqrt{{c}_{T}\ue8a0\left(0,0\right)\ue89e{c}_{T}\ue8a0\left(T,T\right)}}\ue89e\text{}\ue89e{c}_{T}\ue8a0\left(m,n\right)=\sum _{k=\frac{T}{2}\frac{N}{2}}^{\frac{T}{2}+\frac{N}{2}1}\ue89e{s}_{k+m}\ue89e{s}_{k+n},& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(1\right)\end{array}$

[0052]
where, S_{k }is the kth sample in the fixedposition window, s_{O }is the signal at the center of the fixedposition window, T is a pitch lag, and N is the number of samples accumulated for the correlation computation.

[0053]
The binary voicing decision forces the MELP to use either periodic pulse or noise excitation for each frequency band even in frames containing an irregular or illdefined pitch. As a result, noise excitation for bands inappropriately designated as noise or pitch excitation inappropriately matched with an inaccurate pitch lag leads to distortion in transitions. To solve this problem, a slidingsample window is used in the present invention. This method seeks the pitch analysis window position that provides the highest pitch correlation by sliding the window around the original position. This is equivalent to using a more periodically stable signal rather than using a portion of the signal with an irregular pitch for pitch analysis. By using a periodically stable portion of the signal for pitch analysis, the present invention avoids inappropriate voicing decisions and pitch estimates, thus reducing the artifactual nose in the nonperiodically stable signal segments.

[0054]
[0054]FIG. 4 shows a robust pitch detector used in the present invention. In FIG. 4, the normalized pitch correlation in the window
43 is first computed in the same manner as the fixed window pitch detection as shown in Equation (1), where, S
_{k }is the k
^{th }signal and s
_{0 }is the signal at the center of the original fixedposition window. The normalized pitch correlation in the window
43 is computed recursively as follows:
$\begin{array}{cc}{r}_{i}\ue8a0\left(T\right)=\frac{{c}_{T}\ue8a0\left(i,T+i\right)}{\sqrt{{c}_{T}\ue8a0\left(i,i\right)\ue89e{c}_{T}\ue8a0\left(T+i,T+i\right)}},\text{}\ue89e\mathrm{where},\text{\ue891}\ue89e{c}_{T}\ue8a0\left(i,j\right)={c}_{T}\ue8a0\left(i1,j1\right)+{s}_{i\frac{T}{2}+\frac{N}{2}1}\ue89e{s}_{j\frac{T}{2}+\frac{N}{2}1}{s}_{i1\frac{T}{2}\frac{N}{2}}\ue89e{s}_{j1\frac{T}{2}\frac{N}{2}}& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(2\right)\end{array}$

[0055]
In each window, the maximum normalized pitch correlation r
_{i}(T
_{i}) and the associated pitch lag, T
_{i }is determined and the final pitch lag selected as the pitch lag associated with the maximum normalized pitch correlation r(T) in all windows as follows:
$\begin{array}{cc}r\ue8a0\left(T\right)=\stackrel{{N}_{s}1}{\underset{i={N}_{s}}{\mathrm{max}}}\ue89e\left[\underset{T}{\mathrm{max}}\ue89e\left\{{r}_{i}\ue8a0\left(T\right)\right\}\right],& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(3\right)\end{array}$

[0056]
where, Ns is the maximum windowsliding range from the original fixedposition window. In the present invention, an LPC parameter, a gain, bandpass voicing decision, and fractional pitch are computed using the signal in the window that maximizes the normalized pitch correlation. A direct implementation of Equation (2) solving for r_{i }(T) for all values of i would result in a significant increase in the computational complexity. To reduce the additional complexity, the recursion Equation (2) for c_{T }(i, j) is used to compute the autocorrelation.

[0057]
The aperiodic flag is set if V_{bpl}, determined in the voicing analysis for the 0 to 500 Hz bandpass, is less than 0.5 and set to 0 otherwise. When set, the flag informs the decoder that the voiced component of the excitation should be aperiodic.

[0058]
A 10^{th }order linear prediction analysis is performed on the input speech signal using a 200 sample (25 ms) Hamming window centered on the last sample in the current frame. A traditional autocorrelation analysis procedure is implemented using LevinsonDurbin recursion. In addition, a bandwidth expansion constant of 0.994 (15 Hz) is applied to the prediction coefficients by multiplying each coefficient by the bandwidth expansion constant.

[0059]
Next, a linear prediction residual signal is calculated by filtering the input speech signal with the prediction filter using the coefficients determined above and an inverse of the prediction filter using those same coefficients. The two resulting signals are summed to create the linear prediction residual signal.

[0060]
Plosive Analysis

[0061]
The plosive analysis/synthesis system of the current invention consists of three parts: plosive detection, plosive modeling, and plosive synthesis. FIG. 5 shows the plosive analysis/synthesis system.

[0062]
Plosive Detection

[0063]
With reference to FIG. 5, the plosive detector
56 uses a sliding window for “peakiness” computation to detect the frame that contains a plosive signal. The peakiness value is sensitive to the phase of the plosive signal. By using a sliding window to detect a window position that maximizes the peakiness value, the phase sensitivity of the plosive is reduced. The peakiness, P, is defined as a ratio of the L2 norm to the L1 norm of the signal:
$\begin{array}{cc}P=\frac{\sqrt{\frac{1}{N}\ue89e\sum _{n=0}^{N1}\ue89e{r}_{n}^{2}}}{\frac{1}{N}\ue89e\sum _{n=0}^{N1}\left{r}_{n}\right},& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(4\right)\end{array}$

[0064]
where, r
_{n }is a LPC residual signal and N is a frame size. As shown in FIG. 6, the plosive detector slides the peakiness analysis window
63 to find the maximum peakiness value in all windows. The peakiness of each window is given by:
$\begin{array}{cc}{P}_{i}=\frac{\sqrt{\frac{1}{N}\ue89e\sum _{n=0}^{N1}\ue89e{r}_{n+i}^{2}}}{\frac{1}{N}\ue89e\sum _{n=0}^{N1}\left{r}_{n+i}\right}=\frac{\sqrt{\frac{1}{N}\ue89e{B}_{i}}}{\frac{1}{N}\ue89e{A}_{i}},& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(5\right)\end{array}$

[0065]
where, P_{i }is the peakiness of the i^{th }window from the past, and r_{0 }is the first LPC residual signal in the original fixedposition window. In FIG. 6, the peakiness in the window 63 (P_{−Ns}) is first computed. The peakiness in the window 63 is computed recursively as follows:

A _{l} =A _{i−1} +r _{N−1=i} −r _{i−1}

B_{i} =B _{i−1} =r _{N−1=i} ^{2} −r _{t−1} ^{2}, Eq. (6)

[0066]
Then, the maximum peakiness value in all windows is used as the peakiness value P of the frame:
$\begin{array}{cc}P=\stackrel{{N}_{s}1}{\underset{i={N}_{s}}{\mathrm{max}}}\ue89e\left[{P}_{i}\right],& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(7\right)\end{array}$

[0067]
where, N_{s }is the maximum windowsliding range, which is also used for the pitch detector of the present invention. The peakiness value with the sliding window is illustrated in FIG. 3A along with that of the fixed position window and a corresponding speech input waveform. In addition to the peakiness value, the low pass energy is computed and used to distinguish the rapid onset of a vowel from the plosive signal.

[0068]
Plosive Modeling

[0069]
In the present invention, a simple model is applied to the plosive signal expression in plosive modeling 57 of FIG. 5 so as to minimize the additional transmission bits. FIG. 12 shows the plosive signals detectable in the English language. Analysis of the frequency spectrums associated with the identified plosive sounds in FIG. 12 reveals that the 28 separate plosive sounds could be closely represented by the frequency spectrums of 18 replacement plosive sounds by aligning the maximum amplitude positions of each plosive signal. Near transparent replacement requires at least a rough spectral fit for each frequency. FIG. 13 illustrates the replacement matrix for the plosive sounds in the current invention.

[0070]
In this model, all plosive signals p(n) are produced by scaling and LPC synthesis filtering the single prestored template LPC residual signal v(n) as follows:
$\begin{array}{cc}p\ue8a0\left(n\right)={g}_{p}\ue89ev\ue8a0\left(n\right)+\sum _{i=1}^{P}\ue89e{a}_{i}\ue89ep\ue8a0\left(n1\right),& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(8\right)\end{array}$

[0071]
where, g_{p }is the scaling factor based on the energy of the input plosive signal, and a_{i }are the LPC coefficients computed from the input plosive signal. The template plosive signal v(n) was chosen arbitrarily and filtered with the 14^{th }order inverse linear prediction filter. Since only a rough spectral fit between the input and the synthesized plosive signals provides a near transparent sound, an accurate LPC analysis is not required for the input plosive signal. In order to minimize the additional bits required for the plosive model, the same 10^{th }order LPC model used for voiced pitch modeling is used for the production of the plosive signal.

[0072]
The parameters for transmission are a plosive flag, a plosive location, and plosive gain. The gain is computed by comparing the energy of the LPC residual of the plosive signal with that of the template signal. For the specific embodiment of the present invention, the gain is quantized with two bits. The position of the plosive signal is identified by seeking the maximum amplitude position in the frame and representing the plosive signal position with one bit in either the first half or the second half of the current frame. Thus, for the specific embodiment of the present invention, the plosive signal is quantized with only four bits including one bit for a plosive flag, two bits for a plosive gain and one bit for plosive position as is shown in FIG. 14. In the present invention, plosive synthesis is performed in the MELP decoder and will be disclosed in the description of the decoder.

[0073]
Next, the input speech signal gain is measured twice per frame using a pitch adaptive window length. This adaptive length is identical for both gain measurements and is determined as follows. When V
_{bp1}>0.6, the length is the shortest multiple of P
_{2 }which is longer than 120 samples. If this length exceeds 320 samples, it is divided by 2. When V
_{bpl }is less than or equal to 0.6, the window length is 120 samples. The gain calculation for the first window produces G
_{1 }and is centered 90 samples before the last sample of the current frame. The calculation for the second window produces G
_{2 }and is centered on the last sample of the current frame. The gain is the RMS value, measured in dB, of the signal in the window s
_{n}:
$\begin{array}{cc}{G}_{i}=10\ue89e{\mathrm{log}}_{10}\ue8a0\left(0.01+\frac{1}{L}\ue89e\sum _{n=1}^{L}\ue89e{s}_{n}^{2}\right),& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(9\right)\end{array}$

[0074]
where, L is the window length. The 0.01 offset prevents the log argument from approaching zero. If a gain measurement is less than 0.0, it is clamped to 0.0. The gain measurement assumes that the input signal range is −32768 to 32767.

[0075]
Next, the encoder performs a quantization of the LPC coefficients. First, the LPC coefficients are converted into line spectrum frequencies (LSFs). All adjacent pairs of the LSF components are organized such that each is in ascending frequency order with a minimum of 50 Hz separation. The resulting LSF vector f is quantized using a multistage vector quantizer. The resulting vector is used in the Fourier magnitude calculation in the decoder.

[0076]
The final pitch value, P_{3}, is quantized on a logarithmic scale with a 99level uniform quantizer ranging from 20 to 160 samples. These pitch values are then mapped to a 7bit codeword using a lookup table. The all zero codeword represents the unvoiced state and is sent if V_{bpl }is less than or equal to 0.6. All 28 codewords with Hamming weight of 1 or 2 are reserved for error protection.

[0077]
The two gain values are quantized as follows. G_{2 }is quantized with a 5bit uniform quantizer ranging from 10 to 77 dB. G_{1 }is quantized to 3 bits using the following adaptive algorithm. If G_{2 }for the current frame is within 5 dB of G_{2 }for the previous frame, and G_{1 }is within 3 dB of the average of G_{2 }values for the current and previous frames, then the frame is steadystate and a code of all zeros is sent to indicate that the decoder should set G_{1 }to the mean of G_{2 }values for the current and previous frames. Otherwise, the frame represents a transition and G_{1 }is quantized with a 7level uniform quantizer ranging from 6 dB below the minimum of the G_{2 }values for the current and previous frames to 6 dB above the maximum of those G_{2 }values.

[0078]
Bandpass voicing quantization occurs as follows. When V_{bpl }is less than or equal to 0.6 (unvoiced state), the remaining strengths V_{bpi}, i=2, 3, 4, 5 are set to 0. When V_{bpl }is >0.6, the remaining voicing strengths are quantized to 1.

[0079]
Fourier Magnitude calculation and quantization occurs as follows. The Fourier magnitudes of the first 10 pitch harmonics of the prediction signal residual generated by the quantized prediction coefficients. It uses a 512 point Fast Fourier Transform (FFT) of a 200 sample window centered at the end of the frame. First, a set of quantized predictor coefficients are calculated from the quantized LSF vector. Then, the residual window is generated using the quantized prediction coefficients. Next, a 200 sample Hamming window is applied, the signal is zeropadded to 512 points, and the complex FFT is performed. Finally, the complex FFT output is transformed into magnitudes and the harmonics found with a spectral peakselecting algorithm.

[0080]
The peakselecting algorithm finds the maximum within a width of 512/P frequency samples centered around the initial estimate for each pitch harmonic, where P is the quantized pitch. This width is truncated to an integer. The initial estimate for the location of the i^{th }harmonic is 512 i/P. The number of harmonic magnitudes searched for is limited to the smaller of 10 or P/4. These magnitudes are then normalized to have a RMS value of 1.0. If fewer than 10 harmonics are found, the remaining magnitudes are set to 1.0.

[0081]
The 10 magnitudes are quantized with an 8bit quantizer. The codebook is searched for a perceptually weighted Euclidean distance, with fixed weights that emphasize low frequencies over higher frequencies. The weights are given by:
$\begin{array}{cc}{w}_{i}={\left[\frac{117}{25+75\ue89e{\left(1+1.4\ue89e{\left(\frac{{f}_{i}}{1000}\right)}^{2}\right)}^{0.69}}\right]}^{2},i=1,2,\dots \ue89e\text{\hspace{1em}},10& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(10\right)\end{array}$

[0082]
where,f_{i}=8000_{i}/60 is the frequency in Hz corresponding to the i_{th }harmonic for a default pitch period of 60 samples. The weights are applied to the squared difference between the input Fourier magnitudes and the codebook values.

[0083]
Lastly, the MELP encoder adds error protection and structures the 54bit frame as follows. FIG. 12 shows the bit allocation for the MELP coder. To improve performance in channel errors, the unused coder parameters for the unvoiced mode are replaced with forward error correction. Three Hamming (7,4) codes and one Hamming (8,4) code may be used. The (7,4) code corrects single bit errors, while the (8,4) code detects double bit errors. The (8,4) code is applied to the 4 most significant bits (MSBs) of the first multistage vector quantization index, and the 4 parity bits are written over the bandpass voicing. The remaining three bits of the first multistage vector quantization index along with the reserved bit, are covered by a (7,4) code with the resulting 3 parity bits written to the MSBs of the Fourier series vector quantization index. The 4 MSBs of the G
_{2 }codeword are protected with 3 parity bits which are written to the next 3 bits of the Fourier magnitudes. Finally, the least significant bit (LSB) of the second gain index and the 3 bit G
_{1 }codeword are protected with 3 parity bits written to the 2 LSBs of the Fourier magnitudes and the aperiodic flag bit. The parity generator matrix for the Hamming (7,4) code is:
$\begin{array}{cc}{G}_{7,4}=\left[\begin{array}{cccc}1& 1& 0& 1\\ 1& 0& 1& 1\\ 0& 1& 1& 1\end{array}\right]\ue89e\text{\hspace{1em}}.& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(11\right)\end{array}$

[0084]
The parity generator matrix for the Hamming (8,4) code is:
$\begin{array}{cc}{G}_{8,4}=\left[\begin{array}{cccc}1& 1& 0& 1\\ 1& 0& 1& 1\\ 0& 1& 1& 1\\ 1& 1& 1& 0\end{array}\right].& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(12\right)\end{array}$

[0085]
[0085]FIG. 16A illustrates the bit allocation across the parameters communicated in the 54 bits of each MELP frame. FIG. 16B shows the transmission order for the 54 bits of each MELP frame for both voiced and unvoiced frame modes.

[0086]
MELP Decoder

[0087]
The received bit stream is unpacked from the communications channel 18 and assembled into the parametric codewords. Parameter decoding differs for the voiced and unvoiced frames. Pitch is decoded first as it contains the voiced/unvoiced mode information. If the pitch code is all zeros or has only 1 bit set, then the unvoiced mode is used. If two bits are set, a frame erasure is indicated. Otherwise, the pitch value is decoded and the voiced mode is used.

[0088]
In the unvoiced mode, the (8,4) Hamming code is decoded to correct single bit errors and to detect double bit errors. If an uncorrectable error is detected, a frame erasure is indicated. Otherwise, the (7,4) Hamming codes are decoded, correcting single bit errors.

[0089]
If an erasure is indicated in the current frame, by the Hamming code, by the pitch code, or directly signaled from the communication channel 18, then a frame repeat mechanism is implemented. All of the parameters for the current frame are replaced with the parameters from the previous frame. In addition, the first gain term is set equal to the second gain term so that no gain transitions are permitted.

[0090]
If an erasure is not indicated, the remaining parameters are decoded. The LSFs are checked for ascending order and a minimum separation of 50 Hz. In the unvoiced mode, default parameter values are used for the pitch, jitter, bandpass voicing, and Fourier magnitudes. The pitch value is set to 50 samples, the jitter is set to 25%, the bandpass voicing strengths are set to 0, and the Fourier magnitudes are set to 1.0. In the voiced mode, V_{bpl }is set to 1; jitter is set to 25% if the aperiodic flag is set; otherwise, jitter is set to 0%. The bandpass voicing strength for the upper four bands is set to 1.0 if the corresponding bit is a 1; otherwise, the voicing strength is set to 0.

[0091]
When the special all zero code for the first gain parameter G_{1}, is received, some errors in the second gain parameter, G_{2}, can be detected and corrected. This correction process provides improved performance in channel errors.

[0092]
For quiet input signals, a small amount of gain attenuation is applied to both gain parameters using a power subtraction rule. This attenuation is a simplified, frequency invariant case of a smooth spectral subtraction noise suppression method. The background noise estimate is also used in the adaptive spectral enhancement calculation.

[0093]
Gain, G_{1}, is then modified by subtracting a positive correction term, G_{att}, given in dB by:

G _{att}=−10log_{10}(1−10^{01[G} ^{ n } ^{+3−G} ^{ 1 } ^{]}). Eq. (13)

[0094]
All MELP speech synthesis parameters are interpolated pitch synchronously for each synthesized pitch period. The interpolated parameters are the gain in dB, LSFs, pitch, jitter, Fourier magnitudes, pulse and noise coefficients for mixed excitation, and spectraltilt coefficient for the adaptive spectralenhancement filter. Gain is linearly interpolated between the gain of the prior frame, G_{2p}, and the first gain of the current frame, G_{1}, if the starting point, t_{0}, t_{0}=0, 1, . . . , 179, of the new pitch period is less than 90; otherwise, gain is interpolated between the G_{1 }and G_{2}. Normally, the other parameters are linearly interpolated between the past and current frame values. The interpolation factor, int, for these parameters is based on the starting point of the new pitch period:

int=^{t} ^{ 0 }/_{180 } Eq. (14)

[0095]
There are two exceptions to the interpolation procedure. First, there is an onset with a high pitch frequency, pitch interpolation is disabled and the new pitch is immediately used. This condition is met when G
_{1 }is more than 6 dB greater than G
_{2 }and the current frame's pitch period is less than half the prior frame's pitch period. The second exception also involves a gain onset. If G
_{2 }differs from G
_{2p }by more than 6 dB, then the LSFs, spectral tilt, and pitch are interpolated using the interpolated gain trajectory as a basis, since the gain is transmitted twice per frame and has a more accurate interpolation path. In this case, the interpolation factor is given by:
$\begin{array}{cc}\mathrm{int}=\frac{{G}_{\mathrm{int}}{G}_{2\ue89ep}}{{G}_{2}{G}_{2\ue89ep}},& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(15\right)\end{array}$

[0096]
where G_{int }is the interpolated gain. This interpolation factor is then clamped between 0 and 1.

[0097]
New Mixed Excitation Algorithm

[0098]
Although the mixed excitation method in the existing MELP coder minimizes the bandpass filtering operations, it still requires two 32
^{nd }order FIR filtering operations for a pulse train and noise. The present invention removes these filters to reduce the computational complexity of the existing MELP. FIG. 9 shows a new mixedexcitation algorithm in the present invention. The existing MELP uses the Fourier magnitudes to generate a pulse train. The pulse train is mixed with random noise in time domain by bandpass filtering. In the present invention, noise is mixed with a pulse train in the frequency domain by adding a random phase to the Fourier magnitudes. Block
64 hows the random phase generator. The random phase is added to only the Fourier magnitudes in unvoiced frequency bands. The mixed excitation signal in the present method is given by:
${e}_{m}\ue89e\left(n\right)=\frac{1}{2\ue89e\pi}\ue89e{\int}_{\pi}^{\pi}\ue89e{E}_{M}\ue8a0\left({\uf74d}^{j\ue89e\text{\hspace{1em}}\ue89e\omega}\right)\ue89e{\uf74d}^{j\ue89e\text{\hspace{1em}}\ue89e\omega \ue89e\text{\hspace{1em}}\ue89en}\ue89e\text{\hspace{1em}}\ue89e\uf74c\omega ,{E}_{M}\ue8a0\left({\uf74d}^{j\ue89e\text{\hspace{1em}}\ue89e\omega}\right)={E}_{0}\ue8a0\left({\uf74d}^{j\ue89e\text{\hspace{1em}}\ue89e\omega}\right)$

[0099]
If, ω=0, ω=π, or in the voiced band,

[0100]
otherwise,

E _{m}(e ^{jω})=E _{0}(e ^{jω})e ^{jω100 } , φ=U[−απ, απ], Eq. (16)

[0101]
where, cc is an interpolation coefficient between 0 and 1. Since the existing MELP coder generates a pulse pitchsynchronously, the bandpass voicing decision needs to be linearly interpolated between 0 (voiced) and 1 (unvoiced).

[0102]
The adaptive spectral enhancement filter is then applied to the mixed excitation signal. This filter is a 10
^{th }order pole/zero filter with additional first order tilt compensation. The coefficients are generated by bandwidth expansion of the LPC filter transfer function A(z), corresponding to the interpolated LSFs. The transfer function of the enhancement filter, H
_{ase}(Z), is given by:
$\begin{array}{cc}{H}_{\mathrm{ase}}\ue8a0\left(z\right)=\frac{A\ue8a0\left(\alpha \ue89e\text{\hspace{1em}}\ue89e{z}^{1}\right)}{A\ue8a0\left(\beta \ue89e\text{\hspace{1em}}\ue89e{z}^{1}\right)}\times \left(1+\mu \ue89e\text{\hspace{1em}}\ue89e{z}^{1}\right),& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(17\right)\end{array}$

[0103]
where,

α=0.5p β=0.8p′ Eq. (18)

[0104]
and tilt coefficient, μ, is first calculated as max(0.5k
_{l}0), then interpolated and multiplied by p, the signal probability. The first reflection coefficient, k
_{l}, is calculated from the decoded LSFs. By the MELP predictor coefficient sign convention, k
_{l}, is usually negative for the voiced spectra. The signal probability p is estimated by comparing the current interpolated gain, G
_{int}, to the background noise estimate, G
_{n}, using the formula:
$\begin{array}{cc}p=\frac{{G}_{\mathrm{int}}{G}_{n}12}{18}.& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(19\right)\end{array}$

[0105]
This signal probability is clamped between 0 and 1.

[0106]
Linear prediction synthesis is performed by applying the coefficients corresponding to the interpolated LSFs directly to the form filter.

[0107]
Since excitation of the synthesized voice signal is generated at an arbitrary level, a speech gain adjustment must be performed on the synthesized speech. The correct scaling factor, S
_{gain}, is computed for each synthesized pitch period of length Tby dividing the desired RMS value (G
_{int}, must be converted from dB) by the RMS value of the unsealed synthetic speech signal s
_{n}:
$\begin{array}{cc}{S}_{\mathrm{gain}}=\frac{{10}^{\frac{{G}_{\mathrm{int}}}{20}}}{\sqrt{\frac{1}{T}\ue89e\sum _{n=1}^{T}\ue89e\text{\hspace{1em}}\ue89e{s}_{n}^{2}}}.& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(20\right)\end{array}$

[0108]
To prevent discontinuities in the synthesized speech, this scale factor is linearly interpolated between the previous and current values for the first ten samples of the pitch period.

[0109]
The pulse dispersion filter is a 65^{th }order FIR filter derived from a spectrally flattened triangular pulse. The coefficients used in the filter are provided in the Specification for the Analog to Digital Conversion of Voice by 2,400 Bit/Second Mixed Excitation Linear Prediction herein enclosed for reference.

[0110]
PostProcessor for the Fourier Magnitude Model

[0111]
In the present invention, a postprocessor for the Fourier magnitude model
62 is added to the MELP decoder as shown in FIG. 2A. In the prior art, it was observed that the first few harmonic magnitudes of the coded speech for some lowpitch male speakers were suppressed by the preprocessing highpass filter
11 in FIG. 2B and the adaptive spectral enhancement filter (ASEF)
30 in FIG. 2C. It was found that this effect led to a highpass filtered quality for lowpitch male speakers. To provide more natural speech quality for such speakers, the present invention adaptively emphasizes the harmonic magnitudes in low frequencies by removing the effect of the two filters. The emphasized harmonic magnitude is given by:
$\uf603\begin{array}{cc}\stackrel{~}{S}\ue8a0\left({\uf74d}^{j\ue89e\text{\hspace{1em}}\ue89e{\omega}_{i}}\right)\uf604=\uf603S\ue8a0\left({\uf74d}^{j\ue89e\text{\hspace{1em}}\ue89e{\omega}_{i}}\right)\uf604\ue89e\frac{G}{H\ue8a0\left({\uf74d}^{j\ue89e\text{\hspace{1em}}\ue89e{\omega}_{i}}\right)},& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(21\right)\end{array}$

[0112]
where, ω
_{i is the i} ^{th }harmonic frequency, G is the average Fourier spectrum energy, and S(e
^{jω}) is the nonemphasized Fourier magnitude of the i
^{th }harmonic. As shown in FIG. 8, the present invention uses the MELP Fourier magnitude parameters, which are the Fourier magnitudes of the LPC residual signal
23, for the harmonic magnitude emphasis rather than using the harmonic magnitude of the synthesized speech S(e
^{jω}). From Parseval's theorem, the average Fourier spectrum magnitude G is given by:
$\begin{array}{cc}G=\sum _{n=0}^{N1}\ue89e\uf603\text{\hspace{1em}}\ue89e{h\ue8a0\left(n\right)}^{2}\uf604,& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(22\right)\end{array}$

[0113]
where, h(n) is the impulse response of the filter H(e^{jω}), and N is the length of impulse response. The magnitude response of the filter H(e^{jω}), is given by:

H(e ^{jω})=H _{1}(e ^{jω})H _{2}(e ^{jω}), Eq. (23)

[0114]
where, H
_{1 }(e
^{jω}) and H
_{2 }(e
^{jω}) are the magnitude responses of the ASEF 30 and preprocessing highpass filter 11 respectively. To avoid losing the advantage of the ASEF 30 in the prior art, the harmonic magnitude emphasis is applied to only the harmonics that are 200 Hz less than the first formant frequency of the frame. The first formant frequency F
_{1 }is roughly estimated using quantized line spectrum frequencies (LSFs) as follows:
${F}_{1}=\frac{{\hat{f}}_{1}+{\hat{f}}_{2}}{2},$

[0115]
otherwise,
$\begin{array}{cc}{F}_{1}=\frac{{\hat{f}}_{2}+{\hat{f}}_{3}}{2},& \mathrm{Eq}.\text{\hspace{1em}}\ue89e\left(23\right)\end{array}$

[0116]
where, {circumflex over (f)}_{i }is the i^{th }quantized LSF. From the experimental result, the emphasized harmonic magnitude {tilde over (S)}(e^{jω}) is further emphasized by 2 dB in the present invention.

[0117]
Plosive Synthesis

[0118]
[0118]FIG. 7 shows the block diagram of the plosive synthesis 66. As shown in FIG. 7, all plosive signals are produced by scaling and LPC synthesis/filtering 32 the plosive residual template 71, which is prestored in the synthesizer. This plosive residual template 71 was chosen arbitrarily and filtered with the 14^{th }order LPC inverse filter. The LPC coefficients for the frame that contains the plosive 81 are also used for the plosive signal synthesis. The gain of synthesized plosive signal is adjusted by applying plosive gain 76 to the MELP gain 34. In the present invention, the length of synthesized plosive signal is a half of the frame length, and the synthesized plosive is added back to either the first half or the second half of the coded speech frame according to the plosive position as shown in block 73. Before the plosive is added back to the coded speech, the gain of the coded speech is adjusted in gain suppressor 75 such that the gain of the half frame to which the plosive is added back is suppressed. It is realized by simply replacing the gain of the half frame to which the plosive is added back with that of the previous half frame:

[0119]
[0119]g _{i (}0)=g _{i1}(1), if the plosive position is the first half of the frame, otherwise, g _{i(}1)=g _{i}(0), if the plosive position is the second half of the frame, where, g_{i is the j} ^{th }gain (j=0,1) in the ith frame. Since plosive detection, modeling and synthesis are performed independently from the MELP coder as shown in FIG. 5, this embodiment can be applied to other low bitrate speech coders.

[0120]
Bit Allocation

[0121]
Another advantage of the present invention is bitstream compatibility with the existing MELP coder. The present invention consists of four embodiments including a robust pitch detector, a plosive analysis/synthesis system, a postprocessor for the Fourier magnitude model and a new mixedexcitation algorithm. As shown in FIG. 14, only the plosive analysis/synthesis system requires additional bits for transmission. In the present invention, the additional bits for the plosive can be packed into the bitstream of the existing MELP. There are two different modes for the bit allocation of the existing MELP: one voiced, the other unvoiced. The mode is selected as voiced if the first band is voiced and as unvoiced if the first band is unvoiced. For unvoiced mode, the existing MELP coder sets only the first and fifth band to voiced and the index for a pitch lag is set less than three so as to indicate that the frame is unvoiced. In the decoder, if the index for the pitch lag is less than three, the frame is regarded as unvoiced. Otherwise, the frame is regarded as voiced. In the present invention, a frame that contains a plosive is assumed to be a unvoiced frame. FIG. 10 shows the bit packing flow diagram for the plosive signal. To identify the plosive frame in the decoder of the present invention, the first and the fifth frame is set to voiced but the pitch is set to three as a dummy. Then, a plosive gain and position is packed into the bits for the Fourier magnitude, which is used for the voiced frame in the existing MELP. FIG. 11 shows the bit unpacking flow diagram for the plosive signal. The decoder of the existing MELP regards the frame as unvoiced if the pitch index is less than three. If the pitch index is equal to or greater than three, the combination that only the first and the fifth bands are unvoiced will never occur in the existing MELP. In the decoder of the present invention, the frame is regarded as the plosive frame if this combination occurs. Then, the plosive parameters such as a gain and position are extracted from the bits for the Fourier magnitude. Since the bitstream specification is maintained in the present invention, the present system can interchange the encoder/decoder with the existing MELP.

[0122]
While preferred embodiments of the invention have been disclosed in detail in the foregoing description and drawings, it will be understood by those skilled in the art that variations and modifications thereof can be made without departing from the spirit and scope of the invention as set forth in the following claims.