Publication number | USRE43191 E1 |
Publication type | Grant |
Application number | US 10/621,240 |
Publication date | Feb 14, 2012 |
Filing date | Aug 24, 2004 |
Priority date | Apr 19, 1995 |
Fee status | Paid |
Also published as | US6263307 |
Publication number | 10621240, 621240, US RE43191 E1, US RE43191E1, US-E1-RE43191, USRE43191 E1, USRE43191E1 |
Inventors | Levent M. Arslan, Alan V. McCree, Vishu R. Viswanathan |
Original Assignee | Texas Instruments Incorporated |
Export Citation | BiBTeX, EndNote, RefMan |
Patent Citations (16), Non-Patent Citations (7), Referenced by (13), Classifications (11), Legal Events (1) | |
External Links: USPTO, USPTO Assignment, Espacenet | |
Cofiled patent applications with Ser. Nos. 08/424,928, 08/425,125, 08/426,746, and 08/426,427 are copending and disclose related subject matter. These applications all have a common assignee.
The invention relates to electronic devices, and, more particularly, to speech analysis and synthesis devices and systems.
Human speech consists of a stream of acoustic signals with frequencies ranging up to roughly 20 KHz; but the band of 100 Hz to 5 KHz contains the bulk of the acoustic energy. Telephone transmission of human speech originally consisted of conversion of the analog acoustic signal stream into an analog electrical voltage signal stream (e.g., microphone) for transmission and reconversion to an acoustic signal stream (e.g., loudspeaker) for reception.
The advantages of digital electrical signal transmission led to a conversion from analog to digital telephone transmission beginning in the 1960s. Typically, digital telephone signals arise from sampling analog signals at 8 KHz and nonlinearly quantizing the samples with 8-bit codes according to the μ-law (pulse code modulation, or PCM). A clocked digital-to-analog converter and companding amplifier reconstruct an analog electrical signal stream from the stream of 8-bit samples. Such signals require transmission rates of 64 Kbps (kilobits per second). Many communications applications, such as digital cellular telehone, cannot handle such a high transmission rate, and this has inspired various speech compression methods.
The storage of speech information in analog format (e.g., on magnetic tape in a telephone answering machine) can likewise be replaced with digital storage. However, the memory demands can become overwhelming: 10 minutes of 8-bit PCM sampled at 8 KHz would require about 5 MB (megabytes) of storage. This demands speech compression analogous to digital transmission compression.
One approach to speech compression models the physiological generation of speech and thereby reduces the necessary information transmitted or stored. In particular, the linear speech production model presumes excitation of a variable filter (which roughly represents the vocal tract) by either a pulse train for voiced sounds or white noise for unvoiced sounds followed by amplification or gain to adjust the loudness. The model produces a stream of sounds simply by periodically making a voiced/unvoiced decision plus adjusting the filter coefficients and the gain. Generally, see Markel and Gray, Linear Prediction of Speech (Springer-Verlag 1976).
More particularly, the linear prediction method partitions a stream of speech samples s(n) into “frames” of, for example, 180 successive samples (22.5 msec intervals for a 8 KHz sampling rate); and the samples in a frame then provide the data for computing the filter coefficients for use in coding and synthesis of the sound associated with the frame. Each frame generates coded bits for the linear prediction filter coefficients (LPC), the pitch, the voiced/unvoiced decision, and the gain. This approach of encoding only the model parameters represents far fewer bits than encoding the entire frame of speech samples directly, so the transmission rate may be only 2.4 Kbps rather than the 64 Kbps of PCM. In practice, the LPC coefficients must be quantized for transmission, and the sensitivity of the filter behavior to the quantization error has led to quantization based on the Line Spectral Frequencies (LSF) representation.
To improve the sound quality, further information may be extracted from the speech, compressed and transmitted or stored along with the LPC coefficients, pitch, voicing, and gain. For example, the codebook excitation linear prediction (CELP) method first analyzes a speech frame to find the LPC filter coefficients, and then filters the frame with the LPC filter. Next, CELP determines a pitch period from the filtered frame and removes this periodicity with a comb filter to yield a noise-looking excitation signal. Lastly, CELP encodes the excitation signals using a codebook. Thus CELP transmits the LPC filter coefficients, pitch, gain, and the codebook index of the excitation signal.
The advent of digital cellular telephones has emphasized the role of noise suppression in speech processing, both coding and recognition. Customer expectation of high performance even in extreme car noise situations plus the demand to move to progressively lower data rate speech coding in order to accommodate the ever-increasing number of cellular telephone customers have contributed to the importance of noise suppression. While higher data rate speech coding methods tend to maintain robust performance even in high noise environments, that typically is not the case with lower data rate speech coding methods. The speech quality of low data rate methods tends to degrade drastically with high additive noise. Noise supression to prevent such speech quality losses is important, but it must be achieved without introducing any undesirable artifacts or speech distortions or any significant loss of speech intelligibility. These performance goals for noise suppression have existed for many years, and they have recently come to the forefront due to digital cellular telephone application.
One approach to noise suppression in speech employs spectral subtraction and appears in Boll, Suppression of Acoustic Noise in Speech Using Spectral Subtraction, 27 IEEE Tr.ASSP 113 (1979), and Lim and Oppenheim, Enhancement and Bandwidth Compression of Noisy Speech, 67 Proc.IEEE 1586 (1979). Spectral subtraction proceeds roughly as follows. Presume a sampled speech signal s(j) with uncorrelated additive noise n(j) to yield an observed windowed noisy speech y(j)=s(j)+n(j). These are random processes over time. Noise is assumed to be a stationary process in that the process's autocorrelation depends only on the difference of the variables; that is, there is a function r_{N}(.) such that:
E{n(j)n(i)}=r_{N}(i−j)
where E is the expectation. The Fourier transform of the autocorrelation is called the power spectral density, P_{N}(ω). If speech were also a stationary process with autocorrelation r_{S}(j) and power spectral density P_{S}(ω), then the power spectral densities would add due to the lack of correlation:
P_{Y}(ω)=P_{S}(ω)+P_{N}(ω)
Hence, an estimate for P_{Ss}(ω), and thus s(j), could be obtained from the observed noisy speech y(j) and the noise observed during intervals of (presumed) silence in the observed noisy speech. In particular, take P_{Y}(ω) as the squared magnitude of the Fourier transform of y(j) and P_{N}(ω) as the squared magnitude of the Fourier transform of the observed noise.
Of course, speech is not a stationary process, so Lim and Oppenheim modified the approach as follows. Take s(j) not to represent a random process but rather to represent a windowed speech signal (that is, a speech signal which has been multiplied by a window function), n(j) a windowed noise signal, and y(j) the resultant windowed observed noisy speech signal. Then Fourier transforming and multiplying by complex conjugates yields:
|Y(ω)|^{2}=|S(ω)|^{2}+|N(ω)|^{2}+2Re{S(ω)N(ω)*}
For ensemble averages the last term on the righthand side of the equation equals zero due to the lack of correlation of noise with the speech signal. This equation thus yields an estimate, S^(ω), for the speech signal Fourier transform as:
|S^(ω)|^{2}=|Y(ω)|^{2}−E{|N(ω)|^{2}}
This resembles the preceding equation for the addition of power spectral densities.
An autocorrelation approach for the windowed speech and noise signals simplifies the mathematics. In particular, the autocorrelation for the speech signal is given by
r_{S}(j)=Σ_{i}S(i)S(i+j),
with similar expressions for the autocorrelation for the noisy speech and the noise. Thus the noisy speech autocorrelation is:
r_{Y}(j)=r_{S}(j)+r_{N}(j)+c_{SN}(j)+_{SN}(−j)
where c_{SN}(.) is the cross correlation of s(j) and n(j). But the speech and noise signals should be uncorrelated, so the cross correlations can be approximated as 0. Hence, r_{Y}(j)=r_{S}(j)+r_{N}(j). And the Fourier transforms of the autocorrelations are just the power spectral densities, so
P_{Y}(ω)=P_{S}(ω)+P_{N}(ω)
Of course, P_{Y}(ω) equals |Y(ω)|^{2 }with Y(ω) the Fourier transform of y(j) due to the autocorrelation being just a convolution with a time-reversed variable.
The power spectral density P_{N}(ω) of the noise signal can be estimated by detection during noise-only periods, so the speech power spectral estimate becomes
|S^(ω)|^{2}=|Y(ω)|^{2}−|N(ω)|^{2}−P_{Y}(ω)−P_{N}(ω)
which is the spectral subtraction.
The spectral subtraction method can be interpreted as a time-varying linear filter H(ω) so that S^(ω)=H(ω)Y(ω) which the foregoing estimate then defines as:
H(ω)^{2}=[P_{Y}(ω)−P_{N}(ω)]/P_{Y}(ω)
The ultimate estimate for the frame of windowed speech, s^(j), then equals the inverse Fourier transform of S^(ω), and then combining the estimates from successive frames (“overlap add”) yields the estimated speech stream.
This spectral subtraction can attenuate noise substantially, but it has problems including the introduction of fluctuating tonal noises commonly referred to as musical noises.
The Lim and Oppenheim article also describes an alternative noise suppression approach using noncausal Wiener filtering which minimizes the mean-square error. That is, again S^(ω)=H(ω)Y(ω) but with H(ω) now given by:
H(ω)=P_{S}(ω)/[P_{S}(ω)+P_{N}(ω)]
This Wiener filter generalizes to:
H(ω)=[P_{S}(ω)/[P_{S}(ω)+αP_{N}(ω)]]^{β}
where constants α and β are called the noise suppression factor and the filter power, respectively. Indeed, α=1 and β=½ leads to the spectral subtraction method in the following.
A noncausal Wiener filter cannot be directly applied to provide an estimate for s(j) because speech is not stationary and the power spectral density P_{S}(ω) is not known. Thus approximate the noncausal Wiener filter by an adaptive generalized Wiener filter which uses the squared magnitude of the estimate S^(ω) in place of P_{S}(ω):
H(ω)=(|S^(ω)|^{2}/[|S^(ω)|^{2}+αE{|N(ω)|^{2}}])^{β}
Recalling S^(ω)=H(ω)Y(ω) and then solving for |S^(ω)| in the β=½ case yields:
|S^(ω)|=[|Y(ω)|^{2}−αE{|N(ω)|^{2}}]^{1/2 }
which just replicates the spectral subtraction method when α=1.
However, this generalized Wiener filtering has problems including how to estimate S^, and estimators usually apply an iterative approach with perhaps a half dozen iterations which increases computational complexity.
Ephraim, A Minimum Mean Square Error Approach for Speech Enhancement, Conf.Proc. ICASSP 829 (1990), derived a Wiener filter by first analyzing noisy speech to find linear prediction coefficients (LPC) and then resynthesizing an estimate of the speech to use in the Wiener filter.
In contrast, O'Shaughnessy, Speech Enhancement Using Vector Quantization and a Formant Distance Measure, Conf.Proc. ICASSP 549 (1988), computed noisy speech formants and selected quantized speech codewords to represent the speech based on formant distance; the speech was resynthesized from the codewords. This has problems including degradation for high signal-to-noise signals because of the speech quality limitations of the LPC synthesis.
The Fourier transforms of the windowed sampled speech signals in systems 100 and 150 can be computed in either fixed point or floating point format. Fixed point is cheaper to implement in hardware but has less dynamic range for a comparable number of bits. Automatic gain control limits the dynamic range of the speech samples by adjusting magnitudes according to a moving average of the preceding sample magnitudes, but this also destroys the distinction between loud and quiet speech. Further, the acoustic energy may be concentrated in a narrow frequency band and the Fourier transform will have large dynamic range even for speech samples with relatively constant magnitude. To compensate for such overflow potential in fixed point format, a few bits may he reserved for large Fourier transform dynamic range; but this implies a loss of resolution for small magnitude samples and consequent degradation of quiet speech. This is especially true for systems which follow a Fourier transform with an inverse Fourier transform.
The present invention provides speech noise suppression by spectral subtraction filtering improved with filter clamping, limiting, and/or smoothing, plus generalized Wiener filtering with a signal-to-noise ratio dependent noise suppression factor, and plus a generalized Wiener filter based on a speech estimate derived from codebook noisy speech analysis and resynthesis. And each frame of samples has a frame-energy-based scaling applied prior to and after Fourier analysis to preserve quiet speech resolution.
The invention has advantages including simple speech noise suppression.
The drawings are schematic for clarity.
Overview
The preferred embodiment noise suppression filters may also be realized without Fourier transforms; however, the multiplication of Fourier transforms then corresponds to convolution of functions.
The preferred embodiment noise suppression filters may each be used as the noise suppression blocks in the generic systems of
The smoothed spectral subtraction preferred embodiments have a spectral subtraction filter which (1) clamps attenuation to limit suppression for inputs with small signal-to-noise ratios, (2) increases noise estimate to avoid filter fluctuations, (3) smoothes noisy speech and noise spectra used for filter definition, and (4) updates a noise spectrum estimate from the preceding frame using the noisy speech spectrum. The attenuation clamp may depend upon speech and noise estimates in order to lessen the attenuation (and distortion) for speech; this strategy may depend upon estimates only in a relatively noise-free frequency band.
The signal-to-noise ratio adaptive generalized Wiener filter preferred embodiments use H(ω)=[P_{S}^(ω)/[P_{S}^(ω)+αP_{N}(ω)]]^{β} where the noise suppression factor α depends on E_{Y}/E_{N }with E_{N }the noise energy and E_{Y }the noisy speech energy for the frame. These preferred embodiments also use a scaled LPC spectral approximation of the noisy speech for a smoothed speech power spectrum estimate as illustrated in the flow diagram
The codebook-based generalized Wiener filter noise suppression preferred embodiments use H(ω)=[P_{S}^(ω)/[P_{S}^(ω)+αP_{N}(ω)]]^{β} with P_{S}^(ω) estimated from LSFs as weighted sums of LSFs in a codebook of LSFs with the weights determined by the LSFs of the input noisy speech. Then iterate: use this H(ω) to form H(ω)Y(ω), next redetermine the input LSFs from H(ω)Y(ω), and then redetermine H(ω) with these LSFs as weights for the codebook LSFs. A half dozen iterations may be used.
The power estimates used in the preferred embodiment filter definitions may also be used for adaptive scaling of low power signals to avoid loss of precision during FFT or other operations. The scaling factor adapts to each frame so that with fixed-point digital computations the scale expands or contracts the samples to provide a constant overflow headroom, and after the computations the inverse scale restores the frame power level.
Smoothed spectral subtraction preferred embodiments
H(ω)^{2}=[|Y(ω)|^{2}−|N(ω)|^{2}]/|Y(ω)|^{2}=1−|N(ω)|^{2}/|Y(ω)|^{2 }
A graph of this function with logarithmic scales appears in
The preferred embodiments modify this standard spectral subtraction in four independent but synergistic approaches as detailed in the following.
Preliminarily, partition an input stream of noisy speech sampled at 8 KHz into 256-sample frames with a 50% overlap between successive frames; that is, each frame shares its first 128 samples with the preceding frame and shares its last 128 samples with the succeeding frame. This yields an input stream of frames with each frame having 32 msec of samples and a new frame beginning every 16 msec.
Next, multiply each frame with a Hann window of width 256. (A Hann window has the form w(k)=(1+cos(2πk/K))/2 with K+1 the window width.) Thus each frame has 256 samples y(j), and the frames add to reconstruct the input speech stream.
Fourier transform the windowed speech to find Y(ω) for the frame; the noise spectrum estimation differs from the traditional methods and appears in modification (4).
(1) Clamp the H(ω) attenuation curve so that the attenuation cannot go below a minimum value;
H(ω)^{2}=max[10^{−2}, 1−|N(ω)|^{2}/|Y(ω)|^{2}]
Of course, the 10 dB clamp could be replaced with any other desirable clamp level, such as 5 dB or 20 dB. Also, the clamping could include a sloped clamp or stepped clamping or other more general clamping curves, but a simple clamp lessens computational complexity. The following “Adaptive filter clamp” section describes a clamp which adapts to the input signal energy level.
(2) Increase the noise power spectrum estimate by a factor such as 2 so that small errors in the spectral estimates for input (noisy) signals do not result in fluctuating attenuation filters. The corresponding filter for this factor alone would be:
H(ω)^{2}=1−4|N(ω)|^{2}/|Y(ω)|^{2 }
For small input signal-to-noise power ratios this becomes negative, but a clamp as in (1) eliminates the problem. This noise increase factor appears as a shift in the logarithmic input signal-to-noise power ratio independent variable of
(3) Reduce the variance of spectral estimates used in the noise suppression filter H(ω) by smoothing over neighboring frequencies. That is, for an input windowed noisy speech signal y(j) with Fourier transform Y(ω), apply a running average over frequency so that |Y(ω)|^{2 }is replaced by (W★|Y|^{2})(ω) in H(ω) where W(ω) is a window about 0 and ★ is the convolution operator.
H(ω)^{2}=1−|N(ω)|^{2}/W★|Y|^{2}(ω)
Thus a filter with all three of the foregoing features has transfer function:
H(ω)^{2}=max[10^{−2}, 1−4|N(ω)|^{2}/W★|Y|^{2}(ω)]
Extend the definition of H(ω) by symmetry to π<ω<2π or −π<ω<0
(4) Any noise suppression by spectral subtraction requires an estimate of the noise power spectrum. Typical methods update an average noise spectrum during periods of non-speech activity, but the performance of this approach depends upon accurate estimation of speech intervals which is a difficult technical problem. Some kinds of acoustic noise may have speech-like characteristics, and if they are incorrectly classified as speech, then the noise estimated will not be updated frequently enough to track changes in the noise environment.
Consequently, the preferred embodiment takes noise as any signal which is always present. At each frequency recursively estimate the noise power spectrum P_{N}(ω) for use in the filter H(ω) by updating the estimate from the previous frame, P′_{N}(ω), using the current frame smoothed estimate for the noisy speech power spectrum, P_{Y}(ω)=W★|Y|^{2}(ω), as follows:
For the first frame, just take P_{N}^(ω) equal to P_{Y}(ω).
Thus, the noise power spectrum estimate can increase up to 3 dB per second or decrease up to 12 dB per second. As a result, the noise estimates will only slightly increase during short speech segments, and will rapidly return to the correct value during pauses between words. The initial estimate can simply be taken as the first input frame which typically will be silence; of course, other initial estimates could he used such as a simple constant. This approach is simple to implement, and is robust in actual performance since it makes no asumptions about the characteristics of either the speech or the noise signals. Of course, multiplicative factors other than 0.978 and 1.006 could be used provided that the decrease limit exceeds the increase limit. That is, the product of the multiplicative factors is less than 1; e.g., (0.978) (1.006) is less than 1.
A preferred embodiment filter may include one or more of the four modifications, and a preferred embodiment filter combining all four of the foregoing modifications will have a transfer function:
H(ω)^{2}=max[10^{−2}, 1−4P_{N}^(ω)/W★|Y|^{2}(ω)]
with P_{N}^(ω) the noise power estimate as in the preceding.
Adaptive Filter Clamp
The filter attenuation clamp of the preceding section can be replaced with an adaptive filter attenuation clamp. For example, take
H(ω)^{2}=max[M^{2}, 1−|N(ω)|^{2}/|Y(ω)|^{2}]
and let the minimum filter gain M depend upon the signal and noise power of the current frame (or, for computational simplicity, of the preceding frame). Indeed, when speech is present, it serves to mask low-level noise; therefore, M can be increased in the presence of speech without the listener hearing increased noise. This has the benefit of lessening the attentuation of the speech and thus causing less speech distortion. Because a common response to having difficulty communicating over the phone is to speak louder, this decreasing the filter attenuation with increased speech power will lessen distortion and improve speech quality. Simply put, the system will transmit clearer speech the louder a person talks.
In particular, let YP be the sum of the signal power spectrum over the frequency range 1.8 KHz to 4.0 KHz: with a 256-sample frame sampling at 8 KHz and 256-point FFT, this corresponds to frequencies 51π/128 to π. That is,
YP=Σ_{ω}P_{Y}(ω) for 51π/129≦ω≦π
Similarly, let NP be the corresponding sum of the noise power:
NP=Σ_{ω}P_{N}^(ω) for 51π/128≦ω≦π
with P_{N}^(ω) the noise estimate from the preceding section. The frequency range 1.8 KHz to 4.0 KHz lies in a band with small road noise for an automobile but still with significant speech power, thus detect the presence of speech by considering YP−NP. Then take M equal to A+B(YP−NP) where A is the minimum filter gain with an . all noise input (analogous to the clamp of the preceding section), and B is the dependence of the minimum filter gain on speech power. For example, A could be −8 dB or −10 dB as in the preceding section, and B could be in the range of ¼ to 1. Further, YP−NP may become negative for near silent frames, so preserve the minimum clamp at A by ignoring the B(YP−NP) factor when YP−NP is negative. Also, an upper limit of −4 dB for very loud frames could be imposed by replacing. B(YP−NP) with min[−4 dB, B(YP−NP)].
More explicitly, presume a 16-bit fixed-point format of two's complement numbers, and presume that the noisy speech samples have been scaled so that numbers X arising in the computations will fall into the range −1≦X<+1, which in hexadecimal notation will be the range 8000 to 7FFF. Then the filter gain clamp could vary between A taken equal to 1000 (0.125), which is roughly −9 dB, and an upper limit for A+B(YP−NP) taken equal to 3000 (0.375), which is roughly −4.4 dB. More conservatively, the clamp could be constrained to the range of 1800 to 2800.
Furthermore, a simpler implementation of the adaptive clamp which still provides its advantages uses the M from the previous frame (called M_{OLD}) and takes M for the current frame simply equal to (17/16)M_{OLD }when M_{OLD }is less than A+B(YP−NP) and (15/16)M_{OLD }when M_{OLD }is greater than A+B(YP−NP).
The preceding adaptive clamp depends linearly on the speech power; however, other dependencies such as quadratic could also be used provided that the functional dependence is monotonic. Indeed, memory in system and slow adaptation rates for M make the clamp nonlinear.
The frequency range used to measure the signal and noise powers could be varied, such as 1.2 KHz to 4.0 KHz or another band (or bands) depending upon the noise environment.
Note that the adaptive clamp could be taken as dependent upon the ratio YP/NP instead of just the difference or on some combination. Also, the positive slope of the adaptive clamp (see
Note that the estimates YP and NP could be defined by the previous frame in order to make an implementation on a DSP more memory efficient. For most frames the YP and NP will be close to those of the preceding frame.
Modified generalized Wiener filter preferred embodiments
H(ω)^{2}=P_{S}^(ω)/[P_{S}^(ω)+αP_{N}^(ω)]
with P_{S}^(ω) an estimate for the speech power spectrum, P_{N}^(ω) an estimate for the noise power spectrum, and α a noise suppression factor. The preferred embodiments modify the generalized Wiener filter by using an α which tracks the signal-to-noise power ratio of the input rather than just a constant.
Heuristically, the preferred embodiment may be understood in terms of the following intuitive analysis. First, take P_{S}^(ω) to be cP_{Y}^(ω) for a constant c with P_{Y}^(ω) the power spectrum of the input noisy speech modelled by LPC. That is, the LPC model for y(j) in some sense removes the noise. Then solve for c by substituting this presumption into the statement that the speech and the noise are uncorrelated (P_{Y}(ω)=P_{S}(ω)+P_{N}(ω)) and integrating (summing) over all frequencies to yield:
fP_{Y(ω)}dω=fcP_{Y}^(ω)dω+fP_{N}(ω)dω
where P_{S}^ estimated P_{S}.
Thus by Parseval's theorem, E_{Y}=cE_{Y}+E_{N}, where E_{Y }is the energy of the noisy speech LPC model and also an estimate for the energy of y(j), and E_{N }is the energy of the noise in the frame. Thus, c=(E_{Y}−E_{N})/E_{Y }and so P_{S}^(ω)=[(E_{Y}−E_{N})/E_{Y}]P_{Y}(ω). Then inserting this into the definition of the generalized Wiener filter transfer function gives:
H(ω)^{2}=P_{Y}(ω)/(P_{Y}(ω)+[E_{Y}/(E_{Y}−E_{N})]αP_{N}^(ω))
Now take the factor multiplying P_{N}^(ω)(i.e., [E_{Y}/(E_{Y}−E_{N})]α) as inversely dependent upon signal-to-noise ratio (i.e., [E_{Y}/(E_{Y}−E_{N})]α=κE_{N}/E_{S }for a constant κ) so that the noise suppression varies from frame to frame and is greater for frames with small signal-to-noise ratios. Thus the modified generalized Wiener filter insures stronger suppression for noise-only frames and weaker suppression for voiced-speech frames which are not noise corrupted as much. In short, take α=κE_{N}/E_{Y}, so the noise suppression factor has been made inversely dependent on the signal-to-noise ratio, and the filter transfer function becomes:
H(ω)^{2}=P_{Y}(ω)/(P_{Y}(ω)+[E_{N}/(E_{Y}−E_{N})]κP_{N}^(ω))
Optionally, average α by weighting with the α from the preceding frame to limit discontinuities. Further, the value of the constant κ can be increased to obtain higher noise suppression, which does not result in fluctuations in the speech as much as it does for standard spectral subtraction because H(ω) is always nonnegative.
In more detail, the modified generalized Wiener filter perferred embodiment proceeds through the following steps as illustrated in
Thus the noise spectrum estimate can increase at 3 dB per second and decrease at 12 dB per second. For the first frame, just take P_{N}(ω) equal to P_{Y}(ω). And E_{N }is the integration (sum) of P_{N }over all frequencies.
Also, optionally, to handle abrupt increases in noise level, use a counter to keep track of the number of successive frames in which the condition P_{Y}>1.006 P′_{N}(ω) occurs. If 75 successive frames have this condition, then change the mutliplier from 1.006 to (1.006)^{2 }and restart the counter at 0. And if the next successive 75 frames have the condition P_{Y}>(1.006)^{2 }P′_{N}(ω), then change the multiplier from (1.006)^{2 }to (1.006)^{3}. Continue in this fashion provided 75 successive frames all have satisfy the condition. Once a frame violates the condition, return to the initial multiplier of 1.006.
Of course, other multipliers and count limits could be used.
Insertion of noise suppressor 1100 into the systems of
Codebook based generalized Wiener filter preferred embodiment
H(ω)^{2}=P_{S}^(ω)/[P_{S}^(ω)+αP_{N}^(ω)]
with α the noise suppression constant. Heuristically, the preferred embodiments estimate the noise P_{N}^(ω) in the same manner as step (5) of the previously described generalized Wiener filter preferred embodiments, and estimate P_{S}^(ω) by the use of the line spectral frequencies (LSF) of the input noisy speech as weightings for LSFs from a codebook of noise-free speech samples. In particular, codebook preferred embodiments proceed as follows.
In more detail, let (LSF_{j,1}, LSF_{j,2}, LSF_{j,3}, . . . , LSF_{j,M}) be M LSFs of the jth entry of the codebook; then take the distance of the noisy speech frame LSFs, (LSF_{n,1}, LSF_{n,2}, LSF_{n,3}, . . . , LSF_{n,M}), from the jth entry to be:
d_{j}=Σ_{i}(LSF_{j,i}−LSF_{n,i})/(LSF_{n,i}−LSF_{n,c(i)})
where LSF_{n,c(i) }is the noisy speech frame LSF which is the closest to LSF_{n,i }(so c(i) will be either i−1 or i+1 if the LSF_{n,i }are in size order). Thus, this distance measure is dominated by the LSF_{n,i }which are close to each other, and this provides good results because such LSFs have a higher chance of being formants in the noisy speech frame.
Insertion of noise suppressor 1200 into the systems of
Internal precision control
The preferred embodiments employ various operations such as FFT, and with low power frames the signal samples are small and precision may be lost in multiplications. For example, squaring a 16-bit fixed-point sample will yield a 32-bit result, but memory limitations may demand that only 16 bits be stored and so only the upper 16 bits will be chosen to avoid overflow. Thus an input sample with only the lowest 9 bits nonzero will have an 18-bit answer which implies only the two most significant bits will be retained and thus a loss of precision.
An automatic gain control to bring input samples up to a higher level avoids such a loss of precision but destroys the power level information: both loud and quiet input speech will have the same power output levels. Also, such automatic gain control typically relies on the sample stream and does not consider a frame at a time.
A preferred embodiment precision control method proceeds as follows.
An alternative precision control scaling uses the sum of the absolute values of the samples in a frame rather than the power estimate (sum of the squares of the samples). As with the power estimate scaling, count the number S of significant bits is the sum of absolute values and scale the input samples by a factor of 2^{N+8−S−H }where again N+1 is the number bits in the sample representation, the 8 comes from the 256 (2^{8}) sample frame size, and H provides headroom bits. Heuristically, with samples of K significant bits on the average, the sum of absolute values should be about K+8 bits, and so S will be about K+8 and the factor will be 2^{N−K−H }which is the same as the power estimate sum scaling.
Further, even using the power estimate sum with S significant bits, scaling factors such as 2^{(2N+8−S)−H }have yielded good results. That is, variations of the method of scaling up according to a frame characteristic, processing, and then scaling down will also be viable provided the scaling does not lead to excessive overflow.
Modifications
The preferred embodiments may be varied in many ways while retaining one or more of the features of clamping, noise enhancing, smoothed power estimating, recursive noise estimating, adaptive clamping, adaptive noise suppression factoring, codebook based estimating, and internal precision controlling.
For example, the various generalized Wiener filters of the preferred embodiments had power β equal to ½, but other powers such as 1, ¾, ¼, and so forth also apply; higher filter powers imply stronger filtering. The frame size of 256 samples could be increased or decreased, although powers of 2 are convenient for FFTs. The particular choice of 3 bits of additional headroom could be varied, especially with different size frames and different number of bits in the sample representation. The adaptive clamp could have a negative dependence upon frame noise and signal estimates (B<0). Also, the adaptive clamp could invoke a near-end speech detection method to adjust the clamp level. The α and κ coefficients could be varied and could enter the transfer functions as simple analytic functions of the ratios, and the number iterations in the codebook based generalized Wiener filter could be varied.
Cited Patent | Filing date | Publication date | Applicant | Title |
---|---|---|---|---|
US4964166 | May 26, 1988 | Oct 16, 1990 | Pacific Communication Science, Inc. | Adaptive transform coder having minimal bit allocation processing |
US5012519 * | Jan 5, 1990 | Apr 30, 1991 | The Dsp Group, Inc. | Noise reduction system |
US5036540 | Sep 28, 1989 | Jul 30, 1991 | Motorola, Inc. | Speech operated noise attenuation device |
US5133013 * | Jan 18, 1989 | Jul 21, 1992 | British Telecommunications Public Limited Company | Noise reduction by using spectral decomposition and non-linear transformation |
US5140638 | Aug 6, 1990 | Jul 20, 1999 | U S Philiips Corp | Speech coding system and a method of encoding speech |
US5148489 * | Mar 9, 1992 | Sep 15, 1992 | Sri International | Method for spectral estimation to improve noise robustness for speech recognition |
US5212764 * | Apr 24, 1992 | May 18, 1993 | Ricoh Company, Ltd. | Noise eliminating apparatus and speech recognition apparatus using the same |
US5230060 | Feb 22, 1991 | Jul 20, 1993 | Kokusai Electric Co., Ltd. | Speech coder and decoder for adaptive delta modulation coding system |
US5337251 | Jun 5, 1992 | Aug 9, 1994 | Sextant Avionique | Method of detecting a useful signal affected by noise |
US5353408 | Dec 30, 1992 | Oct 4, 1994 | Sony Corporation | Noise suppressor |
US5537647 * | Nov 5, 1992 | Jul 16, 1996 | U S West Advanced Technologies, Inc. | Noise resistant auditory model for parametrization of speech |
US5544250 | Jul 18, 1994 | Aug 6, 1996 | Motorola | Noise suppression system and method therefor |
US5581653 | Aug 31, 1993 | Dec 3, 1996 | Dolby Laboratories Licensing Corporation | Low bit-rate high-resolution spectral envelope coding for audio encoder and decoder |
US5590242 * | Mar 24, 1994 | Dec 31, 1996 | Lucent Technologies Inc. | Signal bias removal for robust telephone speech recognition |
US5598505 * | Sep 30, 1994 | Jan 28, 1997 | Apple Computer, Inc. | Cepstral correction vector quantizer for speech recognition |
US5623577 | Jan 28, 1994 | Apr 22, 1997 | Dolby Laboratories Licensing Corporation | Computationally efficient adaptive bit allocation for encoding method and apparatus with allowance for decoder spectral distortions |
Reference | ||
---|---|---|
1 | Arslan et al., "New Methods for Adaptive Noise Suppression," (CASSP '95: Acoustics, Speech & Signal Processing Conference, pp. 812-815, 1995. | |
2 | * | Arslan et al., "New Methods for Adaptive Noise Suppression," ICASSP '95: Acoustics, Speech & Signal Processing Conference, pp. 812-815, May 1995. |
3 | Boll, "Suppression of Acoustic Noise in Speech Using Spectral Subtraction," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-27, No. 2, Apr. 1979, pp. 113-120. | |
4 | * | Deller et al. "Discrete-Time Processing of Speech Signals." Prentice-Hall, Inc., pp. 331-333, 1987. |
5 | Deller et al., "Discrete-Time Processing of Speech Signals," Prentice-Hall, Inc., pp. 506-528, 1987. | |
6 | Oppenheim et al., "Digital Signal Processing," 1975, pp. 239-240. | |
7 | Parsons, "Voice and Speech Processing", 1987. |
Citing Patent | Filing date | Publication date | Applicant | Title |
---|---|---|---|---|
US8271276 * | Sep 18, 2012 | Dolby Laboratories Licensing Corporation | Enhancement of multichannel audio | |
US8731214 * | Apr 23, 2010 | May 20, 2014 | Stmicroelectronics International N.V. | Noise removal system |
US8972250 * | Aug 10, 2012 | Mar 3, 2015 | Dolby Laboratories Licensing Corporation | Enhancement of multichannel audio |
US9015044 * | Aug 20, 2012 | Apr 21, 2015 | Malaspina Labs (Barbados) Inc. | Formant based speech reconstruction from noisy signals |
US9020818 * | Aug 20, 2012 | Apr 28, 2015 | Malaspina Labs (Barbados) Inc. | Format based speech reconstruction from noisy signals |
US9240190 * | Mar 16, 2015 | Jan 19, 2016 | Malaspina Labs (Barbados) Inc. | Formant based speech reconstruction from noisy signals |
US9368128 * | Jan 26, 2015 | Jun 14, 2016 | Dolby Laboratories Licensing Corporation | Enhancement of multichannel audio |
US20110142254 * | Jun 16, 2011 | Stmicroelectronics Pvt., Ltd. | Noise removal system | |
US20120221328 * | May 3, 2012 | Aug 30, 2012 | Dolby Laboratories Licensing Corporation | Enhancement of Multichannel Audio |
US20130231924 * | Aug 20, 2012 | Sep 5, 2013 | Pierre Zakarauskas | Format Based Speech Reconstruction from Noisy Signals |
US20130231927 * | Aug 20, 2012 | Sep 5, 2013 | Pierre Zakarauskas | Formant Based Speech Reconstruction from Noisy Signals |
US20150142424 * | Jan 26, 2015 | May 21, 2015 | Dolby Laboratories Licensing Corporation | Enhancement of Multichannel Audio |
US20150187365 * | Mar 16, 2015 | Jul 2, 2015 | Malaspina Labs (Barbados), Inc. | Formant Based Speech Reconstruction from Noisy Signals |
U.S. Classification | 704/226, 704/205, 704/230 |
International Classification | G10L21/02, G10L19/14, G10L19/06 |
Cooperative Classification | G10L21/0208, G10L21/0216, G10L19/07 |
European Classification | G10L19/07, G10L21/0208 |