US RE40280 E1 Abstract A method and apparatus for quantizing audio signals is disclosed which advantageously produces a quantized audio signal which can be encoded within an acceptable range. Advantageously, the quantizer uses a scale factor which is interpolated between a threshold based on the calculated threshold of hearing at a given frequency and the absolute threshold of hearing at the same frequency.
Claims(8) 1. A method of coding an audio signal comprising:
(a) converting a time domain representation of the audio signal into a frequency domain representation of the audio signal, the frequency domain representation comprising a set of frequency coefficients; (b) calculating a masking threshold based upon the set of frequency coefficients; (c) using a rate loop processor in an iterative fashion to determine a set of quantization step size coefficients for use in encoding the set of frequency coefficients, said set of quantization step size coefficients determined by using the masking threshold and an absolute hearing threshold; and (d) coding the set of frequency coefficients based upon the set of quantization step size coefficients. 2. The method of
(
a) converting a time domain representation of the audio signal into a frequency domain representation of the audio signal, the frequency domain representation comprising a set of frequency coefficients; (
b) calculating a masking threshold based upon the set of frequency coefficients; (
c) using a rate loop processor in an iterative fashion to determine a set of quantization step size coefficients for use in encoding the set of frequency coefficients, said set of quantization step size coefficients determined by using the masking threshold and an absolute hearing threshold; and (
d) coding the set of frequency coefficients based upon the set of quantization step size coefficients, wherein the set of frequency coefficients are MDCT coefficients.
3. The method of
4. A decoder for decoding a set of frequency coefficients representing an audio signal, the decoder comprising:
(a) means for receiving the set of coefficients, the set of frequency coefficients having been encoded by:
(1) converting a time domain representation of the audio signal into a frequency domain representation of the audio signal comprising the set of frequency coefficients;
(2) calculating a masking threshold based upon the set of frequency coefficients;
(3) using a rate loop processor in an iterative fashion to determine a set of quantization step size coefficients needed to encode the set of frequency coefficients, said set of quantization step size coefficients determined by using the masking threshold and an absolute hearing threshold; and
(4) coding the set of frequency coefficients based upon the set of quantization step size coefficients; and
(b) means for converting the set of coefficients to a time domain signal. 5. A method of coding an audio signal comprising:
( a) converting a time domain representation of the audio signal into a frequency domain representation of the audio signal, the frequency domain representation comprising a set of frequency coefficients; ( b) calculating a masking threshold based upon the set of frequency coefficients; ( c) using a rate loop processor in an iterative fashion to determine a set of quantization step size coefficients for use in encoding the set of frequency coefficients, said set of quantization step size coefficients determined by using the masking threshold and an absolute hearing threshold; and ( d) coding the set of frequency coefficients based upon the set of quantization step size coefficients, wherein said using the masking threshold and the absolute hearing threshold to determine the set of quantization step size coefficients comprises using the absolute hearing threshold to modify the masking threshold and then using the modified masking threshold to determine the set of quantization step size coefficients. 6. A method of coding an audio signal comprising:
( a) converting a time domain representation of the audio signal into a frequency domain representation of the audio signal, the frequency domain representation comprising a set of frequency coefficients; ( b) calculating a masking threshold based upon the set of frequency coefficients; ( c) using a rate loop processor in an iterative fashion to determine a set of quantization step size coefficients for use in encoding the set of frequency coefficients, said set of quantization step size coefficients determined by using the masking threshold and an absolute hearing threshold; and ( d) coding the set of frequency coefficients based upon the set of quantization step size coefficients, wherein the masking threshold is modified based on the absolute hearing threshold, and wherein said using the masking threshold and the absolute hearing threshold to determine the set of quantization step size coefficients comprises using the modified masking threshold to determine the set of quantization step size coefficients. 7. A decoder for decoding a set of frequency coefficients representing an audio signal, the decoder comprising:
( a) means for receiving the set of coefficients, the set of frequency coefficients having been encoded by:
(
) 1 converting a time domain representation of the audio signal into a frequency domain representation of the audio signal comprising the set of frequency coefficients; (
) 2 calculating a masking threshold based upon the set of frequency coefficients; (
) 3 using a rate loop processor in an iterative fashion to determine a set of quantization step size coefficients needed to encode the set of frequency coefficients, said set of quantization step size coefficients determined by using the masking threshold and an absolute hearing threshold; and (
) 4 coding the set of frequency coefficients based upon the set of quantization step size coefficients; and ( b) means for converting the set of coefficients to a time domain signal, wherein said using the masking threshold and the absolute hearing threshold to determine the set of quantization step size coefficients comprised using the absolute hearing threshold to modify the masking threshold and then using the modified masking threshold to determine the set of quantization step size coefficients. 8. A decoder for decoding a set of frequency coefficients representing an audio signal, the decoder comprising:
( a) means for receiving the set of coefficients, the set of frequency coefficients having been encoded by:
(
) 1 converting a time domain representation of the audio signal into a frequency domain representation of the audio signal comprising the set of frequency coefficients; (
) 2 calculating a masking threshold based upon the set of frequency coefficients; (
) 3 using a rate loop processor in an iterative fashion to determine a set of quantization step size coefficients needed to encode the set of frequency coefficients, said set of quantization step size coefficients determined by using the masking threshold and an absolute hearing threshold; and (
) 4 coding the set of frequency coefficients based upon the set of quantization step size coefficients; and ( b) means for converting the set of coefficients to a time domain signal, wherein the masking threshold was modified based on the absolute hearing threshold, and wherein said using the masking threshold and the absolute hearing threshold to determine the set of quantization step size coefficients comprised using the modified masking threshold to determine the set of quantization step size coefficients. Description This application is a continuation of application Ser. No. 07/844,811, filed on Mar. 2, 1992 now abandoned.Notice: More than one reissue application has been filed for the reissue of U.S. Pat. No. in-part of application Ser. No. 07/844,967, filed on Feb. 28, 1992, now abandoned, which is a continuation of application Ser. No. 07/292,598, filed on Dec. 10, 1988, now abandoned.The following U.S. patent applications filed concurrently with the present application and assigned to the assignee of the present application are related to the present application and each is hereby incorporated herein as if set forth in its entirety: “A METHOD AND APPARATUS FOR THE PERCEPTUAL CODING OF AUDIO SIGNALS, ” by A. Ferreira and J. D. Johnston, application Ser. No. 07/844,819, now abandoned, which in turn was parent of application Ser. No. 08/334,889, allowed Jul. 11, 1996: “A METHOD AND APPARATUS FOR CODING AUDIO SIGNALS BASED ON PERCEPTUAL MODEL,” by J.D. Johnston, application Ser. No. 07/844,804, now U.S. Pat. No. 5,285,498, issued Feb. 8, 1994; and “AN ENTROPY CODER,” by J.D. Johnston and J.A. Reeds, application Ser. No. 07/844,809, now U.S. Pat. No. 5,227,788, issued Jul. 13, 1993. The present invention relates to processing of signals, and more particularly, to the efficient encoding and decoding of monophonic and stereophonic audio signals, including signals representative of voice and music for storage or transmission. Consumer, industrial, studio and laboratory products for storing, processing and communicating high quality audio signals are in great demand. For example, so-called compact disc (“CD”) and digital audio tape (“DAT”) recordings for music have largely replaced the long-popular phonograph record and cassette tape. Likewise, recently available digital audio tape (“DAT”) recording promise to provide greater flexibility and high storage density for high quality audio signals. See, also, Tan and Vermeulen, “Digital audio tape for data storage”, IEEE Spectrum, pp. 34-38 (October 1989).A demand is also arising for broadcast applications of digital technology that offer CD-like quality. While these emerging digital techniques are capable of producing high quality signals, such performance is often achieved only at the expense of considerable data storage capacity or transmission bandwidth. Accordingly, much work has been done in an attempt to compress high quality audio signals for storage and transmission. Most of the prior work directed to compressing signals for transmission and storage has sought to reduce the redundancies that the source of the signals places on the signal. Thus, such techniques as ADPCM, sub-band coding and transform coding described, e.g., in N. S. Jayant and P. Noll, “Digital Codin of Waveforms,” Prentice-Hall, Inc. 1984, have sought to eliminate redundancies that otherwise would exist in the source signals. In other approaches, the irrelevant information in source signals is sought to be eliminated using techniques based on models of the human perceptual system. Such techniques are described, e.g., in E. F. Schroeder and J. J. Platte “‘MSC’: Stereo Audio Coding with CD-Quality and 256 kBIT/SEC, ”IEEE Trans. on Consumer Electronics, Vol. CE-33, No. 4, November 1987; and Johnston, Transform Coding of Audio Signals Using Noise Criteria, Vol. 6, No. 2, IEEE J.S.C.A. (February 1988). Perceptual coding, as described, e.g., in the Johnston paper relates to a technique for lowering required bitrates (or reapportioning available bits) or total number of bits in representing audio signals. In this form of coding, a masking threshold for unwanted signals is identified as a function of frequency of the desired signal. Then, inter alia, the coarseness of quantizing used to represent a signal component of the desired signal is selected such that the quantizing noise introduced by the coding does not rise above the noise threshold, though it may be quite near this threshold. The introduced noise is therefore masked in the perception process. While traditional signal-to- noise ratios for such perceptually coded signals may be relatively low, the quality of these signals upon decoding, as perceived by a human listener, is nevertheless high. Brandenburg et al. U.S. Pat. No. 5,040,217, issued Aug. 13, 1991, describes a system for efficiently coding and decoding high quality audio signals using such perceptual considerations. In particular, using a measure of the “noise-like” or “tone-like” quality of the input signals, the embodiments described in the latter system provides a very efficient coding for monophonic audio signals. It is, of course, important that the coding techniques used to compress audio signals do not themselves introduce offensive components or artifacts. This is especially important when coding stereophonic audio information where coded information corresponding to one stereo channel, when decoded for reproduction, can interfere or interact with coding information corresponding to the other stereo channel. Implementation choices for coding two stereo channels include so-called “dual mono” coders using two independent coders operating at fixed bit rates. By contrast, “joint mono” coders use two monophonic coders but share one combined bit rate, i.e., the bit rate for the two coders is constrained to be less than or equal to a fixed rate, but trade- offs can be made between the bit rates for individual coders. “Joint stereo” coders are those that attempt to use interchannel properties for the stereo pair for realizing additional coding gain. It has been found that the independent coding of the two channels of a stereo pair, especially at low bit-rates, can lead to a number of undesirable psychoacoustic artifacts. Among them are those related to the localization of coding noise that does not match the localization of the dynamically imaged signal. Thus the human stereophonic perception process appears to add constraints to the encoding process if such mismatched localization is to be avoided. This finding is consistent with reports on binaural masking-level differences that appear to exist, at least for low frequencies, such that noise may be isolated spatially. Such binaural masking-level differences are considered to unmask a noise component that would be masked in a monophonic system. See, for example, B.C.J. Morre, “An Introduction to the Psychology of Hearing, Second Edition,” especially chapter 5, Academic Press, Orlando, Fla., 1982. One technique for reducing psychoacoustic artifacts in the stereophonic context employs the ISO-WG11-MPEG-Audio Psychoacoustic II [ISO] Model. In this model, a second limit of signal-to-noise ratio (“SNR”) is applied to signal-to-noise ratios inside the psychoacoustic model. However, such additional SNR constraints typically require the expenditure of additional channel capacity or (in storage applications) the use of additional storage capacity, at low frequencies, while also degrading the monophonic performance of the coding. Limitations of the prior art are overcome and a technical advance is made in a method and apparatus for coding a stereo pair of high quality audio channels in accordance with aspects of the present invention. Interchannel redundancy and irrelevancy are exploited to achieve lower bit-rates while maintaining high quality reproduction after decoding. While particularly appropriate to stereophonic coding and decoding, the advantages of the present invention may also be realized in conventional dual monophonic stereo coders. An illustrative embodiment of the present invention employs a filter bank architecture using a Modified Discrete Cosine Transform (MDCT). In order to code the full range of signals that may be presented to the system, the illustrative embodiment advantageously uses both L/R (Left and Right) and M/S (Sum/Difference) coding, switched in both frequency and time in a signal dependent fashion. A new stereophonic noise masking model advantageously detects and avoids binaural artifacts in the coded stereophonic signal. Interchannel redundancy is exploited to provide enhanced compression for without degrading audio quality. The time behavior of both Right and Left audio channels is advantageously accurately monitored and the results used to control the temporal resolution of the coding process. Thus, in one aspect, an illustrative embodiment of the present invention, provides processing of input signals in terms of either a normal MDCT window, or, when signal conditions indicate, shorter windows. Further, dynamic switching between RIGHT/LEFT or SUM/DIFFERENCE coding modes is provided both in time and frequency to control unwanted binaural noise localization, to prevent the need for overcoding of SUM/DIFFERENCE signals, and to maximize the global coding gain. A typical bitstream definition and rate control loop are described which provide useful flexibility in forming the coder output. Interchannel irrelevancies, are advantageously eliminated and stereophonic noise masking improved, thereby to achieve improved reproduced audio quality in jointly coded stereophonic pairs. The rate control method used in an illustrative embodiment uses an interpolation between absolute threshold and masking threshold for signals below the rate-limit of the coder, and a threshold elevation strategy under rate-limited conditions. In accordance with an overall coder/decoder system aspect of the present invention, it proves advantageously to employ an improved Huffman- like entropy coder/decoder to further reduce the channel bit rate requirements, or storage capacity for storage applications. The noiseless compression method illustratively used employs Huffman coding along with a frequency-partitioning scheme to efficiently code the frequency samples for L,R,M and S, as may be dictated by the perceptual threshold. The present invention provides a mechanism for determining the scale factors to be used in quantizing the audio signal (i.e., the MDCT coefficients output from the analysis filter bank) by using an approach different from the prior art, and while avoiding many of the restrictions and cost of prior quantizer/rate-loops. The audio signals quantized pursuant to the present invention introduce less noise and encode into fewer bits than the prior art. These results are obtained in an illustrative embodiment of the present invention whereby the utilized scale factor, is iteratively derived by interpolating between a scale factor derived from a calculated threshold of hearing at the frequency corresponding to the frequency of the respective spectral coefficient to be quantized and a scale factor derived from the absolute threshold of hearing at said frequency until the quantized spectral coefficients can be encoded within permissible limits. FIG. FIG. 1. Overview To simplify the present disclosure, the following patents, patent applications and publications are hereby incorporated by reference in the present disclosure as if fully set forth herein: U.S. Pat. No. 5,040,217, issued Aug. 13, 1991 by K. Brandenburg et al, U.S. patent application Ser. No. 07/292,598, entitled Perceptual Coding of Audio Signals, filed Dec. 30, 1988;J. D. Johnston, Transform Coding of Audio Signals Using Perceptual Noise Criteria, IEEE Journal on Selected Areas in Communications, Vol. 6, No. 2 February 1988); International Patent Application (PCT) WO 88/01811, filed Mar. 10, 1988; U.S. patent application Ser. No. 07/491,373, entitled Hybrid Perceptual Coding, filed Mar. 9, 1990, Brandenburg et al, Aspec: Adaptive Spectral Entropy Coding of High Quality Music Signals, AES 90th Convention (1991); Johnston, J., Estimation of Perceptual Entropy Using Noise Masking Criteria, ICASSP, (1988); J. D. Johnston, Perceptual Transform Coding of Wideband Stereo Signals, ICASSP (1989); E. F. Schroeder and J. J. Platte, “‘MSC’: Stereo Audio Coding with CD-Quality and 256 kBIT/SEC, ” IEEE Trans. on Consumer Electronics, Vol. CE-33, No. 4, November 1987; and Johnston, Transform Coding of Audio Signals Using Noise Criteria, Vol. 6, No. 2, IEEE J.S.C.A. (February 1988). For clarity of explanation, the illustrative embodiment of the present invention is presented as comprising individual functional blocks (including functional blocks labeled as “processors”). The functions these blocks represent may be provided through the use of either shared or dedicated hardware, including, but not limited to, hardware capable of executing software. (Use of the term “processor” should not be construed to refer exclusively to hardware capable of executing software.) Illustrative embodiments may comprise digital signal processor (DSP) hardware, such as the AT&T DSP16 or DSP32C, and software performing the operations discussed below. Very large scale integration (VLSI) hardware embodiments of the present invention, as well as hybrid DSP/VLSI embodiments, may also be provided. An illustrative embodiment of the perceptual audio coder The filter bank Features of the MDCT that make it useful in the present context include its critical sampling characteristic, i.e. for every n samples into the filter bank, n samples are obtained from the filter bank. Additionally, the MDCT typically provides half-overlap, i.e. the transform length is exactly twice the length of the number of samples, n, shifted into the filterbank. The half-overlap provides a good method of dealing with the control of noise injected independently into each filter tap as well as providing a good analysis window frequency response. In addition, in the absence of quantization, the MDCT provides exact reconstruction of the input samples, subject only to a delay of an integral number of samples. One aspect in which the MDCT is advantageously modified for use in connection with a highly efficient stereophonic audio coder is the provision of the ability to switch the length of the analysis window for signal sections which have strongly non-stationary components in such a fashion that it retains the critically sampled and exact reconstruction properties. The incorporated U.S. patent application by Ferriera and Johnston, entitled “A METHOD AND APPARATUS FOR THE PERCEPTUAL CODING OF AUDIO SIGNALS,” (referred to hereinafter as the “filter bank application”) filed of even date with this application, describes a filter bank appropriate for performing the functions of element The perceptual model processor The psychoacoustic analysis performed in the perceptual model processor In operation, an illustrative embodiment of the perceptual model processor The same threshold calculation used for L and R thresholds is also used for M and S thresholds, with the threshold calculated on the actual M and S signals. First, the basic thresholds, denoted BTHR 1. An additional factor is calculated for each of the M and S thresholds. This factor, called MLD 2. The actual threshold for M (THR In effect, the MLD signal substitutes for the BTHR signal in cases where there is a chance of stereo unmasking. It is not necessary to consider the issue of M and S threshold depression due to unequal L and R thresholds, because of the fact that L and R thresholds are known to be equal. The quantizer/rate loop processor Entropy encoder Illustrative entropy encoder The use of each of the elements shown in 2.1. The Analysis Filter Bank The analysis filter bank An illustrative analysis filter bank The analysis filter bank Input signal buffer Assuming that, at a given time, the input signal buffer Each frame of the input audio signal is provided to the window multiplier Each data window is a vector of scalers called “coefficients”. While all seven of the data windows have 2N coefficients (i.e., the same number as there are audio signal samples in the frame), four of the seven only have N/2 non-zero coefficients (i.e., one-fourth the number of audio signal samples in the frame). As is discussed below, the data window coefficients may be advantageously chosen to reduce the perceptual entropy of the output of the MDCT processor The information for the data window coefficients is stored in the window memory Keeping in mind that the data window is a vector of 2N scalers and that the audio signal frame is also a vector of 2N scalers, the data window coefficients are applied to the audio signal frame scalers through point-to-point multiplication (i.e., the first audio signal frame scaler is multiplied by the first data window coefficient, the second audio signal frame scaler is multiplied by the second data window coefficient, etc.). Window multiplier The seven windowed frame vectors are provided by window multiplier FFT processor MDCT processor As discussed above with reference to window multiplier Delay memory Data selector For purposes of an illustrative stereo embodiment, the above analysis filterbank 2.2. The Perceptual Model Processor A perceptual coder achieves success in reducing the number of bits required to accurately represent high quality audio signals, in part, by introducing noise associated with quantization of information bearing signals, such as the MDCT information from the filter bank The perceptual model processor In order to mask the quantization noise by the signal, one must consider the spectral contents of the signal and the duration of a particular spectral pattern of the signal. These two aspects are related to masking in the frequency domain where signal and noise are approximately steady state—given the integration period of the hearing system- and also with masking in the time domain where signal and noise are subjected to different cochlear filters. The shape and length of these filters are frequency dependent. Masking in the frequency domain is described by the concept of simultaneous masking. Masking in the time domain is characterized by the concept of premasking and postmasking. These concepts are extensively explained in the literature; see, for example, E. Zwicker and H. Fastl, “Psychoacoustics, Facts, and Models,” Springer-Verlag, 1990. To make these concepts useful to perceptual coding, they are embodied in different ways. Simultaneous masking is evaluated by using perceptual noise shaping models. Given the spectral contents of the signal and its description in terms of noise-like or tone-like behavior, these models produce an hypothetical masking threshold that rules the quantization level of each spectral component. This noise shaping represents the maximum amount of noise that may be introduced in the original signal without causing any perceptible difference. A measure called the PERCEPTUAL ENTROPY (PE) uses this hypothetical masking threshold to estimate the theoretical lower bound of the bitrate for transparent encoding. J. D. Jonston, Estimation of Perceptual Entropy Using Noise Masking Criteria,” ICASSP, 1989. Premasking characterizes the (in)audibility of a noise that starts some time before the masker signal which is louder than the noise. The noise amplitude must be more attenuated as the delay increases. This attenuation level is also frequency dependent. If the noise is the quantization noise attenuated by the first half of the synthesis window, experimental evidence indicates the maximum acceptable delay to be about 1 millisecond. This problem is very sensitive and can conflict directly with achieving a good coding gain. Assuming stationary conditions—which is a false premiss—The coding gain is bigger for larger transforms, but, the quantization error spreads till the beginning of the reconstructed time segment. So, if a transform length of 1024 points is used, with a digital signal sampled at a rate of 48000 Hz, the noise will appear at most 21 milliseconds before the signal. This scenario is particularly critical when the signal takes the form of a sharp transient in the time domain commonly known as an “attack”. In this case the quantization noise is audible before the attack. The effect is known as pre-echo. Thus, a fixed length filter bank is a not a good perceptual solution nor a signal processing solution for non-stationary regions of the signal. It will be shown later that a possible way to circumvent this problem is to improve the temporal resolution of the coder by reducing the analysis/synthesis window length. This is implemented as a window switching mechanism when conditions of attack are detected. In this way, the coding gain achieved by using a long analysis/synthesis window will be affected only when such detection occurs with a consequent need to switch to a shorter analysis/synthesis window. Postmasking characterizes the (in)audibility of a noise when it remains after the cessation of a stronger masker signal. In this case the acceptable delays are in the order of 20 milliseconds. Given that the bigger transformation time segment lasts 21 milliseconds (1024 samples), no special care is needed to handle this situation. WINDOW SWITCHING The PERCEPTUAL ENTROPY (PE)_measure of a particular transform segment gives the theoretical lower bound of bits/sample to code that segment transparently. Due to its memory properties, which are related to premasking protection, this measure shows a significant increase of the PB value to its previous value—related with the previous segment—when some situations of strong non-stationarity of the signal (e.g. an attack) are presented. This important property is used to activate the window switching mechanism in order to reduce pre-echo. This window switching mechanism is not a new strategy, having been used, e.g., in the ASPEC coder, described in the ISO/MPEG Audio Coding Report, 1990, but the decision technique behind it is new using the PE information to accurately localize the non-stationarity and define the right moment to operate the switch. Two basic window lengths: 1024 samples and 256 samples are used. The former corresponds to a segment duration of about 21 milliseconds and the latter to a segment duration of about 5 milliseconds. Short windows are associated in sets of 4 to represent as much spectral data as a large window (but they represent a “different” number of temporal samples). In order to make the transition from large to short windows and vice-versa it proves convenient to use two more types of windows. A START window makes the transition from large (regular) to short windows and a STOP window makes the opposite transition, as shown in FIG. In order to exploit interchannel redundancy and irrelevancy, the same type of window is used for RIGHT and LEFT channels in each segment. The stationarity behavior of the signal is monitored at two levels. First by large regular windows, then if necessary, by short windows. Accordingly, the PE of large (regular) window is calculated for every segment while the PE of short windows are calculated only when needed. However, the tonality information for both types is updated for every segment in order to follow the continuous variation of the signal. Unless stated otherwise, a segment involves 1024 samples which is the length of a large regular window. The diagram of The process begins by analysing a “new” segment with 512 new temporal samples (the remaining 512 samples belong to the previous segment). As shown in It has been observed that for short windows the information about stationarity lies more on its PE value than on the differential to the PE value of the precedent window. Accordingly, the first window that has a PE value larger than a predefined threshold is detected. PE If none of the short PEs is above the threshold, the remaining possibilities are L To identify the correct location, another short window must be processed. It is represented in As mentioned before for each segment, RIGHT and LEFT channels use the same type of analysis/synthesis window. This means that switch is done for both channels when at least one channel requires it. It has been observed that for low bitrate applications the solution of It is also evident that the details of the reconstructed signal when short windows are used are closer to the original signal than when only regular large window are used. This is so because the attack is basically a wide bandwidth signal and may only be considered stationary for very short periods of time. Since short windows have a greater temporal resolution than large windows, they are able to follow and reproduce with more fidelity the varying pattern of the spectrum. In other words, this is the difference between a more precise local (in time) quantization of the signal and a global (in frequency) quantization of the signal. The final masking threshold of the stereophonic coder is calculated using a combination of monophonic and stereophonic thresholds. While the monophonic threshold is computed independently for each channel, the stereophonic one considers both channels. The independent masking threshold for the RIGHT of the LEFT channel is computed using a psychoacoustic model that includes an expression for tone masking noise and noise masking tone. The latter is used as a conservative approximation for a noise masking noise expression. The monophonic threshold is calculated using the same procedure as previous work. In particular, a tonality measure considers the evolution of the power and the phase of each frequency coefficient across the last three segments to identify the signal as being more tone—like or noise—like. Accordingly, each psychoacoustic expression is more or less weighted than the other. These expressions found in the literature were updated for better performance. They are defined as:
A brief description of the complete monophonic threshold calculation follows. Some terminology must be introduced in order to simplify the description of the operations involved. The spectrum of each segment is organized in three different ways, each one following a different purpose. 1. First, it may be organized in partitions. Each partition has associated one single Bark value. These partitions provide a resolution of approximately either one MDCT line or ⅓ of a critical band, whichever is wider. At low frequencies a single line of the MDCT will constitute a coder partition. At high frequencies, many lines will be combined into one coder partition. In this case the Bark value associated is the median Bark point of the partition. This partitioning of the spectrum is necessary to insure an acceptable resolution for the spreading function. As will be shown later, this function represents the masking influence among neighboring critical bands. 2. Secondly, the spectrum may be organized in bands. Bands are defined by a parameter file. Each band groups a number of spectral lines that are associated with a single scale factor that results from the final masking threshold vector. 3. Finally, the spectrum may be organized in sections. It will be shown later that sections involve an integer number of bands and represent a region of the spectrum coded with the same Huffman code book. Three indices for data values are used. These are: ω→indicates that the calculation is indexed by frequency in the MDCT line domain. b→indicates that the calculation is indexed in the threshold calculation partition domain. In the case where we do a convolution or sum in that domain, bb will be used as the summation variable. n→indicates that the calculation is indexed in the coder band domain. Additionally some symbols are also used: -
- 1. The index of the calculation partition, b.
- 2. The lowest frequency line in the partition, ωlow
_{b}. - 3. The highest frequency line in the partition, ωhigh
_{b}. - 4. The median bark value of the partition, bval
_{b}. - 5. The value for tone masking noise (in dB) for the partition, TMN
_{b}. - 6. The value for noise masking tone (in dB) for the partition, NMT
_{b}.
Several points in the following description refer to the “spreading function”. It is calculated by the following method:
The following steps are the necessary steps for calculation the SMR -
- 1. Concatenate
**512**new samples of the input signal to form another 1024 samples segment. Please refer to FIG.**5**a. - 2. Calculate the complex spectrum of the input signal using the O-FFT as described in 2.0 and using a sine window.
- 3. Calculate a predicted r and φ.
- 1. Concatenate
The polar representation of the transform is calculated r A predicted magnitude, {circumflex over (r)} -
- 4. Calculate the unpredictability measure c
_{ωcω}, the unpredictability measure, is:${c}_{\omega}=\frac{{\left({\left({r}_{\omega}\mathrm{cos}\text{\hspace{1em}}{\varphi}_{\omega}-{\hat{r}}_{\omega}\mathrm{cos}\text{\hspace{1em}}{\hat{\varphi}}_{\omega}\right)}^{2}+{\left({r}_{\omega}\mathrm{sin}\text{\hspace{1em}}{\varphi}_{\omega}-{\hat{r}}_{\omega}\mathrm{sin}\text{\hspace{1em}}{\hat{\varphi}}_{\omega}\right)}^{2}\right)}^{5}}{{r}_{\omega}+\mathrm{abs}\left({\hat{r}}_{\omega}\right)}$ - 5. Calculate the energy and unpredictability in the threshold calculation partitions.
- 4. Calculate the unpredictability measure c
The energy in each partition, e -
- 6. Convolve the partitioned energy and unpredictability with the spreading function.
${e\mathrm{cb}}_{b}=\sum _{\mathrm{bb}=1}^{\mathrm{bmax}}{e}_{\mathrm{bb}}\mathrm{sprdngf}\left({\mathrm{bval}}_{\mathrm{bb}},{\mathrm{bval}}_{b}\right)$ $c\text{\hspace{1em}}{t}_{b}=\sum _{\mathrm{bb}=1}^{\mathrm{bmax}}{c}_{\mathrm{bb}}\mathrm{sprdngf}\left({\mathrm{bval}}_{\mathrm{bb}},{\mathrm{bval}}_{b}\right)$
- 6. Convolve the partitioned energy and unpredictability with the spreading function.
Because ct The normalization coefficient, rnorm -
- 7. Covert cb
_{b }to tb_{b}. tb_{b}=−0.299−0.43 log_{e}(cb_{b})
- 7. Covert cb
Each tb -
- 8. Calculate the required SNR in each partition.
${\mathrm{TMN}}_{b}=19.5+{\mathrm{bval}}_{b}\frac{18.0}{26.0}$ ${\left(N\mathrm{MT}\right)}_{b}=6.56-b\text{\hspace{1em}}v\text{\hspace{1em}}a\text{\hspace{1em}}{l}_{b}\frac{3.06}{26.0}$
- 8. Calculate the required SNR in each partition.
Where TMN The required signal to noise ratio, SNR -
- 9. Calculate the power ratio.
The power ratio, bc -
- 10. Calculation of actual energy threshold, nb
_{b}. nb_{b}=en_{b}bc_{b } - 11. Spread the threshold energy over MDCT lines, yielding nb
_{ω}${\mathrm{nb}}_{\omega}=\frac{{\mathrm{nb}}_{b}}{{\mathrm{\omega high}}_{b}-{\mathrm{\omega low}}_{b}+1}$ - 12. Include absolute thresholds, yielding the final energy threshold of audibility, thr
_{ω} thr_{107 =}max(nb_{107 }, absthr_{107 }).
- 10. Calculation of actual energy threshold, nb
The dB values of absthr shown in the “Absolute Threshold Tables” are relative to the level that a sine wave of ±½ 1sb has in the MDCT used for threshold calculation. The dB values must be converted into the energy domain after considering the MDCT normalization actually used. -
- 13. Pre-echo control
- 14. Calculate the signal to mask ratios. SMR
_{n}.
The table of “Bands of the Coder” shows -
- 1. The index, n, of the band.
- 2. The upper index, ωhigh
_{n }of the band n. The lower index, ωlow_{n}, is computed from the previous band as ωhigh_{n−1}+1.
To further classify each band, another variable is created. The width index, width Then, if (width Where, in this case, minimum(a, . . . ,z) is a function returning the most negative or smallest positive argument of the arguments a . . . z. The ratios to be sent to the decoder, SMR It is important to emphasize that since the tonality measure is the output of a spectrum analysis process, the analysis window has a sine form for all the cases of large or short segments. In particular, when a segment is chosen to be coded as a START or STOP window, its tonality information is obtained considering a sine window; the remaining operations, e.g. the threshold calculation and the quantization of the coefficients, consider the spectrum obtained with the appropriate window. STEREOPHONIC THRESHOLD The stereophonic threshold has several goals. It is known that most of the time the two channels sound “alike”. Thus, some correlation exists that may be converted in coding gain. Looking into the temporal representation of the two channels, this correlation is not obvious. However, the spectral representation has a number of interesting features that may advantageously be exploited. In fact, a very practical and useful possibility is to create a new basis to represent the two channels. This basis involves two orthogonal vectors, the vector SUM and the vector DIFFERENCE defined by the following linear combination:
These vectors, which have the length of the window being used, are generated in the frequency domain since the transform process is by definition a linear operation. This has the advantage of simplifying the computational load. The first goal is to have a more decorrelated representation of the two signals. The concentration of most of the energy in one of these new channels is a consequence of the redundancy that exists between RIGHT and LEFT channels and on average, leads always to a coding gain. A second goal is to correlate the quantization noise of the RIGHT and LEFT channels and control the localization of the noise or the unmasking effect. This problem arises if RIGHT and LEFT channels are quantized and coded independently. This concept is exemplified by the following context: supposing that the threshold of masking for a particular signal has been calculated, two situations may be created. First we add to the signal an amount of noise that corresponds to the threshold. If we present this same signal with this same noise to the two ears then the noise is masked. However, if we add an amount of noise that corresponds to the threshold to the signal and present this combination to one ear; do the same operation for the other ear but with noise uncorrelated with the previous one, then the noise is not masked. In order to achieve masking again, the noise at both ears must be reduced by a level given by the masking level difference (MLD). The unmasking problem may be generalized to the following form: the quantization noise is not masked if it does not follow the localization of the masking signal. Hence, in particular, we may have two limit cases: center localization of the signal with unmasking more noticeable on the sides of the listener and side localization of the signal with unmasking more noticeable on the center line. The new vectors SUM and DIFFERENCE are very convenient because they express the signal localized on the center and also on both sides of the listener. Also, they enable to control the quantization noise with center and side image. Thus, the unmasking problem is solved by controlling the protection level for the MLD through these vectors. Based on some psychoacoustic information and other experiments and results, the MLD protection is particularly critical for very low frequencies to about 3 KHz. It appears to depend only on the signal power and not on its tonality properties. The following expression for the MLD proved to give good results:
C(i) is the spread signal energy on the basilar membrane, corresponding only to the partition i. A third and last goal is to take advantage of a particular stereophonic signal image to extract irrelevance from directions of the signal that are masked by that image. In principle, this is done only when the stereo image is strongly defined in one direction, in order to not compromise the richness of the stereo signal. Based on the vectors SUM and DIFFERENCE, this goal is implemented by postulating the following two dual principles: -
- 1. If there is a strong depression of the signal (and hence of the noise) on both sides of the listener, then an increase of the noise on the middle line (center image) is perceptually tolerated. The upper bound is the side noise.
- 2. If there is a strong localization of the signal (and hence of the noise) on the middle line, then an increase of the (correlated) noise on both sides is perceptually tolerated. The upper bound is the center noise.
However, any increase of the noise level must be corrected by the MLD threshold. According to these goals, the final stereophonic threshold is computed as follows. First, the thresholds for channels SUM and DIFFERENCE are calculated using the monophonic models for noise-masking-tone and tone-masking-noise. The procedure is exactly the one presented in pages 25 and 26. At this point we have the actual energy threshold per band, nb Secondly, the MLD threshold for both channels i.e. THRn After these operations, the remaining steps after the 11th, as presented in 3.2 are also taken for both channels. In essence, these last thresholds are further adjusted to consider the absolute threshold and also a partial premasking protection. It must be noticed that this premasking protection was simply adopted from the monophonic case. It considers a monaural time resolution of about 2 milliseconds. However, the binaural time resolution is as accurate as 6 microseconds! To conveniently code stereo signals with relevant stereo image based on interchannel time differences, is a subject that needs further investigation. STEREOPHONIC CODER The simplified structure of the stereophonic coder allows for the encoding of the stereo signals which are subsequently decoded by the stereophonic decoder which, is presented in FIG. Coding Mode Selection When a new segment is read, the tonality updating for large and short analysis windows is done. Monophonic thresholds and the PE values are calculated according to the technique described previously. This gives the first decision about the type of window to be used for both channels. Once the window sequence is chosen, an orthogonal coding decision is then considered. It involves the choice between independent coding of the channels, mode RIGHT/LEFT (R/L) or joint coding using the SUM and DIFFERENCE channels (S/D). This decision is taken on a band basis of the coder. This is based on the assumption that the binaural perception is a function of the output of the same critical bands at the two ears. If the threshold at the two channels is very different, then there is no need for MLD protection and the signals will not be more decorrelated if the channels SUM and DIFFERENCE are considered. If the signals are such that they generate a stereo image, then a MLD protection must be activated and additional gains may be exploited by choosing the S/D coding mode. A convenient way to detect this latter situation is by comparing the monophonic threshold between RIGHT and LEFT channels. If the thresholds in a particular band do not differ by more than a predefined value, e.g. 2 dB, then the S/D coding mode is chosen. Otherwise the independent mode R/L is assumed. Associated which each band is a one bit flag that specifies the coding mode of that band and that must be transmitted to the decoder as side information. From now on it is called a coding mode flag. The coding mode decision is adaptive in time since for the same band it may differ for subsequent segments, and is also adaptive in frequency since for the same segment, the coding mode for subsequent bands may be different. An illustration of a coding decision is given in FIG. At this point it is clear that since the window switching mechanism involves only monphonic measures, the maximum number of PE measures per segment is 10 (2 channels *[1 large window+4 short windows]). However, the maximum number of thresholds that we may need to compute per segment is 20 and therefore 20 tonality measures must be always updated per segment (4 channels *[1 large window+4 short windows]). Bitrate Adjustment It was previously said that the decisions for window switching and for coding mode selection are orthogonal in the sense that they do not depend on each other. Independent to these decisions is also the final step of the coding process that involves quantization. Huffman coding and bitstream composing: i.e. there is no feedback path. This fact has the advantage of reducing the whole coding delay to a minimum value (1024/48000=21.3 milliseconds) and also to avoid instabilities due to unorthodox coding situations. The quantization process effects both spectral and coefficients and scale factors. Spectral coefficients are clustered in bands, each band having the same step size or scale factor. Each step size is directly computed from the masking threshold corresponding to its band. The quantized values, which are integer numbers, are then converted to variable word length or Huffman codes. The total number of bits to code the segment, considering additional fields of the bitstream, is computed. Since the bitrate must be kept constant, the quantization process must be iteratively done till that number of bits is within predefined limits. After the number of bits needed to code the whole segment, considering the basic masking threshold, the degree of adjustment is dictated by a buffer control unit. This control unit shares the deficit or credit of additional bits among several segments, according to the needs of each one. The technique of the bitrate adjustment routine is represented by the flowchart of FIG. The main steps of this routine are depicted in FIG. If at this point, neither the basic masking threshold nor absolute thresholds have provided an acceptable bit representation of the whole segment, an iterative procedure (as shown in In order to use the same procedure for segments coded with large and short windows, in this latter cases, the coefficients of the 4 short windows are clustered by concatenating homologue bands. Scale factors are clustered in the same. The bitrate adjustment routine ( The spectrum partioning is done using a minimum cost strategy. The main steps are as follows. First, all possible sections are defined -the limit is one section per hand- each one having the code book that best matches the amplitude distribution of the coefficients within that section. As the beginning and the end of the whole spectrum is known, if K is the number of sections, there are K-1 separators between sections. The price to eliminate each separator is computed. The separator that has a lower price is eliminated (initial prices may be negative). Prices are computed again before the next iteration. This process is repeated till a maximum allowable number of sections is obtained and the smallest price to eliminate another separator is higher than a predefined value. Aspects of the processing accomplished by quantizer/rate-loop The inputs to quantizer/rate-loop Quantizer/rate-loop A “utilized scale factor”, Δ, is iteratively derived by interpolating between a calculated scale factor and a scale factor derivated from the absolute threshold of hearing at the frequency corresponding to the frequency of the respective spectral coefficient to be quantized until the quantized spectral coefficients can be encoded within permissible limits. An illustrative embodiment of the present invention can be seen in FIG. α α α=α Next, as shown in Next, as shown in Next, as shown in Advantageously, and depending on the relationship of the cost C to the permissible range PR the interpolation constant and bounds are adjusted until the utilized scale factor yields a quantized spectral coefficient which has a cost within the permissible range. Illustratively, as shown in The stereophonic decoder has a very simple structure as shown in FIG. Illustrative embodiments may comprise digital signal processor (DSP) hardware, such as the AT&T DSP16 or DSP32C, and software performing the operations discussed below of the present invention. Very large scale integration (VLSI) hardware embodiments of the present invention, as well as hybrid DSP/VLSI embodiments, may also be provided. For example, an AT&T DSP16 may be employed to perform the operations of the rate loop processor depicted in FIG. Patent Citations
Non-Patent Citations
Referenced by
Legal Events
Rotate |