US 8219409 B2 Abstract An encoder/decoder for multi-channel audio data, and in particular for audio reproduction through wave field synthesis. The encoder comprises a two-dimensional filter-bank to the multi-channel signal, in which the channel index is treated as an independent variable as well as time, and and the resulting spectral coefficient are quantized according to a two-dimensional psychoacoustic model, including masking effect in the spatial frequency as well as in the temporal frequency. The coded spectral data are organized in a bitstream together with side information containing scale factors and Huffman codebook identifiers.
Claims(28) 1. Method for encoding a plurality of audio channels comprising the steps of: applying to said plurality of audio channels a two-dimensional filter-bank along both the time dimension and the channel dimension resulting in two-dimensional spectra; coding said two-dimensional spectra, resulting in coded spectral data, organizing said plurality of audio channels into a two-dimensional signal with time dimension and channel dimension, wherein said two-dimensional spectra and said coded spectral data represent transform coefficients in a four-dimensional uniform or non-uniform tiling, comprising the temporal-index of the block, the channel-index of the block, the temporal frequency dimension, and the spatial frequency dimension.
2. The method of
3. The method of
4. The method of
5. The method of
6. The method of
7. The method of
8. The method of
9. The method of
10. The method of
11. The method of
12. The method of
13. The method of
14. The method of
15. Method for decoding a coded set of data representing a plurality of audio channels comprising the steps of: obtaining a reconstructed two-dimensional spectra from the coded data set;
transforming the reconstructed two-dimensional spectra with a two-dimensional inverse filter-bank,
wherein said reconstructed two-dimensional spectra represent transform coefficients in a four-dimensional uniform or non-uniform tiling, comprising the time-index of the block, the channel-index of the block, the temporal frequency dimension, and the spatial frequency dimension.
16. The method of
17. The method of
18. The method of
19. The method of
20. The method of
21. The method of
22. An encoding device, operatively arranged to carry out the method of
23. A non-transitory digital carrier on which is recorded an encoding software loadable in the memory of a digital processor, containing instructions to carry out the method of
24. A decoding device, operatively arranged to carry out the method of
25. A non-transitory digital carrier on which is recorded a decoding software loadable in the memory of a digital processor, containing instructions to carry out the method of
26. An acoustic reproduction system comprising:
a digital decoder, for decoding a bitstream representing samples of an acoustic wave field or loudspeaker drive signals at a plurality of positions in space and time, the decoder including an entropy decoder, operatively arranged to decode and decompress the bitstream, into a quantized two-dimensional spectra, and a quantization remover, operatively arranged to reconstruct a two-dimensional spectra containing transform coefficients relating to a temporal-frequency value and a spatial-frequency value, said quantization remover applying a masking model of the frequency masking effect along the temporal frequency and/or the spatial frequency, and a two-dimensional inverse filter-bank, operatively arranged to transform the reconstructed two-dimensional spectra into a plurality of audio channels;
a plurality of loudspeaker or acoustical transducers arranged in a set disposition in space, the positions of the loudspeakers or acoustical transducers corresponding to the position in space of the samples of the acoustic wave field;
one or more Digital-to-Analog Converters (DACs) and signal conditioning units, operatively arranged to extract a plurality of driving signals from plurality of audio channels, and to feed the driving signals to the loudspeakers or acoustical transducers, wherein said reconstructed two-dimensional spectra represent transform coefficients in a four-dimensional uniform or non-uniform tiling, comprising the time-index of the block, the channel-index of the block, the temporal frequency dimension, and the spatial frequency dimension, the system further comprising an interpolating unit, for providing an interpolated acoustic wave field signal.
27. An acoustic recording system comprising:
a plurality of microphones or acoustical transducers arranged in a set disposition in space to sample an acoustic wave field at a plurality of locations;
one or more Analog-to-Digital Converters (ADCs), operatively arranged to convert the output of the microphones or acoustical transducers into a plurality of audio channels containing values of the acoustic wave field at a plurality of positions in space and time;
a digital encoder, including a two-dimensional filter bank operatively arranged to transform the plurality of audio channels into a two-dimensional spectra containing transform coefficients relating to a temporal-frequency value and a spatial-frequency value, a quantizing unit, operatively arranged to quantize the two-dimensional spectra into a quantized two-dimensional spectra, said quantizing applying a masking model of the frequency masking effect along the temporal frequency and/or the spatial frequency, and an entropy coder, for providing a compressed bitstream representing the acoustic wave field or the loudspeaker drive signals;
a digital storage unit for recording the compressed bitstream,
a windowing unit, operatively arranged to partition the time dimension and/or the spatial dimension in a series of two-dimensional signal blocks;
wherein said two-dimensional spectra represent frequency coefficients in a four-dimensional uniform or non-uniform tiling, comprising the time-index of the block, the channel-index of the block, the temporal frequency dimension, and the spatial frequency dimension.
28. A non-transitory digital carrier containing an encoded bitstream representing a plurality of audio channels including a series of frames corresponding to two-dimensional signal blocks, each frame comprising:
entropy-coded spectral coefficients of the represented wave field in the corresponding two-dimensional signal block, the spectral coefficients being quantized according to a two-dimensional masking model, and allowing reconstruction of the wave field or the loudspeaker drive signal by a two-dimensional filter-bank,
side information necessary to decode the spectral data, wherein said reconstructed two-dimensional spectra represent transform coefficients in a four-dimensional uniform or non-uniform tiling, comprising the time-index of the block, the channel-index of the block, the temporal frequency dimension, and the spatial frequency dimension.
Description The present invention relates to a digital encoding and decoding for storing and/or reproducing sampled acoustic signals and, in particular, signal that are sampled or synthesized at a plurality of positions in space and time. The encoding and decoding allows reconstruction of the acoustic pressure field in a region of area or of space. Reproduction of audio through Wave Field Synthesis (WFS) has gained considerable attention, because it offers to reproduce an acoustic wave field with high accuracy at every location of the listening room. This is not the case in traditional multi-channel configurations, such as Stereo and Surround, which are not able to generate the correct spatial impression beyond an optimal location in the room—the sweet spot. With WFS, the sweet spot can be extended to enclose a much larger area, at the expense of an increased number of loudspeakers. The WFS technique consists of surrounding the listening area with an arbitrary number of loudspeakers, organized in some selected layout, and using the Huygens-Fresnel principle to calculate the drive signals for the loudspeakers in order to replicate any desired acoustic wave field inside that area. Since an actual wave front is created inside the room, the localization of virtual sources does not depend on the listener's position. A typical WFS reproduction system comprises both a transducer (loudspeaker) array, and a rendering device, which is in charge of generating the drive signals for the loudspeakers in real-time. The signals can be either derived from a microphone array at the positions where the loudspeakers are located in space, or synthesized from a number of source signals, by applying known wave equation and sound processing techniques. The fact that WFS requires a large amount of audio channels for reproduction presents several challenges related to processing power and data storage or, equivalently, bitrate. Usually, optimally encoded audio data requires more processing power and complexity for decoding, and vice-versa. A compromise must therefore be struck between data size and processing power in the decoder. Coding the original source signals provides, potentially, consistent reduction of data storage with respect to coding the sound field at a given number of locations in space. These algorithms are, however very demanding in processing power for the decoder, which is therefore more expensive and complex. The original sources, moreover, are not always available and, even when they are, it may not be desirable, from a copyright protection standpoint, to disclose them. Several encodings and decoding schemes have been proposed and used, and they can yield, in many cases, substantial bitrate reductions. Among others, suitable for encoding methods systems described in WO8801811 international application, as well as in U.S. Pat. Nos. 5,535,300 and 5,579,430 patents, which rely on a spectral representation of the audio signal, in the use of psycho-acoustic modelling for discarding information of lesser perceptual importance, and in entropy coding for further reducing the bitrate. While these methods have been extremely successful for conventional mono, stereo, or surround audio recordings, they can not be expected to deliver optimal performance if applied individually to a large number of WFS audio channels. There is accordingly a need for audio encoding and decoding methods and systems which are able to store the WFS information in a bitstream with a favorable reduction in bitrate and that is not too demanding for the decoder. According to the invention, these aims are achieved by means of the encoding method, the decoding method, the encoding and decoding devices and software, the recording system and the reproduction system that are the object of the appended claims. In particular the aims of the present invention are achieved by a method for encoding a plurality of audio channels comprising the steps of: applying to said plurality of audio channels a two-dimensional filter-bank along both the time dimension and the channel dimension resulting in two-dimensional spectra; coding said two-dimensional spectra, resulting in coded spectral data. The aims of the present invention are also attained by a method for decoding a coded set of data representing a plurality of audio channels comprising the steps of: obtain a reconstructed two-dimensional spectra from the coded data set; transforming the reconstructed two-dimensional spectra with a two-dimensional inverse filter-bank. According to another aspect of the same invention, the aforementioned goals are met by an acoustic reproduction system comprising: a digital decoder, for decoding a bitstream representing samples of an acoustic wave field or loudspeaker drive signals at a plurality of positions in space and time, the decoder including an entropy decoder, operatively arranged to decode and decompress the bitstream, into a quantized two-dimensional spectra, and a quantization remover, operatively arranged to reconstruct a two-dimensional spectra containing transform coefficients relating to a temporal-frequency value and a spatial-frequency value, said quantization remover applying a masking model of the frequency masking effect along the temporal frequency and/or the spatial frequency, and a two-dimensional inverse filter-bank, operatively arranged to transform the reconstructed two-dimensional spectra into a plurality of audio channels; a plurality of loudspeaker or acoustical transducers arranged in a set disposition in space, the positions of the loudspeakers or acoustical transducers corresponding to the position in space of the samples of the acoustic wave field; one or more DACs and signal conditioning units, operatively arranged to extract a plurality of driving signals from plurality of audio channels, and to feed the driving signals to the loudspeakers or acoustical transducers. Further the invention also comprises an acoustic registration system comprising: a plurality of microphones or acoustical transducers arranged in a set disposition in space to sample an acoustic wave field at a plurality of locations; one or more ADC's, operatively arranged to convert the output of the microphones or acoustical transducers into a plurality of audio channels containing values of the acoustic wave field at a plurality of positions in space and time; a digital encoder, including a two-dimensional filter bank operatively arranged to transform the plurality of audio channels into a two-dimensional spectra containing transform coefficients relating to a temporal-frequency value and a spatial-frequency value, a quantizing unit, operatively arranged to quantize the two-dimensional spectra into a quantized two-dimensional spectra, said quantizing applying a masking model of the frequency masking effect along the temporal frequency and/or the spatial frequency, and an entropy coder, for providing a compressed bitstream representing the acoustic wave field or the loudspeaker drive signals; a digital storage unit for recording the compressed bitstream. The aims of the invention are also achieved by an encoded bitstream representing a plurality of audio channels including a series of frames corresponding to two-dimensional signal blocks, each frame comprising: entropy-coded spectral coefficients of the represented wave field in the corresponding two-dimensional signal block, the spectral coefficients being quantized according to a two-dimensional masking model, and allowing reconstruction of the wave field or the loudspeaker drive signal by a two-dimensional filter-bank, side information necessary to decode the spectral data. The invention will be better understood with the aid of the description of an embodiment given by way of example and illustrated by the figures, in which: The acoustic wave field can be modeled as a superposition of point sources in the three-dimensional space of coordinates (x, y, z). We assume, for the sake of simplicity, that the point sources are located at z=0, as is often the case. This should not be understood, however, as a limitation of the present invention. Under this assumption, the three dimensional space can be reduced to the horizontal xy-plane. Let p(t,r) be the sound pressure at r=(x,y) generated by a point source located at r The spacetime signal p(t,x) can be represented as a linear combination of complex exponentials with temporal frequency Ω and spatial frequency Φ, by applying a spatio-temporal version of the Fourier transform:
which we call the continuous-space-time spectrum. It is important to note, however, that the spacetime signal can be spectrally decomposed also with respect to other base function than the complex exponential of the Fourier base. Thus it could be possible to obtain a spectral decomposition of the spacetime signal in spatial and temporal cosine components (DCT transformation), in wavelets, or according to any other suitable base. It may also be possible to choose different bases for the space axes and for the time axis. These representations generalize the concepts of frequency spectrum and frequency component and are all comprised in the scope of the present invention. Consider the space-time signal p(t,x) generated by a point source located in far-field, and driven by s(t). According to (4) If the point source is not far enough from the x-axis to be considered in far-field, (1) must be used, such that Note that the space-time signal p(t,x) generated by a source signal s(t)=δ(t) is in fact a Green's solution for the wave equation measured on the x-axis. This means that (9) and (11) act as a transfer function between p(t,r The simple linear disposition of The short-space analysis of the acoustic wave field is similar to its time domain counterpart, and therefore exhibits the same issues. For instance, the length L The windowing operation in the space-time domain consists of multiplying p(t,x) both by a temporal window w Consider the plane wave examples of previous section, and let w An example of encoder device according to the present invention is now described with reference to the The spacetime signal P Even if the The present invention also includes an encoder producing a bitstream that is broadcast, or streamed on a network, without being locally stored. Even if the different elements On the decoder side, described now with reference to the The drive signals q(n,m) for the loudspeakers In practical implementations of the invention, the filtering operation could also be carried out, in equivalent manner, in the frequency domain, on the two-dimensional spectral coefficients Y The As mentioned with reference to the encoder, the present invention also include a standalone decoder, implementing the sole decoding unit Sampling and Reconstruction In most practical applications, p(t,x) can only be measured on discrete points along the x-axis. A typical scenario is when the wave field is measured with microphones, where each microphone represents one spatial sample. If s The discrete-spacetime signal p According to the present invention, the actual coding occurs in the frequency domain, where each frequency pair (Ω,Φ) is quantized and coded, and then stored in the bitstream. The transformation to the frequency domain is performed by a two-dimensional filterbank that represents a space-time lapped block transform. For simplicity, we assume that the transformation is separable, i.e., the individual temporal and spatial transforms can be cascaded and interchanged. In this example, we assume that the temporal transform is performed first. Let p
The matrices {circumflex over (X)}, Ŷ, and {circumflex over (P)} are the estimations of X, Y, and P, and have size N×M. Combining all transformation steps in the table yields {circumflex over (P)}={tilde over (Ψ)}{tilde over (Ψ)} ^{T}·P·{tilde over (Y)}{tilde over (Y)}^{T}, and thus perfect reconstruction is achieved if {tilde over (Ψ)}{tilde over (Ψ)}^{T}=I and {tilde over (Y)}{tilde over (Y)}^{T}=I, i.e., if the transformation matrices are orthonormal.
According to a preferred variant of the invention, the WFC scheme uses a known orthonormal transformation matrix called the Modified Discrete Cosine Transform (MDCT), which is applied to both temporal and spatial dimensions. This is not, however an essential feature of the invention, and the skilled person will observe that also other orthogonal transform, providing frequency-like coefficient, could also serve. In particular, the filter bank used in the present invention could be based, among others, on Discrete Cosine transform (DCT), Fourier Transform (FT), wavelet transform, and others. The transformation matrix {tilde over (Ψ)} (or {tilde over (Y)} for space) is defined by Note that the spatio-temporal MDCT generates a transform block of size B One last important note is that, when using the spatio-temporal MDCT, if the signal is zero-padded, the spatial axis requires K Preferably the blocks partition the space-time domain in a four-dimensional uniform or non-uniform tiling. The spectral coefficients are encoded according to a four-dimensional tiling, comprising the time-index of the block, the spatial-index of the block, the temporal frequency dimension, and the spatial frequency dimension. Psychoacoustic Model The psychoacoustic model for spatio-temporal frequencies is an important aspect of the invention. It requires the knowledge of both temporal-frequency masking and spatial-frequency masking, and these may be combined in a separable or non-separable way. The advantage of using a separable model is that the temporal and spatial contributions can be derived from existing models that are used in state-of-art audio coders. On the other hand, a non-separable model can estimate the dome-shaped masking effect produced by each individual spatio-temporal frequency over the surrounding frequencies. These two possibilities are illustrated in The goal of the psychoacoustic model is to estimate, for each spatio-temporal spectral block of size B The psychoacoustic models thus allow encoding information using more bits for the perceptually important spectral components, and less bits for other components of lesser perceptual importance. Preferably the different embodiments of the present invention include a masking model that takes into account both the masking effect along the spatial frequency and the masking effect along the time frequency, and is based on a two-dimensional masking function of the temporal frequency and of the spatial frequency. Three different methods for estimating M are now described. This list is not exhaustive, however, and the present invention also covers other two-dimensional masking models. Average Based Estimation A way of obtaining a rough estimation of M is to first compute the masking curve produced by the signal in each channel independently, and then use the same average masking curve in all spatial frequencies. Let x Another way of estimating M is to compute one masking curve per spatial frequency. This way, the triangular energy distribution in the spectral block Y is better exploited. Let x One interesting remark about this method is that, since the masking curves are estimated from vertical lines along the Ω-axis, this is actually equivalent to coding each channel separately after decorrelation through a DCT. Further on, we show that this method gives a worst estimation of M than the plane-wave method, which is the most optimal without spatial masking consideration. Plane-wave Based Estimation Another, more accurate, way for estimating M is by decomposing the spacetime signal p(t,x) into plane-wave components, and estimating the masking curve for each component. The theory of wave propagation states that any acoustic wave field can be decomposed into a linear combination of plane waves and evanescent waves traveling in all directions. In the spacetime spectrum, plane waves constitute the energy inside the triangular region |Φ|≦|Ω|c As derived in (7), the spacetime spectrum P(Ω,Φ) generated by a plane wave with angle of arrival α is given by As mentioned before, we are discarding spatial-frequency masking effects in this analysis, i.e., we are assuming there is total separation of the plane waves by the auditory system. Under this assumption, The main purpose of the psychoacoustic model, and the matrix M, is to determine the quantization step Δ Another way of controlling the quantization noise, which we adopted for the WFC, is by setting Δ After quantization, the spectral coefficients are preferably converted into binary base using entropy coding, for example, but not necessarily, by Huffman coding. A Huffman codebook with a certain range is assigned to each spatio-temporal critical band, and all coefficients in that band are coded with the same codebook. The use of entropy coding is advantageous because the MDCT has a different probability of generating certain values. An MDCT occurrence histogram, for different signal samples, clearly shows that small absolute values are more likely than large absolute values, and that most of the values fall within the range of −20 to 20. MDCT is not the only transformation with this property, however, and Huffman coding could be used advantageously in other implementations of the invention as well. Preferably, the entropy coding adopted in the present invention uses a predefined set of Huffman codebooks that cover all ranges up to a certain value r. Coefficient bigger than r or smaller than −r are encoded with a fixed number of bits using Pulse Code Modulation (PCM). In addition, adjacent values (Y According to an embodiment, a set of 7 Huffman codebooks covering all ranges up to [−7,7] is generated according to the following probability model. Consider a pair of spectral coefficients y=(Y When performing the actual coding of the spectral block Y, the appropriate Huffman codebook is selected for each critical band according to the maximum amplitude value Y According to another aspect of the invention, the binary data resulting from an encoding operation are organized into a time series of bits, called the bitstream, in a way that the decoder can parse the data and use it reconstruct the multichannel signal p(t,x). The bitstream can be registered in any appropriate digital data carrier for distribution and storage. The main header The frame format is repeated for each spectral block Y The scale factors can be encoded in a number of alternative formats, for example in logarithmic scale using 5 bits. The number of scale factors depends on the size B Decoding The decoding stage of the WFC comprises three steps: decoding, re-scaling, and inverse filter-bank. The decoding is controlled by a state machine representing the Huffman codebook assigned to each critical band. Since Huffman encoding generates prefix-free binary sequences, the decoder knows immediately how to parse the coded spectral coefficients. Once the coefficients are decoded, the amplitudes are re-scaled using (42) and the scale factor associated to each critical band. Finally, the inverse MDCT is applied to the spectral blocks, and the recombination of the signal blocks is obtained through overlap-and-add in both temporal and spatial domains. The decoded multi-channel signal p The inventors have found, by means of realistic simulation that the encoding method of the present invention provides substantial bitrate reductions with respect to the known methods in which all the channels of a WFC system are encoded independently from each other. Patent Citations
Non-Patent Citations
Classifications
Legal Events
Rotate |