« PreviousContinue »
METHOD FOR THE TREATMENT OF
COMPRESSED SOUND DATA FOR
 The invention relates to a processing of sound data for spatialized restitution of acoustic signals.
 The appearance of new formats for coding data on telecommunications networks allows the transmission of complex and structured sound scenes comprising multiple sound sources. In general, these sound sources are spatialized, that is to say they are processed in such a way as to afford a realistic final rendition in terms of position of the sources and room effect (reverberation). Such is the case for example for coding according to the MPEG-4 standard which makes it possible to transmit complex sound scenes comprising compressed or uncompressed sounds, and synthesis sounds, with which are associated spatialization parameters (position, effect of the surrounding room). This transmission is made over networks with constraints, and the sound rendition depends on the type of terminal used. On a mobile terminal of PDA type for example (standing for "Personal Digital Assistant"), a listening headset will preferably be used. The constraints of terminals of this type (calculation power, memory size) render the implementation of sound spatialization techniques difficult.
 Sound spatialization covers two different processing types. On the basis of a monophone audio signal, one seeks to give a listener the illusion that the sound source or sources are at very precise positions in space (that one desires to be able to modify in real time), and immersed in a space having particular acoustic properties (reverberation, or other acoustic phenomena such as occlusion). By way of example, on telecommunication terminals of mobile type, it is natural to envisage a sound rendition with a stereophonic listening headset. The most effective technique of positioning of the sound sources is then binaural synthesis.
 It consists, for each sound source, in filtering the monophone signal via acoustic transfer functions, called HRTFs (standing for "Head Related Transfer Functions"), which model the transformations engendered by the torso, the head and the auricle of the ear of the listener on a signal originating from a sound source. For each position in space, it is possible to measure a pair of these functions (one for the right ear, one for the left ear). The HRTFs are therefore functions of a spatial position, more particularly of an angle of azimuth 9 and of an angle of elevation <|), and of the sound frequency f. Thus, for a given subject, a database of acoustic transfer functions of N positions in space is obtained, for each ear, and in which a sound may be "placed" (or "spatialized" according to the terminology used hereinbelow).
 It is indicated that a similar spatialization processing consists of a so-called "transaural" synthesis, in which provision is simply made for more than two loudspeakers in a restitution device (which then takes a different form from a headset with two earpieces, left and right).
 In a conventional manner, the implementation of this technique is effected in a so-called "bichannel" form (processing represented diagrammatically in FIG. 1 pertaining to the prior art). For each sound source to be positioned according to the pair of azimuthal and elevation angles [9, (j)], the signal of the source is filtered with the HRTF function
of the left ear and with the HRTF function of the right ear. The two channels, left and right, deliver acoustic signals which are then broadcast to the ears of the listener with a stereophonic listening headset. This bichannel binaural synthesis is of a type referred to hereinbelow as "static", since in this case the positions of the sound sources do not change over time.
 If one wishes, on the contrary, to vary the positions of the sound sources in space in the course of time ("dynamic" synthesis), the filters used to model the HRTFs (left ear and right ear) have to be modified. However, these filters being for the most part of the finite impulse response type (FIR) or infinite impulse response type (IIR), problems of discontinuities of the left and right output signals appear, giving rise to audible "clicks". The technical solution conventionally employed to alleviate this problem is to make two sets of binaural filters take a turn in parallel. The first set simulates a position [91, <|)1] at the instant tl, the second a position [92, <|)2] at the instant t2. The signal giving the illusion of a displacement between the positions at the instants tl and t2 is then obtained by cross-fading the left and right signals resulting from the filtering processes for the position [91, <|)1] and for the position [92, <|)2]. Thus, the complexity of the system for positioning the sound sources is then doubled (two positions at two instants) with respect to the static case.
 In order to alleviate this problem, techniques of linear decomposition of the HRTFs have been proposed (processing represented diagrammatically in FIG. 2 pertaining to the prior art). One of the advantages of these techniques is that they allow an implementation whose complexity depends much less on the total number of sources to be positioned in space. Specifically, these techniques make it possible to decompose the HRTFs over a basis of functions common to all the positions in space, and therefore depending only on frequency, thereby making it possible to reduce the number of filters required. Thus, this number of filters is fixed, independently of the number of sources and/or of the number of positions of sources to be envisaged. The addition of a further sound source then adds only operations of multiplication by a set of weighting coefficients and by a delay x1; these coefficients and this delay depending only on the position [9, (j)]. No further filter is therefore necessary.
 These techniques of linear decomposition are also of interest in the case of dynamic binaural synthesis (i.e. when the position of the sound sources varies in the course of time). Specifically, in this configuration, the values of the weighting coefficients and of the delays, rather than the coefficients of the filters, are now made to vary as a function of position alone. The principle described hereinabove of linear decomposition of sound rendition filters generalizes to other approaches, as will be seen hereinbelow.
 Moreover, in the various group communication services (teleconferencing, audio conferencing, video conferencing, or the like) or "STREAMING" communication services, to adapt a binary throughput to the bandwidth provided by a network, the audio and/or speech streams are transmitted in a compressed coded format. Hereinbelow we consider only streams initially compressed by coders of frequency type (or by frequency transform) such as those operating according to the MPEG-1 standard (layer I-II-III), the MPEG-2/4 AAC standard, the MPEG-4 TwinVQ standard, the Dolby AC-2 standard, the Dolby AC-3 standard, or else a UIT-T G.722.1 standard for speech coding, or else the Applicant's TDAC coding method. The use of such coders amounts to firstly performing a time/frequency transformation on blocks of the time signal. The parameters obtained are thereafter quantized and coded so as to be transmitted in a frame with other supplementary information required for decoding. This time/frequency transformation may take the form of a bank of frequency subband filters or else a transform of MDCT type (standing for "Modified Discrete Cosine Transform"). Hereinbelow, the same terms "subband domain" will designate a domain defined in a frequency subband space, a domain of a frequency-transformed time space or a frequency domain.
 To perform the sound spatialization on such streams, the conventional procedure consists in firstly doing a decoding, carrying out the sound spatialization processing on the time signals, then recoding the signals which result, for transmission to a restitution terminal. This irksome succession of steps is often very expensive in terms of calculation power, of memory required for the processing and of the algorithmic lag introduced. It is therefore often unsuited to the constraints imposed by machines where the processing is performed and to the communication constraints.
 The present invention comes to improve the situation.
 One of the aims of the present invention is to propose a method of processing sound data grouping together the operations of compression coding/decoding of the audio streams and of spatialization of said streams.
 Another aim of the present invention is to propose a method of processing sound data, by spatialization, which adapts to a variable number (dynamically) of sound sources to be positioned.
 A general aim of the present invention is to propose a method of processing sound data, by spatialization, allowing wide broadcasting of the spatialized sound data, in particular broadcasting for the general public, the restitution devices being simply equipped with a decoder of the signals received and restitution loudspeakers.
 To this end it proposes a method of processing sound data, for spatialized restitution of acoustic signals, in which:
 a) at least one first set and one second set of weighting terms, representative of a direction of perception of said acoustic signal by a listener, are obtained for each acoustic signal; and
 b) said acoustic signals are applied to at least two sets of filtering units, disposed in parallel, so as to deliver at least a first output signal and a second output signal each corresponding to a linear combination of the acoustic signals weighted by the collection of weighting terms respectively of the first set and of the second set and filtered by said filtering units.
 Each acoustic signal in step a) of the method within the sense of the invention is at least partially compressioncoded and is expressed in the form of a vector of subsignals associated with respective frequency subbands, and each
filtering unit is devised so as to perform a matrix filtering applied to each vector, in the frequency subband space.
 Advantageously, each matrix filtering is obtained by conversion, in the frequency subband space, of a (finite or infinite) impulse response filter defined in the time space. Such an impulse response filter is preferably obtained by determination of an acoustic transfer function dependent on a direction of perception of a sound and the frequency of this sound.
 According to an advantageous characteristic of the invention, these transfer functions are expressed by a linear combination of frequency dependent terms weighted by direction dependent terms, thereby making it possible, as indicated hereinabove, on the one hand, to process a variable number of acoustic signals in step a) and, on the other hand, to dynamically vary the position of each source over time. Furthermore, such an expression for the transfer functions "integrates" the interaural delay which is conventionally applied to one of the output signals, with respect to the other, before restitution, in binaural processing. To this end, matrices of filters of gains associated with each signal are envisaged.
 Thus, said first and second output signals preferably being intended to be decoded into first and second restitution signals, the aforesaid linear combination already takes account of a time shift between these first and second restitution signals, in an advantageous manner.
 Finally, between the step of reception/decoding of the signals received by a restitution device and the step of restitution itself, it is possible not to envisage any further step of sound spatialization, this spatialization processing being completely performed upstream and directly on coded signals.
 According to one of the advantages afforded by the present invention, association of the techniques of linear decomposition of the HRTFs with the techniques of filtering in the subband domain makes it possible to profit from the advantages of the two techniques so as to arrive at sound spatialization systems with low complexity and reduced memory for multiple coded audio signals.
 Specifically, in a conventional "bichannel" architecture, the number of filters to be used is dependent on the number of sources to be positioned. As indicated hereinabove, this problem does not arise in an architecture based on the linear decomposition of HRTFs. This technique is therefore preferable in terms of calculation power, but also memory space required for storing the binaural filters. Finally, this architecture makes it possible to optimally manage the dynamic binaural system, since it makes it possible to effect the "fading" between two instants tl and t2 on coefficients which depend only on position, and therefore does not require two sets of filters in parallel.
 According to another advantage afforded by the present invention, the direct filtering of the signals in the coded domain allows a saving of one complete decoding per audio stream before undertaking the spatialization of the sources, thereby entailing a considerable gain in terms of complexity.
 According to another advantage afforded by the present invention, the sound spatialization of the audio