US 6826284 B1 Abstract A real-time passive acoustic source localization system for video camera steering advantageously determines the relative delay between the direct paths of two estimated channel impulse responses. The illustrative system employs an approach referred to herein as the “adaptive eigenvalue decomposition algorithm” (AEDA) to make such a determination, and then advantageously employs a “one-step least-squares algorithm” (OSLS) for purposes of acoustic source localization, providing the desired features of robustness, portability, and accuracy in a reverberant environment. The AEDA technique directly estimates the (direct path) impulse response from the sound source to each of a pair of microphones, and then uses these estimated impulse responses to determine the time delay of arrival (TDOA) between the two microphones by measuring the distance between the first peaks thereof (i.e., the first significant taps of the corresponding transfer functions). In one embodiment, the system minimizes an error function (i.e., a difference) which is computed with the use of two adaptive filters, each such filter being applied to a corresponding one of the two signals received from the given pair of microphones. The filtered signals are then subtracted from one another to produce the error signal, which is minimized by a conventional adaptive filtering algorithm such as, for example, an LMS (Least Mean Squared) technique. Then, the TDOA is estimated by measuring the “distance” (i.e., the time) between the first significant taps of the two resultant adaptive filter transfer functions.
Claims(27) 1. A method of locating an acoustic source within a physical environment with use of a plurality of microphones placed at different locations within said physical environment, each microphone receiving an acoustic signal resulting from said acoustic source and generating a corresponding microphone output signal in response thereto, the method comprising the steps of:
estimating a first impulse response representative of the acoustic signal received by a first one of said microphones and estimating a second impulse response representative of the acoustic signal received by a second one of said microphones;
determining a relative time delay of arrival between said acoustic signal received by said first one of said microphones and said acoustic signal received by said second one of said microphones, said determination based on said estimated first impulse response and on said estimated second impulse response; and
locating said acoustic source within said physical environment based on said determined relative time delay of arrival,
wherein the step of estimating the first and second impulse responses includes filtering said microphone output signal of said first one of said microphones with a first adaptive filter and filtering said microphone output signal of said second one of said microphones with a second adaptive filter; and,
said step of estimating the first and second impulse responses further includes adjusting said first adaptive filter to provide an estimate of said second impulse response and adjusting said second adaptive filter to provide an estimate of said first impulse response.
2. The method of
3. The method of
4. The method of
5. The method of
6. The method of
7. The method of
8. The method of
9. The method of
10. The method of
11. The method of
12. The method of
13. The method of
14. An apparatus for locating an acoustic source within a physical environment comprising:
a plurality of microphones placed at different locations within said physical environment, each microphone receiving an acoustic signal resulting from said acoustic source and generating a corresponding microphone output signal in response thereto;
a first adaptive filter applied to said microphone output signal of a first one of said microphones, and a second adaptive filter applied to said microphone output signal of a second one of said microphones, the second adaptive filter estimating a first impulse response representative of the acoustic signal received by said first one of said microphones, and the first adaptive filter estimating a second impulse response representative of the acoustic signal received by said second one of said microphones;
means for determining a relative time delay of arrival between said acoustic signal received by said first one of said microphones and said acoustic signal received by said second one of said microphones, said determination based on said estimated first impulse response and on said estimated second impulse response; and
means for locating said acoustic source within said physical environment based on said determined relative time delay of arrival.
15. The apparatus of
16. The apparatus of
17. The apparatus of
18. The apparatus of
19. The apparatus of
20. The apparatus of
21. The apparatus of
22. The apparatus of
23. The apparatus of
24. The apparatus of
25. The apparatus of
26. The apparatus of
27. The apparatus of
Description The present invention relates generally to the field of acoustics and more particularly to the problem of localizing an acoustic source for purposes of, for example, tracking the source with a video camera. In current video-conferencing environments, a set of cameras are typically set up at a plurality of different locations to provide video of the active talker as he or she contributes to the discussion. In previous teleconferencing environments, this tedious task needed the full involvement of professional camera operators. More recently, however, artificial object tracking techniques have been advantageously used to locate and track an active talker automatically in three dimensional space—namely, to determine his or her range and azimuth, as well as his or her elevation. In this manner, the need for one or more human camera operators can be advantageously eliminated. There are several possible approaches for automatically tracking an active talker. Broadly they can be divided into two classes—the class of visual tracking and the class of acoustic tracking—depending on what particular type of information (visual or acoustic cues, respectively) is employed. Even though visual tracking techniques have been investigated for several decades and have had reasonably good success, acoustic source localization systems have certain advantages that are not present in vision-based tracking systems. For example, acoustic approaches which receive an acoustic signal omnidirectionally can advantageously act in the dark. Therefore they are able to detect and locate sound sources in the rear or which are otherwise “in hiding” from the view of the camera. Humans, like most vertebrates, have two ears which form a microphone array, mounted on a mobile base (i.e., a human head). By continuously receiving and processing the propagating acoustic signals with such a binaural auditory system, we accurately and instantaneously gather information about the environment, particularly about the spatial positions and trajectories of sound sources and about their states of activity. However, the brilliant performance features demonstrated by our binaural auditory system form a big technical challenge for acoustic engineers attempting to artificially recreate the same effect, primarily as a result of room reverberation. Nonetheless, microphone array processing is a rapidly emerging technique which can play an important role in a practical solution to the active talker tracking problem. In general, locating point sources using measurements or estimates from passive, stationary sensor arrays has had numerous applications in navigation, aerospace, and geophysics. Algorithms for radiative source localization, for example, have been studied for nearly 100 years, particularly for radar and underwater sonar systems. Many processing techniques have been proposed, with differing levels of complexity and differing restrictions. The application of such source localization concepts to the automation of video camera steering in teleconferencing applications, however, has been only recently addressed. Specifically, existing acoustically-based source localization methods can be loosely divided into three categories—steered beamformer-based techniques, high-resolution spectral estimation-based techniques, and time delay estimation-based techniques. (See, e.g., “A Practical Methodology for Speech Source Localization with Microphone Arrays” by M. S. Brandstein et al., Comput., Speech, Language, vol. 2, pp. 91-126, November 1997.) With continued investigation over the last two decades, the time delay estimation-based location method has become the technique of choice, especially in recent digital systems. In particular, research efforts that have been applied to time delay estimation-based source localization techniques primarily focus on obtaining improved (in the sense of accuracy, robustness, and efficiency) source location estimators which can be implemented in real-time with a digital computer. More specifically, time delay estimation-based localization systems determine the location of acoustic sources based on a plurality of microphones in a two-step process. In the first step, a set of time delay of arrivals (TDOAs) among different microphone pairs is calculated. That is, for each of a set of microphone pairs, the relative time delay between the arrival of the acoustic source signal at each of the microphones in the pair is determined. In the second step, this set of TDOA information is then employed to estimate the acoustic source location with the knowledge of the particular microphone array geometry. Methods which have been employed to perform such localization (i.e., the second step of the two step process) include, for example, the maximum likelihood method, the triangulation method, the spherical intersection method, and the spherical interpolation method. (See the discussion of these techniques below.) Specifically, time delay estimation (TDE) (i.e., the first step of the two step process) is concerned with the computation of the relative time delay of arrival between different microphone sensors. In developing a time delay estimation algorithm (i.e., the first of the two steps of a time delay estimation-based acoustic source localization system), it is necessary to make use of an appropriate parametric model for the acoustic environment. Two parametric acoustic models for TDE problems—namely, ideal free-field and real reverberant models—may be employed. Generally then, the task of a time delay estimation algorithm is to estimate the model parameters (more specifically, the TDOAs) based on the model employed, which typically involves determining parameter values that provide minimum errors in accordance with the received microphone signals. In particular, conventional prior art time delay estimation-based acoustic source localization systems typically use a generalized cross-correlation (GCC) method that selects as its estimate the time delay which maximizes the cross-correlation function between time-shifted versions of the signals of the two distinct microphones. (See, e.g., “The Generalized Correlation Method for Estimation of Time Delay” by C. H. Knapp et al., IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-24, pp. 320-327, August 1976.) More specifically, in the GCC approach, the TDOA between two microphone sensors can be found by computing the cross-correlation function of their signals and selecting the peak location. The peak can be sharpened by pre-whitening the signals before computing the cross-correlation, which leads to the so-called phase transform method. Techniques have been proposed to improve the generalized cross-correlation (GCC) algorithms in the presence of noise. (See, e.g., “A Pitch-Based Approach to Time-Delay Estimation of Reverberant Speech” by M. S. Brandstein, Proc. IEEE ASSP Workshop Appls. Signal Processing Audio Acoustics, 1997). But because GCC is based on a simple signal propagation model in which the signals acquired by each microphone are regarded as delayed replicas of the source signal plus background noise, it has a fundamental drawback of an inability to cope well with the reverberation effect. (See, e.g., “Performance of Time-Delay Estimation in the Presence of Room Reverberation” by B. Champagne et al., IEEE Trans. Speech Audio Processing, vol. 4, pp. 148-152, March 1996.) Although some improvement may be gained by cepstral prefiltering, shortcomings still remain. (See, e.g., “Cepstral Prefiltering for Time Delay Estimation in Reverberant Environments” by A. Stephenne et al., Proc. IEEE ICASSP, 1995, pp. 3055-58.) Even though more sophisticated techniques exist, they tend to be computationally intensive and are thus not well suited for real-time applications. (See, e.g., “Modeling Human Sound-Source Localization and the Cocktail-Party-Effect,” by M. Bodden, Acta Acoustica 1, pp. 43-55, 1993.) Therefore, an alternative approach to the GCC method for use in reverberant environments would be highly desirable. In accordance with an illustrative embodiment of the present invention, a real-time passive acoustic source localization system for video camera steering advantageously determines the relative delay between the direct paths of two estimated channel impulse responses. The illustrative system employs a novel approach referred to herein as the “adaptive eigenvalue decomposition algorithm” (AEDA) to make such a determination, and then advantageously employs a “one-step least-squares” algorithm (OSLS) for purposes of acoustic source localization. The illustrative system advantageously provides the desired features of robustness, portability, and accuracy in a reverberant environment. More specifically, and in accordance with one aspect of an illustrative embodiment of the present invention, the AEDA technique directly estimates the (direct path) impulse response from the sound source to each of the microphones in a pair of microphones, and then uses these estimated impulse responses to determine the TDOA associated with the given pair of microphones, by determining the distance between the first peaks thereof (i.e., the first significant taps of the corresponding transfer function). For example, in accordance with one illustrative embodiment of the present invention, a passive acoustic source localization system minimizes an error function (i.e., a difference) which is computed with the use of two adaptive filters, each such filter being applied to a corresponding one of the two signals received from the pair of microphones for which it is desired to compute a TDOA. The filtered signals are advantageously subtracted from one another to produce the error signal, which signal is minimized by a conventional adaptive filtering algorithm such as, for example, an LMS (Least-Mean-Squared) technique, such as may be used, for example, in acoustic echo cancellation systems and which is fully familiar to those of ordinary skill in the art. Then, the TDOA may be estimated by determining the “distance” (i.e., the time) between the first significant taps of the two resultant adaptive filter transfer functions. In accordance with another aspect of an illustrative embodiment of the present invention, the acoustic source location is subsequently performed (based on the resultant TDOAs) with use of an OSLS algorithm, which advantageously reduces the computational complexity but achieves the same results as a conventional spherical interpolation (SI) method. And in accordance with still another aspect of the present invention, the filter coefficients may be advantageously updated in the frequency domain using the unconstrained frequency-domain LMS algorithm, so as to take advantage of the computational efficiencies of a Fast Fourier Transform (FFT). FIG. 1 shows schematic diagrams of acoustic models which may be used for time delay estimation problems. FIG. 1A shows a diagram of an ideal free-field acoustic model, and FIG. 1B shows a diagram of a real reverberant acoustic model. FIG. 2 shows an adaptive filter arrangement for use in the adaptive eigenvalue decomposition algorithm employed in accordance with an illustrative embodiment of the present invention. FIG. 3 shows a schematic diagram of three-dimensional space illustrating certain defined notation for use in the source localization problem solved in accordance with an illustrative embodiment of the present invention. FIG. 4 shows a schematic block diagram of a real-time system infrastructure which may be employed in accordance with an illustrative embodiment of the present invention. FIG. 5 shows an illustrative three-dimensional microphone array for passive acoustic source location in accordance with an illustrative embodiment of the present invention. FIG. 1A shows a diagram of an ideal free-field acoustic model which may be used for time delay estimation problems. Sound is generated by sound source
where α
Assume that s(n), b In a real acoustic environment, however, one must take into account the reverberation of the room and the above ideal model as illustratively shown in FIG. 1A no longer holds. FIG. 1B shows a diagram of a real reverberant acoustic model which may be used more advantageously for time delay estimation problems in accordance with an illustrative embodiment of the present invention. In particular, a more complicated but more realistic model for the microphone signals x
where * denotes convolution and g In the GCC technique which is based on the simple signal propagation model, the time-delay estimate is obtained as the value of τ that maximizes the generalized cross-correlation function given by where S
is the generalized cross-spectrum. Then, the GCC TDE may be expressed as: The choice of Φ(ƒ) is important in practice. The classic cross-correlation (CCC) method is obtained by taking Φ(ƒ)=1. In the noiseless case, with the model of Equation (1), knowing that X
The fact that ψ It is clear by examining Equation (6) that the phase rather than the magnitude of cross-spectrum provides the TDOA information. Thereafter, the cross-correlation peak can be sharpened by pre-whitening the input signals, i.e., by choosing φ(ƒ)=1/|S
depends only on the τ In accordance with an illustrative embodiment of the present invention, a completely different approach from the prior art GCC technique is employed. This novel method focuses directly on the channel impulse responses for TDE. Specifically, an adaptive eigenvalue decomposition algorithm (AEDA) is advantageously used for TDE. The AEDA focuses directly on the channel impulse for TDE and assumes that the system (i.e., the room) is linear and time invariant. By following the real reverberant model as illustrated in FIG. 1B, and by observing the fact that
in the noiseless case, one can deduce the following relation at time n:
where x
and M is the length of the impulse responses. (See, e.g., “Adaptive Filtering Algorithms for Stereophonic Acoustic Echo Cancellation” by J. Benesty et al., Proc. IEEE ICASSP, pp. 3099-3102, 1995.) From Equation (10), it can be derived that R(n)u=0, where R(n)=E{x(n)x In practice, accurate estimation of the vector u is not trivial due to the nonstationary nature of speech, the length of the impulse responses, the background noise, etc. However, for the present application we only need to find an efficient way to detect the direct paths of the two impulse responses. In order to efficiently estimate the eigenvector (here û) corresponding to the minimum eigenvalue of R(n), the constrained LMS algorithm, familiar to those of ordinary skill in the art, may be used. (See, e.g., “An Algorithm for Linearly Constrained Adaptive Array Processing” by O. L. Frost III, Proc. of the IEEE, vol. 60, no. 8, pp. 926-935, August 1972.) The error signal is and the constrained LMS algorithm may be expressed as
where μ, the adaptation step, is a positive small constant and Substituting Equations (15) and (17) into Equation (16), and taking expectation after convergence gives which is the desired result: û converges in the mean to the eigenvector of R corresponding to the smallest eigenvalue E{e Note that if this normalization is used, then ∥û(n)∥ (which appears in e(n) and ∇e(n)) can be removed, since we will always have ∥û(n)∥=1. If the smallest eigenvalue is equal to zero, which is the case here, the algorithm can be simplified as follows: Since the goal here is not to accurately estimate the two impulse responses g FIG. 2 shows an adaptive filter arrangement for use in the adaptive eigenvalue decomposition algorithm employed in accordance with an illustrative embodiment of the present invention. Specifically, impulse response estimates Ĝ In a passive acoustic source localization system in accordance with various illustrative embodiments of the present invention, the second step of the process employs the set of TDOA estimates (determined in the first step of the process) to perform the localization of the acoustic source. Since the equations describing source localization issues are highly nonlinear, the estimation and quantization errors introduced into the TDE step will be magnified more with some methods than others. As pointed out above, many processing techniques have been proposed for acoustic source localization, any of which may be employed in accordance with various illustrative embodiments of the present invention. For example, one approach derives an expression for the likelihood function, and gives the maximum likelihood (ML) source location estimate as its maximizer. (See “Optimum Localization of Multiple Sources by Passive Arrays” by M. Wax et al., IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-31, pp. 1210-1218, October 1983.) Since the joint probability density function (pdf) of the errors in the time differences has to be assumed a priori, a ML estimator can not generate optimum solutions in most practical applications, however. Furthermore, a ML estimator needs to search its maximizer iteratively by using initialized gradient descent based methods. Selecting a good initial guess to avoid local minima is difficult and convergence to the optimal solution of the iterative process can not be guaranteed. Thus, closed-form source localization estimators have gained wider attention due mainly to their computational simplicity and, to some extent, acceptable robustness to noise. One such approach is to triangulate the source location from time estimates. (See, e.g., “Voice Source Localization for Automatic Camera Pointing System in Videoconferencing” by H. Wang et al., Proc. IEEE ASSP Workshop Appls. Signal Processing Audio Acoustics, 1997.) In such a triangulation procedure, familiar to those of ordinary skill in the art, the number of TDOA estimates are equal to the number of unknowns (depth, direction, and azimuth). Therefore, such a technique is unable to take advantage of extra sensors and TDOA redundancy, and it is also very sensitive to noise in the TDOA estimates, especially when the source is far from the sensors and the values of TDOAs are small. Another approach, also familiar to those of ordinary skill in the art, is to realize that the sensitivity problems with triangulation algorithm are basically due to solving the hyperbolic equations. In order to avoid intersecting hyperboloids, the problem may be recast into one that employs spheres. This is known to those of ordinary skill in the art as the spherical intersection (SX) algorithm. (See, e.g., “Passive Source Localization Employing Intersecting Spherical Surfaces from Time-of-Arrival Differences” by H. C. Schau et al., IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-35, pp. 1223-1225, August 1987.) Note, however, that the SX algorithm requires the solution of a quadratic equation to determine the depth of the acoustic source. The solution may not exist or may not be unique. Finally, a spherical interpolation (SI) method, also known to those of ordinary skill in the art, has been proposed. (See, “Closed-Form Least-Squares Source Location Estimation from Range-Difference Measurements” by O. Smith et al., IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-35, pp. 1661-1669, December 1987.) The SI method solves the spherical equations in a least-squares sense. In practice, the SI method has been reported capable of achieving greater noise immunity than the SX method. Specifically, therefore, the problem addressed herein is the determination of the location of an acoustic source given the array geometry and relative TDOA measurements among different microphone pairs. The problem can be stated mathematically with reference to FIG. 3 as follows: A microphone array of N+1 microphones (microphones
in a Cartesian coordinate (see FIG.
The distances from the origin to the i-th microphone and the source are denoted by R The distance between the source (sound source Then, the distance difference between microphone
_{i} −D _{j} , i,j=0, . . . , N. (27)The difference is usually referred to as the “range difference” and it is proportional to the time delay of arrival τ
where, the speed of sound (in meters per second) can be estimated from the air temperature t
(The notations defined above are illustrated in FIG. 3.) The localization problem is then to estimate r distinct TDOA estimates τ Given a set of TDOA estimates τ=(τ
The maximum likelihood solution is then given by By assuming that the additive noise in TDOAs is zero mean and jointly Gaussian distributed, the joint probability density function of τ on r where, τ Since Σ is positive definite, the maximum likelihood solution is equivalent to that minimizing the error function defined as,
Direct estimation of the minimizer is not practicable. If the noise is assumed uncorrelated, the covariance matrix is diagonal
and the error function as defined in Equation (33) can be rewritten as The steepest descent algorithm can be used to find {circumflex over (r)}
where μ is the step size. In the closed-form source locators, triangulation is the most straightforward method. Consider the given set of TDOA estimates τ=(τ All points on such hyperboloid have the same range difference d This equation set is highly nonlinear. If microphones are arbitrarily arranged a closed-form solution may not exist and numerical methods must be used. In our simulations and for this method, the microphones are placed at
and the problem is solved in polar coordinates. From a geometric point of view, the triangulation locator finds the source location by intersecting three hyperboloids. This approach is easy to implement. However, determinant characteristics of triangulation algorithm makes it very sensitive to noise in TDOAs. Small errors in TDOA can deviate the estimated source far from the true location. In order to avoid solving the set of hyperbolic Equations (38) whose solution is very sensitive to noise, the source localization problem call be reorganized into a set of spherical equations. The spherical intersection locator seeks to find the source location by intersecting a group of spheres centered at the microphones. Consider the distance from the i-th microphone to the acoustic source. From the definition of the range difference of Equation (27) and from the fact that D
From the Pythagorean theorem, D
Substituting Equation (40) into Equation (41) yields,
or,
Putting the N equations together and writing them into a matrix form,
where the spherical equations are linear in r The SX locator solves the problem in two steps. It first finds the least-squares solution for r
Then, substituting (40) into R
After expansion,
where, a=1−d b=2b c=−b S′=[(S The valid (real, positive) root is taken as an estimate of the source depth R In the SX procedure, the solution of the quadratic Equations (47) for the source depth R In order to overcome the drawback of spherical intersection, a spherical interpolation locator has been proposed which attempts to relax the restriction R To begin, substitute the least-squares solution of Equation (45) into the original spherical Equations (44) to obtain, [ where I where,
_{N} −S(S ^{T} S)^{−1} S ^{T} =P _{s} ^{T} P _{s}. (50)Substituting this solution into Equation (45) yields the spherical interpolation estimate, In simulations, the SI locator performs better, but is computationally more complex than SX locator. The SI method tries to solve the spherical Equations (44) for the source depth and its location in two separate steps. Since both are calculated in the least-squares sense, however, it has been realized that the procedure can be simplified by a one-step least-squares method as described herein. In particular, the one-step least-squares method advantageously generates the same results as the SI method but with less computational complexity, which is highly desirable for real-time implementations. In particular, the least-squares solution of Equation (44) for θ (the source location as well as its depth) is given by:
or written into block form as First, write the matrix that appears in Equation (53) as follows: where, Define and find, Substituting, Equation (54) with Equation (59) and Equation (60) into Equation (53) yields It can be determined that the solution of Equation (61) is equivalent to the SI estimates of Equation (51), i.e.,
However, consider the computational complexity for the SI and OSLS methods. To calculate the inverse of an M×M matrix by using the Gauss-Jordan method without pivoting (familiar to one of ordinary skill in the art), the numbers of necessary scalar multiplications and additions are given by To multiply matrix X
For the SI locator of Equation (51), many matrix multiplications and one 3×3 matrix inverse need to be performed. Note that P
multiplications and additions, respectively. Both are on the order of O(N Mul
which are on the order of only O(N). When N≧4 (i.e., 5 microphones), the OSLS method is more computationally efficient that the SI method. A real-time passive acoustic source localization system for video camera steering in accordance with an illustrative embodiment of the present invention has been implemented and consists of several hardware components and software modules. The performance of the illustrative source localization algorithm depends in particular on the geometry and size of the microphone array, an illustrative design for which is described below. Specifically, FIG. 4 shows a schematic block diagram of a real-time system infrastructure which may be employed in accordance with an illustrative embodiment of the present invention. The illustrative system comprises a real-time PC-based passive acoustic source localization system for video camera steering, comprising a front-end 6-element microphone array consisting of microphone A Pentium III™ 500 MHz PC (Dell Dimension XPS T500™) is used as a host computer (PC The analog acoustic signal is sampled at 44.1 kHz, downsampled to 8.82 kHz, and quantized with 16 bits per sample. In accordance with the principles of the present invention, five TDOAs are advantageously estimated by the AEDA algorithm of TDOA Estimator In this illustrative system, the microphones may, for example, comprise a set of six Lucent Speech Tracker Directional™ hypercardioid microphones of 6 dB directivity index. These microphones advantageously reject noise and reverberation from the real and allow better speech signal recording than omnidirectional microphones in the half plane in front of the microphone array. The frequency response of a Lucent Speech Tracker Directional™ microphone is 200-6000 Hz. Beyond 4 kHz there is negligible energy and the recorded signal can be recorded as bandlimited by 4 kHz. A set of six one-stage pre-amplifiers having, for example, fixed gain 37 dB, may be advantageously used to enhance the signal output of each microphone. The pre-amplifier outputs are connected to the analog inputs from the rear panel using twisted pair balanced cables provided with XLR connectors. The digital outputs are supplied to the STUDI/O™ digital auto interface card which is installed on the PC running Windows98™, using an optic fiber provided with a TOSLINK connector. The STUDI/O™ board may be set as an eight-channel logical device to the Windows Sound System for easy synchronization. The illustrative passive acoustic source localization system advantageously uses a high-quality (a resolution 460 H×350 V lines) Sony EVI-D30™ color video camera with pan, tilt and zoom control. The ±100° pan range, ±25° tilt range, and 4.4°˜48.8° horizontal angle of view, allows for optimum coverage of a normal conference room environment. The EVI-D30™ camera has two high speed motors for panning and tilting. The maximum pan speed is approximately 80°/sec and the maximum tilt speed is 50°/sec. Pan, tilt, and zoom operation can be performed at the same time. Therefore, the camera is able to capture the full interaction of videoconference participants at remote locations. The EVI-D30™ camera can be controlled by PC The microphone array geometry and size play an important role in the system performance. In general, the bias and standard deviation decrease as the array size gets bigger. However, the array size is usually restricted by the application requirement. For the illustrative system of FIG. 4, for example, in order to provide portability, the distance of each microphone Specifically then, in accordance with the illustrative embodiment of the present invention illustrated in FIG. was used as the figure of merit. For each source location, white Gaussian noise with mean zero and variance 0.25 was added to the TDOAs and the rms error was obtained from 100-trial Monte-Carlo runs. As a result of these simulations, values of θ Therefore, in the illustrative system of FIGS. 4 and 5, the microphone array, has been designed according to the above-described simulation results. Specifically, the reference microphone The preceding merely illustrates the principles of the invention. It will thus be appreciated that those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the invention and are included within its spirit and scope. Furthermore, all examples and conditional language recited herein are principally intended expressly to be only for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventors to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future—i.e., any elements developed that perform the same function, regardless of structure. Thus, for example, it will be appreciated by those skilled in the art that the block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the invention. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudocode, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown. The functions of the various elements shown in the figures, including functional blocks labeled as “processors” or “modules” may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term “processor” or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read-only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage. Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the Figs. are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementor as more specifically understood from the context. In the claims hereof any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, (a) a combination of circuit elements which performs that function or (b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the functions. The invention as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. Applicant thus regards any means which can provide those functionalities as equivalent (within the meaning of that term as used in 35 U.S.C. 112, paragraph 6) to those explicitly shown and described herein. Patent Citations
Referenced by
Classifications
Legal Events
Rotate |