BACKGROUND OF THE INVENTION
The present invention is related to the article "Using Speech Acoustics to Drive Facial Motion", by Hani Yehia, Takaaki Kuratate and Eric Vatikiotis-Bateson (Proceedings of the 14th International Congress of Phonetic Sciences, Vol.1, pp.631-634, American Institute of Physics, August 1999), which follows attached.
1. Field of the Invention
The present invention is an electronic communication technique. More specifically, it consists of a method and system used for digital encoding-decoding of audiovisual speech, i.e. facial image and sound produced by a speaker. The signal is encoded at low bit-rates. The speech acoustics is represented in a parametric form where as facial image estimated from speech acoustic parameters by means of a statistical model.
2. Description of the Background Art Developments in wide area computer networks and edigi
tal communication techniques have contributed to the practical use of video conference systems. These systems enable persons at remote locations to have a conference through a network. Also, telephone communication can be expanded to incorporate video information by means of digital cameras (CCD) currently available. Such systems, however, require bit-rates sufficiently low so that the users' demand is compatible with channel capacity.
Using conventional techniques, transmission of image signals require a bit-rate between two and three orders of magnitude larger than that required for the transmission of telephone speech acoustics. Thus, if video is to be transmitted over a telephone line, the frame rate has to be very low.
One way to solve this problem is to increase the bit-rate capacity of the channel Such a solution is, however, expensive and, hence, not practical. Moreover, the increasing demand for real time video communications justify efforts in the direction innovative video compression techniques.
Video compression rate is limited if done without taking into account the contents of the image sequence that forms the video signal. In the case of audiovisual speech coding, however, it is know that the image being encoded is that of a human face. The use of this information allows the development of compression techniques which are much more efficient. Furthermore, during speech, the acoustic signal is directly related to the speaker's facial motion. Thus, if the redundancy between audio and video signals is reduced, larger compression rates can be achieved. The technique described in this text goes in this direction.
SUMMARY OF THE INVENTION
The objective of the present invention is to provide a method and system of audiovisual speech coding, which is capable of transmitting and recovering a speaker's facial motion and speech audio with high quality even through a channel of limited capacity.
This objective is achieved in two steps. First, facial images are encoded based on the a priori information that the image being encoded is that of a human face. Second, the dependence between speech acoustics and facial motion is used to allow facial image recovery from the speech audio signal.
In the present invention, the method of transmitting facial image includes the following steps: (1) setup, at the receiver, of a facial shape estimator which receives the speech audio signal as input and generates a facial image of the speaker 5 as output; (2) transmission of the speech audio signal to the receiver, and (3) generation of the facial images which form the speaker's video signal.
Thus, transmission of only the speech audio signal enables the receiver to generate the speaker's facial video. 10 The facial image can then be transmitted with high efficiency, using a channel of far lower bit-rate, as compared with the transmission bit-rate required for standard image coding.
Preferably, the setup step is divided in the following parts: (l.a) specification of an artificial neural network architecture to be used at both transmitter and receiver sides; (l.b) training of the artificial neural network on the transmitting side so that facial images determined from the speech audio signal match original facial images as well as possible; and (l.c) transmission of the weights of the trained artificial
20 neural network to the receiver.
The artificial neural network of the transmitter side is trained and its parameters are sent to the receiver side before communication starts. So, the artificial neural network of the receiving side is set identically to that of the transmitter side
25 when communication is established. Thus it is ready for audiovisual speech communication using only the speech audio to recover the speech video counterpart.
Preferably, the step of neural network training consists of
3Q measuring coordinates of predetermined portions of a speaker's face during speech production on the transmitting side; simultaneous extraction of parameters from the speech audio signal; and adjusting the weights of the artificial neural network using the speech audio parameters as input and the facial measured coordinates as reference signal.
The artificial neural network is trained for each speaker. Therefore, efficient real time transmission of facial images of an arbitrary speaker is possible.
Preferably, the method of face image transmission also
40 includes the following steps: measuring, for each frame, coordinates of predetermined portions of the speaker's face during speech production; applying the speech audio signal to the trained artificial neural network of the transmitting side to obtain estimated values of the coordinates of the
45 predetermined portions of the speaker's; and comparing measured and estimated coordinate values to find the estimation error.
As the error between the estimated coordinate values of the predetermined positions of the speaker's face estimated
50 by the artificial neural network and the actual coordinates of the predetermined positions of the speaker's face on the transmitting side is found, it becomes possible to determine to which extent the face image of the speaker generated on the receiving side through communication matches the
Preferably, the method of face image transmission further includes the following steps: transmitting the estimation error to the receiving side; and correcting the output of the artificial neural network on the receiving side based on the
60 estimation error received. The precision used to transmit the estimation error is, however, limited by the channel capacity (bit-rate) available.
As the error signal obtained on the transmitting side is transmitted to the receiving side, it becomes possible to
65 correct the image obtained on the receiving side by using the error signal. As a result, a video signal of the speaker's face matching the speech signal can be generated.