A facial feature extraction method and apparatus uses the variation in light intensity (gray-scale) of a frontal view of a speaker's face. The sequence of video images are sampled and quantized into a regular array of 150.times.150 pixels that naturally form a coordinate system of scan lines...
Inventors: K. Venkatesh Prasad, David G. Stork Assignees: Ricoh Corporation, Ricoh Company, LTD Primary Examiner: Monica S. Davis U.S. Classification 382/190; 382/118; 382/159; 382/193; 382/202 International Classification G06K 946; G06K 932; G06K 900; G06K 962 View patent at USPTO |
Citations|
| 3999006 | Visitor confirmation system | Dec 21, 1976 | | 4109237 | Apparatus and method for identifying individuals through their retinal vasculature patterns | Aug 22, 1978 | | 4228465 | Analog video convolver for real time two-dimensional image processing | Oct 14, 1980 | | 4449189 | Personal access control system using speech and face recognition | May 15, 1984 | | 4625329 | Position analyzer for vehicle drivers | Nov 25, 1986 | | 4773024 | Brain emulation circuit with reduced confusion | Sep 20, 1988 | | 4931865 | Apparatus and methods for monitoring television viewers | Jun 5, 1990 | | 4975960 | Electronic facial tracking and detection system and method and apparatus for automated speech recognition | Dec 4, 1990 | | 4975969 | Method and apparatus for uniquely identifying individuals by particular physical characteristics and security system utilizing the same | Dec 4, 1990 | | 4975978 | Image signal amplifier | Dec 4, 1990 | | 5008946 | System for recognizing image | Apr 16, 1991 | | 5012522 | Autonomous face recognition machine | Apr 30, 1991 | | 5063603 | Dynamic method for recognizing objects and image processing system therefor | Nov 5, 1991 | | 5412738 | Recognition system, particularly for recognising people | May 2, 1995 |
ClaimsWhat is claimed is: 1. A method for extracting a visual feature vector from a sequence of images, each having a plurality of horizontal raster lines, of frontal views of a speaker's face in a speech classification system, the method comprising the following steps: - a) sampling and quantizing each image at uniform intervals along each horizontal raster line of the image to produce an image represented by an array of pixels, each pixel value representing gray-scale level;
- b) preconditioning the pixel image by spatially smoothing and enhancing edges separating regions of greater and less gray-scale intensity using spatial convolution techniques;
- c) thresholding the preconditioned pixel image by using a threshold value for determining a left eye area, a right eye area, and a mouth area, wherein the threshold value is used to define each of the left eye area, the right eye area, and the mouth area;
- d) calculating a left eye area location, a right eye area location, and a mouth area location from the left eye area, the right eye area, and the mouth area, respectively;
- e) establishing an eye line segment as a straight line connecting the left and right eye area locations;
- f) establishing a vertical axis of symmetry as a straight line that is perpendicular to and bisects the eye line segment connecting the left and right eye area locations;
- g) establishing a mouth line by passing a straight line through the mouth area location, the mouth line being perpendicular to the vertical axis of symmetry;
- h) selecting image pixels along the axis of symmetry in the vicinity of the mouth line to form a vertical sectional view of gray-scale pixel values;
- i) selecting image pixels along the mouth line in the vicinity of the axis of symmetry to form a horizontal sectional view of gray-scale values; and
- j) selecting a set of pixels and associated pixel values that occur at the peaks and valleys (maximas and minimas) of the vertical and the horizontal gray-scale pixel value sectional views as a set of elements of a visual feature vector.
2. The method of claim 1 wherein the step of selecting image pixels along the mouth line results in the selected pixels corresponding to a left and a right mouth corner position, and wherein the step of selecting image pixels along the axis of symmetry results in the selected pixels corresponding to an upper lip, a mouth, and a lower lip position. 3. The method of claim 2 further comprising the following steps: - k) computing a mouth corner-to-corner distance measure by taking the difference between the location of the selected pixels corresponding to the left and right mouth corner positions;
- l) computing a vertical mouth separation distance by taking the difference between the location of the selected pixels corresponding to the upper and lower lip positions;
- m) computing horizontal mouth corner-to-corner speed by taking the difference between mouth corner-to-corner distances of adjacent sequential image frames;
- n) computing a set of vertical speeds of an upper lip position, of a mouth area position, and of a lower lip position by taking the difference between pixel positions corresponding to upper lip positions, mouth area positions, and lower lip positions of adjacent sequential image frames;
- o) computing a set of pixel gray-level value changes with respect to time by taking the difference in pixel values of the selected set of pixels between adjacent sequential image frames; and
- p) forming a visual feature vector from the values calculated in steps (k) through (o) and from the selected set of pixel values.
4. The method of claim 1 wherein the step of selecting image pixels along the mouth line selects pixels so that the selected pixels correspond to a left and right mouth corner position and wherein the step of selecting image pixels along the axis of symmetry selects pixels so that the selected pixels correspond to a lower nose area position, an upper lip position, a mouth area position, a lower lip position, and a chin area position, the lower nose area position and the chin area position respectively defining the upper and lower limits of the vertical sectional view of gray-scale pixel values. 5. The method of claim 1 further comprising the step of frame-to-frame temporal smoothing for noise reduction by convolving sequential image frames with a prescribed low-pass filter kernel. 6. An apparatus for extracting a visual feature vector from a sequence of raster-scanned video images of frontal views of a speaker's face for use in a speech classification system, the apparatus comprising: - a) analog-to-digital conversion means for sampling and quantizing each raster-scanned video image at uniform intervals along each raster scan to produce an image of pixels representing gray-scale level centered at the uniform intervals;
- b) filter means for preconditioning the pixel image by spatially smoothing and enhancing edges separating regions of greater and less gray-scale intensity using spatial convolution techniques;
- c) threshold means using a threshold level for thresholding the preconditioned pixel image for determining a left eye area, a right eye area, and a mouth area, wherein the threshold value is used to define each of the left eye area, the right eye area, and the mouth area;
- d) computing means for calculating a left eye area location, a right eye area location, and a mouth area location from the left eye area, the right eye area, and the mouth area, respectively;
- e) computing means for establishing an eye line as a straight line passing through the left and right eye area locations;
- f) computing means for establishing a vertical axis of symmetry as the straight line that is perpendicular to and bisects the eye line segment connecting the left and right eye area locations;
- g) computing means for establishing a mouth line by passing a straight line through the mouth area location, the line being perpendicular to the vertical axis of symmetry;
- h) computing means for selecting image pixels along the axis of symmetry in the vicinity of the mouth line to form a vertical sectional view of gray-scale pixel values;
- i) computing means for selecting image pixels along the mouth line in the vicinity of the axis of symmetry to form a horizontal sectional view of gray-scale pixel values; and
- j) computing means for selecting a set of pixels and associated pixel values that occur at peaks and valleys (maximas and minimas) of the vertical and horizontal gray-scale pixel value sectional views as a set of elements of a visual feature vector.
7. The apparatus of claim 6 wherein the filter and threshold means are computing means. 8. The apparatus of claim 7 wherein the computing means are a programmable computer. 9. A speech recognition system for recognizing utterances belonging to a preestablished set of allowable candidate utterances, comprising: - a) a visual feature vector extraction apparatus for forming a sequence of visual feature vectors from a sequence of raster-scanned video images of a speaker's face, comprising:
- i) analog-to-digital conversion means for sampling and quantizing each raster-scanned video image at uniform intervals along each raster scan to produce an image of pixels representing gray-scale centered at the uniform intervals;
- ii) filter means for preconditioning the pixel image by spatially smoothing and enhancing edges separating regions of greater and less gray-scale intensity using convolution techniques;
- iii) threshold means using a threshold level for thresholding the preconditioned pixel image for determining a left eye area, a right eye area, and a mouth area, wherein the threshold value is used to define each of the left eye area, the right eye area, and the mouth area;
- iv) computing means for calculating a left eye area location, a right eye area location and a mouth area location from the left eye area, the right eye area, and the mouth area, respectively;
- v) computing means for establishing an eye line as a straight line passing through the left and right eye area locations;
- vi) computing means for establishing a vertical axis of symmetry as the straight line that is perpendicular to and bisects the eye line segment connecting the left and right eye area locations;
- vii) computing means for establishing a mouth line by passing a straight line through the mouth area location, the mouth line being perpendicular to the vertical axis of symmetry;
- viii) computing means for selecting image pixels along the axis of symmetry in the vicinity of the mouth line to form a vertical sectional view of gray-scale pixel values;
- ix) computing means for selecting image pixels along the mouth line in the vicinity of the axis of symmetry to form a horizontal sectional view of gray-scale pixel values; and
- x) computing means for selecting a set of pixels and associated pixel values that occur at peaks and valleys (maximas and minimas) of the vertical and horizontal gray-scale pixel value sectional views as a set of elements of a visual feature vector;
- b) an acoustic feature vector extraction apparatus for converting signals representative of acoustic speech occurring concomitantly with the raster-scanned video images into a corresponding sequence of acoustic feature vectors; and
- c) a neural network classifying apparatus for generating a conditional probability distribution of the allowable speech utterances by accepting and operating on the acoustic and visual feature vector sequences respectively supplied by the acoustic and visual feature vector extraction apparatus.
10. The system of claim 9 wherein the visual feature vector extraction apparatus comprises a programmable computer for performing the filter, threshold, and computing means functions. 11. The system of claim 9 wherein the step of selecting image pixels along the mouth line selects pixels so that the selected pixels correspond to a left and a right mouth corner position and wherein the step of selecting image pixels along the axis of symmetry selects pixels so that the selected pixels correspond to a lower nose area position, an upper lip position, a mouth area position, a lower lip position, and a chin area position, the lower nose area position and the chin area position respectively defining the upper and lower limits of the vertical sectional view of gray-scale pixel values. 12. The system of claim 11 wherein the computing means for selecting a set of pixel and associated pixel values as a set of elements of a visual feature vector further comprises computer means for the following steps: - aa) computing a mouth corner-to-corner distance measure by taking the difference between the location of the selected pixels corresponding to the left and right mouth corner positions;
- bb) computing a vertical mouth separation distance by taking the difference between the location of the selected pixels corresponding to the upper and lower lip positions;
- cc) computing horizontal mouth corner-to-corner speed by taking the difference between mouth corner-to-corner distances of adjacent sequential image frames;
- dd) computing a set of vertical speeds of an upper lip position, of a mouth area position, and of a lower lip position by taking the difference between pixel positions corresponding to upper lip positions, mouth area positions, and lower lip positions of adjacent sequential image frames;
- ee) computing a set of pixel gray-level value changes with respect to time by taking the difference in pixel values of the selected set of pixels between adjacent sequential image frames; and
- ff) forming a visual feature vector from the values calculated in steps (a) through (e) and from the selected set of pixel values.
13. The system of claim 9 further comprising means for temporal smoothing across successive pixel image frames for noise reduction by convolution with a prescribed low-pass filter kernel. |