Sign in

Facial feature extraction method and apparatus for a neural network acoustic ...

 K. Venkatesh Prasad et al
A facial feature extraction method and apparatus uses the variation in light intensity (gray-scale) of a frontal view of a speaker's face. The sequence of video images are sampled and quantized into a regular array of 150.times.150 pixels that naturally form a coordinate system of scan lines...
Inventors: K. Venkatesh Prasad, David G. Stork
Assignees: Ricoh Corporation, Ricoh Company, LTD
Primary Examiner: Monica S. Davis

U.S. Classification
382/190; 382/118; 382/159; 382/193; 382/202

International Classification
G06K 946; G06K 932; G06K 900; G06K 962

View patent at USPTO

Citations

Patent NumberTitleIssue date
3999006Visitor confirmation systemDec 21, 1976
4109237Apparatus and method for identifying individuals through their retinal vasculature patternsAug 22, 1978
4228465Analog video convolver for real time two-dimensional image processingOct 14, 1980
4449189Personal access control system using speech and face recognitionMay 15, 1984
4625329Position analyzer for vehicle driversNov 25, 1986
4773024Brain emulation circuit with reduced confusionSep 20, 1988
4931865Apparatus and methods for monitoring television viewersJun 5, 1990
4975960Electronic facial tracking and detection system and method and apparatus for automated speech recognitionDec 4, 1990
4975969Method and apparatus for uniquely identifying individuals by particular physical characteristics and security system utilizing the sameDec 4, 1990
4975978Image signal amplifierDec 4, 1990
5008946System for recognizing imageApr 16, 1991
5012522Autonomous face recognition machineApr 30, 1991
5063603Dynamic method for recognizing objects and image processing system thereforNov 5, 1991
5412738Recognition system, particularly for recognising peopleMay 2, 1995

Claims

What is claimed is:

1. A method for extracting a visual feature vector from a sequence of images, each having a plurality of horizontal raster lines, of frontal views of a speaker's face in a speech classification system, the method comprising the following steps:

a) sampling and quantizing each image at uniform intervals along each horizontal raster line of the image to produce an image represented by an array of pixels, each pixel value representing gray-scale level;
b) preconditioning the pixel image by spatially smoothing and enhancing edges separating regions of greater and less gray-scale intensity using spatial convolution techniques;
c) thresholding the preconditioned pixel image by using a threshold value for determining a left eye area, a right eye area, and a mouth area, wherein the threshold value is used to define each of the left eye area, the right eye area, and the mouth area;
d) calculating a left eye area location, a right eye area location, and a mouth area location from the left eye area, the right eye area, and the mouth area, respectively;
e) establishing an eye line segment as a straight line connecting the left and right eye area locations;
f) establishing a vertical axis of symmetry as a straight line that is perpendicular to and bisects the eye line segment connecting the left and right eye area locations;
g) establishing a mouth line by passing a straight line through the mouth area location, the mouth line being perpendicular to the vertical axis of symmetry;
h) selecting image pixels along the axis of symmetry in the vicinity of the mouth line to form a vertical sectional view of gray-scale pixel values;
i) selecting image pixels along the mouth line in the vicinity of the axis of symmetry to form a horizontal sectional view of gray-scale values; and
j) selecting a set of pixels and associated pixel values that occur at the peaks and valleys (maximas and minimas) of the vertical and the horizontal gray-scale pixel value sectional views as a set of elements of a visual feature vector.

2. The method of claim 1 wherein the step of selecting image pixels along the mouth line results in the selected pixels corresponding to a left and a right mouth corner position, and wherein the step of selecting image pixels along the axis of symmetry results in the selected pixels corresponding to an upper lip, a mouth, and a lower lip position.

3. The method of claim 2 further comprising the following steps:

k) computing a mouth corner-to-corner distance measure by taking the difference between the location of the selected pixels corresponding to the left and right mouth corner positions;
l) computing a vertical mouth separation distance by taking the difference between the location of the selected pixels corresponding to the upper and lower lip positions;
m) computing horizontal mouth corner-to-corner speed by taking the difference between mouth corner-to-corner distances of adjacent sequential image frames;
n) computing a set of vertical speeds of an upper lip position, of a mouth area position, and of a lower lip position by taking the difference between pixel positions corresponding to upper lip positions, mouth area positions, and lower lip positions of adjacent sequential image frames;
o) computing a set of pixel gray-level value changes with respect to time by taking the difference in pixel values of the selected set of pixels between adjacent sequential image frames; and
p) forming a visual feature vector from the values calculated in steps (k) through (o) and from the selected set of pixel values.

4. The method of claim 1 wherein the step of selecting image pixels along the mouth line selects pixels so that the selected pixels correspond to a left and right mouth corner position and wherein the step of selecting image pixels along the axis of symmetry selects pixels so that the selected pixels correspond to a lower nose area position, an upper lip position, a mouth area position, a lower lip position, and a chin area position, the lower nose area position and the chin area position respectively defining the upper and lower limits of the vertical sectional view of gray-scale pixel values.

5. The method of claim 1 further comprising the step of frame-to-frame temporal smoothing for noise reduction by convolving sequential image frames with a prescribed low-pass filter kernel.

6. An apparatus for extracting a visual feature vector from a sequence of raster-scanned video images of frontal views of a speaker's face for use in a speech classification system, the apparatus comprising:

a) analog-to-digital conversion means for sampling and quantizing each raster-scanned video image at uniform intervals along each raster scan to produce an image of pixels representing gray-scale level centered at the uniform intervals;
b) filter means for preconditioning the pixel image by spatially smoothing and enhancing edges separating regions of greater and less gray-scale intensity using spatial convolution techniques;
c) threshold means using a threshold level for thresholding the preconditioned pixel image for determining a left eye area, a right eye area, and a mouth area, wherein the threshold value is used to define each of the left eye area, the right eye area, and the mouth area;
d) computing means for calculating a left eye area location, a right eye area location, and a mouth area location from the left eye area, the right eye area, and the mouth area, respectively;
e) computing means for establishing an eye line as a straight line passing through the left and right eye area locations;
f) computing means for establishing a vertical axis of symmetry as the straight line that is perpendicular to and bisects the eye line segment connecting the left and right eye area locations;
g) computing means for establishing a mouth line by passing a straight line through the mouth area location, the line being perpendicular to the vertical axis of symmetry;
h) computing means for selecting image pixels along the axis of symmetry in the vicinity of the mouth line to form a vertical sectional view of gray-scale pixel values;
i) computing means for selecting image pixels along the mouth line in the vicinity of the axis of symmetry to form a horizontal sectional view of gray-scale pixel values; and
j) computing means for selecting a set of pixels and associated pixel values that occur at peaks and valleys (maximas and minimas) of the vertical and horizontal gray-scale pixel value sectional views as a set of elements of a visual feature vector.

7. The apparatus of claim 6 wherein the filter and threshold means are computing means.

8. The apparatus of claim 7 wherein the computing means are a programmable computer.

9. A speech recognition system for recognizing utterances belonging to a preestablished set of allowable candidate utterances, comprising:

a) a visual feature vector extraction apparatus for forming a sequence of visual feature vectors from a sequence of raster-scanned video images of a speaker's face, comprising:
i) analog-to-digital conversion means for sampling and quantizing each raster-scanned video image at uniform intervals along each raster scan to produce an image of pixels representing gray-scale centered at the uniform intervals;
ii) filter means for preconditioning the pixel image by spatially smoothing and enhancing edges separating regions of greater and less gray-scale intensity using convolution techniques;
iii) threshold means using a threshold level for thresholding the preconditioned pixel image for determining a left eye area, a right eye area, and a mouth area, wherein the threshold value is used to define each of the left eye area, the right eye area, and the mouth area;
iv) computing means for calculating a left eye area location, a right eye area location and a mouth area location from the left eye area, the right eye area, and the mouth area, respectively;
v) computing means for establishing an eye line as a straight line passing through the left and right eye area locations;
vi) computing means for establishing a vertical axis of symmetry as the straight line that is perpendicular to and bisects the eye line segment connecting the left and right eye area locations;
vii) computing means for establishing a mouth line by passing a straight line through the mouth area location, the mouth line being perpendicular to the vertical axis of symmetry;
viii) computing means for selecting image pixels along the axis of symmetry in the vicinity of the mouth line to form a vertical sectional view of gray-scale pixel values;
ix) computing means for selecting image pixels along the mouth line in the vicinity of the axis of symmetry to form a horizontal sectional view of gray-scale pixel values; and
x) computing means for selecting a set of pixels and associated pixel values that occur at peaks and valleys (maximas and minimas) of the vertical and horizontal gray-scale pixel value sectional views as a set of elements of a visual feature vector;
b) an acoustic feature vector extraction apparatus for converting signals representative of acoustic speech occurring concomitantly with the raster-scanned video images into a corresponding sequence of acoustic feature vectors; and
c) a neural network classifying apparatus for generating a conditional probability distribution of the allowable speech utterances by accepting and operating on the acoustic and visual feature vector sequences respectively supplied by the acoustic and visual feature vector extraction apparatus.

10. The system of claim 9 wherein the visual feature vector extraction apparatus comprises a programmable computer for performing the filter, threshold, and computing means functions.

11. The system of claim 9 wherein the step of selecting image pixels along the mouth line selects pixels so that the selected pixels correspond to a left and a right mouth corner position and wherein the step of selecting image pixels along the axis of symmetry selects pixels so that the selected pixels correspond to a lower nose area position, an upper lip position, a mouth area position, a lower lip position, and a chin area position, the lower nose area position and the chin area position respectively defining the upper and lower limits of the vertical sectional view of gray-scale pixel values.

12. The system of claim 11 wherein the computing means for selecting a set of pixel and associated pixel values as a set of elements of a visual feature vector further comprises computer means for the following steps:

aa) computing a mouth corner-to-corner distance measure by taking the difference between the location of the selected pixels corresponding to the left and right mouth corner positions;
bb) computing a vertical mouth separation distance by taking the difference between the location of the selected pixels corresponding to the upper and lower lip positions;
cc) computing horizontal mouth corner-to-corner speed by taking the difference between mouth corner-to-corner distances of adjacent sequential image frames;
dd) computing a set of vertical speeds of an upper lip position, of a mouth area position, and of a lower lip position by taking the difference between pixel positions corresponding to upper lip positions, mouth area positions, and lower lip positions of adjacent sequential image frames;
ee) computing a set of pixel gray-level value changes with respect to time by taking the difference in pixel values of the selected set of pixels between adjacent sequential image frames; and
ff) forming a visual feature vector from the values calculated in steps (a) through (e) and from the selected set of pixel values.

13. The system of claim 9 further comprising means for temporal smoothing across successive pixel image frames for noise reduction by convolution with a prescribed low-pass filter kernel.

Drawings