What is claimed is:
1. An apparatus for extracting a visual feature vector from a sequence of video camera images of frontal views of a speaker's face in a speech classification system, the apparatus comprising:
- a) a set of fiducial markers placed on a speaker's face in the vicinity of the lips, nose, and chin so that the fiducial markers are readily identifiable in a video camera image of the speaker's face, and the position and movement of the set of fiducial markers are presentative of physiological facial phenomena associated with speech generation;
- b) a video camera directed at the speaker's face for generating a sequence of electrical images of the speaker's face in the vicinity of the lips, nose, and chin; and
- c) a video processor for converting and storing the sequence of electrical images as a rectangular grid of digitized pixels and for detecting and locating a position for each member of the set of fiducial markers within each image of the sequence, the positions being elements of the visual feature vector that is independent of head shifts and rotations by establishing, as a reference vertical axis, a line connecting the centroids of the nose and the chin fiducial markers, and rotating all fiducial marker positions by an angle that is necessary to make the line connecting the nose and chin fiducial marker centroids vertical.
2. The apparatus of claim 1 wherein the video processor locates the position of each member of the set of fiducial markers by detecting a set of pixels corresponding to each fiducial marker and then calculating a centroid for each set of pixels, the centroid being the position of the corresponding fiducial marker.
3. The apparatus of claim 1 wherein the set of fiducial markers are placed on the nose tip, the center of the chin, left and right mouth corners, and at least one vertically juxtaposed pair of fiducial markers, with one each marker on the upper and lower lips of the speaker's face.
4. The apparatus of claim 3 wherein the video processor further includes means for calculating a distance between the nose and chin fiducial marker positions, a distance between the left and right mouth corner fiducial marker positions, and a distance between each vertically juxtaposed pair on the upper and lower lips.
5. A method for extracting a visual feature vector from a sequence of video camera images of frontal views of a speaker's face in a speech classification system, the method comprising the following steps:
- a) placing a set of fiducial markers on a speaker's face in the vicinity of the lips, nose, and chin so that the fiducial markers are readily identifiable in a video camera image of the speaker's face, and the movement and position of the set of fiducial markers are representative of physiological facial phenomena associated with speech generation;
- b) producing a sequence of raster scanned electrical video images of the speaker's face in the vicinity of the fiducial markers;
- c) sampling and quantizing each raster scanned video image so as to produce a grid of digitized pixels representative of each raster scanned video image;
- d) detecting a set of pixels representative of each fiducial marker;
- e) computing a location for each fiducial marker from each set of detected pixels associated with each fiducial marker;
- f) establishing a reference axis corresponding to a straight line passing through the location of the nose and chin fiducial markers: and
- g) rotating all fiducial maker positions by the angle required to rotate the reference axis to a true vertical orientation.
6. The method of claim 5 wherein the step of computing a location for each fiducial marker comprises computing a centroid for each set of pixels associated with each fiducial marker.
7. The method of claim 5 wherein the set of fiducial markers are placed on the nose tip, the center of the chin, the left and right mouth corners, and at least one vertically juxtaposed pair of fiducial markers with one fiducial marker placed on the upper lip and one placed on the lower lip.
8. The method of claim 7 further comprising the following steps:
- a) calculating a distance between the nose and chin fiducial marker;
- b) calculating a distance between the left and right mouth corner fiducial markers; and
- c) calculating a distance between each vertically juxtaposed fiducial marker of each pair placed on the lips.
9. The method of claim 7 wherein the set of vertically juxtaposed pairs of fiducial markers are placed at the lip center and midway between the lip center and each mouth corner.
10. The method of claim 9 further comprises the following steps for calculating elements of a visual feature vector:
- a) calculating a vertical distance between nose and chin fiducial marker positions;
- b) calculating a vertical distance between corresponding fiducial markers of each vertically juxtaposed fiducial marker pair;
- c) calculating a horizontal distance between upper lip fiducial marker positions located midway between the upper lip center and each mouth corner;
- d) calculating a horizontal distance between lower lip fiducial marker positions located midway between the lower lip center and each mouth corner;
- e) calculating a vertical distance between each fiducial marker of each vertically juxtaposed marker pair located midway between the lip centers and each mouth corner position; and
- f) calculating a horizontal distance between each mouth corner fiducial marker position.