« PreviousContinue »
FACE TRACKING FOR CONTROLLING
PRIORITY AND RELATED APPLICATIONS
 This application is a continuation-in-part (CIP) of U.S. patent application Ser. No. 12/141,042, filed Jun. 17, 2008, which claims priority to U.S. Ser. No. 60/945,558, filed Jun. 21,2007, and which is a CIP of U.S. Ser. No. 12/063,089, filed Feb. 6,2008, which is a CIP of U.S. Ser. No. 11/766,674, filed Jun. 21, 2007, now U.S. Pat. No. 7,460,695, which is a CIP of U.S. Ser. No. 11/753,397, filed May 24, 2007, now U.S. Pat. No. 7,403,643, which is a CIP of U.S. Ser. No. 11/464,083, filed Aug. 11, 2006, now U.S. Pat. No. 7,315, 631; and this application is related to Ser. No. 11/765,212, filed Jun. 19,2007, now U.S. Pat. No. 7,460,694; and Ser. No. 11/765,307, filed Jun. 19,2007, now U.S. Pat. No. 7,469,055; and Ser. No. 12/333,221, filed Dec. 11, 2008; and Ser. No. 12/167,500, filed Jul. 3, 2008; and Ser. No. 12/042,104, filed Mar. 4 2008; and Ser. No. 11/861,854, filed Sep. 26, 2007, which are all assigned to the same assignee as the present application and hereby incorporated by reference.
BACKGROUND  1. Field of the Invention
 The present invention provides an improved method and apparatus for image processing in acquisition devices. In particular the invention provides improved real-time face tracking in a digital image acquisition device.  2. Description of the Related Art  Face tracking in digital image acquisition devices includes methods of marking human faces in a series of images such as a video stream or a camera preview. Face tracking can be used to indicate to a photographer the locations of faces in an image, thereby improving acquisition parameters, or allowing post processing of the images based on knowledge of the locations of the faces.  In general, face tracking systems employ two principle modules: (i) a detection module for locating new candidate face regions in an acquired image or a sequence of images; and (ii) a tracking module for confirming face regions.
 A well-known fast-face detection algorithm is disclosed in US 2002/0102024, hereinafter referred to as "ViolaJones", which is hereby incorporated by reference. In brief, Viola-Jones first derives an integral image from an acquired image, which is usually an image frame in a video stream. Each element of the integral image is calculated as the sum of intensities of all points above and to the left of the point in the image. The total intensity of any sub-window in an image can then be derived by subtracting the integral image value for the top left point of the sub-window from the integral image value for the bottom right point of the sub-window. Also, intensities for adjacent sub-windows can be efficiently compared using particular combinations of integral image values from points of the sub-windows.
 In Viola-Jones, a chain (cascade) of 32 classifiers based on rectangular (and increasingly refined) Haar features are used with the integral image by applying the classifiers to a sub-window within the integral image. For a complete analysis of an acquired image, this sub-window is shifted incrementally across the integral image until the entire image has been covered.
 In addition to moving the sub-window across the entire integral image, the sub window is also scaled up/down to cover the possible range of face sizes. In Viola-Jones, a scaling factor of 1.25 is used and, typically, a range of about 10-12 different scales are used to cover the possible face sizes in an XVGA size image.
 The resolution of the integral image is determined by the smallest sized classifier sub-window, i.e. the smallest size face to be detected, as larger sized sub-windows can use intermediate points within the integral image for their calculations.
 A number of variants of the original Viola-Jones algorithm are known in the literature. These generally employ rectangular, Haar feature classifiers and use the integral image techniques of Viola-Jones.
 Even though Viola-Jones is significantly faster than previous face detectors, it still involves significant computation and a Pentium-class computer can only just about achieve real-time performance. In a resource-restricted embedded system, such as a hand held image acquisition device, e.g., a digital camera, a hand-held computer or a cellular phone equipped with a camera, it is generally not practical to run such a face detector at real-time frame rates for video. From tests within a typical digital camera, it is possible to achieve complete coverage of all 10-12 sub-window scales with a 34 classifier cascade. This allows some level of initial face detection to be achieved, but with undesirably high false positive rates.
 In US 2005/0147278, by Rui et al., which is hereby incorporated by reference, a system is described for automatic detection and tracking of multiple individuals using multiple cues. Rui et al. disclose using Viola-Jones as a fast face detector. However, in order to avoid the processing overhead of Viola-Jones, Rui et al. instead disclose using an autoinitialization module which uses a combination of motion, audio and fast face detection to detect new faces in the frame of a video sequence. The remainder of the system employs well-known face tracking methods to follow existing or newly discovered candidate face regions from frame to frame. The method described by Rui et al. involves some video frames being dropped in order to run a complete face detection process.
 U.S. Pat. No. 6,940,545 to Ray et al., which is incorporated by reference, describes the use of face detection to adjust various camera parameters including Auto-Focus (AF), Auto-Exposure (AE), Auto White Balance (AWB) and Auto Color Correction (ACC). The detection algorithm employed by Ray et al. is a two part algorithm wherein the first stage is fast, but exhibits a high false positive rate and the second stage is more accurate but requires significantly more processing time.
 In particular, Ray et al state that the face detector must operate on an image in a timeframe of less than one second (col. 10, line 57) although they do not specify if this timeframe is for the combination of fast and accurate detectors or is a limit on the fast detector only. Where a detection or combined detection/tracking algorithm is applied to preview images in a state-of-art camera, it typically operates on a timeframe of20-30 ms in order to be compatible with preview frame rates of 30-50 fps. This is a significantly faster requirement than any capability specified in Ray et al.  Ray et al also describe the use of a "framing image" (e.g. FIG. 3), which may be deemed somewhat analogous to a preview image within a state-of-art camera. However, the
concept of tracking face regions from frame to frame within a stream (or collection) of (low-resolution) preview images is not described by Ray et al. Also, U.S. Pat. No. 7,269,292, which is incorporated by reference, contains the concept of tracking a face region within a collection of low resolution images and using this information to selectively adjust image compression.
 Disadvantages of the processes described by Ray et al., include the following: First, using a color based fast face detector is actually quite unreliable as many backgrounds and scene objects can be confused with skin colors. Second, the face detector of Ray et al. is applied to an entire scene before any additional processing occurs. This can lead to a time lag of one second plus the time to implement processes such as auto-focus, auto-exposure or auto white balance. Third, Ray et al. only described to adjust camera parameters responsive to a user activating an acquisition, whereas in a practical camera it is desirable to constantly adjust exposure, focus and color balance based on each frame of a preview stream. Fourth, where image acquisition is asynchronous withrespect to the preview stream, then Ray et al. apply face tracker processing to the current frame completely before the rest of the method described by Ray et al. is applied.
SUMMARY OF THE INVENTION
 A method of acquiring an improved image based on tracking a face in a preview image stream with a digital image acquisition device is provided. An initial location and/or size of a face is/are determined in a first preview image of a preview image stream. A subsequent location and/or size for the same face is determined in a subsequent preview image. Based on the initial and subsequent locations and/or sizes, or combinations thereof, the method further includes predicting a region within a third preview image which has just been acquired within which region the same face is expected to occur again. One or more characteristics of the region of the third preview image are analyzed. Based on the analyzing, one or more acquisition parameters of a main acquired image are adjusted, for example, white balance, color balance, focus and/or exposure. The one or more analyzed characteristics of the region may include sharpness, luminance, texture, color histogram, luminance histogram, horizontal luminance profile, vertical luminance profile, horizontal chrominance profile, vertical chrominance profile, or region correlogram, or combinations thereof. The preview and main acquired images may have different resolutions.
 A further method of tracking faces in an image stream with a digital image acquisition device is provided. According to one aspect, a first image is received from an image stream including one or more face regions. A corresponding first integral image is calculated for at least a portion of the first image or a sub-sampled version or a combination thereof. A first subset of face detection windows, such as rectangles or other shapes, is applied to the first integral image to provide a first set of candidate face regions each having a given size and a respective location. A second image is received from the image stream including the one or more face regions, wherein the second image includes substantially a same scene as the first image. A corresponding second integral image is calculated for at least a portion of the second image or a sub-sampled version or a combination thereof. A second subset of face detection windows is applied to the second integral image to provide a second set of candidate face regions each also having a given size and a respective
location. The second subset includes one or more different face detection windows than the first subset, and the first and second subsets include one or more candidate face regions of different sizes or locations or both. The process further includes tracking the candidate face regions of different sizes or locations, or both, of the first and second images from the image stream.
 A further method is provided for tracking faces in an image stream with a digital image acquisition device. Digital images are received from an image stream including faces. Corresponding integral images are calculated for the digital images. Different subsets of face detection windows are applied to different subsets of the integral images to provide different sets of candidate face regions of different sizes or locations or both within the digital images. Each of the different candidate face regions is tracked within further images of the image stream and/ or a main target image with which the image stream is utilized.
 The first and second sets of candidate face regions may be merged with one or more previously detected face regions to provide a merged set of candidate face regions of different sizes or locations or both. The at least one previously detected face region may include a set of confirmed face regions for one or more previously acquired images.  Variable-sized face detection may be applied to one or more of the face regions of the merged set of candidate face regions to provide a set of confirmed face regions and a set of rejected face regions. The applying of variable sized face detection may include applying cascades of Haar classifiers of varying size to integral images of face candidate regions of the merged set.
 The applying of the first and second subsets of face detection windows may include applying fixed-size face detection. The applying of fixed size face detection may include applying a cascade of Haar classifiers of a fixed size to integral images of face candidate regions of the merged set.  The method may also include checking a rejected face region based on alternative criteria from the fixed and variable sized face detection. Responsive to the checking, an indication may be provided that the rejected face region is actually a face region. That previously rejected face region is then added to the set of confirmed face regions. The checking may include applying a skin prototype to a rejected face region.
 Responsive to the first image being captured with a flash, one or more tracked regions of the first integral image may be analyzed for red-eye defect. The red eye defect may be corrected in the first integral image and/or an indication of red eye defect may be stored with the first integral image.  The method may include repeating the receiving, calculating and applying for one or more further images, including applying one or more further subsets of face detection windows to one or more further integral images to provide one or more further sets of candidate face regions each having a given size and a respective location. The one or more further subsets would include different face detection windows than the first and second subsets, such that the first, second and one or more further subsets respectively include candidate face regions of different sizes or locations or both.  A digital image acquisition device is also provided that includes a lens, an image sensor, a processor, and a processor readable memory having code embedded therein for programming the processor to perform any of the methods described herein.
 One or more computer-readable storage devices are also provided that have computer readable code embedded therein for programming one or more processors to perform any of the methods described herein.
BRIEF DESCRIPTION OF THE DRAWINGS
 Embodiments of the invention will now be described by way of example, with reference to the accompanying drawings, in which:
 FIG. 1 is a block diagram illustrating principle components of an image processing apparatus in accordance with certain embodiments;
 FIG. 2 is a flow diagram illustrating operation of the
image processing apparatus of FIG. 1; and
 FIGS. 3(a) to 3(d) illustrate examples of images
processed by an apparatus in accordance with certain
 FIG. 4 illustrates a distribution of face detection over two or more frames, e.g., five frames, under certain example conditions.
 FIG. 5 illustrates basic operations of a face tracker on a preview image frame in accordance with certain embodiments.
DETAILED DESCRIPTIONS OF THE
 Methods are provided for detecting, tracking or rec-
ognizing faces, or combinations thereof, within acquired
digital images of an image stream. An image processing
apparatus is also provided including one or more processors
and one or more digital storage media having digitally-en-
coded instructions embedded therein for programming the
one or more processors to perform any of these methods.
 A first method is provided for tracking faces in an
image stream with a digital image acquisition device. In one
embodiment, an acquired image is received from an image
stream including one or more face regions. The acquired
image is sub-sampled at a specified resolution to provide a
sub-sampled image. A corresponding integral image is cal-
culated for a least a portion of the sub-sampled image. A fixed
size face detection is applied to at least a portion of the
integral image to provide a set of one or more candidate face
regions each having a given size and a respective location.
Responsive to the given size and respective location of the
candidate face regions, and optionally including one or more
previously detected face regions, adjusting a resolution at
which a next acquired image is sub-sampled.
 In certain embodiments, calculations are avoided of
a complete highest resolution integral image for every
acquired image in an image stream, thereby reducing integral
image calculations in an advantageous face tracking system.
This either reduces processing overhead for face detection
and tracking or allows longer classifier chains to be employed
during the frame-to-frame processing interval to provide
higher quality results, and either way providing enhanced
face tracking. This can significantly improve the performance
and/or accuracy of real-time face detection and tracking.
 In certain embodiments, when implemented in an
image acquisition device during face detection, a subsampled
copy of an acquired image may be extracted from the camera
hardware image acquisition subsystem and the integral image
may be calculated for this subsampled image. During face
tracking, the integral image may be calculated for an image patch surrounding each candidate region, rather than the entire image.
 In such an implementation, the process of face detection may be spread across multiple frames. This approach is advantageous for effective implementation. In one example, digital image acquisition hardware is designed to subsample to a single size. This aspect takes advantage of the fact that when composing a picture, a face will typically be present for multiple frames within video sequences. Significant improvements in efficiency are provided, while the reduction in computation does not impact very significantly on the initial detection of faces.
 In certain embodiments, the 3-4 smallest sizes (lowest resolution) of subsampled images are used in cycle. In some cases, such as when the focus of the camera is set to infinity, larger image subsamples may be included in the cycle as smaller (distant) faces may occur within the acquired image(s). In yet another embodiment, the number of subsampled images may change based on the estimated potential face sizes based on the estimated distance to the subject. Such distance may be estimated based on the focal length and focus distance, these acquisition parameters may be available from other subsystems within the imaging appliance firmware.  By varying the resolution/scale of the sub-sampled image which is in turn used to produce the integral image, a single fixed size of classifier can be applied to the different sizes of integral image. Such an approach is particularly amenable to hardware embodiments where the subsampled image memory space can be scanned by a fixed size direct memory access (DMA) window and digital logic to implement a Haarfeature classifier chain can be applied to this DMA window. However, it will be seen that several sizes of classifier (in a software embodiment), or multiple fixed-size classifiers (in a hardware embodiment) could also be used.  A key advantage of this aspect is that from frame to frame the calculation involves a low resolution integral image. This is particularly advantageous when working with a consumer digital image acquisition device such as a portable camera or camera phone.
 A full resolution image patch surrounding each candidate face region may be acquired prior to the acquisition of the next image frame. An integral image may then be calculated for each such image patch and a multi-scaled face detector may be applied to each such image patch. Regions which are found by the multi-scaled face detector to be face regions are referred to as confirmed face regions.  This aspect advantageously avoids involvement of motion and audio queues such as those favored by Rui, and allows significantly more robust face detection and tracking to be achieved in a digital camera, particularly a portable camera, camera-phone or camera-enabled embedded device.  In accordance with certain embodiments, a face detection and recognition method is also provided. In these embodiments, an acquired image is received from an image stream including one or more face regions. The acquired image is sub-sampled at a specified resolution to provide a first-sub-sampled image. An integral image is calculated for at least a portion of the sub-sampled image. Face detection is applied to at least a portion of the integral image to provide a set of one or more candidate face regions each including a given size and a respective location. Using a database, face recognition is selectively applied to one or more candidate face regions to provide an identifier for a recognized face. The