Publication number | US7650011 B2 |

Publication type | Grant |

Application number | US 11/179,280 |

Publication date | Jan 19, 2010 |

Filing date | Jul 11, 2005 |

Priority date | Jul 9, 2004 |

Fee status | Paid |

Also published as | US7369682, US20060023916, US20060036399, WO2006010129A2, WO2006010129A3 |

Publication number | 11179280, 179280, US 7650011 B2, US 7650011B2, US-B2-7650011, US7650011 B2, US7650011B2 |

Inventors | Ming-Hsuan Yang, Ruei-Sung Lin |

Original Assignee | Honda Motor Co., Inc. |

Export Citation | BiBTeX, EndNote, RefMan |

Patent Citations (21), Non-Patent Citations (7), Referenced by (2), Classifications (22), Legal Events (3) | |

External Links: USPTO, USPTO Assignment, Espacenet | |

US 7650011 B2

Abstract

Visual tracking over a sequence of images is formulated by defining an object class and one or more background classes. The most discriminant features available in the images are then used to select a portion of each image as belonging to the object class. Fisher's linear discriminant method is used to project high-dimensional image data onto a lower-dimensional space, e.g., a line, and perform classification in the lower-dimensional space. The projection function is incrementally updated.

Claims(21)

1. A computer-implemented method for tracking a location of an object within a sequence of digital image frames, the method comprising:

using a computer processor to perform steps of:

receiving a first image vector representing a first image frame within the sequence of digital image frames;

determining an initial location of the object in said first image frame from said first image vector;

applying a dynamic model to said first image vector to predict motion of the object between the first image frame and a successive image frame and determine at least one predicted location of the object in the successive image frame within the sequence of digital image frames;

projecting samples from the at least one predicted location of the object in the successive image frame to a low dimensional projection space according to projection parameters;

applying a classification model to the projected samples of said successive image frame, the classification model applied in the low dimensional projection space and classifying each of the projected samples as one of a foreground object type and a background type;

applying an inference model to the classified samples in the low dimensional projection space to predict a most likely location of the object resulting from motion of the object between the first image frame and the successive image frame of the sequence of digital image frames; and

updating the projection parameters based on the most likely location of the object.

2. The method of claim 1 , wherein said dynamic model represents a windows position, an angular orientation, a width and a height of the object.

3. The method of claim 1 , wherein said inference model determines a distance from said successive image vector to a mean of said foreground object type.

4. The method of claim 1 , wherein said classification model comprises a Fisher Linear Discriminant model.

5. The method of claim 1 , wherein said background type comprises a single class.

6. The method of claim 1 , wherein said background type comprises multiple classes.

7. The method of claim 1 , wherein said background type comprises a number of classes equal to the number of images in the set of digital images.

8. A computer system for tracking the location of an object within a sequence of digital image frames, the system comprising:

means for receiving a first image vector representing a first image frame within the sequence of digital image frames;

means for determining the an initial location of the object in said first image frame from said first image vector;

means for applying a dynamic model to said first image vector to predict motion of the object between the first image frame and a successive image frame and determine at least one predicted location of the object in the successive image frame within the sequence of digital image frames;

means for projecting samples from the at least one predicted location of the object in the successive image frame to a low dimensional projection space according to projection parameters;

means for applying a classification model to the projected samples of said successive image frame, the classification model app lied in the low dimensional projection space and classifying each of the projected samples as one of a foreground object type and a background type;

means for applying an inference model to the classified samples in the low dimensional projection space to predict a most likely location of the object resulting from motion of the object between the first image frame and the successive image frame of the sequence of digital image frames; and

means for updating the projection parameters based on the most likely location of the object.

9. The system of claim 8 , wherein said means for applying a dynamic model comprises means for representing a windows position, an angular orientation, a width and a height of the object.

10. The system of claim 8 , wherein said means for applying an inference model comprises means to determine a distance from said successive image vector to a mean of said foreground object type.

11. The system of claim 8 , wherein said classification model is a Fisher linear Discriminant model.

12. The system of claim 8 , wherein said background type comprises a number of classes equal to the number of images in the set of digital images.

13. An image processing computer system for tracking the location of an object within a sequence of digital image frames, the image processing computer system comprising:

an input module for receiving data representative of the sequence of digital image frames;

a memory device coupled to said input module for storing said data representative of the sequence of digital image frames;

a processor coupled to said memory device for iteratively retrieving the data representative of sequence of digital image frames, said processor configured to:

apply a dynamic model to said first image vector to predict motion of the object between the first image frame and a successive image frame and determine at least one predicted location of the object in the successive image frame within the sequence of digital image frames;

project samples from the at least one predicted location of the object in the successive image frame to a low dimensional projection space according to projection parameters;

apply a classification model to the projected samples of said successive image frame, the classification model applied in the low dimensional projection space and classifying each of the projected samples as one of a foreground object type and a background type;

apply an inference model to the classified samples in the low dimensional projection space to predict a most likely location of the object resulting from motion of the object between the first image frame and the successive image frame of the sequence of digital image frames; and

update the projection parameters based on the most likely location of the object.

14. The system of claim 13 , wherein said dynamic model represents a windows position, an angular orientation, a width and a height of the object.

15. The system of claim 13 , wherein said inference model determines a distance from said successive image vector to a mean of said foreground object type.

16. The system of claim 13 , wherein said classification model is a Fisher linear Discriminant model.

17. The method of claim 13 , wherein said background type comprises a number of classes equal to the number of images in the set of digital images.

18. The system of claim 8 , wherein said background type comprises a single class.

19. The system of claim 8 , wherein said background type comprises multiple classes.

20. The system of claim 13 , wherein said background type comprises a single class.

21. The system of claim 13 , wherein said background type comprises multiple classes.

Description

This application claims priority under 35 USC § 119(e) to U.S. Provisional Patent Application No. 60/586,598, filed Jul. 9, 2004, entitled “Object Tracking Using Incremental Fisher Discriminant Analysis,” which is incorporated by reference herein in its entirety.

This application claims priority under 35 USC § 119(e) to U.S. Provisional Patent Application No. 60/625,501, filed Nov. 5, 2004, entitled “Adaptive Discriminative Generative Model and its Applications,” which is incorporated by reference herein in its entirety.

This application is also related to U.S. patent application Ser. No. 11/179,881, filed on Jul. 11, 2005, entitled “Adaptive Discriminative Generative Model and Application to Visual Tracking,” which is incorporated by reference herein in its entirety.

This application is also related to U.S. patent application Ser. No. 10/989,986, filed on Nov. 15, 2004, entitled “Adaptive Probabilistic Visual Tracking with Incremental Subspace Update,” which is incorporated by reference herein in its entirety.

The present invention generally relates to the field of computer-based visual perception, and more specifically, to adaptive probabilistic discriminative generative modeling.

In the field of visual perception, many applications require separating a target object or image of interest from a background. In particular, motion video applications often require an object of interest to be tracked against a static or time-varying background.

The visual tracking problem can be formulated as continuous or discrete-time state estimation based on a “latent model.” In such a model, observations or observed data encode information from captured images, and unobserved states represent the actual locations or motion parameters of the target objects. The model infers the unobserved states from the observed data over time.

At each time step, a dynamic model predicts several possible locations (e.g., hypotheses) of the target at the next time step based on prior and current knowledge. The prior knowledge includes previous observations and estimated state transitions. As each new observation is received, an observation model estimates the target's actual position. The observation model determines the most likely location of the target object by validating the various dynamic model hypotheses. Thus, the overall performance of such a tracking algorithm is limited by the accuracy of the observation model.

One conventional approach builds static observation models before tracking begins. Such models assume that factors such as illumination, viewing angle, and shape deformation do not change significantly over time. To account for all possible variations in such factors, a large set of training examples is required. However, the appearance of an object varies significantly as such factors change. It is therefore daunting, if not impossible, to obtain a training set that accommodates all possible scenarios of a visually dynamic environment.

Another conventional approach combines multiple tracking algorithms that each track different features or parts of the target object. Each tracking algorithm includes a static observation model. Although each tracking algorithm may fail under certain circumstances, it is unlikely that all will fail simultaneously. This approach adaptively selects the tracking algorithms that are currently robust. Although this improves overall robustness, each static observation model must be trained, i.e., initialized, before tracking begins. This severely restricts the application domain and precludes application to previously unseen targets.

Thus, there is a need for improved observation accuracy to provide improved tracking accuracy, and to robustly accommodate appearance variation of target objects in real time, without the need for training.

Visual tracking over a sequence of images is formulated by defining an object class and one or more background classes. The most discriminant features available in the images are then used to select a portion of each image as belonging to the object class. This approach is referred to as classification.

Fisher's linear discriminant (FLD) method is used to project high-dimensional image data onto a lower-dimensional space, e.g., a line, and perform classification in the lower-dimensional space. A projection function maximizes the distance between the means of the object class and background class or classes while minimizing the variance of each class.

FLD requires that the samples in each class are clustered, that is, that the appearance variance within each class is relatively small. In practice, while this constraint likely holds for the object class, it does not hold for the single background class. Accordingly, one embodiment of the present invention comprises one object class and multiple background classes. However, the number of background classes required is an issue. Another embodiment overcomes this issue by using one class per sample to model the background. Yet another embodiment extends FLD by incrementally updating the projection function. Experimental results confirm the effectiveness of the invention.

The features and advantages described in the specification are not all inclusive and, in particular, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings, specification, and claims. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter.

The invention has other advantages and features which will be more readily apparent from the following detailed description of the invention and the appended claims, when taken in conjunction with the accompanying drawings, in which:

*a *illustrates poor discrimination of positive and negative samples.

*b *illustrates good discrimination of positive and negative samples, and in-class and between-class scatter.

Reference will now be made in detail to several embodiments of the present invention(s), examples of which are illustrated in the accompanying figures. It is noted that wherever practicable similar or like reference numbers may be used in the figures and may indicate similar or like functionality. The figures depict embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.

The visual tracking problem is illustrated schematically in _{t }is observed in sequence, and the state variable s_{t }corresponding to the target object is treated as unobserved. The motion of the object from one frame to the next is modeled based upon the probability of the object appearing at s_{t }given that it was just at s_{t−1}. In other words, the model represents possible locations of the object at time t, as determined prior to observing the current image frame. The likelihood that the object is located at a particular possible position is then determined according to a probability distribution. The goal is to determine the most probable a posteriori object location.

The visual tracking problem is formulated in this step as a recursive state estimation problem. A description of this can be found in M. Isard and A. Blake, Contour Tracking by Stochastic Propagation of Conditional Density, *Proceedings of the Fourth European Conference on Computer Vision*, LNCS 1064, Springer Verlag, 1996, which is incorporated by reference herein in its entirety, and in U.S. patent application Ser. No. 10/989,986, entitled “Adaptive Probabilistic Visual Tracking with Incremental Subspace Update,” which was referenced above.

Based on o_{t}, the image region observed at time t, O_{t}={o_{1}, . . . , o_{t}} is defined as a set of image regions observed from the beginning to time t. A visual tracking process infers state s_{t }from observation O_{t}, where state s_{t }contains a set of parameters referring to the tracked object's 2-D position, orientation, and scale in image o_{t}. Assuming a Markovian state transition, this inference problem is formulated with the recursive equation

*p*(*s* _{t} *|O* _{t})=*kp*(*o* _{t} *|s* _{t})∫*p*(*s* _{t} *|s* _{t−1})*p*(*s* _{t−1} *|O* _{t−1})*ds* _{t−1} (1)

where k is a constant, and p(o_{t}|s_{t}) and p(s_{t}|s_{t−1}) correspond to observation and dynamic models, respectively, to be described below.

In equation (1), p(s_{t−1}|O_{t−1}) is the state estimation given all the prior observations up to time t−1, and p(o_{t}|s_{t}) is the likelihood of observing image o_{t }at states s_{t}. For visual tracking, an ideal distribution of p(s_{t}|O_{t}) should peak at o_{t}, i.e., s_{t }matching the observed object's location o_{t}. While the integral in equation (1) predicts the regions where the object is likely to appear given all the prior observations, the observation model p(o_{t}|s_{t}) determines the most likely state that matches the observation at time t.

According to this embodiment, p(o_{t}|s_{t}) measures the probability of observing o_{t }as a sample generated by the target object class. O_{t }is an image sequence, and if the images are acquired at a high frame rate, the difference between o_{t }and o_{t−1 }is expected to be small, even though object's appearance might vary according to different of viewing angles, illuminations, and possible self-deformation. Instead of adopting a complex static model to learn p(o_{t}|s_{t}) for all possible o_{t}, a simpler adaptive model is sufficient to account for the appearance changes. In addition, since o_{t }and o_{t−1 }are most likely similar, and since computing p(o_{t}|s_{t}) depends on p(o_{t−1}|s_{t−1}), the prior information p(o_{t−1}|s_{t−1}) is used to enhance the distinction between the object and its background in p(o_{t}|s_{t}).

Referring now to **206**. This frame vector includes one element per pixel, where each pixel comprises a description of brightness, color etc. Then the initial location of the target object is determined **212**. This may be accomplished either manually or through automatic means. An example of automatic object location determination is face detection. One embodiment of face detection is illustrated in patent application Ser. No. 10/858,878, Method, Apparatus and Program for Detecting an Object, which is incorporated by reference herein in its entirety. Such an embodiment informs the tracking algorithm of an object or area of interest within an image.

Returning to **224** a dynamic model to predict possible locations of the target object in the next frame, s_{t+1}, based upon the location within the current frame, s_{t}, according to a distribution p(S_{t}|S_{t−1}). This is shown conceptually in **310** and possible locations in the next frame **320**(*i*). In other words, a probability distribution provided by the dynamic model encodes beliefs about where the target object might be at time t, prior to observing the respective frame and image region. According to the applied **224** dynamic model, s_{t}, the location of the target object at time t, is a length-5 vector, s=(x,y,θ,w,h), that parameterizes the windows position (x,y), angular orientation (θ) and width and height (w,h).

Then, an image observation model is applied **230**. This model is based on probabilistic principle components analysis (PPCA). A description of this can be found in M. E. Tipping and C. M. Bishop, Probabilistic principle components analysis, *Journal of the Royal Statistical Society, Series B, *1999, which is incorporated by reference herein in its entirety.

Applying **230** the observation model determines p(o_{t}|s_{t}), the probability of observing o_{t }as a sample being generated by the target object class. Note that O_{t }is a sequence of images, and if the images are acquired at high frame rate, it is expected that the difference between o_{t }and o_{t−1 }is small though object's appearance might vary according to different of viewing angles, illuminations, and possible self-deformation. Instead of adopting a complex static model to learn p(o_{t}|s_{t}) for all possible o_{t}, a simpler adaptive model suffices to account for appearance changes. In addition, since o_{t }and o_{t−1 }are most likely similar, and since computation of p(o_{t}|s_{t}) depends on the prior information p(o_{t−1}|s_{t−1}), such prior information can be used to enhance the distinction between the object and the background in p(o_{t}|s_{t}).

Referring again to **236** to improve the estimated target object location. The development of the DGM follows the work of Tipping and Bishop, which was referenced above. The latent model of

*y=Wx+μ+ε* (2)

In equation (2), y and x are analogous to o and s, respectively, W is a n×m projection matrix associating y and x, μ is the mean of y, and ε is additive noise. As is commonly assumed in factor analysis and other graphical models, the latent variables x are independent with unit variance, x˜N(0, I_{m}), where I_{m }is the m-dimensional identity matrix, and ε is zero mean Gaussian noise, ε˜N(0, σ^{2}I_{n}). A description of this is in An Introduction to Multivariate Statistical Analysis, T. W. Anderson, Wiley, 1984, and Learning in Graphical Models, Michael I. Jordan, MIT Press, 1999, which are incorporated by reference herein in their entirety.

Since x and ε are both Gaussian random vectors, it follows that the vector y also has a Gaussian distribution, y˜N(μ,C), where C=WW^{T}+σ^{2}I and I_{n }is an n-dimensional identity matrix. Together with equation (2), the generative observation model is defined by

*p*(*o* _{t} *|s* _{t})=*p*(*y* _{t} *|W*,μ,ε)˜*N*(*y* _{t} *|μ,WW* ^{T}+σ^{2} *I* _{n}) (3)

This latent variable model follows the form of probabilistic principle component analysis, and its parameters can be estimated from a set of example images. Given a set of image frames Y={y_{1}, . . . , y_{N}}, the covariance matrix of Y is denoted as

{λ_{i}|i=1, . . . , N} are the eigenvalues of S arranged in descending order, i.e., λ_{i}≧λ_{j }if i<j. Also, the diagonal matrix Σ_{m}=diag(λ_{1}, . . . , λ_{m}) is defined, and U_{m }are the eigenvectors that correspond to the eigenvalues in Σ_{m}. Tipping and Bishop show that the maximum likelihood estimate of μ, W and ε can be obtained by

where R is an arbitrary m×m orthogonal rotation matrix.

According to this embodiment, the single, linear PPCA model described above suffices to model gradual appearance variation, since the model parameters W, μ, and σ^{2 }may be dynamically adapted to account for appearance change.

The log-probability that a vector y is a sample of this generative appearance model can be computed from equation (4) as

where ^{T}C^{−1} ^{T}+σ^{2}I_{n }and equation (4), it follows that

^{T}U_{m}Σ_{m} ^{−1}Y_{m} ^{T}

is the distance of y within the subspace spanned by U_{m}, which is represented by dw in

^{T}(I_{n}−U_{m}U_{m} ^{T})

is the shortest distance from y to this subspace, as represented by dt in

As discussed above, it is expected that the target object's appearance does not change significantly from o_{t−1 }to o_{t}. Therefore, the observation at o_{t−1 }can be used to improve the likelihood measurement corresponding to o_{t}. That is, a set of samples (e.g., image patches) is drawn, parameterized by {s_{t−1} ^{i}|i=1, . . . , k} in o_{t−1 }that have large p(o_{t−1}|s_{t−1} ^{i}), but low posterior p(s_{t−1} ^{i}|O_{t−1}). These are treated as the negative samples (i.e., samples that are not generated from the class of the target object) that the generative model is likely to confuse as positive samples (generated from the class of the target object) at O_{t}.

Given a set of image samples Y′={y^{1}, . . . , y^{k}}, where y^{i }is the appearance vector collected in o_{t−1 }based on state parameter s_{t−1} ^{i}, a linear projection V* can be determined that projects Y′ onto a subspace such that the likelihood of Y′ in the subspace is minimized. Let V be a p×n matrix, and since p(y|W,μ,σ) is a Gaussian distribution, p(Vy|V,W,μ,σ)˜N(Vμ,VCV^{T}) is a also a Gaussian distribution. The log likelihood is computed by

To facilitate the following analysis, it is assumed that V projects Y to a one-dimensional space, i.e., p=1 and V=v^{T}, and thus

v^{T}Cv is the variance of the object samples in the projected space. A constraint, e.g., v^{t}Cv=1, is imposed to ensure that the minimum likelihood solution of v does not increase the variance in the projected space. By letting v^{T}Cv=1, the optimization problem becomes

In equation (11), v is a projection that maintains the target object's samples in the projected space (i.e., the positive samples) close to μ (with the constraint that variance v^{T}Cv=1), while keeping negative samples in Y′ away from μ. The optimal value of v is the generalized eigenvector of S′ and C that corresponds to largest eigenvalue. In a general case, it follows that

where V* can be obtained by solving a generalized eigenvalue problem of S′ and C. By projecting observation samples onto a lower-dimensional subspace, the discriminative power of the generative model is enhanced. Advantageously, this reduces the time required to compute probabilities, which represents a critical improvement for real time applications like visual tracking.

Understanding of the projection v and its optimal value may be informed by reference to **510**, may projected **530** and **550** onto lines **520** and **540**, respectively. Line **540** represents a poor choice, since there will be low discrimination between positive and negative samples. This is shown conceptually by the projection shown in *a*). Line **520** is a much better choice, since there will generally be much better separation of the projections of positive and negative samples, as illustrated in *b*).

*b*) illustrates the meanings of C and S′ according to a hypothetical one-dimensional example exhibiting very good discrimination. C corresponds to the variance of positive or negative sample clusters, taken as separate classes. This is referred to as “in-class scatter.” S′ corresponds to the separation between the positive and negative clusters, and is referred to as “between-class scatter.” Thus, V* corresponds to the linear projection that maximizes the ratio of between-class scatter to in-class scatter.

The computation of the projection matrix V depends on matrices C and S′. S′ may be updated as follows. Let

Given S′ and C, V may be computed by solving a generalized eigenvalue problem. If S′=A^{T}A and C=B^{T}B are decomposed, then V can be more efficiently determined using generalized singular value decomposition (SVD). By denoting U_{Y′} and Σ_{Y′} as the SVD of S_{Y′}, it follows that by defining A=[U_{Y′}Σ_{Y′} ^{1/2}|(μ−μ_{Y′})]^{T }and B=[U_{m}Σ_{m} ^{1/2}|σ^{2}I]^{T}, then S′=A^{T}A and C=B^{T}B.

V can be computed by first performing a QR factorization:

and computing the singular value decomposition of Q_{A }according to

Q_{A}=U_{A}D_{A}V_{A} ^{t} (15)

which yields V=R^{−1}V_{A}. The rank of A is usually small in vision applications, and V can be computed efficiently, thereby facilitating the tracking process. A description of the method used in the above derivation can be found in G. H. Golumb and C. F. Van Loan, *Matrix Computations*, Johns Hopkins University Press, 1996, which is incorporated by reference herein in its entirety.

Returning to **242**, based on the preceding steps, and according to equation (1). Since the appearance of the target object or its illumination may be time varying, and since an Eigenbasis is used for object representation, the Eigenbasis is preferably continually updated **248** from the time-varying covariance matrix. This problem has been studied in the signal processing community, where several computationally efficient techniques have been proposed in the form of recursive algorithms. A description of this is in B. Champagne and Q. G. Liu, “Plane rotation-based EVD updating schemes for efficient subspace tracking,” IEEE Transactions on Signal Processing 46 (1998), which is incorporated by reference herein it its entirety. In this embodiment, a variant of the efficient sequential Karhunen-Loevei algorithm is utilized to update the Eigenbasis, as explained in A. Levy and M. Lindenbaum, “Sequential Karhunen-Loeve basis extraction and its application to images,” IEEE Transactions on Image Processing 9 (2000), which is incorporated by reference herein it its entirety. This in turn is based on the classic R-SVD method. A description of this is in G. H. Golub and C. F. Van Loan, “Matrix Computations,” The Johns Hopkins University Press (1996), which is incorporated by reference herein in its entirety.

One embodiment of the present invention then determines **262** whether all frames of a motion video sequence have been processed. If not, the method receives **268** the next frame vector, and steps **224**-**256** are repeated.

Having described some of the features of one embodiment of the tracking algorithm, additional aspects of this embodiment are now noted. The algorithm is based on a maximum likelihood estimate that determines the most probable location of the target object at the current time, given all observations up to that time. This is described by s_{t}*=arg max_{s} _{ t }p(s_{t}|O_{t}). It is assumed that the state transition is a Gaussian distribution, i.e.,

p(s_{t}|s_{t−1})˜N(s_{t−1},Σ_{s}) (16)

where Σ_{s }is a diagonal matrix. According to this distribution, the tracking algorithm then draws N samples, or state vectors, S_{t}={c_{1}, . . . , c_{N}} that represent the possible locations of the target. y_{t} ^{i }is the appearance vector of o_{t}, and Y={y_{t} ^{1}, . . . y_{t} ^{N}} is a set of appearance vectors that correspond to the set of state vectors S_{t}. The posterior probability that the tracked object is at c_{i }in video frame o_{t }is then defined as

*p*(*s* _{t} *=c* _{i} *|O* _{t})=κ*p*(*y* _{t} ^{i} *|V,W*,μ,σ)*p*(*s* _{t} *=c* _{i} *|s* _{t−1}*) (17)

where κ is a constant. Therefore, s_{t}*=arg max_{c} _{ i } _{εs} _{ t }p(s_{t}=c_{i}|O_{t}).

Once s_{t}* is determined, the corresponding observation y_{t}* will be a new example to update W and μ. Appearance vectors y_{t} ^{i }with large p(y_{t} ^{i}|V,W,μ,σ) but whose corresponding state parameters c_{i }are away from s_{t}* will be used as new examples to update V. The tracking algorithm assumes o_{1 }and s_{1}* are given (through object detection, as discussed above), and thus obtains the first appearance vector y_{1 }which in turn is used as the initial value of μ. However, V and Ware unknown at the outset. When initial values of V and Ware not available, the tracking algorithm is based on template matching, with μ being the template. The matrix W is computed after a small number of appearance vectors are observed. When W is available, V can be computed and updated accordingly.

As discussed above, it is difficult to obtain an accurate initial estimate of σ. Consequently, σ is adaptively updated according to ε_{m }in W. σ is initially set to a fraction, e.g., 0.1, of the smallest eigenvalues in ε_{m}. This ensures the distance measurement in equation (6) will not be biased to favor either dw or dt.

Now referring to **700** comprises an input module **710**, a memory device **714**, a processor **716**, and an output module **718**. In an alternative embodiment, an image processor **712** can be part of the main processor **716** or a dedicated device to pre-format digital images to a preferred image format. Similarly, memory device **714** may be a standalone memory device, (e.g., a random access memory chip, flash memory, or the like), or an on-chip memory with the processor **716** (e.g., cache memory). Likewise, computer system **700** can be a stand-alone system, such as, a server, a personal computer, or the like. Alternatively, computer system **700** can be part of a larger system such as, for example, a robot having a vision system; a security system (e.g., airport security system), or the like.

According to this embodiment, computer system **700** comprises an input module **710** to receive the digital images O. The digital images may be received directly from an imaging device **701**, for example, a digital camera **701** *a *(e.g., robotic eyes), a video system **701** *b *(e.g., closed circuit television), image scanner, or the like. Alternatively, the input module **710** may be a network interface to receive digital images from another network system, for example, an image database, another vision system, Internet servers, or the like. The network interface may be a wired interface, such as, a USB, RS-232 serial port, Ethernet card, or the like, or may be a wireless interface module, such as, a wireless device configured to communicate using a wireless protocol, e.g., Bluetooth, WiFi, IEEE 802.11, or the like.

An optional image processor **712** may be part of the processor **716** or a dedicated component of the system **700**. The image processor **712** could be used to pre-process the digital images O received through the input module **710** to convert the digital images to the preferred format on which the processor **716** operates. For example, if the digital images received through the input module **710** come from a digital camera **710** *a *in a JPEG format and the processor is configured to operate on raster image data, image processor **712** can be used to convert from JPEG to raster image data.

The digital images O, once in the preferred image format if an image processor **712** is used, are stored in the memory device **714** to be processed by processor **716**. Processor **716** applies a set of instructions that when executed perform one or more of the methods according to the present invention, e.g., dynamic model, observation model, and the like. In one embodiment this set of instructions is stored in the Adaptive Discriminative Generative (ADG) unit **716** within memory device **714**. While executing the set of instructions, processor **716** accesses memory device **714** to perform the operations according to methods of the present invention on the image data stored therein.

Processor **716** tracks the location of the target object within the input images, I, and outputs indications of the tracked object's identity and location through the output module **718** to an external device **725** (e.g., a database **725** *a*, a network element or server **725** *b*, a display device **725** *c*, or the like). Like the input module **710**, output module **718** can be wired or wireless. Output module **718** may be a storage drive interface, (e.g., hard-drive or optical drive driver), a network interface device (e.g., an Ethernet interface card, wireless network card, or the like), or a display driver (e.g., a graphics card, or the like), or any other such device for outputting the target object identification and/or location.

The tracking algorithm with discriminative-generative model was tested with numerous experiments. To examine whether the algorithm was able to adapt and track objects in dynamic environments, videos exhibiting appearance deformation, large illumination change, and large pose variations were recorded. All image sequences consisted of 320×240 pixel grayscale videos, recorded at 30 frames/second and 256 gray-levels per pixel. The forgetting term was empirically selected as 0.85, and the batch size for update was set to 5 as a trade-off of computational efficiency and effectiveness of modeling appearance change in the presence of fast motion. A description of the forgetting term can be found in *Levy and Lindenbaum*, which was cited above.

**810** and **910**. There are two rows of small images below each main video frame. The first row **820**/**920** shows the sampled images in the current frame that have the largest likelihoods of being the target locations according the discriminative-generative model (DGM). The second row **830**/**930** shows the sample images in the current video frame that are selected online for updating the DGM. The results in *Proceedings of the Fourth European Conference on Computer Vision*, LNCS 1064, Springer Verlag, 1996, which is incorporated herein by reference in its entirety. The results show that such methods do not perform as well as the DGM-based method, as the former do not update the object representation to account for appearance change.

According to another embodiment of the present invention, a Fisher Linear Discriminant (FLD) projects image samples onto a lower-dimensional subspace. Within the lower-dimensional space, the within-class scatter matrix is minimized while the between-class matrix is maximized, as discussed above with regard to the embodiment based on the discriminative-generative model. The distribution of the background class is modeled by multiple Gaussian distributions or by a single Gaussian distribution. Preferably, one class models the target object and multiple classes model the background. According to one embodiment, one class per image sample models the background class. The FLD distinguishes samples of the object class from samples of the background classes.

Let X_{i}={x_{1} ^{i}, . . . , x_{Ni} ^{i}} be samples from class i. The FLD computes an optimal projection matrix W by maximizing the objective function

are the between- and within-class scatter matrices respectively, with m_{i }being the mean of class i, N_{i }being the number of samples in class i, and m being the overall mean of the samples.

Let X={x_{1}, . . . , x_{Nx}} be samples from the object class and Y={y_{1}, . . . , y_{Ny}} be samples from the background class. Treating each sample of the background as a separate class, there are N_{y}+1 classes with X_{1}=X and X_{i}={y_{i−1}},i=2, . . . . Ny+1. Except for X_{1}, every class has exactly one sample. Hence, m_{i}=y_{i−1 }when i≠1. Applying these relationships to equations (18) and (19) gives

Now denote m_{x }and m_{y }as the means, and C_{x }and C_{y }as the covariance matrices, of samples in X and Y. By applying the fact that

the between-class and within-class scatter matrices can be written as

Referring now to **1006**. The characteristics of this frame vector are as discussed above in connection with step **206**. The initial location of the target object is next determined **1012**. This may be accomplished as discussed above regarding step **212**. This method initially classifies the target and background using samples in the first video frame. Starting at the first video frame, a set of motion parameters specifies a window that defines the initial target object location, as discussed above regarding step **224**. The image portion inside that window is preferably an initial example for the object class.

A dynamic model is next applied **1024** to predict s_{t+1}, the object's location at time t+1, as discussed above in connection with step **224**. A small perturbation is applied to the window representing the object class and the corresponding image region is cropped, e.g., a portion of the region specified by the window is taken out. A larger set of samples is thus obtained that emulates possible variations of the target object class over the interval from time t to t+1. Alternately, applying a larger perturbation provides samples of the non-target background classes. For example, n_{0 }(e.g., 500) samples may be drawn, corresponding to a set of cropped images at time t+1. These images are then projected onto a low-dimensional space using projection matrix W. It is assumed that object images in the projected space are governed by Gaussian distributions. An inference model is next applied **1042**. Of the n_{0 }samples drawn, this model determines the image that has the smallest distance to the mean of the projected samples in the projection space. This distance is equivalent to dw, as shown in

The FLD is next updated **1056**. The non-selected members of the n_{0 }samples whose corresponding motion parameters are close to those of the chosen sample are selected as training examples for the object class at time t+1. Exemplars for the background class are chosen as those having small distances to the object mean in the projection space, and having motion parameters that deviate significantly from those of the chosen sample. These samples are likely to have been generated from one of the background classes, since their motion parameters significantly differ from those of the chosen sample. However, these samples appear to belong to the object class in the projection space, since they have small distances dw to the object mean, as shown in

The FLD is further updated **1056** by finding W that minimizes J(in equation (18). This may be accomplished by solving a generalized eigenvalue problem. Since S_{W }is a rank deficient matrix, J(W) is changed to

where ε is a scalar having a small value. Using the sequential Karhunen-Loeve algorithm discussed above, C_{x }and C_{y }are approximated by

C_{x}≈U_{x}D_{x}U_{x} ^{T }and C_{y}≈U_{y}D_{y}U_{y} ^{T} (24)

Now define

It can be shown that

S_{B}=A^{T}A and *S* _{w} *+εI=B* ^{T} *B* (26)

The desired value of W is found by applying equations (14) and 15) as discussed above, with W substituted for V.

Returning to **1062** and **1068** are applied in the manner discussed above regarding steps **262** and **268**, respectively.

The tracking algorithm with FLD was tested with a face-tracking experiment. Videos including a human subject's face and exhibiting illumination change and pose variations were recorded. All image sequences consisted of 320×240 pixel grayscale videos, recorded at 30 frames/second and 256 gray-levels per pixel. For initialization, 100 exemplars for the target class and 500 exemplars of the background classes were used to compute the FLD. These sample sizes are chosen as a compromise. The more positive and negative examples used, the better the results. However, more computation is required as the number of examples increases. The number of negative examples is preferably larger than the number of positive examples, since preferably more than one class is used for the negative examples. The FLD was incrementally updated every five frames. During tracking, 5 new target object and background examples were added at each frame, and the previously-used examples were retained.

**1120**/**1220** show the current mean of the object classes followed by the five new object image examples collected in the respective frame. The second rows **1130**/**1230** show the new background examples collected in the respective frame. As shown, tracking is stable despite sharp illumination and pose changes and variation in facial expression.

Advantages of the present invention as applied to visual tracking include improved tracking accuracy and computational efficiency relative to conventional methods. Since the visual tracking models continually adapt, large appearance variations of the target object and background due to pose and lighting changes are effectively accommodated.

Those of skill in the art will appreciate still additional alternative structural and functional designs for a discriminative-generative model and a Fisher Linear Discriminant model and their applications through the disclosed principles of the present invention. Thus, while particular embodiments and applications of the present invention have been illustrated and described, it is to be understood that the invention is not limited to the precise construction and components disclosed herein and that various modifications, changes and variations which will be apparent to those skilled in the art may be made in the arrangement, operation and details of the method and apparatus of the present invention disclosed herein without departing from the spirit and scope of the invention as defined in the appended claims.

Patent Citations

Cited Patent | Filing date | Publication date | Applicant | Title |
---|---|---|---|---|

US5960097 | Jan 21, 1997 | Sep 28, 1999 | Raytheon Company | Background adaptive target detection and tracking with multiple observation and processing stages |

US6047078 | Oct 3, 1997 | Apr 4, 2000 | Digital Equipment Corporation | Method for extracting a three-dimensional model using appearance-based constrained structure from motion |

US6226388 * | Jan 5, 1999 | May 1, 2001 | Sharp Labs Of America, Inc. | Method and apparatus for object tracking for automatic controls in video devices |

US6236736 | Feb 6, 1998 | May 22, 2001 | Ncr Corporation | Method and apparatus for detecting movement patterns at a self-service checkout terminal |

US6295367 | Feb 6, 1998 | Sep 25, 2001 | Emtera Corporation | System and method for tracking movement of objects in a scene using correspondence graphs |

US6337927 * | Jun 4, 1999 | Jan 8, 2002 | Hewlett-Packard Company | Approximated invariant method for pattern detection |

US6363173 * | Oct 14, 1998 | Mar 26, 2002 | Carnegie Mellon University | Incremental recognition of a three dimensional object |

US6400831 | Apr 2, 1998 | Jun 4, 2002 | Microsoft Corporation | Semantic video object segmentation and tracking |

US6539288 | May 23, 2001 | Mar 25, 2003 | Matsushita Electric Industrial Co., Ltd. | Vehicle rendering device for generating image for drive assistance |

US6580810 | Jun 10, 1999 | Jun 17, 2003 | Cyberlink Corp. | Method of image processing using three facial feature points in three-dimensional head motion tracking |

US6683968 * | Sep 1, 2000 | Jan 27, 2004 | Hewlett-Packard Development Company, L.P. | Method for visual tracking using switching linear dynamic system models |

US6757423 | Feb 18, 2000 | Jun 29, 2004 | Barnes-Jewish Hospital | Methods of processing tagged MRI data indicative of tissue motion including 4-D LV tissue tracking |

US6810079 * | Aug 14, 2001 | Oct 26, 2004 | Canon Kabushiki Kaisha | Image processing apparatus, image processing method, and computer-readable storage medium storing thereon program for executing image processing |

US6870945 | Jun 4, 2001 | Mar 22, 2005 | University Of Washington | Video object tracking by estimating and subtracting background |

US6999600 * | Jan 30, 2003 | Feb 14, 2006 | Objectvideo, Inc. | Video scene background maintenance using change detection and classification |

US7003134 | Mar 8, 2000 | Feb 21, 2006 | Vulcan Patents Llc | Three dimensional object pose estimation which employs dense depth information |

US20010048753 | Apr 2, 1998 | Dec 6, 2001 | Ming-Chieh Lee | Semantic video object segmentation and tracking |

US20040208341 | Mar 5, 2004 | Oct 21, 2004 | Zhou Xiang Sean | System and method for tracking a global shape of an object in motion |

USRE37668 | Jun 16, 2000 | Apr 23, 2002 | Matsushita Electric Industrial Co., Ltd. | Image encoding/decoding device |

WO2000048509A1 | Feb 18, 2000 | Aug 24, 2000 | Barnes-Jewish Hospital | Methods of processing tagged mri data indicative of tissue motion including 4-d lv tissue tracking |

WO2003049033A1 * | Dec 3, 2002 | Jun 12, 2003 | Honda Giken Kogyo Kabushiki Kaisha | Face recognition using kernel fisherfaces |

Non-Patent Citations

Reference | ||
---|---|---|

1 | "Pose Invariant Affect Analysis Using Thin-Plate Splines," To appear Int. Conference on Pattern Recognition, Cambridge, UK, Aug. 2004, [online] [Retrieved on Oct. 9, 2006] Retrieved from the Internet. | |

2 | "Pose Invariant Affect Analysis Using Thin-Plate Splines," To appear Int. Conference on Pattern Recognition, Cambridge, UK, Aug. 2004, [online] [Retrieved on Oct. 9, 2006] Retrieved from the Internet<URL:http://cvrr.ucsd.edu/publications/2004/RAAS-ICPR2004.pdf>. | |

3 | Black, Michael J. et al., "EigenTracking: Robust Matching and Tracking of Articulated Objects Using a View-Based Representation," International Journal of Compuber Vision, 1998, pp. 63-84, vol. 26, No. 1. | |

4 | Collins, R.T. et al., "On-Line Selection of Discriminative Tracking Features," Carnegie Mellon University, 2003, pp. 1-14. | |

5 | International Search Report and Written Opinion, PCT/US04/38189, Mar. 2, 2005. | |

6 | International Search Report and Written Opinion, PCT/US05/24582, Feb. 9, 2006, 8 pages. | |

7 | Tipping, Michael E. et al., "Probabilistic Principal Component Analysis," Journal of the Royal Statistical Society, Series B, Sep. 27, 1998, pp. 611-622, vol. 61, part 3. |

Referenced by

Citing Patent | Filing date | Publication date | Applicant | Title |
---|---|---|---|---|

US9152880 | May 30, 2014 | Oct 6, 2015 | The United States Of America As Represented By The Secretarty Of The Army | Method for modeling human visual discrimination task performance of dynamic scenes |

US20120219176 * | May 18, 2011 | Aug 30, 2012 | Al Cure Technologies, Inc. | Method and Apparatus for Pattern Tracking |

Classifications

U.S. Classification | 382/103, 382/223, 382/224 |

International Classification | G06K9/00 |

Cooperative Classification | G06K9/6247, G06T2207/30241, G06K9/6234, G06K9/00241, G06K9/3233, G06T7/208, G06T7/2033, G06K9/6214, G06K9/00261, G06T7/004 |

European Classification | G06K9/62B4P, G06K9/32R, G06K9/62B4D, G06K9/00F1V, G06T7/00P, G06K9/62A6, G06T7/20C, G06T7/20K |

Legal Events

Date | Code | Event | Description |
---|---|---|---|

Oct 13, 2005 | AS | Assignment | Owner name: HONDA MOTOR CO., LTD., JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:YANG, MING-HSUAN;LIN, RUEI-SUNG;REEL/FRAME:016879/0979;SIGNING DATES FROM 20051007 TO 20051011 |

Mar 13, 2013 | FPAY | Fee payment | Year of fee payment: 4 |

Jan 19, 2016 | CC | Certificate of correction |

Rotate