US 20050083248 A1
Image processing procedures include receiving at least two side view images of a face of a user. In other aspects, side view images are warped and blended into an output image of a face of a user as if viewed from a virtual point of view. In further aspects, a virtual video is produced in real time of output images from a video feed of side view images.
1. An image processing method, comprising:
receiving at least two side view images of a face of a user;
warping and blending the side view images into an output image of the face of the user as if viewed from a virtual point of view; and
producing a virtual video in real time of output images from a video feed of side view images.
2. The method of
accessing a three-dimensional closed mesh model of points corresponding to salient facial feature points; and
warping and blending the side view images by texture mapping the side view images to the three-dimensional closed mesh model based on mappings of vertices of polygons of the mesh model into two-dimensional coordinate spaces of side view images.
3. The method of
4. The method of
5. The method of
6. The method of
receiving a selection of a virtual point of view from which to view a three-dimensional model of the face of the user; and
rendering the three-dimensional model from the virtual point of view based on the selection.
7. The method of
8. The method of
9. The method of
10. The method of
11. The method of
hand labeling mesh points for a diverse set of faces and multiframe video recording; and
performing principal components analysis to obtain a minimum spanning dimensionality.
12. The method of
13. The method of
accessing transformation tables for the side view images, wherein the transformation tables define rules for interpolating regions of the side view images into side portions of the output image;
warping the side view images based on the transformation tables, thereby producing the side portions of the output image; and
blending the side portions of the output image, thereby producing the output image.
14. The method of
15. The method of
16. The method of
17. The method of
18. An apparatus, comprising:
a head mounted display unit worn by a first user, the display unit rendering to the first user an output image of a face of a second user virtually interacting with the first user in a collaborative, virtual environment, wherein the output image has been formed, based on offset view images of the face of the second user, such that the face of the second user appears as if viewed from a virtual point of view; and
an input port receiving at least one of the following: (a) offset view images of the face of the second user; (b) user-specific scaling and deformation transformations specific to the second user; (c) position of the face of the second user in a common coordinate system of the collaborative, virtual environment; (d) a three-dimensional model of the face of the second user; (e) a selection of a virtual point of view from which to render the three-dimensional model of the face of the second user; and (f) an output image of the face of the second user.
19. The apparatus of
an array of at least two imaging components having fixed positions and orientations relative to a face of the first user and adapted to receive at least two offset views of the face of the first user; and
an output port transmitting at least one of the following: (a) offset view images of the face of the first user; (b) user-specific scaling and deformation transformations specific to the first user; (c) position of the face of the first user in the common coordinate system of the collaborative, virtual environment within which the first user and the second user virtually interact; (d) a three-dimensional model of the face of the first user; (e) a selection of a virtual point of view from which to render the three-dimensional model of the face of the first user; and (f) an output image of the face of the first user.
20. The apparatus of
21. The apparatus of
22. The apparatus of
23. The apparatus of
24. The apparatus of
25. The apparatus of
26. The apparatus of
27. The apparatus of
28. An apparatus, comprising:
an array of at least two imaging components having fixed positions and orientations relative to a face of a first user and adapted to receive at least two offset views of the face of the first user; and
an output port transmitting at least one of the following: (a) offset view images of the face of the first user; (b) user-specific scaling and deformation transformations specific to the first user; (c) position of the face of the first user in a common coordinate system of a collaborative, virtual environment within which the first user and a second user virtually interact; (d) a three-dimensional model of the face of the first user; (e) a selection of a virtual point of view from which to render the three-dimensional model of the face of the first user; and (f) an output image of the face of the first user, wherein the output image of the face of the first user has been formed by combining offset view images of the face of the first user into an output image of the face of the first user as if viewed from a virtual point of view.
29. The apparatus of
30. The apparatus of
31. The apparatus of
32. The apparatus of
33. Computer software, comprising:
first instructions receiving at least two offset view images of a contoured structure;
second instructions forming, from the offset view images, an output image of the contoured structure as if viewed from a virtual point of view.
34. The computer software of
35. The computer software of
36. Computer software, comprising:
a first set of instructions receiving at least two offset view images of a contoured structure;
a second set of instructions recognizing feature points of the contoured structure in the offset view images, accessing a three-dimensional closed mesh model of feature points similar to the recognized feature points, and texture mapping the offset view images to the three-dimensional closed mesh model based on mappings of vertices of polygons of the mesh model into two-dimensional coordinate spaces of the offset view images, thereby forming a three-dimensional model of the contoured structure.
This application is a continuation-in-part of U.S. patent application Ser. No. 09/748,761 filed on Dec. 22, 2000. The disclosure of the above application is incorporated herein by reference in its entirety for any purpose.
The present invention generally relates to computer-based teleconferencing in a networked virtual reality environment, and more particularly to mobile face capture and image processing.
Networked virtual environments allow users at remote locations to use a telecommunication link to coordinate work and social interaction. Teleconferencing systems and virtual environments that use 3D computer graphics displays and digital video recording systems allow remote users to interact with each other, to view virtual work objects such as text, engineering models, medical models, play environments and other forms of digital data, and to view each other's physical environment.
A number of teleconferencing technologies support collaborative virtual environments Which allow interaction between individuals in local and remote sites. For example, video-teleconferencing systems use simple video screens and wide screen displays to allow interaction between individuals in local and remote sites. However, wide screen displays are disadvantageous because virtual 3D objects presented on the screen are not blended into the environment of the room of the users. In such an environment, local users cannot have a virtual object between them. This problem applies to representation of remote users as well. The location of the remote participants cannot be anywhere in the room or the space around the user, but is restricted to the screen.
Networked immersive virtual environments also present various disadvantages. Networked immersive virtual reality systems are sometimes used to allow remote users to connect via a telecommunication link and interact with each other and virtual objects. In many such systems the users must wear a virtual reality display where the user's eyes and a large part of the face are occluded. Because these systems only display 3D virtual environments, the user cannot see both the physical world of the site in which they are located and the virtual world which is displayed. Furthermore, people in the same room cannot see each others' full face and eyes, so local interaction is diminished. Because the face is occluded, such systems cannot capture and record a full stereoscopic view of remote users' faces.
Another teleconferencing system is termed CAVES. CAVES systems use multiple screens arranged in a room configuration to display virtual information. Such systems have several disadvantages. In CAVES systems, there is only one correct viewpoint, all other local users have a distorted perspective on the virtual scene. Scenes in the CAVES are only projected on a wall. So two local users can view a scene on the wall, but an object cannot be presented in the space between users. These systems also use multiple rear screen projectors, and therefore are very bulky and expensive. Additionally, CAVES systems may also utilize stereoscopic screen displays. Stereoscopic screen display systems do not present 3D stereoscopic views that interpose 3D objects between local users of the system. These systems sometimes use 3D glasses to present a 3D view, but only one viewpoint is shared among many users often with perspective distortions.
Consequently, there is a need for an augmented reality display that mitigates the above mentioned disadvantages and has the capability to display virtual objects and environments, superimpose virtual objects on the “real world” scenes, provide “face-to-face” recording and display, be used in various ambient lighting environments, and correct for optical distortion, while minimizing computational power and time.
Faces have been captured passively in rooms instrumented with a set of cameras, where stereo computations can be done using selected viewpoints. Other objects can be captured using the same methods. Such hardware configurations are unavailable for mobile use in arbitrary environments, however. Other work has shown that faces can be captured using a single camera and processing that uses knowledge of the human face. Either the face has to move relative to the camera, or assumptions of symmetry are employed. Our approach is to use two cameras affixed to the head, which is necessary to convey non symmetrical facial expression, such as the closing of one eye and not the other, or the reflection of a fire on only one side of the face.
There is little overlap in the images taken from outside the user's central field of view, so the frontal view synthesized is a novel view. In previous work, novel views have been synthesized by a panoramic system and/or by interpolating between a set of views. Producing novel views in a dynamic scenario was successfully shown for a highly rigid motion. This work extended interpolation techniques to the temporal domain from the spatial domain. A novel view at a new time instant was generated by interpolating views at nearby time intervals using spatio-temporal view interpolation, where a dynamic 3-D scene is modeled and novel views are generated at intermediate time intervals.
There remains a need for a way to generate in real time a synthetic frontal view of a human face from two real side views.
In accordance with the present invention, image processing procedures include receiving at least two side view images of a face of a user. In other aspects, side view images are warped and blended into an output image of a face of a user as if viewed from a virtual point of view. In further aspects, a video is produced in real time of output images from a video feed of side view images.
In yet other aspects, a teleportal system is provided. A principal feature of the teleportal system is that single or multiple users at a local site and a remote site use a telecommunication link to engage in face-to-face interaction with other users in a 3D augmented reality environment. Each user utilizes a system that includes a display such as a projection augmented-reality display and sensors such as a stereo facial expression video capture system. The video capture system allows the participants to view a 3D, stereoscopic, video-based image of the face of all remote participants and hear their voices, view unobstructed the local participants, and view a room that blends physical with virtual objects with which users can interact and manipulate.
In one preferred embodiment of the system, multiple local and remote users can interact in a room-sized space draped in a fine grained retro-reflective fabric. An optical tracker preferably having markers attached to each user's body and digital video cameras at the site records the location of each user at a site. A computer uses the information about each user's location to calculate the user's body location in space and create a correct perspective on the location of the 3D virtual objects in the room.
The projection augmented-reality display projects stereo images towards a screen which is covered by a fine grain retro-reflective fabric. The projection augmented-reality display uses an optics system that preferably includes two miniature source displays, and projection-optics, such as a double Gauss form lens combined with a beam splitter, to project an image via light towards the surface covered with the retro-reflective fabric. The retro-reflective fabric retro-reflects the projected light brightly and directly back to the eyes of the user. Because of the properties of the retro-reflective screen and the optics system, each eye receives the image from only one of the source displays. The user perceives a 3D stereoscopic image apparently floating in space. The projection augmented-reality display and video capture system does not occlude vision of the physical environment in which the user is located. The system of the present invention allows users to see both virtual and physical objects, so that the objects appear to occupy the same space. Depending on the embodiment of the system, the system can completely immerse the user in a virtual environment, or the virtual environment can be restricted to a specific region in space, such as a projection window or table top. Furthermore, the restricted regions can be made part of an immersive wrap-around display.
Further areas of applicability of the present invention will become apparent from the detailed description provided hereinafter. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
The present invention will become more fully understood from the detailed description and the accompanying drawings, wherein:
The following description of the preferred embodiment(s) is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.
Teleportal sites 101 and 102 preferably include a screen 103. Screen 103 is made of a retro-reflective material such as beads-based or corner-cube based materials manufactured by 3M® and Reflexite Corporation. The retro-reflective material is preferably gold which produces a bright image with adequate resolution. Alternatively, other material which has metalic fiber adequate to reflect at least a majority of the image or light projected onto its surface may be used. The retro-reflective material preferably provides about 98 percent reflection of the incident light projected onto its surface. The material retro-reflects light projected onto its surface directly back upon its incident path and to the eyes of the user. Screen 103 can be a surface of any shape, including but not limited to a plane, sphere, pyramid, and body-shaped, for example, like a glove for a user's hand or a body suit for the entire body. Screen 103 can also be formed to a substantially cubic shape resembling a room, preferably similar to four walls and a ceiling which generally surround the users. In the preferred embodiment, screen 103 forms four walls which surround users 110. 3D graphics are visible via screen 103. Because the users can see 3D stereographic images, text, and animations, all surfaces that have retro-reflective property in the room or physical environment can carry information. For example, a spherical screen 104 is disposed within the room or physical environment for projecting images. The room or physical environment may include physical objects substantially unrelated to the teleportal system 100. For example, physical objects may include furniture, walls, floors, ceilings and/or other inanimate objects.
With a continued reference to
Optical tracking system 106 further includes markers 96 that are preferably attached to one or more body parts of the user. In the preferred embodiment, markers 96 are coupled to each user's hand, which is monitored for movement and position. Markers 96 communicate marker location data regarding the location of the user's head and hands. It should be appreciated that the location of any other body part of the user or object to which a marker is attached can be acquired.
Users 110 wear a novel teleportal headset 105. Each headset preferably has displays and sensors. Each teleportal headset 105 communicates with a networked computer. For example, teleportal headsets 105 of site 101 communicate with networked computer 107 a. Networked computer 107 a communicates with a networked computer 107 b of site 102 via a networked data system 99. In this manner, teleportal headsets can exchange data via the networked computers. It should be appreciated that teleportal headset 105 can be connected via a wireless connection to the networked computers. It should also be appreciated that headset 105 can alternatively communicate directly to networked data system 99. One type of networked data system 99 is the Internet, a dedicated telecommunication line connecting the two sites, or a wireless network connection.
Each of sensors 203, 106 and 204 are preferably connected to networked computer 107 and sends signals to the networked computer. Facial capture system 203 sends signals to the networked computer. However, it should be appreciated that sensors 203, 106 and 204 can directly communicate with a networked data system 99. Facial capture system 203 provides image signals based on the image viewed by a digital camera which are processed by a face-unwarping and image stitching module 207. Images or “first images” sensed by face capture system 203 are morphed for viewing by users at remote sites via a networked computer. The images for viewing are 3D and stereoscopic such that each user experiences a perspectively correct viewpoint on an augmented reality scene. The images of participants can be located anywhere in space around the user.
Morphing distorts the stereo images to produce a viewpoint of preferably a user's moving face that appears different from the viewpoint originally obtained by facial capture system 203. The distorted viewpoint is accomplished via image morphing to approximate a direct face-to-face view of the remote face. Face-warping and image-stitching module 207 morphs images to the user's viewpoint. The pixel correspondence algorithm or face warping and image stitching module 207 calculates the corresponding points between the first images to create second images for remote users. Image data retrieved from the first images allows for a calculation of a 3D structure of the head of the user. The 3D image is preferably a stereoscopic video image or a video texture mapping to a 3D virtual mesh. The 3D model can display the 3D structure or second images to the users in the remote location. Each user in the local and remote sites has a personal and correct perspective viewpoint on the augmented reality scene. Optical tracking system 106 and microphone 204 provide signals to networked computer 107 that are processed by a virtual environment module 208.
A display array 222 is provided to allow the user to experience the 3D virtual environment, for example via a projection augmented-reality display 401 and stereo audio earphones 205 which are connected to user 110. Display array 222 is connected to a networked computer. In the preferred embodiment, a modem 209 connects a networked computer to network 99.
Source display 501 transmits light to beamsplitter 503. It should be appreciated that
Additionally, the projection augmented-reality display can be mounted on the head. More specifically,
Each video camera 601 a and 601 b is mounted to a housing 406. Housing 406 is formed as a temple section of the headset 105. In the preferred embodiment, each digital video camera 601 a and 601 b is pointed at a respective convex mirror 602 a and 602 b. Each convex mirror 602 a and 602 b is connected to housing 406 and is angled to reflect an image of the adjacent side of the face. Digital cameras 601 a and 601 b located on each side of the user's face 410 capture a first image or particular image of the face from each convex mirror 602 a and 602 b associated with the individual digital cameras 601 a and 601 b, respectively, such that a stereo image of the face is captured. A lens 408 is located at each eye of user face 606. Lens 408 allows images to be displayed to the user as the lens 408 is positioned 45 percent relative to the axis in which a light beam is transmitted from a projector. Lens 408 is made of a material that reflects and transmits light. One preferred material is “half silvered mirror.”
Communication of the expressive human face is important to tele-communication and distributed collaborative work. In addition to sophisticated collaborative work environments, there is a strong popular trend for the merger of cell phone and video functionality at consumer prices. At both ends of the technology spectrum, there is a problem producing quality video of a person's face without interfering with that person's ability to perform some task requiring both visual and motor attention. When the person is mobile, the technology of most collaborative environments is unusable. Referring now to
Side view as used herein should be interpreted as any offset view. Thus, the angle with respect to the face does not have to be directly from the side. Also, the side view can be from an angle beneath or above the face. Further, while side views of faces of users are typically captured and used from/in a virtual view, it should be readily understood that other parts of a user may also be captured, such as a user's hand.
A prototype HMD facial capture system has been developed. The development of the video processing reported here was isolated from the HMD device and performed using a fixed lab bench and conventional computer. Porting and integration of the video processing with the mobile HMD hardware can be accomplished in a variety of ways as further described below.
The prototype system was configured with off-the-shelf hardware and software components.
In the experiment demonstrating feasibility of some embodiments of the present invention, several videos were taken for several volunteers so that the synthetic video could be compared to real video. One question posed was whether the synthetic frontal video would be of sufficient quality to support the applications intended for the HMD. The bench was set up for a general user and adjustments were made for individuals only when needed. Video and audio were recorded for each subject for offline processing.
The problem is to generate a virtual frontal view from two side views. The projected light grid provides a basis for mapping pixels from the side images into a virtual image with the projector's viewpoint. The grid is projected onto the face for only a few frames so that mapping tables can be built, and then is switched off for regular operation.
There are three 2D coordinate systems involved in creation of the virtual video. A global 3D coordinate system is denoted; however, it must be emphasized that 3D coordinates are not required for the task according to some embodiments of the present invention.
During the calibration phase, the transformation tables are generated using the grid pattern coordinates. A rectangular grid is projected onto the face and the two side views are captured as shown in
The behavior of a single gridded cell in the original side view and the virtual frontal view is demonstrated in
Equations 1 and 2 are four functions determined during the calibration stage and implemented via the transformations tables. These transformation tables are then used in the operational stage immediately after the grid is switched off. During operation, it is known for each pixel V[x, y] in which grid cell of LCS or RCS it lies. Bilinear interpolation is then used on the grid cell corners to access an actual pixel value to be output to the VV.
In the case where convex mirrors, wide angle lenses, or equivalent sensors are employed to capture offset views of user faces, warping can still be accomplished in one step by making the point correspondences. However, in the case of strong nonlinear distortion, it is envisioned that bicubic interpolation may be employed instead of bilinear interpolation. It is also envisioned that subpixel coordinates and multiple pixel sampling can be used in cases where the face texture changes fast or where the face normal is away from the sensor direction.
Some implementation details are as follows. A rectangular grid of dimension 400×400 is projected onto the face. The grid is made by repeating three colored lines. White, green and cyan colors proved useful because of their bright appearance over the skin color. This combination of hues demonstrated good performance over a wide variety of skin pigmentations. However, it is envisioned that other hues may be employed. The first few frames have the grid projected onto the face before the grid is turned off. One of the frames with the grid is taken and the transformation tables are generated. The size of the grid pattern that is projected in the calibration stage plays a significant role in the quality of the video. This size was decided based on the trade-off between the quality of the video and execution time. An appropriate grid size was chosen based on trial and error. The trial and error process started by projecting a sparse grid pattern onto the face and then increasing the density of the grid pattern. At one point, the increase in the density did not significantly improve the quality of the face image but consumed too much time. At that point, the grid was finalized with a grid cell size of row-width 24 pixels and column-width 18 pixels.
Using the transformation tables 1012 generated in the calibration phase, each virtual frontal frame is generated. The algorithm reconstructs each (x, y) coordinate in the virtual view by accessing the corresponding location in the transformation table and retrieving the pixel in IL (or IR) using interpolation. Then a 1D linear smoothing filter is used to smooth the intensity across the vertical midline of the face. Without this smoothing, a human viewer usually perceives a slight intensity edge at the midline of the face.
Some other post-processing can be included. For example, frames with a gridded pattern can be deleted from the final output: these can be identified by a large shift in intensity when the projected grid is switched off. Also, a microphone recording of the voice of the user, stored in a separate .wav file, can be appended to the video file to obtain a final output.
Color balancing of the cameras can also be performed. Even though software based approaches for color balancing can be taken, the color balancing in the present work is done at the hardware level. Before the cameras are used for calibration, they are balanced using the white balancing technique. A single white paper is shown to both cameras and cameras are white balanced instantly.
The virtual video of the face can be adequate to support the communication of identity, mental state, gesture, and gaze direction. Some objective comparisons between the synthesized and real videos are reported below, plus a qualitative assessment.
The real video frames from the camcorder and the virtual video frames were normalized to the same size of 200×200 and compared using cross correlation and interpoint distances between salient face features. Five images that were considered for evaluation are shown in
1) Normalized cross-correlation: The cross correlation between regions of the virtual image and real image was computed for rectangular regions containing the eyes and mouth (
2) Euclidean distance measure: The difference in the normalized Euclidean distances between some of the most prominent feature points were computed. The feature points are chosen in such a way that one of them is relatively static with respect to the other. For some prominent feature points, such as corners of the eyes, nose tip, corners of the mouth, the corners of the eyes are relatively static when compared with the corners of the mouth.
The results in Table II indicate small errors in the Euclidean distance measurements of the order of 3 pixels in an image of size 200×200. The facial feature points in the five frames were selected manually and hence the errors might have also been caused due to the instability of manual selection. One can note that the error values of Dcf and Dcg are larger than the others. This is probably because the nose tip is not as robustly located compared to eye corners.
A preliminary subjective study was also performed. In general, the quality of the videos was assessed as adequate to support the variety of intended applications. The two halves of all the videos are well synchronized and color balanced. The quality of the audio is good and it has been synchronized well with the lip movements. Some observed problems were distortion in the eyes and teeth and in some cases a cross-eyed appearance. The face appears slightly bulged compared with the real videos, which is probably due to the combined radial distortions of the camera and projector lenses.
Synchronization in the two videos is preferred in the invention application. Since two views of a face with lip movements are merged together, any small changes in the synchronization has high impact on the misalignment of the lips. This synchronization was evaluated based on sensitive movements such as eyeball movements and blinking eyelids. Similarly, mouth movements were examined in the virtual videos. FIGS. 26 to 27 show some of these effects.
Analysis indicates that a real-time mobile system is feasible. The total computation time consists of (1) transferring the images into buffers, (2) warping by interpolating each of the grid blocks, and (3) linearly smoothing each output image. The average time is about 60 ms per frame using a 746 MHz computer. Less than 30 ms would be considered to be real-time: this can be achieved with a current computer with clock rate of 2.6 GHz. Some implementations can require more power to mosaic training data into the video to account for features occluded from the cameras.
It can be concluded that the algorithm being used can be made to work in real-time. The working prototype has been tested on a diverse set of seven individuals. From comparisons of the virtual videos with real videos, it is expected that important facial expressions will be represented adequately and not distorted by more than 2%. Thus, the HMD system implementing the image processing software of the present invention can support the intended telecommunication applications.
It is envisioned that calibration using a projected grid can be used with the algorithms described above. 3D texture-mapped face models can also be created by calibrating the cameras and projector in the WCS. 3D models present the opportunity for greater compression of the signal and for arbitrary frontal viewpoints, which are desired for virtual face-to-face collaboration. Although technically feasible, structured light projection is an obtrusive step in the process and may be cumbersome in the field. Thus, a generic mesh model of the face can also be employed.
There is a problem due to occlusion in the blending of the two side images. Some facial surface points that should be displayed in the frontal image are not visible in the side images. For example, the two cameras cannot see the back of the mouth. It is envisioned that training data may be taken from a user and patched into the synthetic video, either for that user or for another, similar user. During training, the user can generate a basis for all possible future output material. The system can contain methods to index to the right material and blend it with the regular warped output. A related problem is that facial deformations that make significant alterations to the face surface may not be rendered well by the static warp. Examples are tongue thrusts and severe facial distortions. The static warp algorithm achieves good results for moderate facial distortion: It does not crash when severe cases are encountered, but the virtual video can show a discontinuity in important facial features.
Other embodiments of the present invention employ a 3D model as described below. The 3D modeling embodiments include one or more of the following: (a) a calibration method that does not depend upon structured light, (b) an output format that is a dynamic 3D model rather than just a 2D video, and (c) a real-time tracking method that identifies salient face points in the two side videos and updates both the 3D structure and the texture of the 3D model accordingly.
The 3D face model can be represented by a closed mesh of n points (x, y, z), i=1, n and a texture map. This model can be rendered rapidly by standard graphics software and displayed by standard graphics cards. The mesh point 3D coordinates are available for a generic face. Scaling and deformation transformations can be used to instantiate this model for an individual wearing the Face Capture Head Mounted Display Units (FCHMDs). The model can be viewed/rendered from a general viewpoint within the coverage of the cameras and not just from the central point in front of the face. Triangles of the mesh can be texture-mapped to the sensed images and to other stored face data that may be needed to fill in for unimaged patches.
The 3D face model can be instantiated to fit a specific individual by one or more of the following: (1) choosing special points by hand on a digital frontal and profile photo; (2) choosing special points from the two side video frames of a neutral expression taken from the FCHMD, and enabling the wearer to make adjustments while viewing the resulting rendered 3D model.
In some embodiments, standard rendering of the face model requires one or more of the following: (1) the set of triangles modeling the 3D geometry; (2) the two side images from the FCHMD; (3) a mapping of all vertices of each 3D triangle into the 2D coordinate space of one of the side images; (4) a viewpoint from which to view the 3D model; and (5) a lighting model that determines how the 3D model is illuminated.
It is envisioned that users communicating with one another may each wear a FCHMD, and that the FCHMD can operate in a variety of ways. For example, side views of a first user's face can be transmitted to the second user's FCHMD, where they can be warped and blended to produce the 3D model, which is then rendered from a selected perspective to produce the output image. Also, the first user's FCHMD can warp and blend the side views to produce the 3D model, and transmit the 3D model to the second user's FCHMD where it can be rendered from a selected perspective to produce the output image. Further, the first user's FCHMD can warp and blend the side views, render the resulting 3D model from a selected perspective to produce the output image, and transmit the output image to the second user's FCHMD. Yet further, an external image processing module external to the FCHMDs can perform some or all of the steps necessary to produce the output image from the side views. Further still, this external image processing module can be remotely located on a communications network, rather than physically located at a location of one or more of the user's. Accordingly, a FCHMD may be adapted to transmit to a remote location and/or receive from a remote location at least one of the following: (1) side view images; (2) user-specific scaling and deformation transformations; (3) position of a user's face in a common coordinate system of a collaborative, virtual environment; (4) a 3D model of a user's face; (5) a selection of a virtual point of view from which to render a user's face; and (6) an output image. Supplemental image data obtained from a particular user or from training users can also be transmitted or received, and can even be integrated into the generic mesh models ahead of time.
It should be readily understood that the FCHMD does not have to transmit or receive one or more of each of the types of data listed above. For example, it is possible that an FCHMD may only transmit and receive output images. It is also possible that an FCHMD may transmit and receive only two data types, including output images together with position of a user's face in a common coordinate system of a collaborative, virtual environment. It is further possible that an FCHMD will transmit and receive only side view images. It is still further possible that an FCHMD will transmit and receive only two data types, including side view images, together with position of a user's face in a common coordinate system of a collaborative, virtual environment. It is yet further possible that an FCHMD will transmit and receive only 3D models of user's faces. It is still yet further possible that an FCHMD will transmit and receive only two data types, including 3D models of user's faces, together with position of a user's face in a common coordinate system of a collaborative, virtual environment. In the cases where 3D models or side view images are transmitted and received, it may be the case that user-specific scaling and deformation transformations are transmitted and received at some point, perhaps during an initialization of collaboration. It is additionally possible that one FCHMD can do most or all of the work for both FCHMDs, and receive side view images and face position data for a first user while transmitting output images or a 3D model for a second user. Accordingly, all of these embodiments and others that will be readily apparent to those skilled in the art are described above.
During operation, the FCHMD optics/electronics of some embodiments can sense in real time the real expressive face of the wearer from the two side videos, and the software can create in real time an active 3D face model to be transmitted to remote collaborators.
The morphable model is trained for dynamic use on a population of users. A diverse set of training users may wear the FCHMD and follow a script that induces a variety of facial expressions, while frontal video is also recorded. This training set can support salient point tracking and also the substitution of real data for viewpoints that cannot be observed by the side cameras (inside the mouth, for example). Moreover, the training videos can record sequences of articulator movements that can be used during online FCHMD use.
Let S be a set of shape vectors composed of the face surface points and a corresponding set T of texture vectors.
Any face Sp, Tp in the population can be represented as Sp=Σj=1 MajSj and Tp=Σj=1 MajTj, with Σj=1 Maj=1 and Σj=1 Mbj1. The parameters aj, bj represent the face p in terms of the training faces and the new illumination conditions and possibly slight variation in the camera view.
Tracking of salient feature points can be accomplished to dynamically change the transformation tables and achieve a dynamic model. The parameters of the model aj, bj can be dynamically fit by optimizing the similarity between a model rendered using these parameters and the observed images.
The FCHMD can be calibrated by finding the optimal fit between a parameterized model and the video data currently observed on the FCHMD. Once this fit is known, locations of the salient mesh points (Xk, Yk, Zk) are known and thus a texture map is defined between the 3D mesh and the 2D images for that instant of time (current expression). Since iterative hill-climbing is used for the fitting procedure, it is expected that either some intelligent guess or some hand selection will be needed to initialize the fitting. A fully automatic procedure can be initialized from an average wearer's face determined from the training data. The control software for the FCHMD can have a back up procedure so that the HMD wearer can initialize the fitting by manually choosing some salient face points via the wearer viewing the video images and selecting points.
The description of the invention is merely exemplary in nature and, thus, variations that do not depart from the gist of the invention are intended to be within the scope of the invention. Such variations are not to be regarded as a departure from the spirit and scope of the invention.