FIELD OF THE INVENTION
The present invention relates to the field of video superposition devices, and more particularly to multiple image source windowed display generation systems.
BACKGROUND OF THE INVENTION
A known video superposition system known as “chroma keying” employs a foreground image which is separated from an actual background by detection of a background screen chrominance value. Thus, for example, a person is presented in front of a blue screen. A video processing circuit detects the chrominance level, producing a signal when the key color is detected. This color is generally a deep blue, for two reasons First, this color is generally uncommon in natural foreground scenes, so that artifacts are minimized. Second, this color represents an extreme, so that a single ended comparator may be used to produce the key signal.
When the key signal occurs, a video source switches a synchronized (genlocked) background video signal to the output. Thus, where the key level in the foreground is not detected, the foreground is output, while where the key color is detected, the background signal is output. This technology is well established, and many variations and modifications exist. U.S. Pat. No. 4,200,890 and 4,409,618 relate to digital video effects systems employing a chroma key tracking technique. U.S. Pat. No. 4,319,266 relates to a chroma keying system. U.S. Pat. No. 5,251,016 relates to a chroma keyer with secondary hue selector for reduced artifacts. U.S. Pat. No. 5,313,275 relates to a chroma processor including a look-up table or memory, permitting chroma key operation. U.S. Pat. No. 5,398,075 relates to the use of analog chroma key technology in a computer graphics environment. U.S. Pat. No. 5,469,536 relates to an image editing system including masking capability, which employs a computerized hue analysis of the image to separate a foreground object from the background.
Computer generated graphics are well known, as are live video windows within computer graphics screens. U.S. Pat. No. 3,899,848 relates to the use of a chroma key system for generating animated graphics. U.S. Pat. No. 5.384,912 relates to a computer animated graphics system employing a chroma key superposition technique. U.S. Pat. No. 5,345,313 relates to an image editing system for taking a background and inserting part of an image therein, relying on image analysis of the foreground image. U.S. Pat. No. 5,394,517 relates to a virtual reality, integrated real and virtual environment display system employing chroma key technology to merge the two environments.
A number of spatial position sensor types are known. These include electromagnetic, acoustic, infrared, optical, gyroscopic, accelerometer, electromechanical, and other types. In particular, systems are available from Polhemus and Ascension which accurately measure position and orientation over large areas, using electromagnetic fields.
Rangefinder systems are known, which allow the determination of a distance to an object. Known systems include optical focus zone, optical parallax, infrared, and acoustic methods. Also known are non-contact depth mapping systems which determine a depth profile of an object without physical contact with a surface of the object.
U.S. Pat. No. 5,521,373 relates to a position tracking system having a position sensitive radiation detector. U.S. Pat. No. 4,988,981 relates to a glove-type computer input device. U.S. Pat. No. 5,227,985 relates to a computer vision system for position monitoring in three dimensions using non-coplanar light sources attached to a monitored object. U.S. Pat. No. 5,423,554 relates to a virtual reality game method and apparatus employing image chroma analysis for tracking a colored glove as an input to a computer system.
U.S. Pat. No. 5,502,482 relates to a system for deriving a studio camera position and motion from the camera image by image analysis. U.S. Pat. No. 5,513,129 relates to a method and system for controlling a computer-generated virtual environment with audio signals.
SUMMARY OF THE INVENTION
The present invention employs a live video source, a background image source, a mask region generator and an overlay device which merges the foreground with the background image based on the output of the mask region generator. Two classes of mask region generators are provided; first, an “in-band” system is provided which acquires the necessary mask region boundaries based on the foreground image acquisition system, and second an “out-of-band” system which provides a separate sensory input to determine the mask region boundary.
A preferred embodiment of the “in-band” system is a rangefinder system which operates through the video camera system, to distinguish the foreground object in the live video source from its native background based on differences in distance from the camera lens. Thus, rather than relying on an analysis of the image per se to extract the foreground object, this preferred embodiment of the system defines the boundary of the object through its focal plane or parallax.
A preferred embodiment of the “out-of-band” system includes an absolute position and orientation sensor physically associated with the foreground object with a predetermined relationship of the sensor to the desired portion of the foreground object. Thus, where the foreground object is a person, the sensor may be an electromagnetic position sensor mounted centrally on top of the head with the mask region defined by an oval boundary below and in front of the position and orientation sensor.
In a preferred embodiment, the foreground image is a portrait of a person, while the background image is a computer generated image of a figure. A position sensor tracks a head position in the portrait, which is used to estimate a facial area. The image of the facial area is then merged in an anatomically appropriate fashion with the background figure.
The background image is, for example, an animated “character”, with a masked facial portion. The live video signal in this case includes, as the foreground image, a face, with the face generally having a defined spatial relation to the position sensor. The masked region of the character is generated, based on the output of the position sensor in an appropriate position, so that the face may be superimposed within the masked region. As seen in the resulting composite video image, the live image of the face is presented within a mask of an animated character, presenting a suitable foundation for a consumer entertainment system. The mask may obscure portions of the face, as desired. Manual inputs or secondary position sensors for the arms or legs of the individual may be used as further control inputs, allowing the user to both control the computer generated animation and to become a part of the resultant image. This system may therefore be incorporated into larger virtual reality systems to allow an increased level of interaction, while minimizing the need for specialized environments.
In practice, it is generally desired to mask a margin of the face so that no portion of the background appears in a composite image. Thus, the actual video background is completely obscured and irrelevant. In order to produce an aesthetically pleasing and natural appearing result, the region around the face is preferably provided with an image which appears as a mask. Thus, the background image may appear as a masked character, with the foreground image as a video image of a face within the mask region. The mask region may be independent of the video image data, or developed based on an image processing algorithm of the video image data. In the later case, where processing latencies are substantial, the composite output may be initially provided as a video image data independent mask which is modified over time, when the image is relatively static, for greater correspondence with the actual image. Thus, such a progressive rendering system will allow operation on platforms having various available processing power for image processing, while yielding acceptable results on systems having a low amount of available processing power.
It is not always possible to adjust the size and placement of an image mask for each user of the system. Thus, the preferred embodiment provides a background image which is tolerant of misalignments and misadjustments of the video image with the background image. In the case of a masked character background image, this tolerance includes providing an edge portion of the mask which merges appropriately with a variety of types of facial images, e.g., men, women, children, and possibly pet animals.
Because the system is not limited to a chroma key superposition system, the information from the position sensor and the video camera allow simple extraction of the image of an individual's face in a more generalized computer graphic image, based on an estimate of its position. Thus, multiple individuals may be presented in a single graphic image, each interacting with his or her environment or with each other. While these individuals may be present in the same environment, for example, within the field of view of a single video camera, this ability to build complex images from multiple inputs allow individuals at remote locations within a computer network to interact while viewing each other's faces. Therefore, personalized multiplayer “video” games become possible. This same technology my also have uses outside the fields of entertainment, including communications and video conferencing. This personalized representation separated from its native background also forms the basis for a new style of multi-user graphic interface system.
In implementation, the position estimation system preferably acts as an input to a computer generated animation as the background image figure. In one set of embodiments, the generation of the resulting combined image is performed through a chroma key system. Therefore, in such systems, the background figure image is provided with a key color in a facial region or desired superposed video image region(s) of the figure. In contrast to typical applications of chroma key technology, the chroma key appears in the presented background, with the live video image overlayed in the chroma key region. Of course, chroma key technology is not the only method for combining the various image information, and in fact the process may be performed digitally in a computerized system.
In one embodiment, the position sensor defines a predefined window, which is translated around the video space. Where further refinement is desired, the orientation and distance of the foreground object from the video camera may be compensated. The shape of the window may be a regular shape or an outline of a foreground image. Thus, with an image of a person as the foreground image, the image may be initially processed to determine the shape. Thereafter, the shape may be translated, resized or otherwise transformed as the window. In this latter case, the shape of the window may be periodically redetermined, but need not be recalculated in real time.
In a particularly preferred embodiment, the live video image is an image of a person having a face, with a position sensor mounted on top of a set of headphones. An oval mask region is defined with respect to the position sensor, so that the position of the face within the video image is predicted to be within the oval mask region. The position sensor also serves as an input to a computer animation graphic image generator, which generates an animated body in appropriate position and orientation for combining with the face. Further position sensors may be provided on the arms of the person, as further inputs to the computer animation graphic image generator, allowing further control over the resulting image. The computer animation graphic image includes a chroma key portion in a region intended for the facial image. The live video image is then merged with the computer animation graphic image and presented as a composite.
The position tracking system is, for example, an Ascension position tracking system mounted centrally on a bridging portion on a set of headphones, worn by a person. The person is present within the image of a video camera, and the system calibrated to locate the position tracking system unit within the field of view of the video camera. The face of the person is estimated to be within an oval area approximately 10 inches down and 8 inches wide below the position tracking system sensor, when the person is facing the camera. Since the preferred position tracking sensor senses movement in six degrees of freedom, the window is altered to correspond to the expected area of presentation of the face in the image. The border between the live video image of the face and the animated character need not be presented as a simple oval region, and may include images which overlay the face, as well as complex boundaries.
By employing a separate position tracking sensor, the preferred embodiment avoids the need for sophisticated image analysis, thereby allowing relatively simple and available components. Further, known types of position sensors also provide orientation information which may be useful for providing control inputs into the background image generation system and also to control the position and shape of the mask region to compensate for skew, profile, tilting and other degrees of freedom of the object. The computer generated animated image responds to the position tracking sensor as an input, allowing the animation to track the movements of the person.
While one preferred embodiment employs an Ascension tracking system, which, while of high quality, is also expensive, another preferred embodiment employs an acoustic sensor to determine location in three or more dimensions. This system is similar to the known “power glove” accessory for video games. Other types of position sensors may also be used.
Thus, the present invention avoids the need for a defined background for a foreground image when electronically superimposing video images by providing a position sensor to provide information for determining a location of the desired foreground image in a foreground video stream. The position sensor thus minimizes the need for analysis of the foreground image stream, allowing relatively simple merging of the video streams.
In systems where the facial image is captured and electronically processed, rather than genlocked and superimposed, the use of the position sensor to define a mask region in the video image substantially reduces a computational complexity required to extract a facial portion from a video image, especially as compared to a typical digital image processing system. As noted above, the margin of the face need not be determined with high precision in many instances, and therefore the background image which is generated to surround the facial image may be provided to include a degree of tolerance to this imprecision, such as a wide edge margin and avoidance of structures which should be precisely aligned with facial features. Where the image is to be transmitted over a computer image, and where the facial portion of the image is the most important component of the image, the use of the present system allows transmission of the masked portion of the image only, reducing the amount of information which must be transmitted and thus compressing the image data.
A known paradigm for user interaction with computers is known as an “Avatar”, a computer generated representation of a user, which is generally wholly animated. These Avatars may be transmitted through a computer network system, for example the Internet, to allow a user to interact with a graphical environment of a system. According to the present invention, these Avatars need not be completely computer generated, and may therefore include a real time video image of a face. This system therefore allows, with reduced computational requirements and limited bandwidth requirements, the personalization of Avatars. Thus, the present invention provides a new type of graphical user interface in which a user is represented as an actual image within a computer graphic space. Multiple users may therefore interact while viewing actual images of each other, even where the users are separated over nodes of a computer network.
As stated above, an in-band mask region determining system may operate based on the foreground video input device. Thus, the position sensing system need not include physically separate hardware. Likewise, the video signal superposition system need not be an external chroma key superposition system, and may be integrated with the animation generation system.
In a first image analysis embodiment, an outline of a major foreground object is determined, and the outline used to define a mask. In a second image analysis embodiment, the foreground object is irradiated with an unobtrusive radiation, e.g., infrared, which is detected by a special video camera or optical sensors. Thus, the infrared contrast of the foreground image defines the foreground object, and a corresponding mask provided. In a third embodiment, an optical transmitter, e.g., one or more LEDs, preferably including a blue LED, is mounted on the headphones, visible to the video camera. The presence of an illuminated spot is detected, and a mask defined in relation to the position of the spot. If distance and orientation information are desired, a plurality of LEDs may be mounted, in a configuration sufficient to allow estimation of position and orientation. Thus, it can be seen that the position detecting system may operate through the video feed without requiring rigorous image analysis, which often cannot be performed in or near real time.
The resulting image of the method according to the present system and method may be presented on a video monitor, transmitted over a video network for rendering at a remote site, or stored on a video storage medium, such as video tape. In the latter case, the opportunities for complex background generation become apparent. Where the image is not simply transient, a higher level of detail in the background image may be preferred, because the stored image may be reviewed a number of times. Further, since the background is computer generated, it need not be constant. Thus, for example, the foreground image and control signals, e.g., position and orientation signals, may be stored on a CD-ROM, with the background image generated in real time on reproduction of the images on a computer system. Since the video image and the control parameters are stored, the reproducted image sequence need not be fixed, and may therefore vary based on a set of background parameters.
An alternative set of embodiments provide different processing systems to capture the facial image for presentation. For example, the location of the image of the face may be identified, with the facial image texture mapped onto a computer generated image. Thus, the boundary between the foreground image and background image need not be a discrete edge, and the present invention therefore allows a more subtle merging of the images.
The location of the foreground image need not be determined with a typical position sensor, and other systems may be used. Advantageously, the focal plane of the foreground object, e.g., person, differs from the background. In this case, the boundary of the foreground object may be determined by detecting a focal plane boundary. This technique offers two advantages. First, it allows redundant use of the focus control system found in many video camera systems, eliminating the need for a separate position sensing system. Second, it allows imaging of irregular-shaped objects, e.g., a person wearing a hat, without being limited by a predefined mask shape.
A rangefinder system may be used to obtain a depth map of a face in real-time, with the resulting data used as control parameters for a computer generated character's face. This rangefinder information allows use of facial expression as a control input, while reducing the need for strict image analysis. This depth information may also be employed to assist in texture mapping the video image information on the background. Likewise, other objects or images may be tracked and used as control inputs.
It is noted that, while many embodiments according to the present invention employ a computer generated graphic image, the background image need not be computer generated. Thus, the background image may also represent a video feed signal. In one embodiment, the background image is a video image of a robot or computer automated mechanical structure, which, e.g., responds to the position and orientation inputs from the foreground input to provide coordination. The merging or foreground and background images in this case may be through the use of typical chroma key technology.
It is thus an object of the present invention to provide a graphic image system comprising a source of a first signal representing a first image including a moving human subject having a bead with a face; an image position estimating system for estimating the instantaneous position of said head of said human subject; a source of a second signal representing a second image including a character having a head with a mask outline; and means, responsive to said position estimating system and to said first and second signals, for dynamically defining an estimated boundary of said face of said human subject in said first image and for merging the face in said first image, as limited by said estimated boundary, with the second image within the mask outline.
It is also an object of the invention to provide a video system comprising a video input, receiving a video signal representing an image having a movable foreground object; a position tracking system for tracking a position of said movable foreground object; and means, responsive to said position tracking system, for dynamically defining an estimated boundary of said moveable foreground object in said image.
It is a further object to provide a video system having a source of background video image and a video superposition control for superposing the foreground object of said image within said estimated boundary on said background video image.
It is a still further object of the invention to provide a video system wherein said background video image is responsive to said position tracking system.
It is another object of the invention to provide a video superposition control having a chroma key video superposition unit. The background video image preferably comprises a computer generated animated image stream.
According to various objects of the invention, the position tracking system may be a radio frequency field sensor, an electro-acoustic transducer or an optical position sensing system. The position tracking system may have various degrees of freedom, for example two, three or six. The position tracking system may include a physical transducer mounted on the foreground object.
According to the present invention, the mask or estimated boundary may be geometric in shape, for example oval, round or having multiple discontinuous geometric segments.
The position tracking system produces a position, and optionally orientation of the foreground object within the field of view of the video camera.
According to another object of the invention, the normal relation of foreground and background signals in a chroma key video superposition unit are reversed, with the “background” image having a region of chroma activating the keying circuit, to allow a foreground object to be presented. The location of the region is determined by the position tracking system.
It is another object according to the present invention to provide a video system comprising, in combination: a video camera producing a first video signal defining a first image including a foreground object and a background; a position tracking system for identifying a position with respect to said foreground object, said foreground object having features in constant physical relation to said position; and a computer, responsive to said position tracking system, for defining a mask region separating said foreground object from said background. The computer preferably generates a second video signal including a portion corresponding to said mask region, wherein an image in the second video signal is responsive to said position tracking system. The mask region differentiates image regions which are processed differently. The mask region preferably is employed in a chroma key system to form a composite image of the first and second images, although an out-of-band signal may be used to define the mask region.
According to another object of the invention, the position tracking sensor determines a position and orientation of the foreground object, and is used to control a size, shape and position of the mask region.
These and other objects and features of the present invention will become more fully apparent from the following description and appended claims taken in conjunction with the accompanying drawings, in which like numerals refer to like parts.