WO1996031047A2 - Immersive video - Google Patents

Immersive video Download PDF

Info

Publication number
WO1996031047A2
WO1996031047A2 PCT/US1996/004400 US9604400W WO9631047A2 WO 1996031047 A2 WO1996031047 A2 WO 1996031047A2 US 9604400 W US9604400 W US 9604400W WO 9631047 A2 WO9631047 A2 WO 9631047A2
Authority
WO
WIPO (PCT)
Prior art keywords
scene
video
viewer
dimensional
real
Prior art date
Application number
PCT/US1996/004400
Other languages
French (fr)
Other versions
WO1996031047A3 (en
Inventor
Ramesh Jain
Koji Wakimoto
Saied Moezzi
Arun Katkere
Original Assignee
The Regents Of The University Of California
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US08/414,437 external-priority patent/US5729471A/en
Application filed by The Regents Of The University Of California filed Critical The Regents Of The University Of California
Priority to AU53802/96A priority Critical patent/AU5380296A/en
Publication of WO1996031047A2 publication Critical patent/WO1996031047A2/en
Publication of WO1996031047A3 publication Critical patent/WO1996031047A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/111Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation
    • H04N13/117Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation the virtual viewpoint locations being selected by the viewers or determined by viewer tracking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/139Format conversion, e.g. of frame-rate or size
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/156Mixing image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/194Transmission of image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/204Image signal generators using stereoscopic image cameras
    • H04N13/243Image signal generators using stereoscopic image cameras using three or more 2D image sensors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/204Image signal generators using stereoscopic image cameras
    • H04N13/246Calibration of cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/275Image signal generators from 3D object models, e.g. computer-generated stereoscopic image signals
    • H04N13/279Image signal generators from 3D object models, e.g. computer-generated stereoscopic image signals the virtual viewpoint locations being selected by the viewers or determined by tracking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/296Synchronisation thereof; Control thereof
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/2625Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects for obtaining an image which is composed of images from a temporal image sequence, e.g. for a stroboscopic effect
    • H04N5/2627Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects for obtaining an image which is composed of images from a temporal image sequence, e.g. for a stroboscopic effect for providing spin image effect, 3D stop motion effect or temporal freeze effect
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/30Nc systems
    • G05B2219/32Operator till task planning
    • G05B2219/32014Augmented reality assists operator in maintenance, repair, programming, assembly, use of head mounted display with 2-D 3-D display and voice feedback, voice and gesture command
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • G06T2207/10012Stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • G06T2207/10021Stereoscopic video; Stereoscopic image sequence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/167Synchronising or controlling image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/189Recording image signals; Reproducing recorded image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/257Colour aspects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/286Image signal generators having separate monoscopic and stereoscopic modes
    • H04N13/289Switching between monoscopic and stereoscopic modes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/30Image reproducers
    • H04N13/332Displays for viewing with the aid of special glasses or head-mounted displays [HMD]
    • H04N13/334Displays for viewing with the aid of special glasses or head-mounted displays [HMD] using spectral multiplexing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/30Image reproducers
    • H04N13/332Displays for viewing with the aid of special glasses or head-mounted displays [HMD]
    • H04N13/337Displays for viewing with the aid of special glasses or head-mounted displays [HMD] using polarisation multiplexing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/30Image reproducers
    • H04N13/332Displays for viewing with the aid of special glasses or head-mounted displays [HMD]
    • H04N13/341Displays for viewing with the aid of special glasses or head-mounted displays [HMD] using temporal multiplexing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/30Image reproducers
    • H04N13/332Displays for viewing with the aid of special glasses or head-mounted displays [HMD]
    • H04N13/344Displays for viewing with the aid of special glasses or head-mounted displays [HMD] with head-mounted left-right displays
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/30Image reproducers
    • H04N13/363Image reproducers using image projection screens
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N2013/0074Stereoscopic image analysis
    • H04N2013/0081Depth or disparity estimation from stereoscopic image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N2013/0074Stereoscopic image analysis
    • H04N2013/0085Motion estimation from stereoscopic image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N2013/0074Stereoscopic image analysis
    • H04N2013/0092Image segmentation from stereoscopic image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N2013/0074Stereoscopic image analysis
    • H04N2013/0096Synchronisation or controlling aspects

Definitions

  • the present invention generally concerns t multimedia, (ii- video, including video-on- ⁇ eman ⁇ and interactive video, and ⁇ ii) television, including television-on-demand and interactive television.
  • the present invention particularly concerns automated dynamic selection of one video camera/image from multiple real video cameras/images m accordance with a particular perspective, an object m tne scene, cr an event in the video scene.
  • the present invention also particularly concerns the synthesis of diverse spatially and temporally coherent and consistent virtual video cameras, and virtual video images, from multiple real world video images that are obtained by multiple real video cameras.
  • the present invention still further concerns the creation of three-dimensional video image databases, and the location and dynamical tracking of video images of selected objects depicted m the databases for, among other purposes, the selection of a real camera or image, or the synthesis of a virtual camera or image, best showing the selected object.
  • the present invention still further concerns (i) interactive synthesis of video, or television, images of a real- world scene on demand, (ii) the synthesis of virtual video images of a real-world scene in real time, or virtual television, (iii) the synthesis of virtual video images/virtual television pictures of a real-world scene which video images/virtual television are linked to any of a particular perspective on the video/television scene, an object in the video/television scene, or an event in the video/television scene, (iv) the synthesis of virtual video images/virtual television pictures of a real-world scene wherein the pictures are so synthesized to user-specified parameters of presentation, e.g. panoramic, or at magnified scale if so desired by the user, and (v) the synthesis of 3-D stereoscopic virtual video images/virtual television.
  • video presentation of a real-world scene in accordance with the present invention will be seen to be interactive with both (1) a viewer of the scene and, m the case of a selected dynamically moving object, or an event, in the scene, (11) the scene itself.
  • True interactive video or television is thus presented to a viewer.
  • the image presented to the viewer will be seen to be a virtual image that is not mandated to correspond to any real world camera nor to any real world camera image.
  • a viewer may thus view a video or television of a real-world scene from a vantage point (i.e., a perspective on the video scene) , and/or dynamically m response to objects moving m the scene and/or events transpiring in the scene, in manner that is not possible m reality.
  • the viewer may, for example, view the scene from a point in the air above the scene, or from the vantage point of an object in the scene, where no real camera exists or even, in some cases, can exist.
  • MPI Multiple Perspective Interactive
  • MPI video supports the editing of, and viewer interaction with, video and television m a manner that is useful in viewing activities ranging from education to entertainment.
  • m conventional video viewers are substantially passive; all they can do is to control the flow of video by pressing buttons such as play, pause, fast forward or fast reverse
  • buttons such as play, pause, fast forward or fast reverse
  • the related invention of MPI video makes considerable progress -- even by use of currently existing tecnnology -- towards “liberating” video and TV from the traditional single- source, broadcast, model, and towards placing each viewer in his or her own “director's seat”
  • a three-dimensional (3-D) video model, or database, is used in MPI video
  • the immersive video and immersive telepresence systems of the present invention preserve, expand, and build upon this 3-D model
  • This three-dimensional model, and the functions that it performs, are well and completely understood, and will be completely taught within this specification.
  • the considerable computational power required if a full custom virtual video image for each viewer is to be synthesized in real time and on demand requires that the model should be constructed and maintained m consideration of (1) powerful organizing principles, (ii) efficient algorithms, and (in) effective and judicious simplifying assumptions. This then, and more, is what the present invention will be seen to concern.
  • Scene-interactive video and television is nothing so grandiose as permitting a user/viewer to interact with the objects and/or events of a scene -- as will be seen to be the subject of the present and related inventions. Rather, the interaction with the scene is simply that of a machine -- a computer -- that must recognize, classify and, normally, adapt its responses to what it "sees" in the scene.
  • Scene-interactive video and television is thus simply an extension ot machine vision so as to permit a computer to make decisions, sound alarms, etc , based on wnat it detects in, and detects to be transpiring m, a video scene.
  • Two classic problems m this area are (1) security cameras, whicn must detect contraband, and (11) an autonomous computer-guided automated battlefield tank, which must sense and respond to its environment .
  • the present invention functions oppositely It “defines the world", or at least so much of the world is “on stage” and m view to (each of) multiple video cameras
  • the video and television systems of the present invention have at their command a plethora of correlatable and correlated, simultaneous, positional information Once it is known where each of multiple cameras are, and are pointing, it is a straigntforward matter for computer processes to fix, and to track, items in the scene.
  • the systems, including the MPI-video subsystem, of the present invention will be seen to perform co-ordinate transformation of (video) image data (i.e., pixels) , and to do this during a generation of two- and three-dimensional image databases.
  • the present invention of immersive video will be seen to involve the manipulation, processing and compositing of video data in order to synthesize video images.
  • Video compositing is the amalgamation of video data from separate video streams.
  • the previous process of so doing is called “video mosaicmg” .
  • the underlaying task m video mosaicmg is to create larger images from frames obtained from one or more single cameras, typically one single camera producing a panning video stream.
  • To generate seamless video mosaics registration and alignment of the frames from a sequence are critical issues.
  • the immersive video system of the present invention will be seen to use its several streams of 2D video data to build and maintain a 3D video database
  • the utility of such 3D database in the synthesis of virtual video images seems clear For example, an arbitrary planar view of the scene will contain the data of 2D planar slice "through" the 3D database
  • the immersive video system of the present invention will so show that -- (1) certain scene constraints being made, (11) certain simplifying assumptions being made regarding scene objects and object dynamical motions being made, and (111) certain computational efficiencies in the manipulations of video data being embraced -- it is indeed possible, and even practical, to so synthesize useful and aesthetically pleasing video, and even television, images.
  • Machine Dynamic Selection, of One Video Camera/Image of a Scene from Multiple Video Cameras/Images of the Scene in Accordance with a Particular Perspective on the Scene, an Ob ⁇ ect m the Scene, or an Event in the Scene The present invention contemplates machine dynamic selection, of one video camera/image of a scene from multiple video cameras/images of the scene in accordance with a particular perspective on the scene, an object in the scene, or an event in the scene
  • the present invention thus contemplates making each and any viewer of a video or a television scene to be his or her own proactive editor of the scene, having the ability to interactively dictate and select -- in advance of the unfolding of the scene, and by high-level command -- a particular perspective by which the scene will be depicted, as and when the scene unfolds.
  • the viewer can command the selection of real, or -- in advanced embodiments of the invention -- even the synthesis of virtual, video images o the scene m response to any of his or ner desired and selected - spatial perspective on the scene, Hi) static or dynamically' moving op ect appearing m the scene, or (m) event depicted _n tne scene.
  • the viewer -- any viewer -- is accordingly considerably more powerful than even the broadcast video editor of, for example, a live sporting event circa 1995
  • the viewer is accorded the ability to (1) select in advance a preferred video perspective of view as optionally may be related to dynamic object movements and/or to events unfolding in the scene, and even, as the ultimate extension of the invention, (iu to synthesize video views where no real video camera even exists
  • MPI Multiple Perspective Interactive
  • Video MPI Video forms the core of the Immersive Video discussed hereinafter m section 3.
  • a viewer of an American football game on video or on television can command a consistent "best" view of d) one particular player, or, alternatively (ii) the football itself as will be, from time to time, handled by many players.
  • the system receives and processes multiple video views (images) generally of the football field, the football and the players within the game.
  • the system classifies, tags and tracks objects in the scene, including static objects such as field markers, and dynamically moving objects such as the football and the football players.
  • the system will consistently, dynamically, select and present a single "best" view of the selected object (for example, the football, or a particular player) . This will require, and the system will automatically accomplish, a "handing off” from one camera to another camera as different ones of multiple cameras best serve to image over time the selected object.
  • the viewer can ask to be shown a synthesized video view, such as from a perspective constantly positioned behind a certain offensive running back, where no real video camera actually exists.
  • the system of the invention is powerful (i) in accepting viewer specification at a high level of those particular objects and/or events in the scene that the user/viewer desires to be shown, and (ii) to subsequently identify and track all user/viewer-selected objects and events (and still others for other users/viewers) in the scene.
  • the system of the present invention can also, based on its scene knowledge database, serve to answer questions about the scene .
  • the system of the present invention can replay events in the scene from the same perspective, or from selected new perspectives, depending upon the desires of the user/viewer. It is not necessary for the user/viewer to "find" the best and proper image; the system performs this function. For example, if the user/viewer wants to see how player number twenty (#20) came to make an interception in the football game, then he or she could order a replay of the entire down focused on player number twenty (#20) .
  • the user/viewer can generate commands like: "replay for me at 1/2 speed the event of the fumble as shown from a straight overhead view” .
  • Such commands are honored by the system of the present invention even though no real video camera may, in actuality, exist at this precise overhead location.
  • the present invention contemplates selecting real, or -- in advanced embodiments -- synthesizing virtual, video/television images of a scene from multiple real video/television images of the scene, particularly so as to select or to synthesize video/television images that are linked to any such (i) spatial perspective (s) on the scene, (ii) object (s) in the scene, or (iii) event (s) in the scene, as are selectively desired by a user/viewer to be shown.
  • the method of the invention is directed to presenting to a user / viewer a particular, viewer-selected, two-dimensional video image cf a real-world, three-dimensional, scene.
  • multiple video cameras each at a different spatial location, produce multiple two-dimensional images of the real- world scene, each at a different spatial perspective.
  • Objects of interest m the scene are identified and classified in these two-dimensional images.
  • These multiple two-dimensional images of the scene, and their accompanying oo ect information are • then compined m a computer into a three-dimensional video datapase, or model, of the scene.
  • the model of a footpall game knows, for example, that tne game is played upon a football field replete with static, flxed-position, field yard lines and hash mark markings, as well as of the existence of the dynamic objects of play.
  • the model is, it will be seen, not too hard to construct so long as there are, or are made to be, sufficient points of reference m the imaged scene. It is, conversely, almost impossible to construct the 3-D model, and select or synthesize the chosen image, of an amorphous scene, such as the depths of the open ocean.
  • the computer also receives from a prospective user/viewer of the scene a user/viewer-specified criterion relative to which criterion the user/viewer wishes to view the scene.
  • the computer From the (1) 3-D model and (11) the criterion, the computer produces a particular two-dimensional image of the scene that is in accordance with the user/viewer-specifled criterion. This particular two-dimensional image of the real-world scene is then displayed on a video display to the user/viewer.
  • the computer may ultimately produce, and the display may finally show, only such a particular two-dimensional image O 96/31047 PCIYUS96/04400 of the scene -- in accordance with the user/viewer-specifled criterion -- as was originally one of the images of the real- world scene that was directly imaged by one of the multiple video cameras.
  • This is, indeed, the way the rudimentary embodiment of the invention taught and shown herein functions.
  • this automatic camera selection may seem unimpressive
  • the appropriate, selected, scene image may change over time m accordance with ust what is imaged, and m what location (sj , by whicn camera (s) , and m accordance with just what transpires in the scene
  • the evolving contents of the scene as the scene is imaged by tne multiple cameras and as it is automatically interpreted by the computer, determine just what image of the scene is shown at any one time, and just what sequence of images are shown from time to time, to the user/viewer Action m tne scene "feeds back" on how the scene is shown to the viewer '
  • the computer is not limited to selecting from the three-dimensional model a two-dimensional image that is, or that corresponds to, any of the images of the real-world scene as was imaged by any of the multiple video cameras. Instead, the computer may synthesize from the three-dimensional model a completely new two- dimensional image that is without exact equivalence to any of the images of the real-world scene as have been imaged by any of the multiple video cameras
  • the user/viewer-specifled criterion may be of a particular spatial perspective relative to which the user/viewer wishes to view the scene.
  • This spatial perspective need not be immutably fixed, but can instead be linked to a dynamic object in the scene.
  • the computer produces from the three-dimensional model a particular two-dimensional image of the scene that is in best accordance with some particular spatial perspective criterion that has been received from the viewer.
  • the particular two-dimensional image of the scene that is generated and displayed may, or may not, be, or be equivalent to, any real image of the scene as was obtained by any of the video cameras.
  • the scene image shown may be a virtual image. Even if the image shown is a real image, the computer will still automatically select, and the display will still display, over time, those actual images of the scene as are imaged, over time, by different ones of the multiple video cameras. Automated scene switching, especially m relation to dynamic objects in the scene, is not known to the inventors to exist in the prior art .
  • the user/viewer-specified criterion may be of a particular object in the scene.
  • the computer will combine the images from the multiple video cameras not only so as to generate a three-dimensional video model of the scene, but so as to generate a model in which objects in the scene are identified.
  • the computer will subsequently produce, and the display will subsequently show, the particular image -- whether real or virtual -- appropriate to best show the selected object.
  • this is a feedback loop: the location of an object in the scene serves to influence, in accordance with a user/viewer selection of the object, how the scene is shown.
  • the same video scene could be, if desired, shown over and over, each time focusing view on a different selected object in the scene.
  • the selected object may either be static, and unmoving, or dynamic, and moving, in the scene. Regardless of whether the object in the scene is static or dynamic, it is preferably specified to the system by the user/viewer by act of positioning a cursor on the video display.
  • the cursor is a special type that unambiguously specifies an object in the scene by an association between the object position and the cursor position in three dimensions, and is thus called "a three- dimensional cursor" .
  • the criterion specified by the user/viewer may be of a particular event in the scene.
  • the computer will again combine the images from the multiple video cameras not only so as to generate a three-dimensional video model of the scene, but so as to generate a model in which one or more dynamically occurring event (s) in the scene are recognized and identified.
  • the computer will subsequently produce, and the display will show, a particular image -- whether real or virtual -- that is appropriate to best show the selected event.
  • the location of an event in the scene influences, in accordance with a viewer selection of the event, how the scene is shown.
  • the method of the invention may be performed in real time as interactive television.
  • the television scene will be presented to a user/viewer interactively in accordance with the user/viewer-specified criterion.
  • the present invention still further contemplates telepresence and immersive video, being the non-real-time creation of a synthesized, virtual, camera/video image of a real-world scene, typically in accordance with one or more viewing criteria that are chosen by a viewer of the scene.
  • Immersive video, or telepresence, or visual reality (VisR) is an extension of Multiple Perspective Interactive (MPI) video.
  • MPI Multiple Perspective Interactive
  • the creation of the virtual image is based on a computerized video processing -- in a process called hypermosaicing -- of multiple video views of the scene, each from a different spatial perspective on the scene.
  • the present invention is embodied in a method of telepresence, being a video representation of being at real-world scene that is other than the instant scene of the viewer.
  • the method includes (i) capturing video of a real-world scene from each of a multiplicity of different spatial perspectives on the scene, (ii) creating from the captured video a full three-dimensional model of the scene, and (iii) producing, or synthesizing, from the three-dimensional model a video representation on the scene that is in accordance with the desired perspective on the scene of a viewer of the scene.
  • immersive telepresence This method is thus called “immersive telepresence" because the viewer can view the scene as if immersed therein, and as if present at the scene, all in accordance with his or her desires. Namely, it appears to the viewer that, since the scene is presented as the viewer desires, the viewer is immersed in the scene Notably, the viewer-desired perspective on the scene, and the video representation synthesized in accordance with this viewer-desired perspective, need not be m accordance with any of the video captured from any scene perspective
  • the video representation can oe m accordance with the position and direction of the viewer's eyes and nead, and can exhibit "motional parallax" "Motional parallax" _s normally and conventionally defined as a three-dimensionai effect wnere different views cn the scene are produced as tne viewer moves position, making tne viewer's brain to comprehend that the viewed scene is three-dimensional Motional parallax is observable even if the viewer has but one eye
  • the video representation can De stereoscopic.
  • “Stereoscopy” is normally and conventionally defined as a three-dimensional effect where each of the viewer's two eyes sees a slightly different view on the scene, thus making the viewer's brain to comprehend that the viewed scene is three-dimensional. Stereoscopy is detectable even should the viewer not move his or her head or eyes m spatial position, as is required for motional parallax.
  • the present invention is embodied in a method of telepresence where, again, video of a real-world scene is obtained from a multiplicity of different spatial perspectives on the scene. Again, a full three- dimensional model of the scene is created the from the captured video From this three-dimensional model a video representation on the scene that is in accordance with a predetermined criterion -- selected from among criteria including a perspective on the scene, an object in the scene and an event in the scene -- is produced, or synthesized.
  • This embodiment of the invention is thus called “interactive telepresence" because the presentation to the viewer is interactive in accordance with the criterion.
  • the synthesized video presentation of the scene in accordance with the criterion need not be, and normally is not, equivalent to any of the video captured from any scene perspective.
  • the video representation can be in accordance with a criterion selected by the viewer, thus viewer-interactive telepresence
  • the presentation can be in accordance with the position and direction of the viewer's eyes and head, and will thus exhibit motional parallax; and/or the presentation can exhibit stereoscopy.
  • An immersive video, or telepresence, system serves to synthesize and to present diverse video images of a real-world scene in accordance with a predetermined criterion or criteria.
  • the criterion or criteria of presentation is (are) normally specified by, and may be changed at times and from time to time by, a viewer/user of the system. Because the criterion (criteria) is (are) changeable, the system is viewer/user- mteractive, presenting (primarily) those particular video images (of a real-world scene) that the viewer/user desires to see .
  • the immersive video system includes a knowledge database containing information about the scene.
  • Knowledge database containing information about the scene.
  • Existence of this "knowledge database” immediately means that the something about the scene is both d) fixed and (ii) known; for example that the scene is of "a football stadium", or of "a stage”, or even, despite the considerable randomness of waves, of "a surface of an ocean that lies generally in a level plane”.
  • the antithesis of a real-world scene upon which the immersive video system of the present invention may successfully operate is a scene of windswept foliage in a deep jungle.
  • the knowledge database may contain, for example, data regarding any of (i) the geometry of the real-world scene, (ii) potential shapes of objects in the real-world scene, (iii) dynamic behaviors of objects in the real-world scene, (iv) an internal camera calibration model, and/or (v) an external camera calibration model.
  • the knowledge base of an American football game would be something to the effect that (i) the game is played essentially in a thick plane lying flat upon the surface of the earth, this plane being marked with both (yard) lines and hash marks; (ii) humans appear in the scene, substantially at ground level; (iii) a football moves in the thick plane both in association with (e.g., running plays) and detached from (e.g., passing and kicking plays) the humans; and (iv) the locations of each of several video cameras on the football game are a priori Known, or are determined by geometrical analysis of the video view received from each
  • the system further includes multiple video cameras each at a different spatial location Each of these multiple video cameras serves to produce a two-dimensional video image of the real-world scene at a different spatial perspective
  • Each of these multiple cameras can typically change the direction from which t observes the scene, and can typically pan and zoom, but, at least m the more r ⁇ imentary versions of tne immersive video system, remains fixed in location
  • a classic example of multiple stationary video cameras on a real-world scene are the cameras at a sporting event, for example at an American football game
  • the system also includes a viewer/user interface
  • a prospective viewer/user of the scene uses this interface to specify a criterion, or several criteria, relative to which he or she wishes to view the scene
  • This viewer/user interface may commonly be anything from head gear mounted to a boom to a computer joy stick to a simple keypoard
  • the viewer/user who establishes (and re-establishes) the criterion (criteria) by which an image on the scene is synthesized is the final consumer of the video images so synthesized and presented by the system.
  • the control input (s) arising at the viewer/user interface typically arise from a human video sports director (in the case of an athletic event) , from a human stage director (in the case of a stage play) , or even from a computer (performing the function of a sports director or stage director) .
  • the viewing desires of the ultimate viewer/user may sometimes be translated to the immersive video system through an intermediary agent that may be either animate or inanimate.
  • the immersive video system includes a computer running a software program.
  • This computer receives the multiple two- dimensional video images of the scene from the multiple video cameras, and also the viewer-specified criterion (criteria) from the viewer interface.
  • the typical computer functioning in an immersive video system is fairly powerful. It is typically an engineering work station class computer, or several such computers that are linked together if video must be processed in real time -- i.e., as television. Especially if the immersive video is real time -- i.e., as television -- then some or all of the computers normally incorporate hardware graphics accelerators, a well- known but expensive part for this class of computer.
  • the computer (s) and other hardware elements of an immersive video system are both general purpose and conventional but are, at the present time (circa 1995) typically "state-of- the-art", and of considerable cost ranging to tens, and even hundreds, of thousands of American dollars.
  • the system computer includes (in software and/or in hardware) (i) a video data analyzer for detecting and for tracking objects of potential interest and their locations in the scene, (ii) an environmental model builder for combining multiple individual video images of the scene to build a three- dimensional dynamic model cf the environment of the scene within which three-dimensional dynamic environmental model potential objects of interest in the scene are recorded along with their instant spatial locations, (iii) a viewer criterion interpreter for correlating the viewer-specified criterion with the objects of interest in the scene, and with the spatial locations of these objects, as recorded in the dynamic environmental model in order to produce parameters of perspective on the scene, and (iv) a visualizer for generating, from the three-dimensional dynamic environmental model in accordance with the parameters of perspective, a particular two-dimensional video image of the scene .
  • a video data analyzer for detecting and for tracking objects of potential interest and their locations in the scene
  • an environmental model builder for combining multiple individual video images of the scene to build a three- dimensional
  • the computer function (i) -- the video data analyzer -- is a machine vision function.
  • the function can presently be performed quite well and quickly, especially if (i) specialized video digitalizing hardware is used, and/or (ii) simplifying assumptions about the scene objects are made. Primarily because of the scene model builder next discussed, abundant simplifying assumptions are both well and easily made in the immersive video system of the present invention. For example, it is assumed that, in a video scene of an American football game, the players remain essentially in and upon the thick plane of the football field, and do not "fly" into the airspace above the field.
  • an immersive video system in accordance with the present invention not yet having been discussed, it is somewhat premature to explain how a scene object that is not in accordance with the model may suffer degradation in presentation. More particularly, the scene model is not overly particular as to what appears within the scene, but it is particular as to where within (the volume of) the scene an object to be modeled appears.
  • the immersive video system can fully handle a scene- intrusive object that is not in accordance with prior simplifications -- for example, a spectator or many spectators or a dog or even an elephant walking onto a football field during or after a football game -- and can process these unexpected objects, and object movements quite as well as any other.
  • a parachutist parachuting into a football stadium may not be "well-modeled" by the system when he/she is high above the field, and outside the thick plane, but will be modeled quite well wnen finally near, or on, ground level
  • "quite well" it is meant that, while tne immersive video system will readily permit a viewer to examine, for example, the dentation of the quarterback if ne or sne is interested in staring tne quarteroack "in the teetn” , it is very difficult for the system lespecially initially, and in real time as television , to process through a discordant scene occurrence, such as the stadium parachutist, so well so as to permit the examination of his or her teeth also wnen the parachutist is still many meters above the
  • the computer function (11) -- the environmental model builder -- is likely the "backbone" of the present invention. It incorporates important assumptions that, while scene specific, are generally of a common nature throughout all scenes that are of interest for viewing with tne present invention.
  • the environmental model is (1) three- dimensional (3-D) , with both (1) static and m) dynamic components
  • the scene environmental model is not the scene image, nor the scene images rendered three-dimensionally.
  • the current scene image such as of the play action on a football field, may be, and typically is, considerably smaller than the scene environmental model which may be, for example, the entire football stadium and the objects and actors expected to be present therein.
  • the static "underlayment” or “background” of any scene is pre-processed into the three-dimensional video database.
  • the video model of an (empty) sports stadium -- the field, filed markings, goal posts, stands, etc. - - is pre-processed (as the environmental model) into the three- dimensional video database .
  • the dynamic elements m the scene -- i.e., the players, the officials, the football and the like -- need be, and are, dealt with.
  • the typically greater portion of any scene that is (at any one time) static is neither processed nor re-processed from moment to moment, and from frame to frame. It need not be so processed or re-processed because nothing has changed, nor is changing.
  • the static background is not inflexible, and may be a "rolling" static background based on the past history of elements within the video scene.
  • dynamical objects m the scene -- which objects typically appear only m a minority of the scene (e.g. the football players) but may appear in the entire scene (e.g., the crowd) -- are preferably processed m two ways. If the computer recognition and classification algorithm can recognize -- in consideration of a priori model knowledge of items appearing in the scene such as the football, and the football players -- an item in the scene, then that item will be isolated, and will be processed/re-processed into the three-dimensional video database as a multiple voxel representation.
  • a voxel is a three- dimensional pixel.
  • Other dynamic elements of the scene that cannot be classified or isolated into the three-dimensional environmental model are swept up into the three-dimensional video database mostly in their raw, two-dimensional, video data form.
  • Such a dynamic, but un-isolated, video element could be, for example, the movement of a crowd doing a "wave" motion at a sports stadium, or the surface of the sea.
  • the system and method does not truly know, of course, whether it is inserting into the instant three-dimensional video database m accordance with the scene environmental model an instant video image of a football quarterback taking a drink, or an instant video image of a football fan taking the same drink Moreover, dynamic objects can both enter (e g as m coming onto the imaged field of play and exit ie g as m leaving the imaged field of play; the scene
  • the system and method of the present invention for constructing a 3-D video scene deal only with (1) the scene environmental model, and (11) the mathematics of the pixel dynamics What must be recognized is that, m so doing, the system and method serve to discriminate between and among raw video image data m processing such image data into the three-dimensional video database
  • the system and method of the present invention finds it hard to discriminate, and hard to process for entrance into the three- dimensional database, a three-dimensional scene object (or actor) where there was no previous scene object (or actor) .
  • the system of the present invention to classify and to process the throw and the thrower into the three-dimensional database so completely that the facial features of the thrower may -- either upon an "instant replay" of the scene focusing on the area of the perpetrator or for that rare viewer who had been focusing his view to watch the crowd instead of the athletes all along -- immediately be recognized. (If the original raw video data streams still exist, then it is always possible to process them better. )
  • the system includes a video display that receives the particular two-dimensional video image of the scene from the computer, and that displays this particular two-dimensional video image of the real-world scene to the viewer/user as that particular view of the scene which is m satisfaction of the viewer/user-specifled criterion (criteria) .
  • a viewer/user of an immersive video system m accordance with the present invention may view the scene from any static or dynamic viewpoint -- regardless that a real camera/video does not exist at the chosen viewpoint -- only but starts to describe the experience of immersive video.
  • any video image (s) can be generated.
  • the immersive video image (s) that is (are) actually displayed to the viewer/user are ultimately, in one sense, a function of the display devices, or the arrayed display devices -- i.e., the television (s) or mon ⁇ tor(s) -- that are available for the viewer/user to view.
  • Any "planar” view on the scene may be derived as the information which is present on any (straight or curved) plane (or other closed surface, such as a saddle) that is "cut" through the three-dimensional model of the scene.
  • This "planar” surface may, or course, be positioned anywhere within the three- dimensional volume of the scene model
  • Video views may be presented in any aspect ratio, and m any geometric form that is supported by the particular video display, or arrayed video displays (e g , televisions, and video projectors , by wnich the video imagery is presented to the viewer/user
  • a cylindrical, nemisphericai , or spherical panoramic view of a video scene may be generated from any point inside or outside the cylinder, hemisphere, or sphere
  • successive views on tne scene may appear as the scene os circumnavigated from a position outside the scene
  • An observer at the video horizon cf tne scene will look into the scene as if thougn a window, with the scene in plan view, or, if foreshortened, as if viewing the interior surface of a cylinder or a sphere from a peephole m the surface of the cylinder or sphere
  • the viewer/user could view the game in progress as if he or she "walked” at ground level, or even as if he or she "flew at low altitude” , around or across the field
  • a much more unusual panoramic cylindrical, or spherical "surround” view of the scene may be generated from a point inside the scene.
  • the scene can be made to appear -- especially when the display presentation is made so as to surround the user as do the four walls of a room or as does the dome of a planetarium -- to completely encompass the viewer.
  • the viewer/user could view the game m progress as if ne or she was a player "inside” the game, even to the extent of looking "outward" at the stadium spectators
  • the immersive video of the present invention should be modestly interesting.
  • two-dimensional screen views of three-dimensional real world scenes suffer in realism because of subtle systematic dimensional distortion.
  • the surface of the two-dimensional display screen e.g., a television
  • CCD Charge Coupled Device
  • the immersive video system of the present invention straightens all this out, exactly matching (in dedicated embodiments) the image presented to the particular screen upon which the image is so presented.
  • the immersive video system of the present invention is using the image of one (or more) cameras to "correct" the presentation inot the imaging, the presentation) of an image derived (actually synthesized in part) from another camera '
  • immersive video m accordance with the present invention permits macnine dynamic generation of views on a scene.
  • Images of a real-world scene may be linked at the discretion of the viewer to any of a particular perspective on the scene, an object m the scene, or an event m the scene
  • a viewer/user may interactively close to view a field goal attempt from the location of the goalpost crossbars (a perspective on the scene) , watching a successful place kick sail overhead.
  • the viewer/user may chose to have the football tan object in the scene) centered m a field of view that is 90° to the field of play 'I e., a perfect "sideline seat") at all times.
  • the viewer/user may chose to view the scene from the position of the left shoulder of the defensive center linebacker unless the football is launched airborne (as a pass) (an event in the scene) from the offensive quarterback, in which case presentation reverts to broad angle aerial coverage of the secondary defensive backs.
  • the present and related inventions serve to make each and any viewer of a video or a television depicting a real-world scene to be his or her own proactive editor of the scene, having the ability to interactively dictate and select -- in advance of the unfolding of the scene, and by high-level command -- any reasonable parameter or perspective by which the scene will be depicted, as and when the scene unfolds.
  • Scene views are constantly generated by reference to the content of a dynamic three-dimensional model -- which model is sort of a three-dimensional video memory without the storage requirement of a one-to-one correspondence between voxels (solid pixels) and memory storage addresses. Therefore it is "no effort at all" for an immersive video system to present, as a selected stream of video data containing a selected view, first scan time video data and second scan time video data that is displaced, each relative to the other, m accordance with the location of each object depicted along the line of view.
  • the immersive video of the present invention with its superior knowledge of the three-dimensional spatial positions of all objects in a scene, excels in such stereoscopic presentations (which stereoscopic presentations are, alas, impossible to show on he one-dimensional pages of the drawings) .
  • the immersive video presentations of the present invention are clearly susceptible of combination with the objects, characters and environments of artificial reality.
  • Computer models and techniques for the generation and presentation of artificial reality commonly involve three- dimensional organization and processing, even if only for tracing light rays for both perspective and illumination.
  • the central, "cartoon”, characters and objects are often "finely wrought", and commonly appear visually pleasing. Alas, equal attention cannot be paid to each and every element of a scene, and the scene background to the focus characters and objects is often either stark, or unrealistic, or both.
  • Immersive video in accordance with the present invention provides the vast, relatively inexpensive, "database” of the real world (at all scales, time compressions/expansions, etc.) as a suitable “field of operation” (or “playground”) for the characters of virtual reality.
  • immersive video permits viewer/user interactive viewing of a scene
  • a viewer/use may "move” in and though a scene in response to what he/she "sees” in a composite scene of both a real, and an artificial virtual, nature. It is therefore possible, for example, to interactively flee from a "dinosaur” (a virtual animal) appearing in the scene of a real world city. It is therefore possible, for example, to strike a virtual "baseball” (a virtual object) appearing in the scene of a real world baseball park. It is therefore possible, for example, to watch a "tiger", or a "human actor” (both real animal) appearing in the scene of a virtual landscape (which landscape has been laid out in consideration of the movements of the tiger or the actor)
  • visual reality and (11) virtual reality can, m accordance with the present invention, be combined with (1) a synthesis of real/virtual video images/television pictures of a combination real-world/virtual scene wherein the synthesized pictures are to user-specified parameters of presentation, e.g panoramic or at magnification if so desired by the user, and/or (2 the syntnesis of said real/virtual video images/television pictures can be 3-D stereoscopic
  • the present invention assumes, and uses, a three- dimensional model of the d) static, and (ID dynamic, environment of a real-world scene -- a three-dimensional, environmental, model
  • Portions of each of multiple video streams showing a single scene, each from a different spatial perspective, that are identified to be (then, at the instant) static by a running comparison are "warped” onto the three-dimensional environmental model
  • This "warping” may be into 2-D (static) representations within the 3-D model -- e g., a football field as is permanently static or even a football bench as is only normally static -- or, alternatively, as a reconstructed 3-D (static) object -- e.g , the goal posts.
  • each video stream that rises from a particular perspective
  • the dynamic part of each video stream is likewise "warped" onto the three- dimensional environmental model
  • the "warping" of dynamic objects is into a reconstructed three-dimensional (dynamic) objects -- e.g., a football player This is for the simple reason that dynamic objects in the scene are of primary interest, and it is they that will later likely be important m synthesized views of the scene.
  • the "warping" of a dynamic object may also be into a two-dimensional representation -- e.g , the stadium crowd producing a wave motion
  • Simple changes m video data determine whether an object is (then) static or dynamic.
  • the environmental model itself determines whether any scene portion or scene object is to be warped onto itself as a two- dimensional representation or as a reconstructed three- dimensional object
  • the reason no attempt is made to reconstruct everything in three-dimensions are twofold First, video data l slacking to model everything in and about the scene in three dimensions -- e.g., the underside of the field or the back of the crowd are not within any video stream Second, and more importantly, there is insufficient computational power to reconstruct a three-dimensional video representation of everything that is within a scene, especially in real time (i.e., as television)
  • Any desired scene view is then synthesized (alternatively, "extracted") from the representations and reconstituted objects that are (both within the three-dimensional model, and is displayed to a viewer/user.
  • the synthesis/extraction may be m accordance with a viewer specified criterion, and may be dynamic in accordance with such criterion
  • the viewer or a football game may request a consistent view fro ⁇ the "fifty yard line", or may alternatively ask to see all plays from the a stadium view at the line of scrimmage
  • the views presented may be dynamically selected in accordance with an object m the scene, or an event in the scene
  • Any interior or exterior perspectives on the scene may be presented.
  • the viewer may request a view looking into a football game from the sideline position of a coach, or may request a view looking out of the football game from at the coach from the then position of the quarterback on the football field
  • Any requested view may be panoramic, or at any aspect ratio, in presentation Views may also be magnified, or reduced
  • any and all views can be rendered stereoscopically, as desired.
  • the synthesized/extracted video views may be processed in real time, as television.
  • Any and all synthesized/extracted video views contain only as much information as is within any of the multiple video streams, no video view can contain information that is not within any video stream, and will simply show black (or white) in this area.
  • the immersive video computer system of the present invention receives multiple video images of view on a real world scene, and serves to synthesize a video image of the scene which synthesized image is not identical to any of the multiple received video images.
  • the computer system includes an information base containing a geometry of the real-world scene, shapes and dynamic behaviors expected from moving objects in the scene, plus, additionally, internal and external camera calibration models on the scene.
  • a video data analyzer means detects and tracks objects of potential interest in the scene, and the locations of these objects .
  • a three-dimensional environmental model builder records the detected and tracked objects at their proper locations m a three-dimensional model of the scene. This recording is in consideration of the information base.
  • a viewer interface is responsive to a viewer of the scene to receive a viewer selection of a desired view on the scene. This selected and desired view need not be identical to any views that are within any of the multiple received video images.
  • a visualizer generates (alternatively, “synthesizes”) (alternatively “extracts”) from the three- dimensional model of the scene, and in accordance with the received desired view, a video image on the scene that so shows the scene from the desired view.
  • FIG. 1 is a top-level block diagram showing the high level architecture of the system for Multiple Perspective Interactive (MPI) video in accordance with the present invention.
  • MPI Multiple Perspective Interactive
  • FIG 2 is a functional block diagram showing an overview of the MPI system in accordance with the present invention, previously seen in block diagram in Figure 1, in use for interactive football video.
  • FIG 3 is a diagrammatic representation of the hardware configuration of the MPI system in accordance with the present invention, previously seen in block diagram in Figure 1.
  • FIG 4 is a pictorial representation of a video display particularly showing how, as a viewer interface feature of the Multiple Perspective Interactive (MPI) video system in accordance with the present invention previously seen in block diagram in Figure 1, a viewer can select one of the many items to focus in the scene.
  • MPI Multiple Perspective Interactive
  • Figure 5 is a diagrammatic representation showing how different cameras provide focus on different objects in the MPI system in accordance with the present invention; depending on the viewer's current interest an appropriate camera must be selected.
  • FIG. 6 is another pictorial representation of the video display of the Multiple Perspective Interactive (MPI) video system in accordance with the present invention, this the video display particularly showing a viewer-controlled three-dimensional cursor serving to mark a point in three-dimensional (3-D) space, with the projection of the 3-D cursor being a regular 2-D cursor.
  • MPI Multiple Perspective Interactive
  • FIG. 7 is a diagram showing coordinate systems for camera calibration in the Multiple Perspective Interactive (MPI) video system in accordance with the present invention.
  • MPI Multiple Perspective Interactive
  • FIG 8 consisting of Figures 8a through 8c, is pictorial representation, and accompanying diagram, of three separate video displays in the Multiple Perspective Interactive (MPI) video system in accordance with the present invention, the three separate displays showing how three different cameras provide three different sequences, the three different sequences being used to build the model of events in the scene.
  • MPI Multiple Perspective Interactive
  • Figure 9 consisting of Figures 9a and 9b, is pictorial representation of two separate video displays in the Multiple Perspective Interactive (MPI) video system in accordance with the present invention showing many known points an image can be used for camera calibration; the frame of Figure 9a having sufficient points for calibration but the frame of Figure 9b having insufficient points for calibration.
  • MPI Multiple Perspective Interactive
  • Figure 10 consisting of Figures 10a through 10c, is pictorial representation of three separate video frames, arising from three separate algorithm-selected video cameras, in the Multiple Perspective Interactive (MPI) video system in accordance with the present invention.
  • MPI Multiple Perspective Interactive
  • FIG 11 is a schematic diagram showing a Global Multi-Perspective Perception System (GM-PPS) portion of the Multiple Perspective Interactive (MPI) video system in accordance with the present invention in use to take data from calibrated cameras covering a scene from different perspectives in order to dynamically detect, localize, track and model moving objects -- including a robot vehicle and human pedestrians -- in the scene .
  • GM-PPS Global Multi-Perspective Perception System
  • MPI Multiple Perspective Interactive
  • FIG 12 is a top-level block diagram showing the high level architecture of the Global Multi-Perspective Perception System (GM-PPS) portion, previously seen in Figure 11, of the Multiple Perspective Interactive (MPI) video system in accordance with the present invention, the architecture showing the interaction between a priori information formalized in a static model and the information computed during system processing and used to formulate a dynamic model.
  • GM-PPS Global Multi-Perspective Perception System
  • MPI Multiple Perspective Interactive
  • Figure 13 is a graphical illustration showing the intersection formed by the rectangular viewing frustum of each camera scene onto the environment volume in the GM-PPS portion of the MPI video system of the present invention; the filled frustum representing possible areas where the object can be located in the 3-D model while, by use of multiple views, the intersection of the frustum from each camera will closely approximate the 3-D location and form of the object in the environment model.
  • Figure 14 consisting of Figure 14a and Figure 14b, is a diagram of a particular, exemplary, environment cf use of the GM-PPS portion, and of the overall MPI video system of the present invention; the environment being an actual courtyard on the campus of the University of California, San Diego, where four cameras, the locations and optical axes of which are shown, monitor an environment consisting of static object, a moving robot vehicle, and several moving persons
  • Figure 15 s a pictorial representation of tne distributed architecture of tne GM-PPS portion of the MPI video system of the present invention wnereir.
  • ⁇ i ⁇ a graphics ano visualization workstation acts as the modeler, ui) several wor ⁇ stat ⁇ ons on the network act as slaves which process individual frames based on the master's request so as to (111; physically store the processed frames either locally, m a nearpy storage server, or, m the real-time case, as digitized information on a local or nearpy frame-grabber
  • Figure 16 is a diagram showing the derivation of a camera coverage table for an area of interest, or environment, in which objects will be detected, localized, tracked and modeled by the GM-PPS portion of the MPI video system of the present invention; each grid cell in the area is associated with its image in each camera plane while, in addition, the diagram shows an object dynamically moving through the scene and the type of information the GM-PPS portion of the MPI video system uses to maintain knowledge about this object's identity.
  • Figure 17, consisting of Figures 17a through 17d, is four pictorial views of the campus courtyard previously diagrammed in Figure 14 at global time 00:22:29:06; the scene containing four moving objects including a vehicle, two walkers and a bicyclist.
  • Figure 18 is a pictorial view of a video display to the GM- PPS portion of the MPI video system of the present invention, the video display showing, as different components of the GM- PPS, views from the four cameras of Figure 17 in a top row, and a panoramic view of the model showing hypotheses corresponding to the four moving objects m the scene in a bottom portion; the GM-PPS serving to detect each object in one or more views as is particularly shown by the bounding boxes, and serving to update object hypotheses by a line-of-sight projection of each observation.
  • Figure 19 consisting of Figures 19a through 19e, is five pictorial views of the GM-PPS model showing various hypotheses corresponding to the four moving objects in the scene of Figure 17 at global time 00:22:29:06;
  • Figures 19a-19d correspond to four actual camera views while Figure 19e shows a virtual image from the top of the scene.
  • Figure 20 consisting of Figures 20a through 20d, is four pictorial views of the same campus courtyard previously diagrammed in Figure 14, and shown in Figure 17, at global time 00:62:39:06; the scene still containing four moving objects including a vehicle, two walkers and a bicyclist.
  • Figure 21 is another pictorial view of the video display to the GM-PPS portion of tne MPI video system of the present invention previously seen in Figure 18, the video display now showing a panoramic view of the model showing the hypotheses corresponding to tne four moving objects in the scene at the global time 00:22:39:06 as was previously shown in Figure 20.
  • Figure 22 consisting of Figures 22a through 22c, s a diagrammatic view showing how immersive video in accordance with the present invention uses video streams from multiple strategically-located cameras that monitor a real-world scene from different spatial perspectives.
  • Figure 23 is a schematic block diagram of the software architecture of the immersive video system m accordance with the present invention.
  • Figure 24 is a pictorial view showing how the video data analyzer portion of the immersive video system of the present invention detects and tracks objects of potential interest and their locations in the scene.
  • Figure 25 is a diagrammatic view showing how, m an immersive video system m accordance with the present invention, the three-dimensional (3D) shapes of all moving objects are found by intersecting the viewing frustrums of objects found by the video data analyzer; two views of a full three-dimensional model generated by the environmental model builder of the immersive video system of the present invention for an indoor karate demonstration being particularly shown.
  • Figure 26 is a pictorial view showing how, in the immersive video system in accordance with the present invention, a remote viewer is able to walk though, and observe a scene from anywhere using virtual reality control devices such as the boom shown here.
  • Figure 27 is an original video frame showing video views from four cameras simultaneously recording the scene of a campus courtyard at a particular instant of time.
  • Figure 28 is four selected virtual camera, or synthetic video, images taken from a 116-frame "walk through” sequence generated by the immersive video system in accordance with the present invention (color differences in the original color video are lost in monochrome illustration) .
  • Figure 29, consisting of Figures 29a and Figure 29b, are synthetic video images generated from original video by the immersive video system m accordance with the present invention, the synthetic images respectively showing a "bird's eye view” and a ground level view of the same courtyard previously seen in Figure 27 at the same instant of time.
  • Figure 30a is a graphical rendition of the 3D environment model generated for the same time instant shown m Figure 27, the volume of voxels in the model intentionally being at a scale sufficiently coarse so that the 3D environmental model of two humans appearing m the scene may be recognized without being so fine that it cannot be recognized that it is only a 3D model, and not an image, that is depicted.
  • Figure 30b is a graphical rendition of the full 3D environment model generated DV tne environmental model builder of the immersive video system of the present invention for an indoor karate demonstration as was previously shown in Figure 25, the two human participants being clothed m karate clothing with a kick m progress, the scale and the resolution of the model being clearly observable .
  • Figure 30c is anotner graphical rendition of the full 3D environment model generated by the environmental model builder of the immersive video system of the present invention, this time for an outdoor karate demonstration, this time the environmental model being further shown to be located m the static scene, particularly of an outdoor courtyard.
  • Figure 31 is a listing of Algorithm 1, the Vista “Compositing” or “Hypermosaicing” Algorithm, m accompaniment to a diagrammatic representation of the terms of the algorithm, of the present invention where, at each time instant, multiple vistas are computed using the current dynamic model and video streams from multiple perspective; for stereoscopic presentations vistas are created from left and from right cameras.
  • Figure 32 is a listing of Algorithm 2, the Voxel Construction and Visualization for Moving Objects Algorithm in accordance with the present invention.
  • Figure 33 is a synthetic video frames, similar to the frames of Figure 10, created by the immersive video system of the present invention at a random user-specified viewpoint during a performance of a indoor karate exercise by an actor m the scene, the virtual views of an indoor karate exercise of Figure 33 being rendered at a higher resolution than were the virtual views of the outdoor karate exercise of Figure 30.
  • the present specification presents a system, a method and a model for Multiple Perspective Interactive -- "MPI" -- video or television.
  • MPI Multiple Perspective Interactive -- "MPI" -- video or television.
  • the cameras are real, and exist in the real world: to use a source camera, or a source image, that is itself virtual constitutes a second-level extension of the invention, and is not presently contemplated.
  • MPI video is always interactive -- the "I" m MPI -- in the sense that the perspective from which the video scene is desired to be, and will be, shown and presented to a viewer is permissively chosen by such viewer, and predetermined.
  • MPI video is also interactive in that, quite commonly, the perspective on the scene is dynamic, and responsive to developments in the scene. This may be the case regardless that the real video images of the scene from which the MPI video is formed are themselves dynamic and may, for example, exhibit pan and zoom. Accordingly, a viewer-selected dynamic presentation of dynamic events that are themselves dynamically imaged is contemplated by the present invention.
  • the "viewer-selected dynamic presentation” might be, for example, a viewer-selected imaging of the quarterback.
  • This image is dynamic in accordance that the quarterback should, by his movement during play, cause that, in the simplest case, the images of several different video camera should be successively selected or, in the case of such full virtual video as is contemplated by the present invention, that the quarterback's image should be variously dynamically synthesized by digital computer means.
  • the football game is, of course, a dynamic event wherein the quarterback moves.
  • the real-world source, camera, images that are used to produce the MPI video are themselves dynamic in accordance that the cameramen at the football game attempt to follow play.
  • the present invention is not restricted to use with video depicting reality --
  • Put reality is the cheapest source of sucn information as can, when viewed through the MPI video system of the present invention, still be quite "intense" In other words, it may be necessary to be attacked by a fake, virtual, tiger when one can visually experience the onrush of a real hostile football linebacker
  • MPI video is presented upon a common monitor, or television set, and does not induce tne viewer to believe that he or she has entered a fantasy reality.
  • MPI video need not be implemented for each and every individual video or television viewer in order to be useful.
  • a broadcast major American football game might reasonably consume not one, but 25+ channels -- one for each player of both sides on the football field, one for each coach, one for the football, and one for the stadium, etc.
  • En early alternative may be MPI video on pay per view. It has been hypothesized that the Internet, in particular, may expand in the future to as likely connect smart machines to human users, and to each other, as to communicatively interconnect more and more humans, only. Customized remote viewing can certainly be obtained by assigning every one his or her own remotely-controllable TV camera, and robotic rover. However, this scheme soon breaks down. How can hundreds and thousands of individually-remotely-controlled cameras jockey for position and for viewer-desired vantage points at a single event, such as the birth of a whale, or an auto race? It is likely a better idea to construct a comprehensive video image database from quality images obtained from only a few strategically positioned cameras, and to then permit universal construction of customized views from this database, all as is taught by the present invention.
  • the MPI video of the present invention causes video databases to be built in which databases are contained --' dynamically and from moment to moment (frame to frame) -- much useful information that is interpretive of the scene depicted.
  • the MPI video system contains information of the player's present whereabouts, and image. It is thus a straightforward matter for the system to provide information, in the form of text or otherwise, on the scene viewed, either continuously or upon request.
  • Such auxiliary information can augment the entertainment experience. For example, a viewer might be alerted to a changed association of a football in motion from a member of a one team to a member of the opposing team as is recognized by the system to be a fumble recovery or interception. For example, a viewer might simply be kept informed as to which player presently has possession of the football.
  • the MPI video model, its implementation, and the architectural components of a rudimentary system implementing the model are taught in the following sections 3 through of this specification.
  • Television is a real-time version of MPI video.
  • Interactive TV is a special case of MPI video. In MPI TV, many operations must be done in real time because many television programs are broadcast in real time.
  • MPI video The concept of MPI video is taught in the context of a sport event.
  • the MPI video model allows a viewer to be active; he or she may request a preferred camera position or angle, or the viewer may even ask questions about contents described in the video. Even the rudimentary system automatically determines the best camera and view to satisfy the demands of the viewer.
  • the particular, rudimentary, embodiment of an MPI video system features automatic camera selection and interaction using three-dimensional cursers.
  • the complete computational techniques used in the rudimentary system are not fully contained herein this specification in detail because, by an large, know techniques hereinafter referred to are implemented. Certain computational techniques are, however, believed novel, and the mathematical basis of each of these few techniques are fully explained herein.
  • the running MPI video system is presently being extend to other applications besides American football.
  • a detail teaching of the concept, and method, of generating a three-dimensional database required by the MPI video system of the present invention is taught and demonstrated in this specification not in the context of football, but rather, as a useful simplification, in the context of a university courtyard though which human and machine subjects (as opposed to football players) roam.
  • the present specification will accordingly be understood as being directed to the enabling principles, construction, features and resulting performance of rudimentary embodiment of an MPI video system, as opposed to presenting great details on any or all of the several separate aspects of the system
  • a physical phenomena or an event can be usually viewed from multiple perspectives
  • Current remote viewing via video or television permits viewing only from one perspective, and tnat perspective being that of an author or editor and not of the viewer A viewer has no cnoice
  • remote viewing via video or television even under such limitations has been very attractive and has influenced our modern society in many aspects.
  • an episode is being recorded, or being viewed in real time
  • This episode could be related, for example, to a scientific experiment, an engineering analysis, a security post, a sports event, or a movie
  • the episode can be recorded using multiple cameras strategically located at different points These cameras provide different perspectives of the episode
  • Each camera view is individually very limited
  • the famous parable about an elephant and the blind men may be recalled.
  • With ust one camera only a narrow aspect of the episode may be viewed.
  • a single camera is unable to provide a global description of an episode.
  • the environment model has a global view of the episode, and it also knows where each individual camera is.
  • the environment model is used in the MPI system to permit a viewer to view what he or she wants from where he or she wants (within the scene, and within limits) .
  • the viewer may be interested in a specific perspective, and may want to view a scene, an episode, or an entire video presentation from this specific perspective.
  • the user may specify a real, or a virtual, camera specifically.
  • the viewer may only specify the desired general location of the camera, without actual knowledge whether a camera m such location would be real or virtual .
  • the viewer may be interested in a specific object. There may be several objects m a scene, an episode, or a presentation. A viewer may want to always view a particular object independent cf its situation m the scene, episode, or presentation. Alternatively, the object that is desired to be viewed may oe context sensitive: the viewer may desire view the basketball until the goal is scored to then shift view to the last player to touch the basketball.
  • the viewer may be interested in a specific event.
  • a viewer may specify characteristics of an event and may want to view a scene, an episode, or a presentation from the best perspective for that event.
  • the viewer may be interested in a having a view from a virtual camera.
  • the viewer may request to view a scene of an event within the scene from a perspective that is not provided by any real camera that is situated to acquire the scene or any portion thereof.
  • the MPI video system of the present invention will, by use of the environment model and video synthesis techniques, synthesize a virtual camera, and video image, so as to view a scene, an episode, or an entire presentation from a viewer-specified perspective.
  • FIG. 1 The high level architecture for a MPI video system so functioning is shown in a first level block diagram in Figure 1.
  • a image at a certain perspective from each camera 10a, 10b, ...lOn is converted to its associated camera scene in camera screen buffers CSB Ila, lib, ...lln. Multiple camera scenes are then assimilated into the environment model 13 by computer process in the Environ. Model Builder 12.
  • a viewer 14 (shown in phantom line for not being part of the MPI video system of the present invention) can select his perspective at the Viewer Interface 15, and that perspective is communicated to the Environment Model via a computer process in Query Generator 16.
  • the programmed reasoning system in the Environment Model 13 decides what to send via Display Control 17 to the Display 18 of the viewer 1 .
  • a complete MPI video system with limited features can be, and has been, implemented using the existing technology.
  • the exact preferred architecture of a MPI video system will depend on the area to which the system is intended to be applied, and the t ⁇ pe and level of viewer interaction allowed
  • certain general issues are in common to any and all implementations of MPI video systems Seven critical areas that must be addressed in building any MPI video system are as follows .
  • a camera scene builder is required as a programmed computer process.
  • the MPI video system In order to convert an image sequence of a camera to a scene sequence, the MPI video system must, and does, know where the camera is located, its orientation, and its lens parameters. Using this information, the MPI video system is then able to locate objects of potential interest, and the locations of these objects in the scene. This requires powerful image segmentation methods
  • the MPI video system may use some knowledge of the domain, and may even change or label objects to make its segmentation task easier. This is, in fact, the approach of the rudimentary embodiment of the MPI video system, as will be further discussed later.
  • an environment model builder is required as a programmed computer process. Individual camera scenes are combined in the MPI video system to form a model of the environment. All potential objects of interest and their locations are recorded in the environment model. The representation of the environment model depend on the facilities provided to the viewer If the images are segmented properly, then, by use of powerful but known computers and computing methods, it is possible to build environment models in real time, or almost in real time.
  • a viewer interface permits the viewer to select the perspective that he or she wants. This information is obtained from the user in a friendly but directed manner. Adequate tools are provided to the user to point and to pick objects of interest, to select the desired perspective, and to specify events of interest. Recent advances in visual interfaces, virtual reality, and related areas have contributed to making the MPI video system viewer interface very powerful -- even in the rudimentary embodiment of the system.
  • a display controller software process is required to respond to the viewers' requests by selecting appropriate images to be displayed to each such viewer. These images may all come from one perspective, or the MPI video system may have to select the best camera at every point in time in order to display the selected view and perspective. Accordingly, multiple cameras may be used to display a sequence over time, but at any given time only a single best camera is used. This has required solving a camera hand-off problem.
  • a video database must be maintained. If the video event is not in real time (i.e., television) then, then it is possible to store an entire episode in a video database. Each camera sequence is stored along with its metadata. Some of the metadata is feature based, and permits content-based operations. See Ramesh Jain and Arun Hampapur,- "Metadata for video-databases" appearing in SIGMOD Records , Dec. 1994.
  • environment models are also stored in the database to allow rapid interactions with the system.
  • this sixth requirement is nothing but the fifth requirement performed faster, and in real time.
  • the requirement might just barely be realizable in software if computational parallelism is exploited, but, depending upon simplifying assumptions made, a computer ranging from an engineering work station to a full-blown supercomputer (both circa 1995) may be required.
  • a computer ranging from an engineering work station to a full-blown supercomputer (both circa 1995) may be required.
  • low-cost (but powerful) microprocessors are likely distributable to each of the Camera Sequence Buffers CSB Ila, lib, ...lln in order to isolate, and to report, features and dynamic features within each camera scene. Correlation of scene features at a higher process level may thus be reduced to a tractable problem.
  • a visualizer is required in those applications that require the displaying of a synthetic image in order to satisfy a viewer's request For example, it is possible that a user selects a perspective that is not available from any camera.
  • a trivial solution is simply to select the closest camera, and to use its image.
  • the solution of the rudimentary MPI video system of the present specification -- which solution is far from trivial m implementation or trite in the benefits obtained -- is to select a best -- and not necessarily a closet -- camera and to use its image and sequence.
  • an MPI video system to synthesize a full virtual video image is basically a function of "raw" computational power. If real time video (i.e., television) is not required, short virtual video segments of real world occurrences may be quite as reasonably produced, and maybe more reasonably produced, than the computer-generated special effects, including morphmg, so popular in American movies circa 1995. Of course, it should be understood that even the synthesis of such segments requires computers of considerable speed capacity.
  • a coach or a player may want to analyze how a particular player ran, or tackled, and to ignore all other players.
  • An interactive viewing system should allow tne viewing of only plays of interest, and these from different angles.
  • the video would desirably be good enough so that some detailed analysis would be capable of being performed on the video of the plays m order to study the precise patterns, and performance, of the selected player.
  • viewers may both d) select cameras according to their preference, and (11) ask questions about the name(s) , or the movement(s) , of players.
  • the following are some examples of interaction between a viewers and the MPI video system
  • the viewer may request that the MPI video system should show a shot of some upcoming play or plays taken from camera located behind the quarterback.
  • the viewer may request that the MPI video system should show a best shot of a particular, viewer-identifled, player.
  • the viewer may request that the MPI video system should show as text the name of the player to which the viewer points, with his or her cursor, on the screen of the display 18 (shown m Figure 1) .
  • the viewer may request that the MPI video system should highlight on the screen a particular player whose name the viewer has selected from a player list.
  • the viewer may request that the MPI video system should show him or her the exact present location of a selected player.
  • the viewer may request that the MPI video system should show him or her the sequence when a selected player crossed, for example, the 40 yard line.
  • the viewer may request that the MPI video system should show him or her the event of a fumble .
  • the viewer may request that the MPI video system should show all third down plays in which quarterback X threw the ball to the receiver Y.
  • the MPI video system needs to have information about d) contents of the football scene as well as (ii) video data.
  • Some of the above, and several similar questions, are relevant to MPI television, while others are relevant to MPI video.
  • the major distinction between MPI TV and MPI video is in the role of the database. In case of MPI video, it is assume that much preprocessing can transpire, with the pre-processed information stored in a database. In case of MPI TV, most processing must be, and will be, in real time.
  • a football scene is captured by several cameras and analyzed by a scene analysis system.
  • the information obtained from individual cameras is used to form the environment model .
  • the environment model allows viewers to interactively view the scene.
  • the configuration of the MPI football video/television system is shown in Figure 3.
  • the current system consists of a UNIX workstation, a laser disc player, a video capture board, and a TV monitor and graphical display.
  • the TV monitor is connected to the laser disc player.
  • the laser disc player is controlled by the UNIX workstation.
  • a graphical user interface is built using X-window and Motif on graphical display.
  • This video data was divided into shots, each of which corresponds to one football play. Each shot was analyzed and a three-dimensional scene description -- to be discussed in considerable detail in sections 5-10 hereinafter -- was generated. Shots from multiple cameras were combined into the environment model.
  • the environment model contains information about position cf players and status of cameras.
  • the environment model is used by the system to allow MPI video viewing to a user. User commands are treated as queries to the system and are handled by the environment model and the database .
  • the interactive video interface of the system is shown in Figure 4.
  • the video screen of Figure 4 shows video frames taken from laser disc.
  • Video control buttons control video playback.
  • a camera list a viewer can choose any camera.
  • a player list a viewer can choose certain players to be focused on. If a viewer doesn't select a camera, then the system automatically selects the best camera. Also, multiple viewers can interact using the three-dimensional cursor.
  • Automatic camera selection is a function that selects the best camera according to the preference of a user.
  • a player is captured by three cameras and they produce three views shown in Fig. 5.
  • camera 2 is the best to see this player, for in camera 1 the player is out of the area while in camera 3 the player is too small.
  • Different cameras provide focus on different objects.
  • an appropriate camera must be selected.
  • This function is performed by the system in the following way. First, viewers select the player that they want to see. Then the system looks into information on player position and camera status in the environment model to determine which camera provides the best shot of the player. Finally the selected shot is routed to the screen.
  • a three-dimensional cursor is introduced in support of the interaction between viewers and the MPI vide/TV system.
  • a three-dimensional cursor is a cursor that moves in three-dimensional space. It is used to indicate particular position in the scene.
  • the MPI video/TV system uses this cursor to highlight players. Viewers also use it to specify players that they want to ask questions about.
  • the cursor consists of five lines. Three of the five lines indicate the x, y and z axes of the three-dimensional space. The intersection of these three lines shows cursor position. The other two lines indicate a projection of the three lines onto the ground The projection helps viewers have a correct information of cursor position.
  • a viewer can manipulate the tnree-dimen ⁇ ional cursor so as to mark a point in the three-dimensional space
  • the projection of the three dimensional cursor is a regular cursor centered at the projection of this marked point.
  • Both viewers and the MPI system use the three-dimensional cursor to interact with each other.
  • a viewer moves the cursor to the position of a player and asks who this player is.
  • the MPI system then compares the position of the cursor and the present position cf each player to determine which player the viewer is pointing.
  • a viewer tells the MPI system a name of a player and asks where the player is .
  • the MPI system then shows the picture of the player and overlays the cursor on the position of the player so as to highlight the player.
  • the purpose of scene analysis is to extract three-dimensional information from video frames captured by cameras. This process is performed m the following two stages:
  • 2-D information is extracted. From each video frame, feature points such as players and field marks ere extracted and a list of feature points is generated.
  • 3-D information is extracted. From the two-dimensional description of the video frame, three-dimensional information in the scene, such as player position and camera status, is then extracted.
  • feature points are extracted from each video frame.
  • Feature points include two separate items in the images.
  • the players are defined by using their feet as feature points.
  • the field marks of the football field are sued as feature points.
  • As is known to fans of American football, and American football field has yard lines to indicate yardage between goal lines, and hash marks to indicate a set distance from the side border, or sidelines, of the field.
  • Field marks are defined as feature points because their exact position as a prior known, and their registration and detection can be used to determine camera status .
  • the feature points are extracted by human-machine interaction. This process is currently carried out as follows. First, the system displays a video frame on the screen of Display 18 (shown in Figure 1) . A viewer, or operator, 14 locates some feature points on the screen and inputs required information for each feature point. The system reads image coordinates of the feature points and generates two-dimensional description.
  • This process results in two-dimensional description of a video frame that consists of a list describing the players and a list describing the field marks.
  • the player descriptions include each player's name and the coordinates of each player's image.
  • the field mark descriptions include the positions (in the three-dimensional world) , and the image coordinates, of all the field marks.
  • the purpose of this step is to obtain three-dimensional information from the two-dimensional frames.
  • the spatial relationship between the three-dimensional world and the vided frames captured by the cameras is shown in Figure 7.
  • a camera is observing a point (x, y, z) .
  • a point (u, v) in the image coordinate system to which the point (x, y, z) is mapped may be determined by the following relationships, which relationships comprise a coordinate system for camera calibration.
  • a point (x, y, z ; in the world coordinate system is transformed to a point (p, q, s) in the camera coordinate system by the following equation
  • R is a transformation matrix from the world coordinate system to the camera coordinate system
  • (x ⁇ ,y, ,z ) is the position of the camera.
  • a point (p,q,s) in the camera coordinate system is projected to point (u,v) on the image plane according to the following equation:
  • f is camera parameter that determines the degree of zoom in or zoom out .
  • an image coordinate (u,v) which corresponds to world coordinate (x,y,z) is determined depending on the (i) camera position, (ii) camera angle and (ii) camera parameter.
  • a camera calibration is performed. If only one known point is observed, a pair of image coordinates and world coordinates may be known. By applying this known pair to the above equations, two equations regarding the seven parameters that determine camera status may be obtained. Observing at least four known points will suffice to provide the minimum equations to solve the seven unknown parameters.
  • the world coordinate may oe determined from the image coordinate if it considered that the point is constrained to lie m a plane.
  • the imaged football players are always approximately on the ground Accordingly, the positions of players can be determined according to the aoove equations
  • the scene analysis process just described should be applied to every video frame m order to get the most precise information about d) the location of players and (n) the events m the scene.
  • the rudimentary, prototype, MPI video system is able to determine and select a single best camera to show a particular player or an event. This is determined by the system using the environment model. Effectively, for the given player's location, the system uses reverse mapping for given camera locations, and then determines where will the image of the player be in the image for different cameras.
  • the system selects the camera in which the selected player is closest to the center of the viewing area.
  • the system could prospectively be made more precise by considering the orientation of the player also.
  • the problem of transferring display ⁇ control from one camera to another is called the "camera hand-off problem" .
  • the rudimentary, prototype, MPI video system has oeen exercised on a very simple football scene imaged from three different cameras
  • the goal of this example is to demonstrate the method and apparatus of the invention, and the feasibility of obtaining practical results.
  • the present implementation and embodiment can clearly Pe extended process longer sequences, and also to different applications, and, indeed, is already Pemg so extended
  • the ac u - "i ⁇ eo data used m the experimental exercise of the MPI video system is shown in Figure 8.
  • the video data consists of the three shots respectively shown m Figures 8a through 8c These three shots record the same football play but are taken from different camera angles Each snot lasted about ten seconds The three different cameras thus provide three separate, cut related, sequences. These sequences are used to Puild tne rrooei of events in the scene.
  • the positions of tne playei "Washington" that a human may reao from the video frames are close tc the values that the syste m calculates Tne values calculated by the MPI / ⁇ eo system are snown pe ow each picture m Fig 10.
  • the three-dimensional cursor appear to be close to the cnosen player "Washington" in the screen video image.
  • the present section 7 and following sections 8- expound the most conceptually and practically difficult portion of the MPI video system its capture, organization and processing of real-world events m order that a system action -- such as, for example, an immediate selection, or synthesis, of an important video image (e g , a football fumble, or an interception) -- may be predicated on this detection.
  • a system action such as, for example, an immediate selection, or synthesis, of an important video image (e g , a football fumble, or an interception) -- may be predicated on this detection.
  • a system action such as, for example, an immediate selection, or synthesis, of an important video image (e g , a football fumble, or an interception) -- may be predicated on this detection.
  • a system action such as, for example, an immediate selection, or synthesis, of an important video image (e g , a football fumble, or an interception) -- may be predicated on this detection.
  • an omniscient multi-perspective perception system based on multiple stationary video cameras permits comprehensive live recognition, and coverage, of objects and events in extended environment.
  • the system of the invention maintains a realistic representation of the real-world events.
  • a static model is built first using detailed a priori information.
  • Subsequent dynamic modeling involves the detection and tracking of people and objects in at least portions cf the scene that are perceived (by the system, and in real time,' to be the most pertinent.
  • the perception system uses camera hand-off, dynamically tracks objects in the scene as they move from one camera coverage zone to another. This tracking is possible due to several important aspects of the approach cf the present invention, including >, strategic placement of cameras for optimal coverage, ii accurate knowledge cf scene-camera transformation, and ii the constraining cf object motion to a known set of surfaces.
  • the exemplary courtyard environment contains (i) one object -- a human walker -- that follows a proscribed and predetermined dynamic path, namely a walkway path.
  • the exemplary environment contains (ii) still other objects -- other human walkers -- that do not even know that they are in any of a scene, a system, or an experiment, and who accordingly move as they please in un- predetermined patterns (which are nonetheless earthbound) .
  • the exemplary environment contains (iii) an object -- a robot -- that is not independent, but which rather moves in the scene in response to static and dynamic objects and events therein, such as to, for example, traverse the scene without running into a static bench or a dynamic human.
  • Global Multi-Perspective Perception is taught and exercised m a campus environment containing a (i) mobile robot, (ii) stationary obstacles, and (m people and vehicles moving about -- actors m tne scene that are shown diagrammatically m Figure Ila.
  • an omniscient multi-perspective perception syste ⁇ uses multiple stationary cameras wnicn provide comprehensive coverage cf an extended environment. The use of fixed global cameras simplifies visual progressing.
  • the particular giooa ⁇ multi-perspective perception system that monitors the campus environment containing people, vehicles and the robot uses the several color and monochrome CCD cameras also diagrammatically represented in Figure 12 .
  • This particular perception system is not only useful in tne MPI video system, but is also useful in any completely autonomous system with or without a human m the loop, such as in the monitoring of planes on airport runways.
  • the next section 8 describes the preferred approach and the principle behind camera coverage, integration and camera hand-off.
  • the prototype global multi-perspective perception system, and the results of experiments thereon, is next described in section 9.
  • the approach of present invention is, to the best present knowledge of inventors, a revolutionary application of computer vision that is immediately practically useable in several diverse fields such as intelligent vehicles as well as the interactive video applications -- such as situation monitoring and tour guides, etc. -- that are the principal subject of the present specification.
  • Multi-perspective perception involves each of the following.
  • the separate observations are assimilated into a three dimensional model.
  • the preferred embodiment of the present invention leaves "familiar ground” quickly, and “plunges" into a new construct for any perception system, whether global and/or multi-perspective or not.
  • model is used in performing the required tasks. Exactly what this means must be postponed until the "model" is better understood.
  • FIG. 12 A high-level schematic diagram of the different components of the preferred embodiment of the prototype multi-perspective perception system in accordance with the present invention is shown in Figure 12.
  • a study of the diagram will show that the system includes both two-dimensional and three-dimensional processing.
  • Reference S. Chatterjee, R. Jain, A. Katkere, P. Kelly, D. Y. Kuramura, and S. Moezzi Modeling and interactivity in MPI-Video, Technical Report VCL-94-103, Visual Computing Laboratory, University of California, San Diego, Dec. 1994.
  • the static model contains a priori information such as camera calibration parameters, look-up tables and obstacle information.
  • the dynamic model contains task specific information like two dimensional and three dimensional maps, dynamic objects, states cf objects in the scene (e.g, a particular human is mobile, or the robot vehicle immobile) , etc.
  • the three-dimensional model of the preferred embodiment of tne prototype multi-perspective perception system in accordance w/itn the present invention is created using information from multiple video streams This model provides information that cannot be derived from a single camera view due to occlusion, size of the objects, etc Reference S. Chatterjee, et al . op. cit
  • a good three dimensional model is required to recognize complex static and moving obstacles
  • the multi-perspective perception system must maintain information about the positions of all the significant static obstacles and dynamic objects m the environment.
  • the system must extract information from both the two-dimensional static model as well as the three-dimensional dynamic model.
  • a representation must be chosen that (1) facilitates maintenance of object positional information as well as (11) supporting more sophisticated questions about object behavior.
  • information representation can be considered an implementation issue, the particular presentation chosen will significantly affect the system development.
  • information representation is considered to be an important element of the preferred multi-perspective perception system, and of its architecture.
  • geometric information is represented as a combination of voxel representation, gridmap representation and object-location representation. Specific implementations and domains deal with this differently.
  • the preferred system When combined with information about the exact position and orientation of a camera, the a priori knowledge of the static environment is very rich source of information which has not previously received much attention.
  • the preferred system is able to compute the three dimensional position of each dynamic object detected by its motion segmentation component. To do so, the (1) a priori information about the scene and (11) the camera calibration parameters are coupled with (iii) the assumption that all dynamic objects move on the ground surface.
  • the three-dimensional voxel representation is particularly efficacious.
  • a dynamic object recorded on an image plane projects nto some set cf voxels. Multiple views of an object will produce multiple projections, one for each camera. The intersection of ail such projections provides an estimate of the 3 -dimensional form of the dynamic object as illustrated m Figure 13 for an object seen by four cameras.
  • Camera handoff should be understood to be the event m which a dynamic object passes from one camera coverage zone to another.
  • the multi-perspective perception system must maintain a consistent representation of an object's identity and behavior during camera handoff. This requires the maintenance of information about the object's position, its motion, etc.
  • Camera Handoff is a crucial aspect of processing m the multi-perspective perception system because it integrates a variety of key system components. Firstly, it relies on accurate camera calibration information, static model data. Secondly, it requires knowledge of objects and their motion through the environment determined from the dynamic model . Finally, the camera handoff can influence dynamic object detection processing.
  • This section 8 has described the architecture, and some important features, of the multi-perspective perception system. Reference also S. Chatterjee, et al . op. cit . The next section describes in detail the preferred implementation of the multi-perspective perception system for the application of monitoring a college courtyard.
  • the implementation of an integrated Multiple Perspective Interactive (MPI) video system demands a robust and capable implementation of the multi-perspective perception subsystem.
  • MPI Multiple Perspective Interactive
  • the four views of the scene were digitized using a frame-addressable VCR, frame capture board combination.
  • the synchronization was done by hand using synthetic synchronization points m the scene (known as hat drops) .
  • the resulting image sequences were placed on separate disks and controllers for independent distributed access. Having an extended pre-digitized sequence d) accorded repeatability and (ID permitted development of the perception system without the distractions and time consumption of repeated digitalization of the scene .
  • the source of the scene image sequence was transparent to the perception system, and was, in fact, hidden behind a virtual frame grabber. Hence, the test was not only realistic, but migration of the perception system into d) real-time using (ii) real video frame capture boards proved easy.
  • Calibration of the cameras in tne perception system is important oecause accurate camera-world transformation is vital to correct system function.
  • the cameras are assumed to be calibrated a priori, so that precise information aoout eacn camera's position and orientation could be used either directly, cr py use cf pre-computed camera coverage tables, to convert two dimensional coservations nto three dimensional model space, and, further, three dimensional expectations into 2D.
  • the work stations m the experiment were connected on a 120 Mbps ethernet switch which guaranteed full-speed point-to-point connection.
  • a central graphical work station was used to control the four video processing workstations, to maintain the environment model (and associated temporal database) , and, optionally, to communicate results to another computer process such as that exercising and performing an MPI video function.
  • the central master computer and the remote slave computers communicate at a high symbolic level; minimal image information is exchanged. Hence only a very low network bandwidth is required for master-slave communication.
  • the master-slave information exchange protocol is preferably as follows:
  • the master computer initializes graphics, the database and the environment model, and waits on a pre-specified port .
  • each slave computer contacts the master computer, using a pre-specified macnine-port comomation, and an initialization nand-sha.-ing protocol ensues.
  • tne master computer acknowledges each slave computer and sends the slave computer initialization information such as ID where the images are actually stored (for the laboratory case, 1 , (ii) the starting frame and frame interval, and (iii camera-specific image-processing information like thresholds, masKs etc.
  • the slave initializes itself based on the information sent by the master computer
  • the master computer either synchronously or asynchronously depending on application, will processes the individual cameras as described in following steps seven through nine.
  • the master computer sends a request to that particular slave computer with information about processing the frame focus of attention windows, frame specific thresholds and other parameters, current and expected locations and identifications of moving objects etc., continuing during this processing any user interaction.
  • requests to all slave computers are sent simultaneously and the integration is done after all slave computers have responded. In asynchronous mode, this will not necessarily proceed in unison.
  • the frame information is used to update the environment model and the database as described in following Section 9.1.7.
  • a communication master computer that manages all slave computers, assimilates the processed information into an environment model, process user input (if any) , and sends information to the MPI video process (if any) , resides at the heart of the multi-perspective perception system.
  • this master computer is an SGI Ind ⁇ go2 work station with nigh-end graphics hardware. This machine, along with graphics software -- OpenGL and Inventor -- was used to develop a functional Environment Model building and visualization system.
  • Reference J. Neidev, T. Davis, and M. Woo; OpenGL" Programming Guide : Offi cial Guide to Learning OpenGL , Release 1, Addison-Wesley Publishing Company, 1993.
  • One of the goals cf the exercise of the multi-perspective perception system was to illustrate the advantages of using static cameras for scene capture, and the relative simplicity of visual processing m this scenario when compared to processing from a single camera While more sophisticated detection, recognition and tracking algorithms are still being developed and applied, the initial, prototype multi-perspective perception system uses simple yet robust motion detection and tracking.
  • the processing of individual video streams is done using independent video processing slaves, possibly running on several different machines. The synchronization and coordination of these slaves, any required resolution of inconsistencies, and generation of expectations is done at the master.
  • Independent processing of information streams is an important feature of the information assimilation architecture of the present invention, and is a continuation and outgrowth of the work of some of the inventors and their colleagues. See, for example, R. Jam; Environment model s and informa tion assimilation, Technical Report RJ 6866(65692) , IBM Almaden Research Center, San Jose, CA, 1989; Y. Voth and R. Jam; Knowledge caching for sensor-based systems , Artificial Intelligence, 71:257-280, Dec. 1994; and A. Katkere and R. Jam; A framework for informa tion assimila tion , to be published m Exploratory Vision edited by M. Landy, et al . , 1994.
  • the independent processing results in pluggable and dynamically reconfigurable processing tracks.
  • the preferred, prototypical, communication slave computers perform the following steps on each individual video frame. Video processing is limited by focus of attention rectangles specified by the master computer, and pre-computed static mask images delineating portions of a camera view which cannot possibly have any interesting motion. The computation of the former is done using current locations of the object hypotheses in each view and projected locations in the next view. The latter is currently created by hand, painting out areas of each view not on the navigable surface (wails, for example) . Camera coverage tables help the master computer these computations. Coverage tables, and the concept cf cpjects, are both illustrated in Figure 16.
  • the input frame is first smoothed to remove some noise. Then the difference image d,. . is computed as follows. Only pixels that are the focus of attention windows and that are not masked are considered.
  • This shadow-removing step is not invariably used nor required since it needs a one frame look-ahead. In many cases simple heuristics may be used to eliminate motion shadows at a symbolic level .
  • filters are applied at the slave site to the list of components obtained from the previous step.
  • Commonly used filters include (i) merging of overlapping bounding boxes, (ii) hard limits of orientation and elongation, and (iii) distance from expected features etc.
  • the central visualization and modeling site receives processed visual information from the video processing sites and creates/updates object hypotheses. There are several sophisticated ways of so doing. Currently, and for the saKe of simplicity in developing a completely operative prototype, this is done as follows:
  • the footprint of each bounding box is projected to the primary surface of motion by intersecting a ray drawn from tne optic center cf that particular camera through the foot of the pounding box with the ground surface.
  • each valid footprint is tested for memoership with existing oojects and tne observation is added as support to the closest object, if any. If no object is close enough, then a new object hypothesis is created.
  • the object positions are projected into the next frame based on a domain-dependent tracker.
  • oo ect positions and associations are compared against predetermined templates. For example, if the courtyard scene the robot has moved into spatial coincidence with one of the predetermined immovable objects, such as a bench, then the robot may have run into the bench -- an abnormal and undesired occurrence.
  • any of a (1) kickoff, (ii) fumble, or (iii) interception may have transpired.
  • the detected event is of interest to the viewer in the MPI video system, then appropriate control signals are sent.
  • an appropriate message may be sent. If the scene of a football game the football is determined to be in spatial coincidence with the forty yard marker, then it is reported that the football is on the forty yard line.
  • Figures 17 through 21 frames in an exemplary exercise -- consisting of one thousand (1000) total frames from four (4) different cameras acquired as described in Section 9.1.2 -- of the Multi-perspective perception subsystem.
  • Figures 17 tnrougn 19 show the state of the subsystem at global time 00.22:29:06
  • Figures 20 and 21 show the state of the subsystem at the global time 00.22 39:06
  • four dynamic objects are shown m the scene: a rooot vehicle, two pedestrians and a bicyclist. The scene is covered by four different cameras
  • a f fth object -- another bicyclist -- is snown, but is not laoeied for clarity
  • ⁇ acn of the four cameras has its own ciocK, as is shown jnder the camera's "lew m one cf Figures 17 through 17d Camera numoer three #3 , wnicn is arp ⁇ trarny Known as "Saied's camera 1 ', is ⁇ se ⁇ to maintain the global clock since tnis camera has the largest coverage and the best image quality Figure 17a-17d clearl snows tne coverage of each camera
  • an object that s out of iew, too small, and/or occiuoed from view m one camera is m view, large and/or un-occluded to tne view of another camera
  • the opject laPe_s _ ⁇ seo m tne Figure 17 are for explanation only
  • the prototype suosvstem does not include any non-trivial opject recognition, and al.*.
  • object identifiers that persist over time are automatically assignee by the system Mnemonic names like "Walker 1", or "Wal ⁇ er" refer to the same ooject identification (e.g., what the software program would label
  • a pictorial representation of the display screen showing the operator interface to the multi-perspective perception subsystem is snown m Figure 18.
  • Four camera views are shown in the top row of Figure 18. Each view is labeled using its mnemonic identification instead of its numeric identification because humans respond better to mnemonic "id's"
  • Each view may be associated with a one of Figures 17a-17d.
  • the bottom section of the operator display screen in Figure 18 shows the object hypotheses which are formed over several frames (first frame is global clock 00:22:10:0) .
  • the intensity each object's marker represents the confidence each hypotheses.
  • the entire display screen, the objects depicted, and the object hypothesis diagrammatically depicted, is, as might well be expected, in full color.
  • Figures 17-21 are therefore monochrome of color images.
  • the object markers are preferably m the color yellow, and the intensity of the bight yellow color of each object's marker represents the confidence m the hypotheses for that opject The eye is sensitive to discern even such slight differences color intensity as correspond to differences m confidence.
  • the multi-perspective perception suosystem has a high confidence in each object for which a marker is depicted m Figure 18 because, at the particular glooal time represented, each ooject happens to nave peen ooserveo from many cameras over several past frames .
  • Tne three-dimensional ⁇ odei at gloca_ time CC:Z2.29:06 is snown in Figures 19a-19e m ootn rea_ and "irtua views.
  • Figures 19a-19 ⁇ show the mo ⁇ el from tne four real camera views.
  • the fifth view cf Figure 19e is a virtual view cf the model from directly overhead the courtyard -- where no real camera actually exists. This vi tual view shows the exact locations of all three objects, including the robotic vehicle, the two-dimensional plane of tne courtyard.
  • Figures 20 and 21 show that state.
  • Figure 20 corresponds to Figure 18
  • wnile Figure 21 corresponds to Figure 19.
  • One important observation to make in Figures 20 and 21 is that, given the relative proximity of Walker number one (#1) and Bicyclist number one (#1) , both are still classified as separate objects. This is only possible due to the subsystem's history and tracking mechanism.
  • MPI multi-perspective interactive
  • a variety of other application areas can benefit from the global multi-perspective perception subsystem described.
  • environments demanding sophisticated visual monitoring, such as airport runways and hazardous or complex roadway traffic situations can advantageously use the global multi-perspective perception subsystem.
  • objects must be recognized and identified, and spatial-temporal information about objects' locations and behaviors must provided to a user.
  • the expected first application of the global multi-perspective perception subsystem to the MPI video system ' has been sports, and it is expected that sports and other entertainment applications -- which greatly benefit -- will be the first commercial application of the subsystem/system.
  • Sports events e.g. football games
  • the reason that still more cameras are not used is primarily perceived as having to do to the expense of such human cameramen as are required to focus the camera image on the "action", and not the cost of the camera. Additionally, it is unsure how many different "feeds" a sports editor can use and select amongst -- especially m real time.
  • the reason the televised sporting event viewing public is by an large satisfied with the coverage offered is that they have never seen anything better -- including the movies. Few people have been privileged to edit a movie or a video, and even fewer to their own personal taste 'no matter how weird, or deviant 1 .
  • the machine-based MPI video of the present invention will, of course, accord viewing diversity without the substantial expense of human labor.
  • Still another application where the global multi-perspective perception subsystem may be used directly is as a tour guide in a museum or any such confined space. Rather than moving objects in the scene (i.e, the courtyard, or the football field) , the scene can remained fixed (i.e., the museum) and the camera can move. The response accorded a museum visitor/video camera user will be even more powerful than, for example, the hypertext linkage on the World Wide Web of the Internet. On an interactive computer screen and system (whether
  • tne global multi-perspective perception subsystem offers the art museum environment is that accumulation of a "user track", instead of an "object track", becomes trivial.
  • the user may be guided in a generally non- repetitious track through the galleries. If he/she stops and lingers for a one artist, or a one subject matter, or a style, or a period, etc., then selected further works of the artist, subject matter, style, period, etc., that seem to command the user's interest may be highlighted to the user. If the user dwells at length at a single work, or at a portion thereof, the central computer can perhaps send textual or audio information so regarding.
  • the provided information is ooviously of no interest to the user, and may be terminated. If the user listens and views through all offered messages that are classified "historical perspective of the persons and things depicted in the art work viewed" , then it might reasonably be assumed that the user is interested in history. If, on the contrary, the user listens and views through all offered messages that are classified "life of the artist", then it might reasonably be assumed that the user is interested in biography.
  • Future improvements to tne global multi-perspective perception supsystem may also be taKen m the area of cooperative human-machine systems Interactivity at tne central s te _gnt Pe improved so as to permit a human tc perfcr ⁇ mgner - _evel cognitive tasKs tnan simply as ⁇ ng "wnere", or "wnat no 0 " cr "wnen”
  • the human might as ⁇ ( for example, "why -5 " In the context of football, and for the event of a tackle, tne machine 'the computer* might oe able to advance as a possiP_e answer (which would not invariably be correct) to the question "why (the tackle) 9 " something l ⁇ e "Defensive L eoacKer 24 at tne site of tackle has not been impeded in his motion since the start of the play " The machine nas sensed that linebacker #24 --
  • Tne present specification has taught a coherent, logical, and useful scheme of implementing virtual video/television.
  • the particular embodiment within which the invention is taught is, as would be expected and as is desirable for the saKe of simplicity of teaching, rudimentary.
  • the synthesized video image is not completely of a virtual camera/image that may be located anywhere, but is instead of a machine-determined most appropriate real-world camera This may initially seem like a significant, and substantive, curtailment of the described scope of the present invention.
  • the computer and computer system realizing the present can usefully Pe very powerful, and can usefully exercise certain exotic software functions m the areas of machine vision, scene and feature analysis, and interactive control.
  • the present invention has not been, to the present date of filing, implemented at its "full blown" level of interactive virtual television. It need not be order that it may be understood as a coherent, logical, and useful scheme of so implementing virtual video/television.
  • the MPI video system is m its infancy.
  • the potential of the MPI video techniques is obvious, but cost effective implementation, especially for the individual "John Q. Public" viewer has a long way to go.
  • Almost all medium- to large- scale computer technology involved in the implementation of the prototype MPI video systems was stretched to its limits. The following are only a few examples of the useful, and probable,
  • indexing techniques will be required. These techniques for images and video are just being developed.
  • a virtual video camera, and a virtual video image, of a scene were synthesized a computer and m a computer system from multiple real video images of the scene that were obtained by multiple real video cameras
  • the present invention need not be (and to the present date of filing has not been) implemented at its "full blown" level of interactive virtual television in order that it may be recognized that a coherent, logical, and useful scheme of implementing virtual video/television is shown taught.
  • the virtual video camera, and virtual image, produced by the MPI video system need not, and commonly does not, have any real-world counterpart.
  • the virtual video camera and virtual image need not, and commonly does not, have any real-world counterpart.
  • 70 image may snow, for example, a view of a sporting event, for example American football, from an aerial, or an on-field, perspective at which no real camera exists or can exist
  • the synthesis of "irtual video images/ v rtua- television pictures may be linked to any of D a perspective, ⁇ i an opiect in the video/television scene, cr m an event m tne vioeo/ television scene
  • the l Kage may be to a static, or a dynamic, D perspect ---, n object or m event
  • the virtual video/television camera could be located
  • the virtual camera, and virtual image, that is synthesized from multiple real wor_d video images may Pe so synthesized interactively, and on demand
  • a television sports director might select a virtual video replay of a play m a football game keyed on a perspective, player or event, or might even so key a selected perspective of an upcoming play to be synthesized in real time, and shown as virtual television.
  • many separate viewers are able to select, as sports fans, their desired virtual images.
  • a virtual video replay, or even a virtual television, image of each of the eleven players on each of two American football teams, plus the image of the football is carried on twenty-three television channels. The "fan" can thus follow his favorite player.
  • MPI Multiple Perspective Interactive
  • video data is an attractive source of Dhformation for tne creation cf "virtual worlds" wnich, nonetheless tc being virtual, incorporate some "real world” fidelity
  • the present invention concerns the use of multiple streams of video data for tne creation cf _mmers ⁇ ve, "visual reality", environments
  • the immersive C ⁇ c svstem cf tne present inv ntion for so synthesizing "visual realDt from multiple streams cf video data is based on, and is a continuance of the Multiple Perspective Interactive Video (MPI-Video ust discussed,
  • An immersive video system incorporates the KPI -Video architecture, wnicn architecture provides the infrastructure for tne processing and the analysis of multiple streams of video data
  • the MPI-Video portion of the immersive video system D performs automated analysis of the raw video and ID constructs a model of the environment and object activity witnm the environment This mode.., together with the raw video data, can be used to create immersive - ⁇ eo environments This is tne most important, and most difficult, functional portion of the immersive video system. Accordingly, this MPI-Video portion of tne immersive video system is first re-visited, and actual results from an immersive "virtual" walk through as processed by the MPI-Video portion of the immersive video system are presented.
  • multiple video cameras cover a dynamic, real-world, environment.
  • These multiple video data streams are a ⁇ seful source of information for building, first, accurate three-dimensional n ⁇ odels cf tne events occurring tne real world, ana, then, completely immersive environments.
  • the immersive environment does not, accordance with the present invention, some straight from the real world environment.
  • the present invention is not simply a linear, orute-force, processing of two-dimensional (video) data into a three-dimensional 'vioeo i database land the subsequent uses thereof, Instead, in accordance with the present invention, the immersive environment comes to exist through a tnree dimensional model, particularly a model of real-world dynamic events. This will later become clearer such as in, inter alia, the discussion of Figure 25.
  • tne immersive video system of the present .nvention visual processing algorithms are used to extract information about object motion and activity (botn of which are dynamic oy definition) in the real world environment.
  • This information along with (1) the raw video data and (11) a priori information about the geometry of the environment -- is used to construct a coherent and complete visual representation of the environment .
  • This representation can then be used to construct accurate immersive environments based on real world object behavior and events. Again, the rough concept, if not the particulars, is clear.
  • the immersive environment comes to be only through a model, or representation, or the real world environment .
  • video data proves powerful source medium for these tasks (leading to the model, and the immersive environment)
  • the effective use of video requires sophisticated data management and processing capabilities.
  • the manipulation of video data is a daunting task, as it typically entails staggering amounts of complex data.
  • using powerful visual analysis techniques it is possible to accurately model the real world using video streams from multiple perspectives covering a dynamic environment.
  • Such "real-world” models are necessary for "virtual world” development and analysis.
  • the MPI-Video portion of the immersive video system builds the infrastructure to capture, analyze and manage information about real-world events from multiple perspectives, and provides viewers (or persons interacting with the scene) interactive access to this
  • the MPI-Video sub-system uses a variety cf visual computing operations, modeling and visualization techniques, and multimedia database methodologies to (i) synthesize and (ii) manag a rich and dynamic representation of object behavior real-world environments monitored oy multiple cameras (see Figures 2 and 22) .
  • An Environment Model 'EM' is a hierarchical representation of D the structure cf an environment and '. ID tne actions that take place in this environment .
  • the EM is used as a oridge oetween the process of analyzing and monitoring the environment and tnose processes that present information to the viewer and support the construction of "immersive visual reality" based on the video data input .
  • Interactively presenting the information about the world to the viewer is another important aspect of "immersive visual reality”. For many applications and many viewer/users, this includes presentation of a "best” view of the real-world environment at all times to the viewer/user. Of course, the concept of what is "best” is dependent on both the viewer and the current context. In following Section 12, the different ways of defining the "best” view, and how to compute the "best” view based on viewer preferences and available model information, is described.
  • immersion of the viewer/user is vital. Selecting the "best” view among available video streams, which selection involves constant change of viewer perspective, may be detrimental towards creating immersion.
  • Generalizing the "best" view concept to selecting a continuous sequence of views that best suit viewer/user requirements and create immersion overcomes this. When such arbitrary views are selected, then the world must somehow be visualized from that perspective for the viewer/user.
  • video can Pe used as a dynamic source of generating texture information.
  • the complete immersive video system discussed in Section 13 uses comprehensive tnree dimensional model of the environment and tne multiple video channels to create immersive, realistic renditions of real-world events from aroitrary perspective Poth monocular and stereo presentations.
  • Section 12 is a description of the construction of accurate three dimensional models of an environment from multi-perspective video streams m consideration cf a priori knowledge of an environment Specifically, section 12 discusses the creation of an Environment Model and also provide details on the preferred MPI-Video architecture.
  • section 4 describes how this model, along with the raw video data, can be used to build realistic "immersive visual reality” vistas, and how a viewer can interact with the model .
  • Section 15 describes various applications of the immersive video system of the present invention.
  • the MPI-Video modeling system described m Section 12 uses multiple video signals to faithfully reconstruct a mode_ of the real-world actions and structure
  • a distributed implementation coupled with expectation-driven, need-based analysis ensures near real-time model construction.
  • the preferred immersive video system, described Section 13, reconstructs realistic monocular and stereo stas from the viewer perspective (see, for example, Figure 33;
  • vioeo-oased systems such as the one taugnt this specification, can be very beneficial.
  • a video-based system can assist the user by r assuming the low level tasKS like building tne structural model oased on the real-world, leaving only higa level annotation to the user
  • Video data can pe used to collect a myriad of visual information about an environment. This information can be stored, analyzed and used to develop "virtual" models cf the environment. These models, m turn can be analyzed to determine potential changes or modifications to an environment
  • MPI-Video might be employed at a particularly hazardous traffic configuration. Visual data of traffic would be recorded and analyzed to determine statistics about usage, accident characteristics, etc
  • changes to the environment could be designed and modeled, where input to the model again could come from the analysis performed on the real data.
  • architectural analysis could benefit py the consideration of current building structures using MPI-Video. This analysis could guide architects the identification and modeling of building environments.
  • MPI-Video Multiple Perspective Interactive Video
  • MPI-Video is a framework for the management and interactive access to multiple streams cf video data capturing different perspectives of the same cr cf related events
  • MPI-Video supports the collection, processing and maintenance of multiple streams of data which are integrated tc represent an environment
  • Sucn representations can Pe based solely on the "real" world recorded by the video cameras, or can incorporate elements of a "virtual" world as well .
  • the preferred MPI-Video system supports a structured approach to the construction of "virtual worlds" using video data.
  • tne MPI-Video architecture shown m Figure 1
  • Those elements salient to the application of MPI-Video m the context of the processing and creation of "immersive visual reality” are highlighted.
  • MPI-Video architecture involves the following operations During processing, multiple data streams are forwarded to the Video Data Analyzer. This unit evaluates each stream to (1) detect and track objects and (11) identify events recorded m the data. Information derived m the Video Data Analyzer is sent to the Assimilator Data from all input streams is integrated by the Assimilator and used to construct a comprehensive representation of events occurring m the scene over time (e.g. object movements and positions) .
  • the Assimilator thus models spatial-temporal activities of objects in the environment, building a so-called environment model.
  • these tracking and modeling processes provide input to the database which maintains both the annotated raw video data as well as information about object behavior, histories and events.
  • Information in the database can be queried by the user or by system processes for information about the events recorded by the video streams as well as being a data repository for analysis operations.
  • a View Selector module -- used to compute and select "best views" and further discussed below -- interfaces with the database and a user interface subsystem to select appropriate views in response to user or system input .
  • a visualizer and virtual view builder uses the raw video data, along with information from the environment model to construct synthetic views of the environment.
  • a user interface provides a variety of features to support access, control and navigation of the data.
  • FIG. 22a shows a schematic cf this courtyard environment, indicating the positions cf the cameras. Synchronized frames from each of the four cameras are shown in Figures 22b and 22c.
  • EM Environment Model
  • the preferred EM consists cf a set cf interdependent objects 0-(t) .
  • This set in turn is comprised of a set cf dynamic objects D (t) and a set of static objects S._.
  • D (t) For instance, vehicles moving in a courtyard are dynamic objects; pillars standing in the courtyard are static objects.
  • the time variance of the set 0. (t) is a result of the time variation of the dynamic objects .
  • any changes that occur in one level should be propagated to other levels (higher and lower) , or at least tagged as an apparent inconsistency for future updating.
  • Each dynamic object at the lowest level has a spatial extent of exactly one grid. Objects with higher extent are composed of these grid objects, and hence belong to higher levels.
  • Eacn dynamic object has several attributes, TMc ⁇ t oasic oe g tne confidence that it exists Each cf the above factors may contrioute to either an increase or decrease m this confidence. These factors also affect tne values of other object attributes.
  • the value of an object 0 t and hence, the state S't 1 , may change due to the following factors.
  • I 1 New input information i.e., new data regarding object position from the video data
  • 2 change related model information
  • I advice from higher processes and (4) decay (due to aging) .
  • the preferred MPI-Video system provides facilities for managing dynamic and static objects, as is discussed further below this section.
  • the EM informed by the two-dimensional video data, provides a wealth of information not available from a single camera view. For instance, objects occluded in one camera view may oe visible in another. In this case, comparison ot objects m 2 vt> at a particular time instant t with objects in S c 1 can help anticipate and resolve such occlusions.
  • the model which takes inputs from both views, can continue to update the status of an object regardless of the occlusion a particular camera plane
  • a representation must be chc en which facilitates maintenance of object positional information as well as supporting more sophisticated questions aoout ooject behavior.
  • the preferred dynamic model relies on the following two components .
  • Tne first component is voxels.
  • the environment is divided up into a set of cubic volume elements, or voxels.
  • Each voxel contains information such as which objects currently occupy this voxel, information about the history of objects in this voxel, an indication of which cameras can "see” this voxel.
  • objects can be described by the voxels they occupy. The voxel representation is discussed in greater detail in section 4.
  • the second component is (x,y,z) world coordinates.
  • the environment and objects in the environment are represented using (x,y,z) world coordinates.
  • objects can be
  • Each of these representations provides different support for modeling and data manipulation activities.
  • the preferred MPI-Vide system utilizes both representations.
  • the Video Data Analyzer uses image and visual processing techniques to perform object detection, recognition and tracking i each cf the camera planes corresponding to the different perspectives.
  • the currently employed technique is based on differences in spatial position to determine object motion in each of the camera views. The technique is as follows.
  • each input frame is smoothed to remove some noise.
  • the difference image d : _ .. is computed as follows. Only pixels that are the focus cf attention windows and that ar not masked are considered. (Here F(t) refers to the pixels in the focus cf attention, i.e. , a region of interest in the frame t.)
  • components on the thresholded binary difference image are computed based on a 4-neighborhood criterion. Components that are too small or too big are thrown away as they usually constitut noise . Also frames that contain a large number of components are discarded. 3oth centroid (from first moments) , and orientation an elongation (from the second moments) are extracted for each component.
  • any of several optional filters can be applied to the components obtained from the previous step. These filters include merging of overlapping bounding boxes, hard limits of orientation and elongation, distance from expected features etc.
  • the list of components associated with each camera is sent from the Video Analysis unit to the Assimilator module which integrates data derived from the multiple streams into a comprehensive representation of the environment.
  • the Assimilator module maintains a record of all objects in the environment .
  • the Assimilator determines if the new data corresponds to an object whose identity it currently maintains. I
  • the list of 2D object bounding boxes is further filtered based on global knowledge
  • each pouncing POX S projected to the primary surface cf motion by intersecting a ray drawn from the optic center of that particular camera througn the foot of the bounding POX with the ground surface .
  • the object positions are projected into the next frame cased on a domain dependent tracker.
  • a current area of our research seeks to employ additional methods to determine and maintain object identity
  • active contour models can be employed in each of the cameras to track object movements. See A. M. Baumberg and D. C. Hogg, An efficient method for contour tracking using active shape models, Techni cal Report 94 . 11 , School of Computer Studies, University of Leeds, April, 1994. See also M. Kass, A. Witkm, and D. Terzopolous, Snakes: Active contour models, In terna tional Journal of Compu ter Vision , pages 321-331, 1988. See also F. Leymarie and M. D.
  • 81 camera calibration process which maps related locations in the two-dimensional video data to a fully three-dimensional representation of the world recorded by the cameras. If an event is seer, m one camera, e.g., a wide receiver making a catch, or a dancer executing a ump, the system, using this mapping, can determine ether cameras that are also recording tne event, and where the various video frames the event is occurring. Then a viewer, cr the system, can choose between these different views cf the action, subject to some preference. For example, the frames which provide a frontal view of the wide receiver or the dancer. This "best view" selection is desc ⁇ oed further pelow and in section 14.
  • cameras can be calibrated before processing the video data using methods such as those described by Tsai and Lenz . See R. Y. Tsai and R. K. Lenz , A new technique for fully autonomous and efficient 3D robotics hand/eye calibration, IEEE Transactions on Roboti cs and Automa tion , 5(3) :345-58, June 1989.
  • the preferred MPI-Video system of the present invention has the capability to integrate these techniques into our analysis and assimilation modules when they become available. To date, evaluation of the preferred MPI-Video system has been done only by use of fixed cameras. The Assimilator maintains the Environment Model discussed above.
  • a key element in the maintenance of multiple camera views is the notion of a Camera Hand-off, here understood to be the event m which a dynamic object passes from one camera coverage zone to another.
  • the Assimilator module also manages this processing task, maintaining a consistent representation of an object's identity and behavior during camera hand-off. This requires information about the object's position, its motion, etc.
  • c (v) be the camera list, or set, associated with a particular voxel v
  • V be the set of all voxels in which an object resides.
  • the View Selector can use a variety of criteria and metrics to determine a "best” view.
  • “best” is understood to be relative to a metric either specified by the user or employed by the system in one of its processing modules.
  • the best view concept can be illustrated by considering a case where there are N cameras monitoring the environment. Cameras will be denoted by C where the index I e ⁇ l, ... , N ⁇ varies over all cameras. At every time step, t, each camera produces a video frame, F _ t .
  • the term ⁇ BV will be used to indicate the best view index. That is, ⁇ BV is the index of the camera which produces the best view at time t. Then, the best view is found by selecting the
  • Some possible best view criteria include the least occluded view, the distance of the object to the camera, and object orientation.
  • the system chooses, at time t, that frame from the camera in which an object cf interest s least occluded.
  • the best view camera index is defined according tc the following criteria,
  • the object size metric S is given by:
  • the best view is the frame in which an object of interest is closest to the corresponding camera.
  • D_(t) is the Euclidean distance between the (x,y,z) location of camera C. and the world coordinates of the object of interest.
  • the world coordinate representation mentioned above, is most appropriate for this metric. Note also, that this criteria does not require any computation involving the data in the frames. However, it does depend on three-dimensional data available from the environment model .
  • orientation criteria For an orientation criteria a variety of possibilities exist. For instance, direction of object motion, that view in which a face is most evident, or the view in which the object is located closest to the center of the frame. This last metric is described by,
  • CD. (t) is given by
  • the values Xsize and Ysize give the extent of the screen and x(t, , y(t) ) are the screen coordinates of the object's two-dimensional centroid in frame F_ ..
  • each m is a metric, e.g. , size as defined adove, and we have M such metrics each of which is applied to the data from each camera, hence, the C . terms in equation ( 10) .
  • each g_ . combines these metrics for C , e.g. as a weighted linear sum.
  • the use of the time t m this equation supports a best view optimization which uses a temporal selection criteria involving many frames over time, as well as spatial metrics computed on each frame. This is addressed m the followin paragraph.
  • the criteria G chooses between all such combinations and arg . selects the appropriate index. For instance, G might specify the minimum value.
  • ⁇ EV arg . G(g x ., g-_., g 3 t )
  • G is a criteria which chooses the optimum from the set of gi,-t's. Note that time does not appear explicitly in the right hand side of this equation, indicating that the same best view evaluation is applied at each time step t. Note, in this case, th same g (here, a weighted linear sum) is applied to all cameras, although, this need not be the case.
  • the best view is a frame from a particular camera.
  • smoothness over time may also be important to the viewer or a system processing module.
  • a spatial metric such as object size or distance from camera is important
  • a smooth sequence of frames with some minimum number of cuts i.e. camera changes
  • best view selection can be a result of optimizing some spatial criteria such that a temporal criteria is also optimum.
  • x (x, y, z , a , ⁇ , f)
  • '.x,y,z is the world coordinate position, or index, of the camera
  • o is a pan angle
  • is camera tilt angle
  • f is a camera parameter which determines zoom in/out .
  • the set of all such vectors x forms a 6-dimensional space, ⁇ . In ⁇ , ⁇ x,y,z) varies continuously over all points in R- ' , - ⁇ , ⁇ ⁇ ⁇ , and f > 0.
  • the best view is that camera positioned at location "x a , where this value of the vector optimizes the constraint G given by:
  • the camera index, x can vary over all points in the environment and the system must determine, subject to a mathematical formulation of viewing specification, where to position the camera to satisfy a best view criteria. Views meetin this criteria can then be constructed using the techniques outline in section 14.
  • the pest view computation ir. equations (5) , ( 7) and -,3 can all be computed on the fly as video data comes into the system. More complex best view calculations, including those that optimize a temporal measure, may require ouffered or stored data to perform best view selection.
  • Figure 23 shows how a selected image sequence is derived from four cameras and the determined "best" view.
  • the "best" view is based upon two criteria, largest size and central location within the image where size takes precedence over location.
  • the outlined frames represent chosen images which accommodate the selection criteria.
  • the oval tracings are superimposed onto the images to assist the viewer tracking the desired object.
  • the last row presents the preferred "best" view according to the desired criteria.
  • a digital zoom mechanism has been employed to the original image .
  • images TO and Tl only from the view of camera 3 is the desired object visible. Although all camera views detect the object image T2 and T3 , the criteria selects the image with the greatest size.
  • image T4 the object is only visible m camera 4.
  • the visualizer and virtual view builder provides processing to create virtual camera views . These are views which are not produced by physical cameras. Rather they are realistic renditions composed from the available camera views and appear as if actually recorded. Such views are essential for immersive applications and are addressed in section 4 below.
  • FIGs 27, 28 and 29 show the current Motif-based preferred MPI-Video interface.
  • This interface provides basic visualization of the model, the raw camera streams and the results of video data analysis applied to these streams.
  • its menus provide control over the data flow as well as some other options.
  • augmentations may include user selection cf viewing position and manipulation (e.g. placemen of virtual model information into the environment.
  • the model shown in figures 27, 28 and 29 employs an x, y, z; world coordinate, bounding box object representation. That is, the system! tracks object centroid and uses a bounding box to indicate presence of an object at a particular location.
  • a voxel-based representation supports finer resolution of object shape and location. Such a formulation is discussed in the next section 14.
  • Immersive and interactive telepresence is an idea that has captured the imagination of science fiction writers for a long time. Although not feasible in its entirety, it is conjectured that limited telepresence will play a major role in visual communication media in the foreseeable future. See, for example, N. Negroponte, Being digi tal , Knopf, New York, 1995.
  • Immersive Video (ImmV) , a spatially- emporally realistic 3D rendition of real-world events. See the inventors' own papers: S. Moezzi, A. Katkere, S. Chatterjee, and R. Jain, Immersive Video, Techni cal Report VCL- 95 - 104 , Visual Computing Laboratory, University of California, San Diego, Mar. 1995; and S. Moezzi, A. Katkere, S. Chatterjee, and R. Jain, Visual Reality: Rendition of Live Events from Multi-Perspective Videos, Technical Report VCL- 95 - 102 , Visual Computing Laboratory, University of California, San Diego, Mar. 1995.
  • ImmV allows an interactive viewer, for example, to watch a broadcast of a football or soccer game from anywhere in the field, even from the position of the quarterback or "walk” through a live session of the US Congress .
  • Immersive Video involves manipulating, processing and compositing of video data, a research area that has received increasing attention. For example, there is a growing interest in generating a mosaic from a video sequence. See M. Hansen, P. Anandan, K. Dana, G. van der Wal, and P. Burt, Real-time Scene Stabilization and Mosaic Construction, in ARPA Image Unders tanding
  • the underlying task is to create larger images from frames obtained from a single-camera ipanning video stream.
  • Video mosaicmg ⁇ as numerous applications including data compression.
  • Another application is video enhancement. See M. Irani and S. Peleg, Motion analysis for image enhancement: resolution, occlusion, and transparency, J. cf Vi sual Communi ca tion and Image Represen ta ti on , 4(4 ⁇ :324-35, Dec. 1993.
  • Yet another application is the generation of panoramic views. See R. Szeliski, Image mosaicing fcr tele-reality applications, Proc . of Workshop on Appl i ca ti ons of Compu ter Vision , pages 44-52, Sarasota, FL, Dec. 1994.
  • Algorithm 1 is the vista compositing algorithm. At each time instant, multiple vistas are computed using the current dynamic model and video streams from multiple perspective. For stereo, vistas are created from left and right cameras.
  • a basic element of this algorithmic process is a set of transformations between the model (or world) coordinate system W ⁇ (x ⁇ ,y_, z ) ⁇ , the coordinate system of the cameras C : ⁇ (x_,y_, z.) ⁇ and the vista coordinate system V : ⁇ (x ,y ,z ) ⁇ .
  • the model (or world) coordinate system W ⁇ (x ⁇ ,y_, z ) ⁇ the coordinate system of the cameras C : ⁇ (x_,y_, z.) ⁇
  • V ⁇ (x ,y ,z ) ⁇ .
  • M_. is the 4 x 4 homogeneous transformation matrix representing transformation between V and the world W [6] .
  • This point is then projected onto each of the camera image planes c .
  • M is the 4 x homogeneous transformation matrix representing transformation between c and the world.
  • the distance of the object point (x_.,y.,z ⁇ ,) from camera window coordinate (x_,y_) which is the depth value d_(x_,y.) .
  • the evaluation criterion ecv for each candidate view cv is:
  • Figure 29b illustrates how photo-realistic video images are generated by tr.e system for a given viewpoint, this case a ground level view overlooking the scene entrance.
  • This view was generated oy the prototype immersive video system using the comprehensive 3D model built by the MPI-Video modeling system and employing Algorithm 1 for the corresponding video frames shown in Figure 28. Note tnat this perspective is entirely different from the original views.
  • a panoramic view cf the same scene may also b produced.
  • a oird' s eye view of the walkway for tne same time instant is snown in Figure 29a. Again, white portions represent areas not covered by any camera. Note the alignment of the circular arc. Images from all four cameras contributed towards the construction of the views.
  • Figure 28 also illustrates immersive abilities of the immersive videc technology of the present invention by presenting selected frames from a 116-frame sequence generated for a walk through the entire courtyard.
  • the walk through sequence illustrates now an event can be viewed from any perspective, while taking into account true object bearings and occlusions.
  • Voxels or Spatial Occupancy Enumeration Cells -- which are cells on a three-dimensional grid representing spatial occupancy -- provide one way cf building accurate and tight object models See J D Foley, A van Dam, S K. Feiner, and J F Hughes, Computer Graphi cs : Principl es and Practi ces , Addison-Wesiey Publishing Company, Inc , second edition, 1990.
  • the immersive video system of the present invention uses techniques to determine occupancy cf the voxels to determine occupancy cf the voxels.
  • An a priori static model (which occupies majority of filled space; is used to determine default occupancy of the voxels
  • the occupancy of only those voxels whose state could have changed from the previous time instant is continuously determined Using higher level knowledge, and information from prior processing, this computation may be, and preferably is, restricted to expected locations of dynamic objects.
  • Algorithm 2 is the voxel-construetion-and-visualization-for-moving-objects algorithm.
  • Figure 23 and the diagrammatic portion of Figure 31 illustrate the viewing frustrums that define this space. Treating voxels as a accumulative array to hold positive and negative evidence of occupancy, the positive evidence of occupancy for this subtended space can be increased. Similarly, the space not subtended by motion points contribute to the negative
  • Voxels are generated by integrating motion information across the four frames cf Figure 27.
  • the physical dimension of each voxel is 8 dm- or 2x2x2 dm- . Comparing this with the cylindrical approximations of the MPI-Video modeling system, it is evident that more realistic virtual vistas can be created with voxels. Close contour approximations like Kalman snakes can also be used to achieve similar improvements.
  • Voxels have been traditionally vilified for their extreme computing and storage requirements. To even completely fill a relatively small area like the courtyard used in the prototype system, some 14.4 million 1 drrr ' voxels are needed. With the recent and ongoing advances m storage and computing, this discussion may be moot. High speed, multi-processor desk-top machines with enormous amounts of RAM and secondary storage have arrived (e.g., high-end desk top computers from SGI) . However for efficiency considerations and elegance, it is herein discussed how storage an computing requirements can greatly be reduced using certain assumptions and optimization.
  • the dynamic objects are assumed to be limited in their vertical extent.
  • all dynamic objects are in the range of 10-20 dm in height.
  • bounds are put on where the objects may be at the current time instant based on prior state, tracking information
  • the former assumption reduces the number of voxels by limiting the vertmcal dimension.
  • voxels are dynamically allocated to certain limited regions m the environment, and it is assumed that the remaining space retains the characteristics of the a priori static model.
  • tne number of voxels become a function of the number of expected dynamic oo eccs instead of being a function of the total modeled space. While making these assumptions, and using two representations, slightly complicates spatial reasoning, the complexity m terms cf storage and computation is greatly reduced.
  • Figure 15 shows the hardware configuration of the prototype immersive video system incorporating MPI video.
  • the preferred setup consists of several independent heterogeneous computers.
  • one work station is used to process data from a single camera, preferably a Model 10 or 20 work station available from Sun.
  • multiple video processing modules can run on a reduced number of work stations (down to a single work station) .
  • a central (master) graphics work station (a SGI Indigo 2 , Indy or Challenge) controls these video processing work stations (slaves) and maintains the Environment Model (and associated temporal database) .
  • the central master and the remote slaves communicate at a high symbolic level and minimal image information is exchanged.
  • object bounding box information is sent from the slaves to the master.
  • actual image data need not be ex-changed, resulting m a very low required network bandwidth for master-slave communication.
  • the work stations in the prototype system are connected on a 120 Mbps Ethernet switch which guarantees full-speed point-to-point connection.
  • the master-slave information exchange protocol is as follows: First, the master initializes graphics, the database and the Environment Model (EM) , and waits on a pre-specified port.
  • EM Environment Model
  • each slave contacts the master (using pre-specified macnine-port comomaticr) and a initialization hand-shaking protocol ensues
  • tne master acknowledges each slave ar. ⁇ sends it initialization mformaticr., e.g. , wnere the images are actually stored 'for tne laboratory case; , the starting frame and frame interval, camera-specific image-processing information like tnresnolds, masks etc
  • each slave initializes itself based on the information sent cy the master
  • the master sends a request to that particular slave with information about processing the frame viz focus cr attention windows frame specific thresholds and other parameters, current an expected locations and identifications cf moving oojects etc. and continues its processing (modeling and user interaction) (The focus of attention is essentially a region of interest the image specifying where the visual processing algorithms should concentrate their action.)
  • synchronous mode requests to all slaves are sent simultaneously and the integration is done after all slaves have responded. In asynchronous mode, this will not necessarily go m unison.
  • the frame information is used to update the Environment Model (EM) .
  • EM Environment Model
  • Immersive Video so far presented has used multi-perspective video and a priori maps to construct three-dimensional models that can be used in interaction and immersion for diverse virtual world applications.
  • One of these application is real-time virtual video, or virtual television, or telepresence -- next discussed in the following section 6.
  • Various ways of presenting virtual video information have been discussed. Selection of the best view, creation of visually realistic virtual views, and interactive querying of the model have also been discussed. The actual
  • Immersive telepresence is an immersive, interactive and realistic real-time rendition of real-world events captured Py multiple video cameras placed at different locations in the environment It is the real-time rendition of virtual video, "virtual television” instead of just “virtual video”
  • immersive telepresence is based on and incorporates Multiple Perspective Interactive Video (MPI-Video) infrastructure for processing video data from multiple perspectives.
  • MPI-Video Multiple Perspective Interactive Video
  • immersive telepresence and immersive video are these First, more computer processing time _s clearly available ir non-real time immersive videc than immersive telepresence This may not be, however, of any great significance. More importantly, with immersive "ideo tne scene model may be revised, so as to improve the video renderings on an iterative basis and/cr to account for scene occurrences that are unanticipated and net within the modeled space e.g , the parachutist falling m elevation into tne scene cf a football game, which motion is totally unlike the anticipated motion of the football players and is not at or near ground level
  • the scene models used for immersive telepresence have een developed, and validated, for virtual video.
  • a scene snould be "canned", or rote It is, however, required that the structure of tne scene 'note that tne scene has “structure”, and is not a “windy jungle' n should pe, to a certain extent, pre-processed into a scene model.
  • scene models will grow m sophistication, integration, and comprehensiveness, becoming able to do better m presentation, wit fewer video feeds, faster.
  • Telepresence is generally understood m th context of Virtual reality (VR) with displays of real, remote scenes.
  • VR Virtual reality
  • This specification and this section instead describe immersive telepresence, being the real-time interactive and realistic rendition cf real-world events, i.e , television where the viewer cannot control (does not interact witri what s nappen g m a real worl ⁇ scene, but can interact with now the scene is viewed.
  • Jaron Lamer defines Virtual Reality as an immersive, interactive simulation of realistic or imaginary environments. Se J. Lamer. Virtual reality tne promise of the future. Interactive Learning In terna ti onal , B(4; :275-9, Oct. -Dec. 1992
  • the new concept called visual reality is an immersive, interactive and realistic rendition of real-world events simultaneously captured b video cameras placed at different locations m the environment.
  • Visual reality uses the Multiple Perspective Interactive Video (MPI-Video) infrastructure. See S.
  • MPI-Video Multiple Perspective Interactive Video
  • MPI-Video is a move away from conventional video-cased system which permit users only a limited amount of control and insight into the data.
  • Traditional systems provide a sparse set of action such as fast-forward, rewind and play of stored information. No provision for automatic analysis and management of the raw video data is available.
  • Visual Reality involves manipulating, processing and compositing of video data, a research area that has received increasing attention. For example, there is a growing interest in generating a mosaic from a video sequence. See M. Hansen, P. Anandan, K. Dana, G. van der Wal, and P. Burt, Real-time scene stabilization and mosaic construction, in ARPA Image Understanding Workshop, Monterey, CA, Nov. 13-16 1994. See also H. Sawhney, Motion video annotation and analysis: An overview, in Proc . 27 th Asilomar Conference on Signals , Systems and Compu ters , pages 85-89 IEEE, Nov. 1993.
  • Video mosaicmg ha numerous applications including data compression, video
  • Section 6.2 recapitulate the concepts of MPI-Video as is especially applied to VisR.
  • Section 6.3 provides implementation details and present results fo the same campus walkway covered by multiple video cameras -- only
  • MPI-Video is a framework for management and interactive access to multiple streams of video data capturing different perspectives of related events. It involves automatic or semi-automatic extraction of content from the data streams, modeling of the scene observed by these video streams, management of raw, derived and associated data. These video data streams can reflect different views of events such as movements of people and vehicles.
  • MPI-Video also facilitates access to raw and derived data through a sophisticated hypermedia and query interface.
  • a user or an automated system, can query about objects and events in the scene, follow a specified object as it moves between zones of camera coverage and select from multiple views.
  • a schematic showing multiple camera coverage typical in a MPI-Video analysis was shown in Figure 22a.
  • Visual reality While in virtual reality (VR) texture mapping is used to create realistic replicas of both static and dynamic components, i visual reality (VisR) , distinctively, actual video streams ar used. Ideally, exact ambiance will always be reflected the rendition, i.e. , purely two dimensional images changes are also captured.
  • m VisR a viewer is aole to move around a football locker and watcn the spectators from anywnere m the fre.d and see them waving, moving, etc
  • the current prototype immersive telepresence system is used conjunction with multiple actual video feeds of a real-world scene to compose vistas of this scene
  • Experimental results obtained fo a campus scene show how an interactive viewer can 'walk through' tnis dynamic, live environment in as it exists in real time 'e.g. , as seen through television) .
  • Any comprehensive tnree-d3mensional model consists of static and dynamic components
  • a priori information e.g , a CAD model, about the environment is used
  • the model views are then be registered with the cameras
  • Accurate camera calibration plays a significant role in this.
  • the dynamic model it is necessary to d) detect tne objects m the images fro different views, (ii) position them m 3-D using calibration information, (in) associate them across multiple perspectives, an (iv) obtain their 3-D shape characteristics.
  • a complete, geometric 3-2 model of a campus scene was C - lz using architectural map data.
  • the VisR system must and does extract information from all tne video streams, reconciling extracted information with the 3-D model
  • a scene representation was cnosen wnich facilitates maintenance cf ooject' s location and snape information
  • ooject information is stored as a comcmation of voxel representation, grid-map representation and object-location representation Note the somewnat lavisn use of information
  • the systems of the present invention are generally compute limited, and are generally not limited in storage Consider also that more and faster storage may be primarily a matter cf expending more money, but there is a limit to how fast the computers can compute no matter how much money is expended. Accordingly, it is generally better to maintain an mformation-rich texture from which the computer (s) can quickly recognize and maintain scene objects than to use a more parsimonious data representation at the expense of greater computational requirements.
  • the prototype VisR, or telepresence, system is able to compute the 3-D position of each dynamic object detected by a motion segmentation module m real time.
  • a priori information about the scene and camera calibration parameters, coupled with the assumption that all dynamic objects move on planar surfaces permits object detection and localization. Note the similarity in constraints to object motion(s) , and the use of a priori information, to immersive video.
  • necessary positional information is extracted from each view. The extracted information is then assimilated and stored in a 2D grid representing the viewing area.
  • moving oojects This also initializes/updates a tracker which exchanges information with a global tracker that maintains state information of all the moving objects.
  • FIG. 104 of the present invention involved the same campus scene (actually, a courtyard) as was used for the immersive video.
  • the scene was covered by four cameras at different locations.
  • Figure 22a shows the model schematic (of the environment 1 along with the camera positions. Note that though the zones of camera coverage have significant overlaps, they are not identical, thus, effectively increasing the overall zone being covered.
  • Figure 27 shows corresponding frames from four views of the courtyard with three people walking.
  • the model view cf the scene is overlaid on each image.
  • Figure 28 shows some "snapshots" from a 116-frame sequence generated for a "walk through” the entire courtyard. People in the scene are detected and modeled as cylinders in our current implementation as shown in Figure 29.
  • the "walk” sequence illustrates how an event can be viewed from anywhere, while taking into account true object bearings and pertinent shadows.
  • Figure 29a shows a ground level view of the scene, and Figure 29b a bird's eye from the top of the scene. Each view is without correspondence to any view within any of the video streams.
  • the prototype VisR system serves to render live video from multiple perspectives. This provides a true immersive telepresence with simple processing modules.
  • Immersive video may be divided into real-time applications, i.e., immersive television, and all other, non-real-time, applications where there is, suddenly, more time to process vide of a scene.
  • an immersive video system based on a single engineering work station class computer can, at the present time, process and monitor (being two separate things)
  • Such a system can, for example, perform the function of a "television sports director" -- at least so far as a "video sports director" focused on limited criteria -- reasonably well.
  • the immersive video "sports director” would, for example, be an aid to tne human sports director, who would control the live television primary feed of a televised sporting event such as a football game.
  • the immersive video "sports director” might be tasked, for example, tc "follow the football". This view could go out constantly upon a separate television channel.
  • the view would normally only be accessed upon selected occasions such as, for example, an "instant replay".
  • the synthesized virtual view is immediately ready, without any such delay as normally presently occurs while humans figure out what camera or cameras really did show the best view(s of a football play, upon the occasion cf an instant replay.
  • the synthesized view generally presenting the "football" at center screen can be ordered. If a particular defensive back made a tackle, then his movements throughout the play may be of interest. In that case a sideline view, or helmet view, of this defensive back can be ordered.
  • multiple video views can be simultaneously synthesized, each transmission upon a separate television channel. Certain channels would be devoted to views of certain players, etc.
  • An immersive video system can be directed to synthesize and deliver up "heads-up facial view” images of people in a crowd, one after the next and continuously as and when camera (s) angle is) permit the capture/synthesis of a quality image.
  • the immersive system can image, re- image and synthetically image anything that its classification stage suspects to be a "firearm".
  • the environment model of a football game expects the players to move but the field to remain fixed
  • the environment model of a secured area can expect the human actors therein to move but the moveable physical property (inventory) to remain fixed or relatively fixed, and not to merge inside the human images as might be the case if the property was being concealed for purposes of theft .
  • an immersive video system is image synthesis and presentation, and not image classification
  • the immersive video system has good ability (as Dt should, at Dts high cost to permit existing computer image classification programs tc successive ⁇ fu_ly recognize deviations -- oojects the scene or events in the scene
  • Dt should, at Dts high cost to permit existing computer image classification programs tc successive ⁇ fu_ly recognize deviations -- oojects the scene or events in the scene
  • tne three-dimensional database, or world model, within an immersive vi ⁇ eo system can be the input tc three-, as opposed to two-, dimensional classificatio programs
  • Human faces ( heads) in particular might oe matched against stored data representing existing, candidate, human heads in tnree dimensions
  • Machine classification cf human facial images is expected to be much improved if, instead of ust one video view at an essentially random view angle, video of an entire observed head is available for comparison.
  • Command and control computers should perhaps compensate for the crudity of their environmental models b assimilating more video data inputs derived from more spatial sites.
  • humans as supported by present-day militar computer systems, already recognize the great utility of sharing tactical information on a theater of warfare basis.
  • NTDS Naval Tactical Data System
  • An attached appendix contains the computer program source code for realizing immersive video in accordance with the present invention.
  • a scene from the 3-D database can be "played back" at normal, real-time, speeds and in accordance with the particular desires of a particular end viewer/user by use of a computer, normally a personal computer, of much less power than the computer (s) that created the 3-D database. Every man or woman will thus be accorded an aid to his or her imagination, and can, as did the fictional Walter Mitty, enter into any scene and into any event .
  • immersive video For example, one immediate use of immersive video is in the analysis of athlete behaviors.
  • An athlete, athlete in training, or aspiring athlete performs a sports motion such as, for example, a golf swing that is videotaped from multiple, typically three, camera perspectives.
  • a 3-D video model cf the swing which may only be a matter of ten or so seconds, is constructed at leisure, perhaps over some minutes in a personal computer.
  • a student golfer and/or his/her instructor can subsequently play cack the swing from any perspective that best suits observation of its salient Characteristics, cr those cf its attributes that are undergoing corrective revision.
  • twc such i-I models cf t e same golfer are made, one can be compared against the other for deviations, which may possibly be presented as colored areas or the like on the video screen. If a model or an expert golfer, or a composite of expert golfers, is made, then the swing of the student golfer can be compared ir. three dimensions to tne swing- s, ⁇ cf the expert golfer ⁇ .sj .
  • MRI Magnetic Resonance Imaging
  • immediate medical applications cf immersive video in accordance with the present invention are much more mundane.
  • a primary care physician might, instead of just recording patient height and weight and relying on his or her memory from one patient visit to the next, might simply videotape the standing patient's unclothed body from multiple perspectives at periodic intervals, an inexpensive procedure conducted in but a few seconds.
  • Three-dimensional patient views constructed from each session could subsequently be compared to note changes in weight, general appearance, etc.
  • the three-dimensional imaging of video information (which video information need not, however, have been derived from video cameras) as is performed by the immersive video system of the present invention will likely be useful for machine recognition of pathologies.
  • a computer is inaccurate in interpreting, for example, x-ray mammograms, because it looks at only a two- dimensional image with deficient understanding of how the light an shadow depicted thereon translates to pathology of the breast .
  • a tumor might be small, but that a small object shown at low contrast and high visual signal-to-noise is difficult to recognize in two dimensions. It is generally easier to
  • Another use cf the same 3-D human images realized with immersive video system cf the present invention would be in video representations cf the prospective results of reconstructive or cosmetic (plastic 1 surgery, cr cf exercise regimens.
  • the surgeon or trainer would modify tne body image, likely by manipulation of the 3-D image database as opposed to 2-D views thereof, much in the manner that any computerized video image is presently edited.
  • the patient/client would be presented with the edited view(s) as being the possible or probable results of surgery, or of exercise.

Abstract

Immersive video, or television, images of a real-world scene are synthesized (i) on demand, (ii) in real time, (iii) as linked to any of a particular perspective on the scene, or an object or event in the scene, (iv) in accordance with user-specified parameters of presentation, including panoramic or magnified presentations, and/or (v) stereoscopically. Multiple video cameras (10a-10c) each at a different spatial location produce multiple two-dimensional video images of the scene. A viewer/user specifies viewing criterion (ia) at a viewer interface (10). A video display (18) receives and displays the synthesized 2-D video image(s).

Description

IMMERSIVE VIDEO
BACKGROUND OF THE INVENTION
1 Field of the Invention
The present invention generally concerns t multimedia, (ii- video, including video-on-αemanα and interactive video, and αii) television, including television-on-demand and interactive television.
The present invention particularly concerns automated dynamic selection of one video camera/image from multiple real video cameras/images m accordance with a particular perspective, an object m tne scene, cr an event in the video scene.
The present invention also particularly concerns the synthesis of diverse spatially and temporally coherent and consistent virtual video cameras, and virtual video images, from multiple real world video images that are obtained by multiple real video cameras.
The present invention still further concerns the creation of three-dimensional video image databases, and the location and dynamical tracking of video images of selected objects depicted m the databases for, among other purposes, the selection of a real camera or image, or the synthesis of a virtual camera or image, best showing the selected object.
The present invention still further concerns (i) interactive synthesis of video, or television, images of a real- world scene on demand, (ii) the synthesis of virtual video images of a real-world scene in real time, or virtual television, (iii) the synthesis of virtual video images/virtual television pictures of a real-world scene which video images/virtual television are linked to any of a particular perspective on the video/television scene, an object in the video/television scene, or an event in the video/television scene, (iv) the synthesis of virtual video images/virtual television pictures of a real-world scene wherein the pictures are so synthesized to user-specified parameters of presentation, e.g. panoramic, or at magnified scale if so desired by the user, and (v) the synthesis of 3-D stereoscopic virtual video images/virtual television.
2. Description of the Prior Art
2.1 Limitations in the Present Viewing of Video and Television
The traditional model of television and video is based on a single video stream transmitted to a passive viewer. A viewer has the option to watch the particular video stream, and to re- watch should the video be recorded, but little else. Due to the emergence of the information hignways and other related information infrastructure circa 1995, there has been considerable interest in concepts like video-on-demand, interactive movies, interactive TV, and virtual presence. Some of these concepts are exciting, and suggest many dramatic changes m society due to tne continuing dawning of the information age
It will shortly be seen that this specification teaches that a novel form of video, and television, is possible where a viewer of video, or television, depicting a real-world scene may select a particular perspective from which perspective the scene will henceforth be presented. The viewer may alternatively select a particular object -- which may be a dynamically moving object -- or even an event m the real world scene that is of particular interest. As the scene develops then t will be presented to tne viewer with the selected object or the selected event (if occurring) prominently featured.
Accordingly, video presentation of a real-world scene in accordance with the present invention will be seen to be interactive with both (1) a viewer of the scene and, m the case of a selected dynamically moving object, or an event, in the scene, (11) the scene itself. True interactive video or television is thus presented to a viewer.
In an extension of the present invention the image presented to the viewer will be seen to be a virtual image that is not mandated to correspond to any real world camera nor to any real world camera image. A viewer may thus view a video or television of a real-world scene from a vantage point (i.e., a perspective on the video scene) , and/or dynamically m response to objects moving m the scene and/or events transpiring in the scene, in manner that is not possible m reality. The viewer may, for example, view the scene from a point in the air above the scene, or from the vantage point of an object in the scene, where no real camera exists or even, in some cases, can exist.
This video system, and approach, is called Multiple Perspective Interactive ("MPI") video. MPI video will be seen to be the basis, and the core, of an even more sophisticated "immersive video" (non-real-time and "immersive telepresence" or "Visualized Reality (VisR) (real-time) system of the present invention.
MPI video supports the editing of, and viewer interaction with, video and television m a manner that is useful in viewing activities ranging from education to entertainment. In particular, m conventional video, viewers are substantially passive; all they can do is to control the flow of video by pressing buttons such as play, pause, fast forward or fast reverse These controls essentially provide the viewer only one choice for a particular segment of vi eo the viewer can either see the video (albeit at a controllable rate) , or skip it.
In the case of live television broadcast, viewers have essentially no control at all. A viewer must either see exactly what a broaαcaster chooses to show, or else change away from that broadcastei and station Even in sports and other broadcast events where multiple cameras are used, a viewer has no choice excec: tne oovious one of either viewing the image presented cr e_se using a remote control so as to "surf" multiple channels
Interactive video and television systems such as MPI video make good use cf the availability of increased video bandwidth due to new sate_iιte and fiber optic viαeo links, and due to advances m several areas of video technology Author George Gilder argues that because the viewers really have no choice in the current ror" of television, it is destined to be replaced by a more viewer-driven system or device See George Gilder, Life After Tel evi si on The corning transforma tion of Media and American Life , W Norton & Co. , 1994
The related invention of MPI video makes considerable progress -- even by use of currently existing tecnnology -- towards "liberating" video and TV from the traditional single- source, broadcast, model, and towards placing each viewer in his or her own "director's seat"
A three-dimensional (3-D) video model, or database, is used in MPI video The immersive video and immersive telepresence systems of the present invention preserve, expand, and build upon this 3-D model This three-dimensional model, and the functions that it performs, are well and completely understood, and will be completely taught within this specification. However, the considerable computational power required if a full custom virtual video image for each viewer is to be synthesized in real time and on demand requires that the model should be constructed and maintained m consideration of (1) powerful organizing principles, (ii) efficient algorithms, and (in) effective and judicious simplifying assumptions. This then, and more, is what the present invention will be seen to concern.
2.2 Previous Scene-Interactive Video and Television
Existing scene-interactive video and television is nothing so grandiose as permitting a user/viewer to interact with the objects and/or events of a scene -- as will be seen to be the subject of the present and related inventions. Rather, the interaction with the scene is simply that of a machine -- a computer -- that must recognize, classify and, normally, adapt its responses to what it "sees" in the scene. Scene-interactive video and television is thus simply an extension ot machine vision so as to permit a computer to make decisions, sound alarms, etc , based on wnat it detects in, and detects to be transpiring m, a video scene. Two classic problems m this area (which problems are not commensurate m difficulty) are (1) security cameras, whicn must detect contraband, and (11) an autonomous computer-guided automated battlefield tank, which must sense and respond to its environment .
The general concepts, and voluminous prior art, concerning "macnine vision", " 'targe i classification", and " (target) tracking" are all relevant to tne present invention However, the video and television systems of the present invention -- while doing very, very well m each of viewing, classifying and tracking, will be seen to come to these proolems from a very different perspective than does the prior art Namely, the prior art considers platforms -- whether they are rovers or wars ips -- that are "located m the world", and that must make sense of their view tnereof from essentially but a single perspective centered on present location
The present invention functions oppositely It "defines the world", or at least so much of the world is "on stage" and m view to (each of) multiple video cameras The video and television systems of the present invention have at their command a plethora of correlatable and correlated, simultaneous, positional information Once it is known where each of multiple cameras are, and are pointing, it is a straigntforward matter for computer processes to fix, and to track, items in the scene.
The systems, including the MPI-video subsystem, of the present invention will be seen to perform co-ordinate transformation of (video) image data (i.e., pixels) , and to do this during a generation of two- and three-dimensional image databases.
2.3 Previous Composite Video and Television
The present invention of immersive video will be seen to involve the manipulation, processing and compositing of video data in order to synthesize video images. (Video compositing is the amalgamation of video data from separate video streams.) It is known to produce video images that -- by virtue of view angle, size, magnification, etc. -- are generally without exact correspondence to any single "real-world" video image. The previous process of so doing is called "video mosaicmg" .
The underlaying task m video mosaicmg is to create larger images from frames obtained from one or more single cameras, typically one single camera producing a panning video stream. To generate seamless video mosaics, registration and alignment of the frames from a sequence are critical issues. Next, the immersive video system of the present invention will be seen to use its several streams of 2D video data to build and maintain a 3D video database The utility of such 3D database in the synthesis of virtual video images seems clear For example, an arbitrary planar view of the scene will contain the data of 2D planar slice "through" the 3D database
The limitation on such a scheme of a information- ntensive representation, and manipulation, of the video data of a real- world scene is that a purelv "brute force" approach is impossible with presently available tecnnology The "trade-off" in handling a lot of video data is that (1) certain scene (or at least scene video) constraints must be imposed, (11) certain simplifying assumptions must be made (regarding the content of tne video information, iiiu certain expediencies must be embraced (regarding tne manipulations of the video data) , and/or (iv) certain limitations must be put on what images can, or cannot, be synthesized from such data. (The present invention will oe seen to involve essentially no (iv) limitations on presentation.) Insofar as the necessary choices and trade-offs are astutely made, then it may well be possible to synthesize useful and aesthetically pleasing video, and even television, images by the use of tractable numbers of affordable computers and other equipments running software programs of reasonable
The immersive video system of the present invention will so show that -- (1) certain scene constraints being made, (11) certain simplifying assumptions being made regarding scene objects and object dynamical motions being made, and (111) certain computational efficiencies in the manipulations of video data being embraced -- it is indeed possible, and even practical, to so synthesize useful and aesthetically pleasing video, and even television, images.
SUMMARY OF THE INVENTION
1. Machine Dynamic Selection, of One Video Camera/Image of a Scene from Multiple Video Cameras/Images of the Scene in Accordance with a Particular Perspective on the Scene, an Obηect m the Scene, or an Event in the Scene The present invention contemplates machine dynamic selection, of one video camera/image of a scene from multiple video cameras/images of the scene in accordance with a particular perspective on the scene, an object in the scene, or an event in the scene
The present invention thus contemplates making each and any viewer of a video or a television scene to be his or her own proactive editor of the scene, having the ability to interactively dictate and select -- in advance of the unfolding of the scene, and by high-level command -- a particular perspective by which the scene will be depicted, as and when the scene unfolds.
The viewer can command the selection of real, or -- in advanced embodiments of the invention -- even the synthesis of virtual, video images o the scene m response to any of his or ner desired and selected - spatial perspective on the scene, Hi) static or dynamically' moving op ect appearing m the scene, or (m) event depicted _n tne scene. The viewer -- any viewer -- is accordingly considerably more powerful than even the broadcast video editor of, for example, a live sporting event circa 1995 The viewer is accorded the ability to (1) select in advance a preferred video perspective of view as optionally may be related to dynamic object movements and/or to events unfolding in the scene, and even, as the ultimate extension of the invention, (iu to synthesize video views where no real video camera even exists
1.1 The Basis of the Present Invention m Multiple Perspective
Interactive (MPI) Video
The basis, and most basic part, of the present invention is called Multiple Perspective Interactive, or MPI, Video MPI Video forms the core of the Immersive Video discussed hereinafter m section 3.
For example, in accordance with the present invention of MPI Video a viewer of an American football game on video or on television can command a consistent "best" view of d) one particular player, or, alternatively (ii) the football itself as will be, from time to time, handled by many players. The system receives and processes multiple video views (images) generally of the football field, the football and the players within the game. The system classifies, tags and tracks objects in the scene, including static objects such as field markers, and dynamically moving objects such as the football and the football players. Some of the various views (images) will at times, and from time to time, be "better" -- by various criteria -- m showing certain things than are other views.
In the rudimentary embodiment of the invention taught within this specification the system will consistently, dynamically, select and present a single "best" view of the selected object (for example, the football, or a particular player) . This will require, and the system will automatically accomplish, a "handing off" from one camera to another camera as different ones of multiple cameras best serve to image over time the selected object. In the ultimate extension of the present invention, the viewer can ask to be shown a synthesized video view, such as from a perspective constantly positioned behind a certain offensive running back, where no real video camera actually exists.
The system of the invention is powerful (i) in accepting viewer specification at a high level of those particular objects and/or events in the scene that the user/viewer desires to be shown, and (ii) to subsequently identify and track all user/viewer-selected objects and events (and still others for other users/viewers) in the scene.
The system of the present invention can also, based on its scene knowledge database, serve to answer questions about the scene .
Finally, the system of the present invention can replay events in the scene from the same perspective, or from selected new perspectives, depending upon the desires of the user/viewer. It is not necessary for the user/viewer to "find" the best and proper image; the system performs this function. For example, if the user/viewer wants to see how player number twenty (#20) came to make an interception in the football game, then he or she could order a replay of the entire down focused on player number twenty (#20) .
For example, and continuing with the example of an American football game, an individual viewer can ask questions like: Who is the particular player shown marked by my cursor? Where is player Mr. X? Where is the football?
In advanced, image-synthesizing, embodiments of the system of the present invention, the user/viewer can generate commands like: "replay for me at 1/2 speed the event of the fumble as shown from a straight overhead view" . Such commands are honored by the system of the present invention even though no real video camera may, in actuality, exist at this precise overhead location.
1.2 Machine Dynamic Selection, of One Video Camera/Image of a Scene from Multiple Video Cameras/Images of the Scene in Accordance with a Particular Perspective on the Scene, an Object in the Scene, or an Event in the Scene The present invention contemplates selecting real, or -- in advanced embodiments -- synthesizing virtual, video/television images of a scene from multiple real video/television images of the scene, particularly so as to select or to synthesize video/television images that are linked to any such (i) spatial perspective (s) on the scene, (ii) object (s) in the scene, or (iii) event (s) in the scene, as are selectively desired by a user/viewer to be shown.
The method of the invention is directed to presenting to a user/viewer a particular, viewer-selected, two-dimensional video image cf a real-world, three-dimensional, scene. In order to do so, multiple video cameras, each at a different spatial location, produce multiple two-dimensional images of the real- world scene, each at a different spatial perspective. Objects of interest m the scene are identified and classified in these two-dimensional images. These multiple two-dimensional images of the scene, and their accompanying oo ect information, are • then compined m a computer into a three-dimensional video datapase, or model, of the scene. The dataoase s called a model pecause it incorporates information apout tne scene as well as the scene video. It incorporates, for example, a definition, or "world view", of the three-dimensional space of the scene The model of a footpall game knows, for example, that tne game is played upon a football field replete with static, flxed-position, field yard lines and hash mark markings, as well as of the existence of the dynamic objects of play. The model is, it will be seen, not too hard to construct so long as there are, or are made to be, sufficient points of reference m the imaged scene. It is, conversely, almost impossible to construct the 3-D model, and select or synthesize the chosen image, of an amorphous scene, such as the depths of the open ocean. (Luckily, viewers are generally more interested in people m the world than in fish.) The computer also receives from a prospective user/viewer of the scene a user/viewer-specified criterion relative to which criterion the user/viewer wishes to view the scene.
From the (1) 3-D model and (11) the criterion, the computer produces a particular two-dimensional image of the scene that is in accordance with the user/viewer-specifled criterion. This particular two-dimensional image of the real-world scene is then displayed on a video display to the user/viewer.
At the highest-level, the description of the previous paragraphs regarding the method of the present invention, and the computer-based system performing the method, may not seem much different in effect than that prior art system presently accorded, say, a network sports director who is able to select among many video feeds m accordance with his (or her) own "user/viewer-specified criterion". The significance of the production of the three-dimensional video model (of the real- world scene) by the method, and in the system, of the present invention is, at this highest level of describing the system's functions, as yet unclear. Consider, then, exactly what flows from the method, and the system, of the present invention that produces and uses a three-dimensional video model.
First, the computer may ultimately produce, and the display may finally show, only such a particular two-dimensional image O 96/31047 PCIYUS96/04400 of the scene -- in accordance with the user/viewer-specifled criterion -- as was originally one of the images of the real- world scene that was directly imaged by one of the multiple video cameras. This is, indeed, the way the rudimentary embodiment of the invention taught and shown herein functions. At first consideration, this automatic camera selection may seem unimpressive However, consider not only that the user/viewer criterion is specified at a hign level, but that the appropriate, selected, scene image may change over time m accordance with ust what is imaged, and m what location (sj , by whicn camera (s) , and m accordance with just what transpires in the scene In other ^ords, the evolving contents of the scene, as the scene is imaged by tne multiple cameras and as it is automatically interpreted by the computer, determine just what image of the scene is shown at any one time, and just what sequence of images are shown from time to time, to the user/viewer Action m tne scene "feeds back" on how the scene is shown to the viewer '
Second, in advanced embodiments of the system, the computer is not limited to selecting from the three-dimensional model a two-dimensional image that is, or that corresponds to, any of the images of the real-world scene as was imaged by any of the multiple video cameras. Instead, the computer may synthesize from the three-dimensional model a completely new two- dimensional image that is without exact equivalence to any of the images of the real-world scene as have been imaged by any of the multiple video cameras
Third, the user/viewer-specifled criterion may be of a particular spatial perspective relative to which the user/viewer wishes to view the scene. This spatial perspective need not be immutably fixed, but can instead be linked to a dynamic object in the scene. In the case of generating a scene view from a user/viewer-specifled spatial perspective, the computer produces from the three-dimensional model a particular two-dimensional image of the scene that is in best accordance with some particular spatial perspective criterion that has been received from the viewer. The particular two-dimensional image of the scene that is generated and displayed may, or may not, be, or be equivalent to, any real image of the scene as was obtained by any of the video cameras. In other words, in advanced embodiments of the invention the scene image shown may be a virtual image. Even if the image shown is a real image, the computer will still automatically select, and the display will still display, over time, those actual images of the scene as are imaged, over time, by different ones of the multiple video cameras. Automated scene switching, especially m relation to dynamic objects in the scene, is not known to the inventors to exist in the prior art .
Fourth, the user/viewer-specified criterion may be of a particular object in the scene. In this case the computer will combine the images from the multiple video cameras not only so as to generate a three-dimensional video model of the scene, but so as to generate a model in which objects in the scene are identified. The computer will subsequently produce, and the display will subsequently show, the particular image -- whether real or virtual -- appropriate to best show the selected object. Clearly this is a feedback loop: the location of an object in the scene serves to influence, in accordance with a user/viewer selection of the object, how the scene is shown. Clearly the same video scene could be, if desired, shown over and over, each time focusing view on a different selected object in the scene.
Moreover, the selected object may either be static, and unmoving, or dynamic, and moving, in the scene. Regardless of whether the object in the scene is static or dynamic, it is preferably specified to the system by the user/viewer by act of positioning a cursor on the video display. The cursor is a special type that unambiguously specifies an object in the scene by an association between the object position and the cursor position in three dimensions, and is thus called "a three- dimensional cursor" .
Fifth, the criterion specified by the user/viewer may be of a particular event in the scene. In this case the computer will again combine the images from the multiple video cameras not only so as to generate a three-dimensional video model of the scene, but so as to generate a model in which one or more dynamically occurring event (s) in the scene are recognized and identified. The computer will subsequently produce, and the display will show, a particular image -- whether real or virtual -- that is appropriate to best show the selected event. Clearly this is again a feedback loop: the location of an event in the scene influences, in accordance with a viewer selection of the event, how the scene is shown.
Sixth, and finally, the method of the invention may be performed in real time as interactive television. The television scene will be presented to a user/viewer interactively in accordance with the user/viewer-specified criterion.
2. Immersive Video, Also Called Telepresence, Also Called
Visual Reality (VisR)
The present invention still further contemplates telepresence and immersive video, being the non-real-time creation of a synthesized, virtual, camera/video image of a real-world scene, typically in accordance with one or more viewing criteria that are chosen by a viewer of the scene. Immersive video, or telepresence, or visual reality (VisR) is an extension of Multiple Perspective Interactive (MPI) video.
In immersive video the creation of the virtual image is based on a computerized video processing -- in a process called hypermosaicing -- of multiple video views of the scene, each from a different spatial perspective on the scene.
When tne syntnesis and the presentation of the virtual image transpires as the viewer desires -- and particularly as the viewer indicates his or her viewing desires simply by action of moving and/cr orienting any of his or her body, head and eyes -- then the process is called "immersive telepresence", or simply "telepresence". Alternatively, the process is sometimes called "visual reality" , or simply "VisR" .
(The proliferation of descriptive terms has more to do with the apparent reality (ies) of the synthesized views drawn from the real-world scene than it does with the system and processes of the present invention for synthesizing such views. For example, a quite reasonable ground level view of a football quarterback as is may be synthesized by the system and method of the present invention may appear to a viewer to have been derived from a hand-held television camera, although in fact no such camera exists and the view was not so derived. These views of common experience are preliminarily called "telepresence" . Contrast a magnified, eye-to-eye, view with an ant. This magnified view is also of the real-world, although it is clearly a view that is neither directly visible to the naked eye, nor of common experience. Although derived by entirely the same processes, views of this latter type of synthesized view of the real world is preliminarily called "visual reality", or "VisR", by juxtaposition of such views the similar sensory effects engendered by "virtual reality", or "VR" . )
2.1 Telepresence, Both Immersive and Interactive
In one of its aspects, the present invention is embodied in a method of telepresence, being a video representation of being at real-world scene that is other than the instant scene of the viewer. The method includes (i) capturing video of a real-world scene from each of a multiplicity of different spatial perspectives on the scene, (ii) creating from the captured video a full three-dimensional model of the scene, and (iii) producing, or synthesizing, from the three-dimensional model a video representation on the scene that is in accordance with the desired perspective on the scene of a viewer of the scene.
This method is thus called "immersive telepresence" because the viewer can view the scene as if immersed therein, and as if present at the scene, all in accordance with his or her desires. Namely, it appears to the viewer that, since the scene is presented as the viewer desires, the viewer is immersed in the scene Notably, the viewer-desired perspective on the scene, and the video representation synthesized in accordance with this viewer-desired perspective, need not be m accordance with any of the video captured from any scene perspective
The video representation can oe m accordance with the position and direction of the viewer's eyes and nead, and can exhibit "motional parallax" "Motional parallax" _s normally and conventionally defined as a three-dimensionai effect wnere different views cn the scene are produced as tne viewer moves position, making tne viewer's brain to comprehend that the viewed scene is three-dimensional Motional parallax is observable even if the viewer has but one eye
Still furtner, and additionally, the video representation can De stereoscopic. "Stereoscopy" is normally and conventionally defined as a three-dimensional effect where each of the viewer's two eyes sees a slightly different view on the scene, thus making the viewer's brain to comprehend that the viewed scene is three-dimensional. Stereoscopy is detectable even should the viewer not move his or her head or eyes m spatial position, as is required for motional parallax.
In another of its aspects, the present invention is embodied in a method of telepresence where, again, video of a real-world scene is obtained from a multiplicity of different spatial perspectives on the scene. Again, a full three- dimensional model of the scene is created the from the captured video From this three-dimensional model a video representation on the scene that is in accordance with a predetermined criterion -- selected from among criteria including a perspective on the scene, an object in the scene and an event in the scene -- is produced, or synthesized.
This embodiment of the invention is thus called "interactive telepresence" because the presentation to the viewer is interactive in accordance with the criterion. Again, the synthesized video presentation of the scene in accordance with the criterion need not be, and normally is not, equivalent to any of the video captured from any scene perspective.
In this method of viewer-interactive telepresence the video representation can be in accordance with a criterion selected by the viewer, thus viewer-interactive telepresence Furthermore, the presentation can be in accordance with the position and direction of the viewer's eyes and head, and will thus exhibit motional parallax; and/or the presentation can exhibit stereoscopy.
2.2 A System for Generating Immersive Video A huge range of neretofore unobtainable, and quite remarkable, video views may be synthesized in accordance with the present invention. Nonetheless that an early consideration of exemplary video views of diverse types would likely provide significant motivation to understanding the construction, and the operation, of the immersive video system described m this section 2.2, discussion of these views is delayed until the next section 2.3. This is so that the reader, having gained some appreciation and understanding m this section 2.2 of the immersive video syster, and process, by which tne video views are synthesized, may later oetter place these diverse views in context .
An immersive video, or telepresence, system serves to synthesize and to present diverse video images of a real-world scene in accordance with a predetermined criterion or criteria. The criterion or criteria of presentation is (are) normally specified by, and may be changed at times and from time to time by, a viewer/user of the system. Because the criterion (criteria) is (are) changeable, the system is viewer/user- mteractive, presenting (primarily) those particular video images (of a real-world scene) that the viewer/user desires to see .
The immersive video system includes a knowledge database containing information about the scene. Existence of this "knowledge database" immediately means that the something about the scene is both d) fixed and (ii) known; for example that the scene is of "a football stadium", or of "a stage", or even, despite the considerable randomness of waves, of "a surface of an ocean that lies generally in a level plane". For many reasons -- including the reason that a knowledge database is required -- the antithesis of a real-world scene upon which the immersive video system of the present invention may successfully operate is a scene of windswept foliage in a deep jungle.
The knowledge database may contain, for example, data regarding any of (i) the geometry of the real-world scene, (ii) potential shapes of objects in the real-world scene, (iii) dynamic behaviors of objects in the real-world scene, (iv) an internal camera calibration model, and/or (v) an external camera calibration model. For example, the knowledge base of an American football game would be something to the effect that (i) the game is played essentially in a thick plane lying flat upon the surface of the earth, this plane being marked with both (yard) lines and hash marks; (ii) humans appear in the scene, substantially at ground level; (iii) a football moves in the thick plane both in association with (e.g., running plays) and detached from (e.g., passing and kicking plays) the humans; and (iv) the locations of each of several video cameras on the football game are a priori Known, or are determined by geometrical analysis of the video view received from each
The system further includes multiple video cameras each at a different spatial location Each of these multiple video cameras serves to produce a two-dimensional video image of the real-world scene at a different spatial perspective Each of these multiple cameras can typically change the direction from which t observes the scene, and can typically pan and zoom, but, at least m the more rααimentary versions of tne immersive video system, remains fixed in location A classic example of multiple stationary video cameras on a real-world scene are the cameras at a sporting event, for example at an American football game
The system also includes a viewer/user interface A prospective viewer/user of the scene uses this interface to specify a criterion, or several criteria, relative to which he or she wishes to view the scene This viewer/user interface may commonly be anything from head gear mounted to a boom to a computer joy stick to a simple keypoard In ultimate applications of the immersive video system of the present invention, the viewer/user who establishes (and re-establishes) the criterion (criteria) by which an image on the scene is synthesized is the final consumer of the video images so synthesized and presented by the system. However, for more rudimentary present versions of the immersive video system, the control input (s) arising at the viewer/user interface typically arise from a human video sports director (in the case of an athletic event) , from a human stage director (in the case of a stage play) , or even from a computer (performing the function of a sports director or stage director) . In other words, the viewing desires of the ultimate viewer/user may sometimes be translated to the immersive video system through an intermediary agent that may be either animate or inanimate.
The immersive video system includes a computer running a software program. This computer receives the multiple two- dimensional video images of the scene from the multiple video cameras, and also the viewer-specified criterion (criteria) from the viewer interface. At the present time, circa 1995, the typical computer functioning in an immersive video system is fairly powerful. It is typically an engineering work station class computer, or several such computers that are linked together if video must be processed in real time -- i.e., as television. Especially if the immersive video is real time -- i.e., as television -- then some or all of the computers normally incorporate hardware graphics accelerators, a well- known but expensive part for this class of computer. Accordingly, the computer (s) and other hardware elements of an immersive video system are both general purpose and conventional but are, at the present time (circa 1995) typically "state-of- the-art", and of considerable cost ranging to tens, and even hundreds, of thousands of American dollars.
The system computer includes (in software and/or in hardware) (i) a video data analyzer for detecting and for tracking objects of potential interest and their locations in the scene, (ii) an environmental model builder for combining multiple individual video images of the scene to build a three- dimensional dynamic model cf the environment of the scene within which three-dimensional dynamic environmental model potential objects of interest in the scene are recorded along with their instant spatial locations, (iii) a viewer criterion interpreter for correlating the viewer-specified criterion with the objects of interest in the scene, and with the spatial locations of these objects, as recorded in the dynamic environmental model in order to produce parameters of perspective on the scene, and (iv) a visualizer for generating, from the three-dimensional dynamic environmental model in accordance with the parameters of perspective, a particular two-dimensional video image of the scene .
The computer function (i) -- the video data analyzer -- is a machine vision function. The function can presently be performed quite well and quickly, especially if (i) specialized video digitalizing hardware is used, and/or (ii) simplifying assumptions about the scene objects are made. Primarily because of the scene model builder next discussed, abundant simplifying assumptions are both well and easily made in the immersive video system of the present invention. For example, it is assumed that, in a video scene of an American football game, the players remain essentially in and upon the thick plane of the football field, and do not "fly" into the airspace above the field.
The views provided by an immersive video system in accordance with the present invention not yet having been discussed, it is somewhat premature to explain how a scene object that is not in accordance with the model may suffer degradation in presentation. More particularly, the scene model is not overly particular as to what appears within the scene, but it is particular as to where within (the volume of) the scene an object to be modeled appears. Consider, for example, that the immersive video system can fully handle a scene- intrusive object that is not in accordance with prior simplifications -- for example, a spectator or many spectators or a dog or even an elephant walking onto a football field during or after a football game -- and can process these unexpected objects, and object movements quite as well as any other. However, if is necessary that the modeled object should appear within a volume of the real-world scene whereat the scene model is operational -- basically that volume portion of the scene where the field of view of multiple cameras overlap. For example, a parachutist parachuting into a football stadium may not be "well-modeled" by the system when he/she is high above the field, and outside the thick plane, but will be modeled quite well wnen finally near, or on, ground level By modeling "quite well", it is meant that, while tne immersive video system will readily permit a viewer to examine, for example, the dentation of the quarterback if ne or sne is interested in staring tne quarteroack "in the teetn" , it is very difficult for the system lespecially initially, and in real time as television , to process through a discordant scene occurrence, such as the stadium parachutist, so well so as to permit the examination of his or her teeth also wnen the parachutist is still many meters above the field.
The computer function (11) -- the environmental model builder -- is likely the "backbone" of the present invention. It incorporates important assumptions that, while scene specific, are generally of a common nature throughout all scenes that are of interest for viewing with tne present invention.
In the first place, the environmental model is (1) three- dimensional (3-D) , with both (1) static and m) dynamic components The scene environmental model is not the scene image, nor the scene images rendered three-dimensionally. The current scene image, such as of the play action on a football field, may be, and typically is, considerably smaller than the scene environmental model which may be, for example, the entire football stadium and the objects and actors expected to be present therein. Within this three-dimensional dynamic environmental model both d) the scene and (11) all potential objects of interest in the scene are dynamically recorded as associated with, or "m", their proper instant spatial locations (It should be remembered that the computer memory in which this 3-D model is recorded as actually one-dimensional (1- D) , being but memory locations each of which is addressed by but a single 1-D address.) Understanding that the scene environmental model, and the representation of scene video information, in the present invention is 3-D will much simplify understanding of how the remarkable views discussed n the next section are derived.
At present there is not enough computer "horsepower" to process a completely amorphous unstructured video scene -- the windy jungle -- into 3-D, at least in real time (i.e., as television) . It is, however, eminently possible to process many scenes of great practical interest and importance into 3-D if and when appropriate simplifying assumptions are made. In accordance with the present invention, these necessary simplifying assumptions are very effective, making that production of the three-dimensional video database (in accordance with the 3-D environmental model) is very efficient.
First, the static "underlayment" or "background" of any scene is pre-processed into the three-dimensional video database. For example, the video model of an (empty) sports stadium -- the field, filed markings, goal posts, stands, etc. - - is pre-processed (as the environmental model) into the three- dimensional video database . From this point on only the dynamic elements m the scene -- i.e., the players, the officials, the football and the like -- need be, and are, dealt with. The typically greater portion of any scene that is (at any one time) static is neither processed nor re-processed from moment to moment, and from frame to frame. It need not be so processed or re-processed because nothing has changed, nor is changing. (In some embodiments of the immersive video system, the static background is not inflexible, and may be a "rolling" static background based on the past history of elements within the video scene. )
Meanwhile, dynamical objects m the scene -- which objects typically appear only m a minority of the scene (e.g. the football players) but may appear in the entire scene (e.g., the crowd) -- are preferably processed m two ways. If the computer recognition and classification algorithm can recognize -- in consideration of a priori model knowledge of items appearing in the scene such as the football, and the football players -- an item in the scene, then that item will be isolated, and will be processed/re-processed into the three-dimensional video database as a multiple voxel representation. (A voxel is a three- dimensional pixel.) Other dynamic elements of the scene that cannot be classified or isolated into the three-dimensional environmental model are swept up into the three-dimensional video database mostly in their raw, two-dimensional, video data form. Such a dynamic, but un-isolated, video element could be, for example, the movement of a crowd doing a "wave" motion at a sports stadium, or the surface of the sea.
As will be seen, those recognized and classified objects m the three-dimensional video database -- such as, for example, a football or a football player -- can later be viewed (to the limits of being obscured in all two-dimensional video data streams from which the three-dimensional video scene is composed) from any desired perspective. But it is not possible to view those unclassified and un-isolated dynamic elements of the scene that are stored in the 3-D video database in their 2-D video data from any random perspective. These dynamic objects can indeed be dynamically viewed, but it is impossible in the
17 system to, for example, go "behind" the moving crowd, or "under" the undulating surface of the sea
The system and method does not truly know, of course, whether it is inserting into the instant three-dimensional video database m accordance with the scene environmental model an instant video image of a football quarterback taking a drink, or an instant video image of a football fan taking the same drink Moreover, dynamic objects can both enter (e g as m coming onto the imaged field of play and exit ie g as m leaving the imaged field of play; the scene The system and method of the present invention for constructing a 3-D video scene deal only with (1) the scene environmental model, and (11) the mathematics of the pixel dynamics What must be recognized is that, m so doing, the system and method serve to discriminate between and among raw video image data m processing such image data into the three-dimensional video database
These assumptions that the real-world scene contains both static and dynamic elements (indeed, preferably two kinds of dynamic elements) , this organization, and these expediencies of video data processing are very important They are collectively estimated to reduce the computational requirements for the maintenance of a 3-D video database of a typical real-world scene of genuine interest by a factor of from fifty to one hundred times (x 50 to x 100) .
However, these simplifications have a price, thankfully normally one that is so small so as to be all but unnoticeable. Portions of the scene "where the action is, or has been" are entered into the three-dimensional video database quite splendidly Viewers normally associate such "actions areas" with the center of their video or television presentation. When action spontaneously erupts at the periphery of a scene, it takes even our human brains -- whose attention has been focused elsewhere (i.e., at the scene center) -- several hundred milliseconds or so to recognize what has happened. So also, but in a different sense, it is possible to "sandbag" the system and method of the present invention by a spontaneous eruption of action, or dynamism, in a previously unclassified scene area. In a "first pass", or in real time (i.e., as television) , the system and method of the present invention finds it hard to discriminate, and hard to process for entrance into the three- dimensional database, a three-dimensional scene object (or actor) where there was no previous scene object (or actor) . Without a priori knowledge m the scene environmental model that a spectator may throw a bottle into a sporting arena, it is hard for the system of the present invention to classify and to process the throw and the thrower into the three-dimensional database so completely that the facial features of the thrower may -- either upon an "instant replay" of the scene focusing on the area of the perpetrator or for that rare viewer who had been focusing his view to watch the crowd instead of the athletes all along -- immediately be recognized. (If the original raw video data streams still exist, then it is always possible to process them better. )
Finally, the algorithms themselves that are used to produce the three-dimensional video database are efficient.
Lastly, the system includes a video display that receives the particular two-dimensional video image of the scene from the computer, and that displays this particular two-dimensional video image of the real-world scene to the viewer/user as that particular view of the scene which is m satisfaction of the viewer/user-specifled criterion (criteria) .
2.3 Scene Views Obtainable With Immersive Video
To immediately note that a viewer/user of an immersive video system m accordance with the present invention may view the scene from any static or dynamic viewpoint -- regardless that a real camera/video does not exist at the chosen viewpoint -- only but starts to describe the experience of immersive video. Literally any video image (s) can be generated. The immersive video image (s) that is (are) actually displayed to the viewer/user are ultimately, in one sense, a function of the display devices, or the arrayed display devices -- i.e., the television (s) or monιtor(s) -- that are available for the viewer/user to view. Because, at present (circa 1995) , the most ubiquitous form of these display devices -- televisions and monitors -- have substantially rectangular screens, most of the following explanations of the various experiences of immersive video will be couched in terms of the planar presentations of these devices. However, when in the future new display devices such as volumetric three- dimensional televisions are built -- see, for example, U.S. Patent Nos. 5,268,862 and 5,325,324 each for a THREE-DIMENSIONAL OPTICAL MEMORY -- then the system of the present invention will stand ready to provide the information displayed by these devices .
2.3.1 Planar Video Views on a Scene
First, consider the generation of one-dimensional, planar and curved surface, video views on a scene.
Any "planar" view on the scene may be derived as the information which is present on any (straight or curved) plane (or other closed surface, such as a saddle) that is "cut" through the three-dimensional model of the scene. This "planar" surface may, or course, be positioned anywhere within the three- dimensional volume of the scene model Literally any interior or exterior virtual video view on the scene may be derived and displayed. Video views may be presented in any aspect ratio, and m any geometric form that is supported by the particular video display, or arrayed video displays (e g , televisions, and video projectors , by wnich the video imagery is presented to the viewer/user
Next, recall that a plane ^s put the surface of a sphere or cylinder of infinite radius In accordance wit"1 the present invention, a cylindrical, nemisphericai , or spherical panoramic view of a video scene may be generated from any point inside or outside the cylinder, hemisphere, or sphere For example, successive views on tne scene may appear as the scene os circumnavigated from a position outside the scene An observer at the video horizon cf tne scene will look into the scene as if thougn a window, with the scene in plan view, or, if foreshortened, as if viewing the interior surface of a cylinder or a sphere from a peephole m the surface of the cylinder or sphere In the example of an American football game, the viewer/user could view the game in progress as if he or she "walked" at ground level, or even as if he or she "flew at low altitude" , around or across the field, or througnout the entire stadium.
A much more unusual panoramic cylindrical, or spherical "surround" view of the scene may be generated from a point inside the scene. The views presented greatly surpass the crude, but commonly experienced, example of "you are there" home video where the viewer sees a real-world scene unfold as a walking video cameraman shoots video of only a limited angular, and solid angular, perspective on the scene. Instead, the scene can be made to appear -- especially when the display presentation is made so as to surround the user as do the four walls of a room or as does the dome of a planetarium -- to completely encompass the viewer. In the example of an American football game, the viewer/user could view the game m progress as if ne or she was a player "inside" the game, even to the extent of looking "outward" at the stadium spectators
(It should be understood that where the immersive video system has no information -- normally because view is obscured to the several cameras -- than no image can be presented of such a scene portion, which portion normally shows black upon presentation. This is usually not objectionable, the viewer/user does not really expect to be able to see "under" the pile of football players, or from a camera view "within" the earth. Note, however, that when the 3-D video database does contain more than ust surface imagery such as, for example, the complete 3-D human physiology (the "visible man") , then "navigation" "inside" solid objects, into areas that have never been "seen" by eye or by camera, and at non-normal scales of view is totally permissible.)
Notably, previous forms of displaying multi-perspective, and/or surround, video presently (circa 1995) suffer from distortion Insofar as the view caught at the focal plane of the camera, or each camera (whether film or video is not identical to the view recreated for the viewer, the (often composite) views suffer from distortion, and to tnat extent a composite view lacks "reality" -- even to the point of being disconcerting However -- and considering again that each and all views presented by an immersive video system in accordance with the present invention are drawn from the volume of a three- dimensional model - there is absolutely no reason that each and every view produced by an immersive video system should not be of absolute fidelity and correct spatial relationship to all other views.
For example, consider first the well known, but complex, pincushion correction circuitry of a common television This circuitry serves to match the information modulation of the display-generating electron oeam to the slightly non-planar, pincushion-like, surface of a common cathode ray tube. If the information extracted from a three-dimensional video model is so extracted in the contour of a common pincushion, then no correction of the information is required m presenting it on an equivalent pincushion surface of a cathode ray tube. Taking this analogy to the next level, if a scene is to be presented on some selected panels of a Liquid Crystal Digital (LCD) display, or walls of a room, then the pertinent video information as would constitute a perspective on the scene at each such panel or wall is simply withdrawn from the three-dimensional model. Because they are correctly spatially derived from a seamless 3-D model, the video presentations on each panel or wall fit together seamlessly, and perfectly.
By now, this capability of the immersive video of the present invention should be modestly interesting. As well as commonly lacking stereoscopy, the attenuation effects of intervening atmosphere, true color fidelity, and other assorted shortcomings, two-dimensional screen views of three-dimensional real world scenes suffer in realism because of subtle systematic dimensional distortion. The surface of the two-dimensional display screen (e.g., a television) is seldom so (optically) flat as is the surface of the Charge Coupled Device (CCD) of a camera providing a scene image. The immersive video system of the present invention straightens all this out, exactly matching (in dedicated embodiments) the image presented to the particular screen upon which the image is so presented. This is, of course, a product of the 3-D video database which was itself constructed from multiple video streams from multiple video cameras It might thus be said that the immersive video system of the present invention is using the image of one (or more) cameras to "correct" the presentation inot the imaging, the presentation) of an image derived (actually synthesized in part) from another camera '
2.3 2 Interactive Video Views on a Scene
Second, consider that immersive video m accordance with the present invention permits macnine dynamic generation of views on a scene. Images of a real-world scene may be linked at the discretion of the viewer to any of a particular perspective on the scene, an object m the scene, or an event m the scene
For example, consider again the example of the real-world event of an American football game A viewer/user may interactively close to view a field goal attempt from the location of the goalpost crossbars (a perspective on the scene) , watching a successful place kick sail overhead. The viewer/user may chose to have the football tan object in the scene) centered m a field of view that is 90° to the field of play 'I e., a perfect "sideline seat") at all times. Finally, the viewer/user may chose to view the scene from the position of the left shoulder of the defensive center linebacker unless the football is launched airborne (as a pass) (an event in the scene) from the offensive quarterback, in which case presentation reverts to broad angle aerial coverage of the secondary defensive backs.
The present and related inventions serve to make each and any viewer of a video or a television depicting a real-world scene to be his or her own proactive editor of the scene, having the ability to interactively dictate and select -- in advance of the unfolding of the scene, and by high-level command -- any reasonable parameter or perspective by which the scene will be depicted, as and when the scene unfolds.
2.3.3 Stereoscopic Video Views on a Scene
Third, consider that stereoscopy is inherent in immersive video m accordance with the present invention.
Scene views are constantly generated by reference to the content of a dynamic three-dimensional model -- which model is sort of a three-dimensional video memory without the storage requirement of a one-to-one correspondence between voxels (solid pixels) and memory storage addresses. Therefore it is "no effort at all" for an immersive video system to present, as a selected stream of video data containing a selected view, first scan time video data and second scan time video data that is displaced, each relative to the other, m accordance with the location of each object depicted along the line of view.
This is, of course, the basis of stereoscopy. When one video stream is presented in a one color, or, more commonly at present, at a one time or in a one polarization, while the other video stream is presented in a separate color, or at a separate time, or in an orthogonal polarization, and each stream is separately gated to the eye (at greater than the eye flicker fusion frequency = 70 Hz) by action of colored glasses, cr time- gated filters, or polarizing filters, then the image presented to the eyes will appear stereoscopic, and three-dimensional. The immersive video of the present invention, with its superior knowledge of the three-dimensional spatial positions of all objects in a scene, excels in such stereoscopic presentations (which stereoscopic presentations are, alas, impossible to show on he one-dimensional pages of the drawings) .
2.3.4 A Combination of Visual Reality and Virtual Reality
Fourth, the immersive video presentations of the present invention are clearly susceptible of combination with the objects, characters and environments of artificial reality. Computer models and techniques for the generation and presentation of artificial reality commonly involve three- dimensional organization and processing, even if only for tracing light rays for both perspective and illumination. The central, "cartoon", characters and objects are often "finely wrought", and commonly appear visually pleasing. Alas, equal attention cannot be paid to each and every element of a scene, and the scene background to the focus characters and objects is often either stark, or unrealistic, or both.
Immersive video in accordance with the present invention provides the vast, relatively inexpensive, "database" of the real world (at all scales, time compressions/expansions, etc.) as a suitable "field of operation" (or "playground") for the characters of virtual reality.
When it is considered that immersive video permits viewer/user interactive viewing of a scene, then it is straightforward to understand that a viewer/use may "move" in and though a scene in response to what he/she "sees" in a composite scene of both a real, and an artificial virtual, nature. It is therefore possible, for example, to interactively flee from a "dinosaur" (a virtual animal) appearing in the scene of a real world city. It is therefore possible, for example, to strike a virtual "baseball" (a virtual object) appearing in the scene of a real world baseball park. It is therefore possible, for example, to watch a "tiger", or a "human actor" (both real animal) appearing in the scene of a virtual landscape (which landscape has been laid out in consideration of the movements of the tiger or the actor)
Note that (1) visual reality and (11) virtual reality can, m accordance with the present invention, be combined with (1) a synthesis of real/virtual video images/television pictures of a combination real-world/virtual scene wherein the synthesized pictures are to user-specified parameters of presentation, e.g panoramic or at magnification if so desired by the user, and/or (2 the syntnesis of said real/virtual video images/television pictures can be 3-D stereoscopic
2 4 The Method cr tne Present Invention, In Brief
In rief, the present invention assumes, and uses, a three- dimensional model of the d) static, and (ID dynamic, environment of a real-world scene -- a three-dimensional, environmental, model
Portions of each of multiple video streams showing a single scene, each from a different spatial perspective, that are identified to be (then, at the instant) static by a running comparison are "warped" onto the three-dimensional environmental model This "warping" may be into 2-D (static) representations within the 3-D model -- e g., a football field as is permanently static or even a football bench as is only normally static -- or, alternatively, as a reconstructed 3-D (static) object -- e.g , the goal posts.
The dynamic part of each video stream (that rises from a particular perspective) is likewise "warped" onto the three- dimensional environmental model Normally the "warping" of dynamic objects is into a reconstructed three-dimensional (dynamic) objects -- e.g., a football player This is for the simple reason that dynamic objects in the scene are of primary interest, and it is they that will later likely be important m synthesized views of the scene. However, the "warping" of a dynamic object may also be into a two-dimensional representation -- e.g , the stadium crowd producing a wave motion
Simple changes m video data determine whether an object is (then) static or dynamic.
The environmental model itself determines whether any scene portion or scene object is to be warped onto itself as a two- dimensional representation or as a reconstructed three- dimensional object The reason no attempt is made to reconstruct everything in three-dimensions are twofold First, video data l slacking to model everything in and about the scene in three dimensions -- e.g., the underside of the field or the back of the crowd are not within any video stream Second, and more importantly, there is insufficient computational power to reconstruct a three-dimensional video representation of everything that is within a scene, especially in real time (i.e., as television)
Any desired scene view is then synthesized (alternatively, "extracted") from the representations and reconstituted objects that are (both within the three-dimensional model, and is displayed to a viewer/user.
The synthesis/extraction may be m accordance with a viewer specified criterion, and may be dynamic in accordance with such criterion For example, the viewer or a football game may request a consistent view fro^ the "fifty yard line", or may alternatively ask to see all plays from the a stadium view at the line of scrimmage The views presented may be dynamically selected in accordance with an object m the scene, or an event in the scene
Any interior or exterior perspectives on the scene may be presented. For example, the viewer may request a view looking into a football game from the sideline position of a coach, or may request a view looking out of the football game from at the coach from the then position of the quarterback on the football field Any requested view may be panoramic, or at any aspect ratio, in presentation Views may also be magnified, or reduced
Finally, any and all views can be rendered stereoscopically, as desired.
The synthesized/extracted video views may be processed in real time, as television.
,Any and all synthesized/extracted video views contain only as much information as is within any of the multiple video streams, no video view can contain information that is not within any video stream, and will simply show black (or white) in this area.
2.5 The Immersive Svstem of the Present Invention, In Brief
In brief, the immersive video computer system of the present invention receives multiple video images of view on a real world scene, and serves to synthesize a video image of the scene which synthesized image is not identical to any of the multiple received video images.
The computer system includes an information base containing a geometry of the real-world scene, shapes and dynamic behaviors expected from moving objects in the scene, plus, additionally, internal and external camera calibration models on the scene.
A video data analyzer means detects and tracks objects of potential interest in the scene, and the locations of these objects .
A three-dimensional environmental model builder records the detected and tracked objects at their proper locations m a three-dimensional model of the scene. This recording is in consideration of the information base.
A viewer interface is responsive to a viewer of the scene to receive a viewer selection of a desired view on the scene. This selected and desired view need not be identical to any views that are within any of the multiple received video images.
Finally, a visualizer generates (alternatively, "synthesizes") (alternatively "extracts") from the three- dimensional model of the scene, and in accordance with the received desired view, a video image on the scene that so shows the scene from the desired view.
These and other aspects and attributes of the present invention will become increasingly clear upon reference to the following drawings and accompanying specification.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a top-level block diagram showing the high level architecture of the system for Multiple Perspective Interactive (MPI) video in accordance with the present invention.
Figure 2 is a functional block diagram showing an overview of the MPI system in accordance with the present invention, previously seen in block diagram in Figure 1, in use for interactive football video.
Figure 3 is a diagrammatic representation of the hardware configuration of the MPI system in accordance with the present invention, previously seen in block diagram in Figure 1.
Figure 4 is a pictorial representation of a video display particularly showing how, as a viewer interface feature of the Multiple Perspective Interactive (MPI) video system in accordance with the present invention previously seen in block diagram in Figure 1, a viewer can select one of the many items to focus in the scene.
Figure 5 is a diagrammatic representation showing how different cameras provide focus on different objects in the MPI system in accordance with the present invention; depending on the viewer's current interest an appropriate camera must be selected.
Figure 6 is another pictorial representation of the video display of the Multiple Perspective Interactive (MPI) video system in accordance with the present invention, this the video display particularly showing a viewer-controlled three-dimensional cursor serving to mark a point in three-dimensional (3-D) space, with the projection of the 3-D cursor being a regular 2-D cursor.
Figure 7 is a diagram showing coordinate systems for camera calibration in the Multiple Perspective Interactive (MPI) video system in accordance with the present invention.
Figure 8, consisting of Figures 8a through 8c, is pictorial representation, and accompanying diagram, of three separate video displays in the Multiple Perspective Interactive (MPI) video system in accordance with the present invention, the three separate displays showing how three different cameras provide three different sequences, the three different sequences being used to build the model of events in the scene.
Figure 9, consisting of Figures 9a and 9b, is pictorial representation of two separate video displays in the Multiple Perspective Interactive (MPI) video system in accordance with the present invention showing many known points an image can be used for camera calibration; the frame of Figure 9a having sufficient points for calibration but the frame of Figure 9b having insufficient points for calibration.
Figure 10, consisting of Figures 10a through 10c, is pictorial representation of three separate video frames, arising from three separate algorithm-selected video cameras, in the Multiple Perspective Interactive (MPI) video system in accordance with the present invention.
Figure 11 is a schematic diagram showing a Global Multi-Perspective Perception System (GM-PPS) portion of the Multiple Perspective Interactive (MPI) video system in accordance with the present invention in use to take data from calibrated cameras covering a scene from different perspectives in order to dynamically detect, localize, track and model moving objects -- including a robot vehicle and human pedestrians -- in the scene .
Figure 12 is a top-level block diagram showing the high level architecture of the Global Multi-Perspective Perception System (GM-PPS) portion, previously seen in Figure 11, of the Multiple Perspective Interactive (MPI) video system in accordance with the present invention, the architecture showing the interaction between a priori information formalized in a static model and the information computed during system processing and used to formulate a dynamic model.
Figure 13 is a graphical illustration showing the intersection formed by the rectangular viewing frustum of each camera scene onto the environment volume in the GM-PPS portion of the MPI video system of the present invention; the filled frustum representing possible areas where the object can be located in the 3-D model while, by use of multiple views, the intersection of the frustum from each camera will closely approximate the 3-D location and form of the object in the environment model.
Figure 14, consisting of Figure 14a and Figure 14b, is a diagram of a particular, exemplary, environment cf use of the GM-PPS portion, and of the overall MPI video system of the present invention; the environment being an actual courtyard on the campus of the University of California, San Diego, where four cameras, the locations and optical axes of which are shown, monitor an environment consisting of static object, a moving robot vehicle, and several moving persons
Figure 15 s a pictorial representation of tne distributed architecture of tne GM-PPS portion of the MPI video system of the present invention wnereir. \ i ι a graphics ano visualization workstation acts as the modeler, ui) several worκstatιons on the network act as slaves which process individual frames based on the master's request so as to (111; physically store the processed frames either locally, m a nearpy storage server, or, m the real-time case, as digitized information on a local or nearpy frame-grabber
Figure 16 is a diagram showing the derivation of a camera coverage table for an area of interest, or environment, in which objects will be detected, localized, tracked and modeled by the GM-PPS portion of the MPI video system of the present invention; each grid cell in the area is associated with its image in each camera plane while, in addition, the diagram shows an object dynamically moving through the scene and the type of information the GM-PPS portion of the MPI video system uses to maintain knowledge about this object's identity.
Figure 17, consisting of Figures 17a through 17d, is four pictorial views of the campus courtyard previously diagrammed in Figure 14 at global time 00:22:29:06; the scene containing four moving objects including a vehicle, two walkers and a bicyclist.
Figure 18 is a pictorial view of a video display to the GM- PPS portion of the MPI video system of the present invention, the video display showing, as different components of the GM- PPS, views from the four cameras of Figure 17 in a top row, and a panoramic view of the model showing hypotheses corresponding to the four moving objects m the scene in a bottom portion; the GM-PPS serving to detect each object in one or more views as is particularly shown by the bounding boxes, and serving to update object hypotheses by a line-of-sight projection of each observation.
Figure 19, consisting of Figures 19a through 19e, is five pictorial views of the GM-PPS model showing various hypotheses corresponding to the four moving objects in the scene of Figure 17 at global time 00:22:29:06; Figures 19a-19d correspond to four actual camera views while Figure 19e shows a virtual image from the top of the scene.
Figure 20, consisting of Figures 20a through 20d, is four pictorial views of the same campus courtyard previously diagrammed in Figure 14, and shown in Figure 17, at global time 00:62:39:06; the scene still containing four moving objects including a vehicle, two walkers and a bicyclist.
Figure 21 is another pictorial view of the video display to the GM-PPS portion of tne MPI video system of the present invention previously seen in Figure 18, the video display now showing a panoramic view of the model showing the hypotheses corresponding to tne four moving objects in the scene at the global time 00:22:39:06 as was previously shown in Figure 20.
Figure 22, consisting of Figures 22a through 22c, s a diagrammatic view showing how immersive video in accordance with the present invention uses video streams from multiple strategically-located cameras that monitor a real-world scene from different spatial perspectives.
Figure 23 is a schematic block diagram of the software architecture of the immersive video system m accordance with the present invention.
Figure 24 is a pictorial view showing how the video data analyzer portion of the immersive video system of the present invention detects and tracks objects of potential interest and their locations in the scene.
Figure 25 is a diagrammatic view showing how, m an immersive video system m accordance with the present invention, the three-dimensional (3D) shapes of all moving objects are found by intersecting the viewing frustrums of objects found by the video data analyzer; two views of a full three-dimensional model generated by the environmental model builder of the immersive video system of the present invention for an indoor karate demonstration being particularly shown.
Figure 26 is a pictorial view showing how, in the immersive video system in accordance with the present invention, a remote viewer is able to walk though, and observe a scene from anywhere using virtual reality control devices such as the boom shown here.
Figure 27 is an original video frame showing video views from four cameras simultaneously recording the scene of a campus courtyard at a particular instant of time.
Figure 28 is four selected virtual camera, or synthetic video, images taken from a 116-frame "walk through" sequence generated by the immersive video system in accordance with the present invention (color differences in the original color video are lost in monochrome illustration) .
Figure 29, consisting of Figures 29a and Figure 29b, are synthetic video images generated from original video by the immersive video system m accordance with the present invention, the synthetic images respectively showing a "bird's eye view" and a ground level view of the same courtyard previously seen in Figure 27 at the same instant of time.
Figure 30a is a graphical rendition of the 3D environment model generated for the same time instant shown m Figure 27, the volume of voxels in the model intentionally being at a scale sufficiently coarse so that the 3D environmental model of two humans appearing m the scene may be recognized without being so fine that it cannot be recognized that it is only a 3D model, and not an image, that is depicted.
Figure 30b is a graphical rendition of the full 3D environment model generated DV tne environmental model builder of the immersive video system of the present invention for an indoor karate demonstration as was previously shown in Figure 25, the two human participants being clothed m karate clothing with a kick m progress, the scale and the resolution of the model being clearly observable .
Figure 30c is anotner graphical rendition of the full 3D environment model generated by the environmental model builder of the immersive video system of the present invention, this time for an outdoor karate demonstration, this time the environmental model being further shown to be located m the static scene, particularly of an outdoor courtyard.
Figure 31 is a listing of Algorithm 1, the Vista "Compositing" or "Hypermosaicing" Algorithm, m accompaniment to a diagrammatic representation of the terms of the algorithm, of the present invention where, at each time instant, multiple vistas are computed using the current dynamic model and video streams from multiple perspective; for stereoscopic presentations vistas are created from left and from right cameras.
Figure 32 is a listing of Algorithm 2, the Voxel Construction and Visualization for Moving Objects Algorithm in accordance with the present invention.
Figure 33 is a synthetic video frames, similar to the frames of Figure 10, created by the immersive video system of the present invention at a random user-specified viewpoint during a performance of a indoor karate exercise by an actor m the scene, the virtual views of an indoor karate exercise of Figure 33 being rendered at a higher resolution than were the virtual views of the outdoor karate exercise of Figure 30.
DESCRIPTION OF THE PREFERRED EMBODIMENT
1. Capabilities of the Multiple Perspective Interactive Video of the Present Invention, and Certain Potential Implications of These Capabilities
The capabilities cf the Multiple Perspective Interactive (MPI) video of the present invention are discussed even prior to teaching the system that realizes these capabilities in order that certain potential implications of these capapilities may best be understood. Should these implications be understood, it may soon be recognized that the present invention accords not merely a "fancy form" of video, out an m-depth change to the existing, fundamental, video and television viewing experience.
The present specification presents a system, a method and a model for Multiple Perspective Interactive -- "MPI" -- video or television. In the MPI video model multiple cameras are used to acquire an episode or a program of interest from several different spatial perspectives. The cameras are real, and exist in the real world: to use a source camera, or a source image, that is itself virtual constitutes a second-level extension of the invention, and is not presently contemplated.
MPI video is always interactive -- the "I" m MPI -- in the sense that the perspective from which the video scene is desired to be, and will be, shown and presented to a viewer is permissively chosen by such viewer, and predetermined. However, MPI video is also interactive in that, quite commonly, the perspective on the scene is dynamic, and responsive to developments in the scene. This may be the case regardless that the real video images of the scene from which the MPI video is formed are themselves dynamic and may, for example, exhibit pan and zoom. Accordingly, a viewer-selected dynamic presentation of dynamic events that are themselves dynamically imaged is contemplated by the present invention.
Consider, for example, the presentation of MPI video for a game of American football. The "viewer-selected dynamic presentation" might be, for example, a viewer-selected imaging of the quarterback. This image is dynamic in accordance that the quarterback should, by his movement during play, cause that, in the simplest case, the images of several different video camera should be successively selected or, in the case of such full virtual video as is contemplated by the present invention, that the quarterback's image should be variously dynamically synthesized by digital computer means. The football game is, of course, a dynamic event wherein the quarterback moves. Finally, the real-world source, camera, images that are used to produce the MPI video are themselves dynamic in accordance that the cameramen at the football game attempt to follow play.
The net effect of all this dynamism is non-obvious, and of a different order than even such video, or television, experience as is commonly accorded a network video director of a major sporting event who is exposed to a multitude of (live) video feeds The experience of MPI video m accordance with the present invention may usefully be compared, and contrasted, with virtual reality The term "virtual reality" commonly has connotations of (1) unreality, (11) sensory immersion, and/or (m) self-directed interaction with a reality that is only fantasy, or "virtual" The effect of the MPI video of the present invention differs from
Figure imgf000034_0001
reality" m all these factors, put is nonetheless quite shocking
Ir the first place, the present invention is not restricted to use with video depicting reality -- Put reality is the cheapest source of sucn information as can, when viewed through the MPI video system of the present invention, still be quite "intense" In other words, it may be necessary to be attacked by a fake, virtual, tiger when one can visually experience the onrush of a real hostile football linebacker
In the second place, MPI video is presented upon a common monitor, or television set, and does not induce tne viewer to believe that he or she has entered a fantasy reality.
Finally, and in the third place, the self-directed interaction with MPI video is directed to observational perspective, and not to a viewer's dynamic control of developments m the scene m accordance with his or her action, or inaction.
What MPI video can do, and what causes it to be "shocking", is that the viewer can view, or, m the American vernacular, "get into", the video scene just where, and even when, the viewer chooses Who at a live sporting event has not looked at the cheerleaders, a favorite player, or even the referee? Psychological and sociological research has shown that, among numerous other differences between us all, men and women, as one example, do not invariably visually acquire the same elements of a picture or painting, let alone do the two sexes visually linger on such elements as they identify in common for equal time durations (Women like to look at babies m a scene more so than do men, and men like to look at women in a scene more so than do women.) Quite simply, humans often have different interests, and focal points of interest, even in the same visual subject matter. With present video and television presentations everyone must watch the same thing, a "common composite" With the viewer-interactive control that is inherent m MPI video, different things can be differently regarded at each viewer's behest. Accordingly, MPI video removes some of the limitations that presently make a video or a television viewer only a passive participant m the video or television viewing process (in the American vernacular, a "couch potato") .
Of course, MPI video need not be implemented for each and every individual video or television viewer in order to be useful. Perhaps with the advent of communicating 500 channels of television to the home, a broadcast major American football game might reasonably consume not one, but 25+ channels -- one for each player of both sides on the football field, one for each coach, one for the football, and one for the stadium, etc.
En early alternative may be MPI video on pay per view. It has been hypothesized that the Internet, in particular, may expand in the future to as likely connect smart machines to human users, and to each other, as to communicatively interconnect more and more humans, only. Customized remote viewing can certainly be obtained by assigning every one his or her own remotely-controllable TV camera, and robotic rover. However, this scheme soon breaks down. How can hundreds and thousands of individually-remotely-controlled cameras jockey for position and for viewer-desired vantage points at a single event, such as the birth of a whale, or an auto race? It is likely a better idea to construct a comprehensive video image database from quality images obtained from only a few strategically positioned cameras, and to then permit universal construction of customized views from this database, all as is taught by the present invention.
As will additionally be seen, the MPI video of the present invention causes video databases to be built in which databases are contained --' dynamically and from moment to moment (frame to frame) -- much useful information that is interpretive of the scene depicted. Clearly, in order to select, or to synthesize, an image of a particular player, the MPI video system contains information of the player's present whereabouts, and image. It is thus a straightforward matter for the system to provide information, in the form of text or otherwise, on the scene viewed, either continuously or upon request.
Such auxiliary information can augment the entertainment experience. For example, a viewer might be alerted to a changed association of a football in motion from a member of a one team to a member of the opposing team as is recognized by the system to be a fumble recovery or interception. For example, a viewer might simply be kept informed as to which player presently has possession of the football.
The more probable use of such auxiliary information is education. It will no longer be necessary to remain in confused ignorance of what one is viewing if, by certain simple commands, "helps" to understanding the scene, and the experience, may be obtained.
2. An Actual System Performing Multiple Perspective
Interactive (MPI) Video in Accordance With the Present Invention, and Certain Limitations of this Exemplary System
The MPI video model, its implementation, and the architectural components of a rudimentary system implementing the model are taught in the following sections 3 through of this specification. Television is a real-time version of MPI video. Interactive TV is a special case of MPI video. In MPI TV, many operations must be done in real time because many television programs are broadcast in real time.
The concept of MPI video is taught in the context of a sport event. The MPI video model allows a viewer to be active; he or she may request a preferred camera position or angle, or the viewer may even ask questions about contents described in the video. Even the rudimentary system automatically determines the best camera and view to satisfy the demands of the viewer.
Videos of American football have been selected as the video source texts upon which the performance of MPI video will be taught and demonstrated. Football video already in existence was retrieved, and operated upon as a sample application of MPI video in order to demonstrate certain desirable features .
The particular, rudimentary, embodiment of an MPI video system features automatic camera selection and interaction using three-dimensional cursers. The complete computational techniques used in the rudimentary system are not fully contained herein this specification in detail because, by an large, know techniques hereinafter referred to are implemented. Certain computational techniques are, however, believed novel, and the mathematical basis of each of these few techniques are fully explained herein.
The rudimentary, demonstration, system of the present invention has been reduced to operative practice, and ell drawings or photographs of the present specification that appear to be of video screens are representations or photographs of actual screens, and are not mock-ups. Additionally, where continuity between successive video views is implied, then this continuity exists in reality although, commensurate with the amount of computer resource and computational power harnessed to do the necessary transformations, the successive and continuous views and presentations may not be in full real time.
The running MPI video system is presently being extend to other applications besides American football. In particular, a detail teaching of the concept, and method, of generating a three-dimensional database required by the MPI video system of the present invention is taught and demonstrated in this specification not in the context of football, but rather, as a useful simplification, in the context of a university courtyard though which human and machine subjects (as opposed to football players) roam. The present specification will accordingly be understood as being directed to the enabling principles, construction, features and resulting performance of rudimentary embodiment of an MPI video system, as opposed to presenting great details on any or all of the several separate aspects of the system
3 Architecture cf tne MPI Video System
A physical phenomena or an event can be usually viewed from multiple perspectives The ability to view from multiple perspectives ^s essential m many applications Current remote viewing via video or television permits viewing only from one perspective, and tnat perspective being that of an author or editor and not of the viewer A viewer has no cnoice However, remote viewing via video or television even under such limitations has been very attractive and has influenced our modern society in many aspects.
Technology has now advanced to the state that each of many simultaneous remote viewers (1) can be provided with a choice to so view remotely from whatever perspective they want, and, with limitations, (ιι» can interactively select just what in the remote scene they want to view
Let us assume that an episode is being recorded, or being viewed in real time This episode could be related, for example, to a scientific experiment, an engineering analysis, a security post, a sports event, or a movie In a simplest and most obvious case, the episode can be recorded using multiple cameras strategically located at different points These cameras provide different perspectives of the episode Each camera view is individually very limited The famous parable about an elephant and the blind men may be recalled. With ust one camera, only a narrow aspect of the episode may be viewed. Like a single blind man, a single camera is unable to provide a global description of an episode.
Using computer vision and related techniques m accordance with the present invention, it is possible to take individual camera views and reconstruct an entire scene. These individual camera scenes are then assimilated into a model that represents the complete episode This model is called an "environment model". The environment model has a global view of the episode, and it also knows where each individual camera is. The environment model is used in the MPI system to permit a viewer to view what he or she wants from where he or she wants (within the scene, and within limits) .
Assume that a viewer is interested m one of the following.
First, the viewer may be interested in a specific perspective, and may want to view a scene, an episode, or an entire video presentation from this specific perspective. The user may specify a real, or a virtual, camera specifically. Alternatively, the viewer may only specify the desired general location of the camera, without actual knowledge whether a camera m such location would be real or virtual .
Second, the viewer may be interested in a specific object. There may be several objects m a scene, an episode, or a presentation. A viewer may want to always view a particular object independent cf its situation m the scene, episode, or presentation. Alternatively, the object that is desired to be viewed may oe context sensitive: the viewer may desire view the basketball until the goal is scored to then shift view to the last player to touch the basketball.
Third, the viewer may be interested in a specific event. A viewer may specify characteristics of an event and may want to view a scene, an episode, or a presentation from the best perspective for that event.
Fourth, the viewer may be interested in a having a view from a virtual camera. The viewer may request to view a scene of an event within the scene from a perspective that is not provided by any real camera that is situated to acquire the scene or any portion thereof. In such cases, the MPI video system of the present invention will, by use of the environment model and video synthesis techniques, synthesize a virtual camera, and video image, so as to view a scene, an episode, or an entire presentation from a viewer-specified perspective.
The high level architecture for a MPI video system so functioning is shown in a first level block diagram in Figure 1. A image at a certain perspective from each camera 10a, 10b, ...lOn is converted to its associated camera scene in camera screen buffers CSB Ila, lib, ...lln. Multiple camera scenes are then assimilated into the environment model 13 by computer process in the Environ. Model Builder 12. A viewer 14 (shown in phantom line for not being part of the MPI video system of the present invention) can select his perspective at the Viewer Interface 15, and that perspective is communicated to the Environment Model via a computer process in Query Generator 16. The programmed reasoning system in the Environment Model 13 decides what to send via Display Control 17 to the Display 18 of the viewer 1 .
Implementation of a universal, plug and play, MPI video system that (i) track virtually anything, (ii) function in real time (i.e, for television) , and/or (iii) produce virtually any desired image, including a full virtual image, severely stresses modern computer and video hardware technology circa 1995, and can quickly come to consume the processing power of a mini- supercomputer. Economical deployment of the MPI video system requires, circa 1995, advances in several hardware technology areas Notably, however, there is, as will imminently be demonstrated, no basic hardware nor software function required by such a MPI video system that ms not only presently realizable, but that is, in actual fact, already realized. Moreover, a relatively high level, user friendly, viewer interface -- which might have been considered impossible or extremely difficult of being successfully achieved -- "falls out" quite naturally, and to good effect, from the preferred implementation of, and the partitioning of function within, the MPI system.
A complete MPI video system with limited features can be, and has been, implemented using the existing technology. The exact preferred architecture of a MPI video system will depend on the area to which the system is intended to be applied, and the t}pe and level of viewer interaction allowed However, certain general issues are in common to any and all implementations of MPI video systems Seven critical areas that must be addressed in building any MPI video system are as follows .
First, a camera scene builder is required as a programmed computer process. In order to convert an image sequence of a camera to a scene sequence, the MPI video system must, and does, know where the camera is located, its orientation, and its lens parameters. Using this information, the MPI video system is then able to locate objects of potential interest, and the locations of these objects in the scene. This requires powerful image segmentation methods For structured applications, the MPI video system may use some knowledge of the domain, and may even change or label objects to make its segmentation task easier. This is, in fact, the approach of the rudimentary embodiment of the MPI video system, as will be further discussed later.
Second, an environment model builder is required as a programmed computer process. Individual camera scenes are combined in the MPI video system to form a model of the environment. All potential objects of interest and their locations are recorded in the environment model. The representation of the environment model depend on the facilities provided to the viewer If the images are segmented properly, then, by use of powerful but known computers and computing methods, it is possible to build environment models in real time, or almost in real time.
Third, a viewer interface permits the viewer to select the perspective that he or she wants. This information is obtained from the user in a friendly but directed manner. Adequate tools are provided to the user to point and to pick objects of interest, to select the desired perspective, and to specify events of interest. Recent advances in visual interfaces, virtual reality, and related areas have contributed to making the MPI video system viewer interface very powerful -- even in the rudimentary embodiment of the system.
Fourth, a display controller software process is required to respond to the viewers' requests by selecting appropriate images to be displayed to each such viewer. These images may all come from one perspective, or the MPI video system may have to select the best camera at every point in time in order to display the selected view and perspective. Accordingly, multiple cameras may be used to display a sequence over time, but at any given time only a single best camera is used. This has required solving a camera hand-off problem.
Fifth, a video database must be maintained. If the video event is not in real time (i.e., television) then, then it is possible to store an entire episode in a video database. Each camera sequence is stored along with its metadata. Some of the metadata is feature based, and permits content-based operations. See Ramesh Jain and Arun Hampapur,- "Metadata for video-databases" appearing in SIGMOD Records , Dec. 1994.
In many applications of the MPI video system, environment models are also stored in the database to allow rapid interactions with the system.
Sixth, real-time processing of video must be implemented to permit viewing of real time video events, i.e. television. In this case a special system architecture is required to interpret each camera sequence in real time and to assimilate their results in real time so that, based on a viewer input, the MPI video system can use the environment model to solve the camera selection problem.
A practitioner of the computer arts and sciences will recognize that this sixth requirement is nothing but the fifth requirement performed faster, and in real time. The requirement might just barely be realizable in software if computational parallelism is exploited, but, depending upon simplifying assumptions made, a computer ranging from an engineering work station to a full-blown supercomputer (both circa 1995) may be required. Luckily, low-cost (but powerful) microprocessors are likely distributable to each of the Camera Sequence Buffers CSB Ila, lib, ...lln in order to isolate, and to report, features and dynamic features within each camera scene. Correlation of scene features at a higher process level may thus be reduced to a tractable problem. Another excellent way of simplifying the problem -- which way is used in the rudimentary embodiment of the MPI video system taught within this specification -- is to demand that the scene, and each camera view thereof, include constant, and readily identifiable, markers as a sort of video "grid" . An American football field already has this grid in the form of yard lines and hash marks. So might a college courtyard with benches and trees . A whale free swimming in an amorphous tank while giving birth is at the other end of the spectrum, and presents an exceedingly severe camera image selection (if not also correlation) problem.
Seventh, a visualizer is required in those applications that require the displaying of a synthetic image in order to satisfy a viewer's request For example, it is possible that a user selects a perspective that is not available from any camera. A trivial solution is simply to select the closest camera, and to use its image. The solution of the rudimentary MPI video system of the present specification -- which solution is far from trivial m implementation or trite in the benefits obtained -- is to select a best -- and not necessarily a closet -- camera and to use its image and sequence.
The ultimate response of the MPI video system is to synthesize the exact synthetic image, and image sequence, the viewer desires and demands Even here, no image can be formed where no source image data exists, such as a view from below the playing field (i.e., from m the ground) Even a synthetic view that is normally acceptable, such as "from the nose of the football in the vector direction of the movement of same" cannot be produced when, and at such times as, the football becomes "buried", and obscured from view, under a pile up after the ball carrier is tackled. "Weird" views in synthesized MPI video can be exciting, but, m accordance with their "weirdness", are not always reliably capable of being successfully synthesized.
The ability of an MPI video system to synthesize a full virtual video image is basically a function of "raw" computational power. If real time video (i.e., television) is not required, short virtual video segments of real world occurrences may be quite as reasonably produced, and maybe more reasonably produced, than the computer-generated special effects, including morphmg, so popular in American movies circa 1995. Of course, it should be understood that even the synthesis of such segments requires computers of considerable speed capacity.
Clearly, implementation of an MPI video system with unrestricted capability requires state-of-the art computer hardware and software, and will benefit by such improvement in both as are confidently expected. Some new issues, other than the above seven, are expected to arise in addressing different applications of MPI video. At the present time, and in this specification, only a rudimentary MPI vide system is taught. By implementing this first MPI video system, the inventors have identified interesting future issues in each of computer vision, artificial intelligence, human interfaces, and databases. However, and for the moment, the following sections serve to discuss and teach an actual MPI video system that was implemented to demonstrate the concept of the invention more concretely and completely, as well as to define and identify performance issues.
4. A Rudimentary, Prototype, Embodiment of an MPI Video System in Use for Producing MPI Video of American Football Key concepts m MPI video are taught m this section 4. by reference to a rudimentary, prototype, embodiment of an MPI video system that was built particularly for multiple perspective interactive viewing of American football. The motivation of the inventors in selecting this domain was to find a domain that was realistic, interesting, non-trivial and sufficiently well structured so as to demonstrate many important concepts of MPI video. It is also of note that, should the present MPI video system be applied commercially, it might already be possessed of such characteristics as would seemingly make it of some practical use m certain applications such as the "instant replay".
Many other sports and many other applications were considered by the inventors. American football was chosen due to the several attributes of the game that make it highly structured both from (i) database and (n) computer vision perspective. These issues of structure are hereinafter discussed in the context of the implementation of the rudimentary, prototype, embodiment of the MPI video system
4.1 Scenario of Use, and Required Functions, of an MPI Video
System As Applied to American Football
Although American-type football games are very popular in North America on conventional television, the broadcasts of these football games have several limitations from a viewer's perspective. The viewing of American football games could seemingly be significantly enhanced by adding the following facilities.
Usually a football game is captured by several cameras that are placed at different locations on the field. Though those cameras cover various parts of the game, viewers can get only one camera view at a time. This view is not a result of viewers' choice, but is instead what an editor thinks most people want to see. In most cases, editor's decision are right. In any case, with the current technology this expert selection of views is seemingly the best that can be done. If a viewer is interested in a certain player, or a shot from a different angle, than he or she cannot see the desired image unless the editor's choice happens to be the same as the viewer's. By giving choices to a viewer, it is anticipated that watching the game might be made significantly more interesting.
Moreover, when watching football game questions often occur to viewers such as "who is this player who ust now tackled", or "how long did this player run m this play" Conventional video or television does not necessary provide such information Tools that provide such information would seemingly be useful
Still further, while watching a video of a football game, a coach or a player may want to analyze how a particular player ran, or tackled, and to ignore all other players. An interactive viewing system should allow tne viewing of only plays of interest, and these from different angles. Moreover, the video would desirably be good enough so that some detailed analysis would be capable of being performed on the video of the plays m order to study the precise patterns, and performance, of the selected player.
In the rudimentary MPI video system, viewers may both d) select cameras according to their preference, and (11) ask questions about the name(s) , or the movement(s) , of players. The following are some examples of interaction between a viewers and the MPI video system
The viewer may request that the MPI video system should show a shot of some upcoming play or plays taken from camera located behind the quarterback.
The viewer may request that the MPI video system should show a best shot of a particular, viewer-identifled, player.
The viewer may request that the MPI video system should show as text the name of the player to which the viewer points, with his or her cursor, on the screen of the display 18 (shown m Figure 1) .
The viewer may request that the MPI video system should highlight on the screen a particular player whose name the viewer has selected from a player list.
The viewer may request that the MPI video system should show him or her the exact present location of a selected player.
The viewer may request that the MPI video system should show him or her the sequence when a selected player crossed, for example, the 40 yard line.
The viewer may request that the MPI video system should show him or her the event of a fumble .
The viewer may request that the MPI video system should show all third down plays in which quarterback X threw the ball to the receiver Y.
To perform these functions, and others, the MPI video system needs to have information about d) contents of the football scene as well as (ii) video data. Some of the above, and several similar questions, are relevant to MPI television, while others are relevant to MPI video. The major distinction between MPI TV and MPI video is in the role of the database. In case of MPI video, it is assume that much preprocessing can transpire, with the pre-processed information stored in a database. In case of MPI TV, most processing must be, and will be, in real time.
In the following section the rudimentary, prototype, MPI system discussed is, remarkably, an MPI TV system. A large random access video database system that is usable as an component of an MPI video system is realizable by conventional means, but is expensive (circa 1995) in accordance with amount of video stored, and the rapidity of the retrieval thereof.
In the rudimentary, prototype, MPI TV system, as shown in Fig. 2, a football scene is captured by several cameras and analyzed by a scene analysis system. The information obtained from individual cameras is used to form the environment model . The environment model allows viewers to interactively view the scene.
Additionally, a prototype football video retrieval system haw been implemented, as hereafter explained, This system incorporates some of the above-listed functions such as automatic camera selection and pointing to players. Other functions are readily susceptible of implementation using the same, existing, hardware and software technologies as are already within the rudimentary embodiment of the system.
4.1.1 Overview of the MPI Football Video/Television System
The configuration of the MPI football video/television system is shown in Figure 3. The current system consists of a UNIX workstation, a laser disc player, a video capture board, and a TV monitor and graphical display. The TV monitor is connected to the laser disc player. The laser disc player is controlled by the UNIX workstation. A graphical user interface is built using X-window and Motif on graphical display.
In use of the system, video of a football game was recorded on a laser disc. The actual video recorded was a part of the 1994 Super Bowl game. Since this video footage was obtained by commercial broadcast, the inventors did not have any control on camera location. Instead, the camera positions were reverse engineered using camera calibration algorithms. See R. M. Haralick and L. G. Shapiro; Computer and Robot Vision, Addison-Wesley Publishing, 1993.
Next, selected parts of the Super Bowl football game in which views from three different cameras were shown were selected. The three views were, of course, broadcast at three separate times. They depict an important, and exciting, play in the 1994 Super Bowl game. This selection was necessary to simulate the availability of separate video streams from multiple cameras.
This video data was divided into shots, each of which corresponds to one football play. Each shot was analyzed and a three-dimensional scene description -- to be discussed in considerable detail in sections 5-10 hereinafter -- was generated. Shots from multiple cameras were combined into the environment model. The environment model contains information about position cf players and status of cameras. The environment model is used by the system to allow MPI video viewing to a user. User commands are treated as queries to the system and are handled by the environment model and the database .
The interactive video interface of the system is shown in Figure 4. The video screen of Figure 4 shows video frames taken from laser disc. Video control buttons control video playback. Using a camera list, a viewer can choose any camera. Using a player list, a viewer can choose certain players to be focused on. If a viewer doesn't select a camera, then the system automatically selects the best camera. Also, multiple viewers can interact using the three-dimensional cursor. These new features are described below. Some interface features for the interactive video are shown here. A user can select one of the many items to focus in the scene.
.2 Automatic Camera Selection
At any moment, there are several cameras that shoot the game. Automatic camera selection is a function that selects the best camera according to the preference of a user. Suppose a player is captured by three cameras and they produce three views shown in Fig. 5. In this case camera 2 is the best to see this player, for in camera 1 the player is out of the area while in camera 3 the player is too small. Different cameras provide focus on different objects. Depending on the current interest, an appropriate camera must be selected.
This function is performed by the system in the following way. First, viewers select the player that they want to see. Then the system looks into information on player position and camera status in the environment model to determine which camera provides the best shot of the player. Finally the selected shot is routed to the screen.
4.3 Interaction Using Three-Dimensional Cursers In accordance with the present invention, a three-dimensional cursor is introduced in support of the interaction between viewers and the MPI vide/TV system. A three-dimensional cursor is a cursor that moves in three-dimensional space. It is used to indicate particular position in the scene. The MPI video/TV system uses this cursor to highlight players. Viewers also use it to specify players that they want to ask questions about.
Examples of interaction using three-dimensional cursers are shown m Figure 6. As shown in Figure 6 , the cursor consists of five lines. Three of the five lines indicate the x, y and z axes of the three-dimensional space. The intersection of these three lines shows cursor position. The other two lines indicate a projection of the three lines onto the ground The projection helps viewers have a correct information of cursor position.
A viewer can manipulate the tnree-dimenεional cursor so as to mark a point in the three-dimensional space The projection of the three dimensional cursor is a regular cursor centered at the projection of this marked point.
Both viewers and the MPI system use the three-dimensional cursor to interact with each other. In the first example of Figure 6, a viewer moves the cursor to the position of a player and asks who this player is. The MPI system then compares the position of the cursor and the present position cf each player to determine which player the viewer is pointing.
In the second example of Fig. 6, a viewer tells the MPI system a name of a player and asks where the player is . The MPI system then shows the picture of the player and overlays the cursor on the position of the player so as to highlight the player.
5. Three-dimensional Scene Analysis
The purpose of scene analysis is to extract three-dimensional information from video frames captured by cameras. This process is performed m the following two stages:
First, 2-D information is extracted. From each video frame, feature points such as players and field marks ere extracted and a list of feature points is generated.
Second, 3-D information is extracted. From the two-dimensional description of the video frame, three-dimensional information in the scene, such as player position and camera status, is then extracted.
The details of these extractions are contained within the following sub-sections.
5.1 Extracting Two-dimensional Information
In the extraction of two-dimensional information, feature points are extracted from each video frame. Feature points include two separate items in the images. First, the players are defined by using their feet as feature points. Second, the field marks of the football field are sued as feature points. As is known to fans of American football, and American football field has yard lines to indicate yardage between goal lines, and hash marks to indicate a set distance from the side border, or sidelines, of the field. Field marks are defined as feature points because their exact position as a prior known, and their registration and detection can be used to determine camera status .
In the rudimentary, prototype, MPI system, the feature points are extracted by human-machine interaction. This process is currently carried out as follows. First, the system displays a video frame on the screen of Display 18 (shown in Figure 1) . A viewer, or operator, 14 locates some feature points on the screen and inputs required information for each feature point. The system reads image coordinates of the feature points and generates two-dimensional description.
This process results in two-dimensional description of a video frame that consists of a list describing the players and a list describing the field marks. The player descriptions include each player's name and the coordinates of each player's image. The field mark descriptions include the positions (in the three-dimensional world) , and the image coordinates, of all the field marks.
In the rudimentary embodiment of the MPI video system, all feature points are specified interactively with the aid of human intelligence. Many features can be detected automatically using machine vision techniques. See R. M. Haralick and L. G. Shapiro, op cit . The process of automatically detecting features in arbitrary images is not trivial, however. It is anticipated, however, that two trends will help the process of feature point identification in MPI video. First, new techniques have recently been developed, and will likely continue to be developed, that should be useful in permitting the MPI video system to extract feature point information automatically. Future new techniques may include some bar-code like mechanism for each player, fluorescent coloring on the players' helmets, or even some simple active devices that will automatically provide the location of each player to the system. It is also anticipated that many current techniques for dynamic vision and related areas may suitably be adapted for the MPI video application.
Because the goal of the rudimentary, prototype, system is primarily to demonstrate MPI video, no extensive effort has been made to extract the feature points automatically. Further progress, and greater system capabilities, in this area is deemed straightforward, and susceptible of implementation by a practitioner of the digital video 5.2 Extracting Three-dimensional Information
The purpose of this step is to obtain three-dimensional information from the two-dimensional frames. The spatial relationship between the three-dimensional world and the vided frames captured by the cameras is shown in Figure 7. Consider that a camera is observing a point (x, y, z) . A point (u, v) in the image coordinate system to which the point (x, y, z) is mapped may be determined by the following relationships, which relationships comprise a coordinate system for camera calibration.
A point (x, y, z ; in the world coordinate system is transformed to a point (p, q, s) in the camera coordinate system by the following equation
Figure imgf000048_0001
where R is a transformation matrix from the world coordinate system to the camera coordinate system, and (x~,y, ,z ) is the position of the camera.
A point (p,q,s) in the camera coordinate system is projected to point (u,v) on the image plane according to the following equation:
P v f /
where f is camera parameter that determines the degree of zoom in or zoom out .
Thus, we see that an image coordinate (u,v) which corresponds to world coordinate (x,y,z) is determined depending on the (i) camera position, (ii) camera angle and (ii) camera parameter.
Therefore, from two-dimensional information that is described above, we can obtain three-dimensional camera and player information in the following way. (See R. M. Haralick and L. G. Shapiro; Computer and Robot Vision, Addison-Wesley Publishing, 1993.)
First, a camera calibration is performed. If only one known point is observed, a pair of image coordinates and world coordinates may be known. By applying this known pair to the above equations, two equations regarding the seven parameters that determine camera status may be obtained. Observing at least four known points will suffice to provide the minimum equations to solve the seven unknown parameters.
However m the application of the MPI video system to football, the (1) camera position is usually fixed, and (11) the rotation angle is zero This reduces the number of unκnowns to three, which requires minimum of two known points The field marks extracted m previous process are then used as known points
Next, an image to world coordinate mapping is performed Once the camera status -- which is described by the seven parameters aoove -- is known, the world coordinate may oe determined from the image coordinate if it considered that the point is constrained to lie m a plane. In the application of the MPI video system to football, the imaged football players are always approximately on the ground Accordingly, the positions of players can be determined according to the aoove equations
5.3 Interpolation
Ideally the scene analysis process just described should be applied to every video frame m order to get the most precise information about d) the location of players and (n) the events m the scene. However, it would require significant human and computational effort to do so in the rudimentary, prototype, MPI video system because feature points are located manually, end not by automation. Therefore, one key frame has been manually selected for every thirty frames, and scene analysis has been applied to the selected key frames. For frames in between, player position and camera status is estimated by interpolation between key frames by proceeding under the assumption that coordinate values change linearly between consecutive two key frames
5.4 Camera Hand-Off
The rudimentary, prototype, MPI video system is able to determine and select a single best camera to show a particular player or an event. This is determined by the system using the environment model. Effectively, for the given player's location, the system uses reverse mapping for given camera locations, and then determines where will the image of the player be in the image for different cameras.
At the present time, the system selects the camera in which the selected player is closest to the center of the viewing area. The system could prospectively be made more precise by considering the orientation of the player also. The problem of transferring display} control from one camera to another is called the "camera hand-off problem" . 6. Results of the Exercise of the Rudimentary MPI Video System
The rudimentary, prototype, MPI video system has oeen exercised on a very simple football scene imaged from three different cameras The goal of this example is to demonstrate the method and apparatus of the invention, and the feasibility of obtaining practical results. The present implementation and embodiment can clearly Pe extended process longer sequences, and also to different applications, and, indeed, is already Pemg so extended
The ac u - "iαeo data used m the experimental exercise of the MPI video system is shown in Figure 8. The video data consists of the three shots respectively shown m Figures 8a through 8c These three shots record the same football play but are taken from different camera angles Each snot lasted about ten seconds The three different cameras thus provide three separate, cut related, sequences. These sequences are used to Puild tne rrooei of events in the scene.
Key frames were selected as previously explained, and scene analysis was applied, In the process of scene analysis, at least three field marKS for each key frame. This reference information was subsequently used as known points m order to solve the three unknown parameters that determine camera status . Note that this entire step could be avoided if a priori knowledge of the camera status was available. It is likely that in early, television network, applications of the MPI video system m coverage of structured events like American football that the camera (i) positions and (ID status parameters will be known, and continuously known, to the MPI video system. To such extent as they are known they obviously need not be calculated. In application of the scene analysis process to the actual video data t was found that not all video frames have enough known points An example of a video frames that lacks sufficient known points is shown m Figure 9b. This may be contrasted with a video frame having more than sufficient known points as is snown in Figure 9a. In the experimental data used, 14 out of 15 key frames from camera 1 had at least three (3) known points, while none of seven (7) key frames from camera 2, and eight (8) out of fourteen (14) key frames from camera 3, had three (3) or more obvious known points. The difference between the cameras was that camera 1 was placed at high position while cameras 2 and 3 were placed at low positions. Accordingly, estimates had to be made for those video frames that didn't show enough obvious known points. The results of such estimations ave not necessarily accurate. Many known points an this image can be used for camera calibration.
Some examples of actual results obtained by use of the rudimentary, prototype, MPI system are snown in Figure 10. These illustrated results were obtained by selecting "Washington" as a player to be focused on For each video frame, a three-dimensional cursor was overlaid according to the position of "Washington" Regarding these video frames, we see that tne results of scene analysis are substantially accurate according to tne following ooservation
First, the positions of tne playei "Washington" that a human may reao from the video frames are close tc the values that the system calculates Tne values calculated by the MPI /ιαeo system are snown pe ow each picture m Fig 10.
Second, eacn axis cf three-αιrτensιcnaJ. cursers appears to agree with direction cf the football field that a human may read rrom viαeo frames
Third, the three-dimensional cursor appear to be close to the cnosen player "Washington" in the screen video image.
Other frames were cnecked as well It has been confirmed that the results cf the MPI video system to isolate, and to tracK, "target" objects of interest are mostly accurate, at least for those frames that contain enough known points to calibrate
7 Global Multi-Perspective Perception In the MPI Video System
The present section 7 and following sections 8- expound the most conceptually and practically difficult portion of the MPI video system its capture, organization and processing of real-world events m order that a system action -- such as, for example, an immediate selection, or synthesis, of an important video image (e g , a football fumble, or an interception) -- may be predicated on this detection. Until this task is broken down into tractable parts m accordance with the present invention, it may seem to require a solution in the areas of machine vision and/or artificial intelligence, and to be of such awesome difficulty so as to likely be intractable, and impossible of solution with present technology. In fact, it is possible to make such significant progress on this task by use of modern technology applied in accordance with the present invention so as not only to get recognizable results, but so as to get results that are by some measure useful, and arguably even cost effective
In accordance with the present invention of Multiple Perspective Interactive (MPI) video, an omniscient multi-perspective perception system based on multiple stationary video cameras permits comprehensive live recognition, and coverage, of objects and events in extended environment. The system of the invention maintains a realistic representation of the real-world events. A static model is built first using detailed a priori information. Subsequent dynamic modeling involves the detection and tracking of people and objects in at least portions cf the scene that are perceived (by the system, and in real time,' to be the most pertinent.
The perception system, using camera hand-off, dynamically tracks objects in the scene as they move from one camera coverage zone to another. This tracking is possible due to several important aspects of the approach cf the present invention, including >, strategic placement of cameras for optimal coverage, ii accurate knowledge cf scene-camera transformation, and ii the constraining cf object motion to a known set of surfaces.
In this and the following sections of this specification, (i) a description cf particularly the novel pattern and event recognition capability cf the MPI video system of the present invention, and >ii certain results presently obtainable with the system, are shown and discussed in the context of a practical implementation of the system on a college campus, to wit: a courtyard cf the Engineering School at the University of California, San Diego. This environment is chosen in lieu of -- as a possible alternative choice -- further discussion of a football field and a football game because (i) it is desired to show more generally how (i) cameras may be strategically placed for optimal coverage, (ii) accurate knowledge facilitates scene-camera transformation, and (iii) object motion may be constrained to a known set of surfaces.
Momentarily considering only (iii) object motion, the exemplary courtyard environment contains (i) one object -- a human walker -- that follows a proscribed and predetermined dynamic path, namely a walkway path. The exemplary environment contains (ii) still other objects -- other human walkers -- that do not even know that they are in any of a scene, a system, or an experiment, and who accordingly move as they please in un- predetermined patterns (which are nonetheless earthbound) . Finally, the exemplary environment contains (iii) an object -- a robot -- that is not independent, but which rather moves in the scene in response to static and dynamic objects and events therein, such as to, for example, traverse the scene without running into a static bench or a dynamic human.
It will therefore be recognized that even more is transpiring in the exemplary courtyard environment than on the previously-discussed football field, and that while this exemplary courtyard environment is admittedly arbitrary, it is also very rich in static and dynamic objects important to the exercise and demonstration of an omniscient multi-perspective perception capability of the MPI video system of the present invention. 7.1 Organization of the Teaching of Global Multi-Perspective
Perception In the MPI Video System
Global Multi-Perspective Perception is taught and exercised m a campus environment containing a (i) mobile robot, (ii) stationary obstacles, and (m people and vehicles moving about -- actors m tne scene that are shown diagrammatically m Figure Ila. In the present approach an omniscient multi-perspective perception syste^ uses multiple stationary cameras wnicn provide comprehensive coverage cf an extended environment. The use of fixed global cameras simplifies visual progressing.
All dynamic oo^ect m the environment, including tne robot, can ce easily and accurately detected by (D integrating motion information from tne different cameras covering these objects, and, importantly to the invention, (ID constraining the environment py analyzing only such motion as is constrained to be to a small set of known surfaces.
The particular giooa^ multi-perspective perception system that monitors the campus environment containing people, vehicles and the robot uses the several color and monochrome CCD cameras also diagrammatically represented in Figure 12 . This particular perception system is not only useful in tne MPI video system, but is also useful in any completely autonomous system with or without a human m the loop, such as in the monitoring of planes on airport runways.
The operation of the global multi-perspective perception system is discussed m both human-controlled and autonomous modes. In the preferred system, individual video streams are d) processed on separate work stations on the local network and (ii) integrated on a special purpose graphics machine on the same network. The particular system, the particular experimental setup, and pertinent performance issues, are described as follows:
The next section 8 describes the preferred approach and the principle behind camera coverage, integration and camera hand-off. The prototype global multi-perspective perception system, and the results of experiments thereon, is next described in section 9. The approach of present invention is, to the best present knowledge of inventors, a revolutionary application of computer vision that is immediately practically useable in several diverse fields such as intelligent vehicles as well as the interactive video applications -- such as situation monitoring and tour guides, etc. -- that are the principal subject of the present specification.
The applicability of the prototype global multi-perspective perception system to just some of these applications is presented in section 10. Opportunities for further improvements and expansions are discussed in Section 11. 8. Multi-Perspective Perception
Multi-perspective perception involves each of the following.
First, the "expectations" that various objects will be observed must be generated from multiple different camera views by use of each cf (i) a priori information, (ii1 ar. environment model, and
Figure imgf000054_0001
the information requirements cf tne present task. The statement cf the immediately preceding sentence must ce read carefully because the sentence contains a great deal of information, and important characterization of one aspect of the present invention. Each of (i) a priori information, (ii) an environment model, and 'iii) the information requirements of the task, have variously been considered, and melded into, prior art systems for, and methods of, machine perception. Note however, that the first sentence of this paragraph is definitive. Next, note that the use of the (i) information, (ii,1 environment model, and (iii) information requirements is to generate -- specifically from multiple different camera views -- something called "expectations" . These "expectations" are the probabilities that a !i> particular object will be observed (ii) at a particular place.
Second, objects from each camera must be independently detected and localized. This is not always done on the prior art, although it is not unduly complex. Simple motion detection is mostly used in the preferred embodiment of the present, prototype, global multi-perspective perception system.
Next, the separate observations are assimilated into a three dimensional model. In this step, the preferred embodiment of the present invention leaves "familiar ground" quickly, and "plunges" into a new construct for any perception system, whether global and/or multi-perspective or not.
Fourth, and finally, the model is used in performing the required tasks. Exactly what this means must be postponed until the "model" is better understood.
A high-level schematic diagram of the different components of the preferred embodiment of the prototype multi-perspective perception system in accordance with the present invention is shown in Figure 12. A study of the diagram will show that the system includes both two-dimensional and three-dimensional processing. Reference S. Chatterjee, R. Jain, A. Katkere, P. Kelly, D. Y. Kuramura, and S. Moezzi; Modeling and interactivity in MPI-Video, Technical Report VCL-94-103, Visual Computing Laboratory, University of California, San Diego, Dec. 1994.
Two key aspects of the architecture diagrammed in Figure 12 are the (i) static model and the (ii) dynamic model. The static model contains a priori information such as camera calibration parameters, look-up tables and obstacle information. The dynamic model contains task specific information like two dimensional and three dimensional maps, dynamic objects, states cf objects in the scene (e.g, a particular human is mobile, or the robot vehicle immobile) , etc.
8 1 Three-dimensional Modeling
The three-dimensional model of the preferred embodiment of tne prototype multi-perspective perception system in accordance w/itn the present invention is created using information from multiple video streams This model provides information that cannot be derived from a single camera view due to occlusion, size of the objects, etc Reference S. Chatterjee, et al . op. cit
A good three dimensional model is required to recognize complex static and moving obstacles At a basic level, the multi-perspective perception system must maintain information about the positions of all the significant static obstacles and dynamic objects m the environment. In addition, the system must extract information from both the two-dimensional static model as well as the three-dimensional dynamic model. As such, a representation must be chosen that (1) facilitates maintenance of object positional information as well as (11) supporting more sophisticated questions about object behavior.
While information representation can be considered an implementation issue, the particular presentation chosen will significantly affect the system development. Thus, information representation is considered to be an important element of the preferred multi-perspective perception system, and of its architecture. In the preferred system, geometric information is represented as a combination of voxel representation, gridmap representation and object-location representation. Specific implementations and domains deal with this differently.
When combined with information about the exact position and orientation of a camera, the a priori knowledge of the static environment is very rich source of information which has not previously received much attention. For each single view, the preferred system is able to compute the three dimensional position of each dynamic object detected by its motion segmentation component. To do so, the (1) a priori information about the scene and (11) the camera calibration parameters are coupled with (iii) the assumption that all dynamic objects move on the ground surface.
Using this information it is a straightforward exercise for a practitioner of the computer programming arts to compute the equation of the line that passes through the camera projection point and a given feature on its image plane. Then, by assuming that the lowest image point of a dynamic object is on the ground, the approximate position of the object on the ground plane os readily found. Positional information obtained from all views is assimilated and stored m the 2D grid representing the viewing area.
For the case where an object is observed by more than one camera, the three-dimensional voxel representation is particularly efficacious. Here a dynamic object recorded on an image plane projects nto some set cf voxels. Multiple views of an object will produce multiple projections, one for each camera. The intersection of ail such projections provides an estimate of the 3 -dimensional form of the dynamic object as illustrated m Figure 13 for an object seen by four cameras.
This section and its accompanying illustrations -- short as they may Pe - - have set forth a complete disclosure of how to make two- and three-dimensional models cf the scene. It no remains only to use such models, m conjunction with other information, for useful purposes.
8.2 Automatic Camera Handoff
Camera handoff should be understood to be the event m which a dynamic object passes from one camera coverage zone to another. The multi-perspective perception system must maintain a consistent representation of an object's identity and behavior during camera handoff. This requires the maintenance of information about the object's position, its motion, etc.
Camera Handoff is a crucial aspect of processing m the multi-perspective perception system because it integrates a variety of key system components. Firstly, it relies on accurate camera calibration information, static model data. Secondly, it requires knowledge of objects and their motion through the environment determined from the dynamic model . Finally, the camera handoff can influence dynamic object detection processing. This section 8 has described the architecture, and some important features, of the multi-perspective perception system. Reference also S. Chatterjee, et al . op. cit . The next section describes in detail the preferred implementation of the multi-perspective perception system for the application of monitoring a college courtyard.
9. Setup of the Multi-perspective Perception System, and
Results of System Use
The implementation of an integrated Multiple Perspective Interactive (MPI) video system demands a robust and capable implementation of the multi-perspective perception subsystem. To simplify the teaching of the multi-perspective perception subsystem, and since this subsystem taken alone is useful m several other applications (described in Section 4) than ust MPI video, the following describes the multi-perspective perception subsystem as a stand-alone system independent of the MPI video system of which it is a part. It will be understood that, one the object identifications, object tracking, and multiple perspective views of the multi-perspective perception su system are ootamed, it is a straightforward matter to use tnese results in a MPI video system (For many purposes of supplying information to the video viewer, only a nιgn-_.eve viewer interface is required to access the considerable current information of the multi-perspective perception subsystem. The following sections descnoe the multi-perspective perception subsystem/system in detail.
9.1 Multi-Perspective Perception System Prototype
9.1.1 Setup and Use
The initial development and exercise of the multi- perspective perception system took place in a laboratory cn an extended digitized color sequence. A one minute long scene was digitized from four color CCD cameras overlooking a typical campus scene i. The one minute scene covers two pedestrians, two cyclists, and a robot vehicle moving between coverage zones. A schematic of this scene shown m Figure 14, consisting of Figure 14a and Figure 14b
For calibration and experimental evaluation of the prototype system, one of the two pedestrians walked on a pre-determined known path No restrictions were placed on other moving objects m the scene.
9.1.2 Digitalization
The four views of the scene were digitized using a frame-addressable VCR, frame capture board combination. The synchronization was done by hand using synthetic synchronization points m the scene (known as hat drops) . The resulting image sequences were placed on separate disks and controllers for independent distributed access. Having an extended pre-digitized sequence d) accorded repeatability and (ID permitted development of the perception system without the distractions and time consumption of repeated digitalization of the scene . The source of the scene image sequence was transparent to the perception system, and was, in fact, hidden behind a virtual frame grabber. Hence, the test was not only realistic, but migration of the perception system into d) real-time using (ii) real video frame capture boards proved easy. 9.1.3 Camera Calibration
Calibration of the cameras in tne perception system is important oecause accurate camera-world transformation is vital to correct system function. The cameras are assumed to be calibrated a priori, so that precise information aoout eacn camera's position and orientation could be used either directly, cr py use cf pre-computed camera coverage tables, to convert two dimensional coservations nto three dimensional model space, and, further, three dimensional expectations into 2D.
For tne experimental exercise of the perception system, a complete, geometric three dimensional model of tne courtyard was puilt using map data This information was then used for external calioration of each camera Calioration was done with a user in the loop The static model was visualized from a location near tne actual camera location and the user interactively modified tne camera parameters unt_i the visualized "iew exactly matched the actua_ camera view 'displayed underneath) .
S .1.4 Distributed Architecture
At tne University cf California, San Diego, cameras are physically distributed throughout tne campus to provide security coverage Because the experimental use of the perception system requires synchronized frames from these cameras at a very fast rate, frame capture was done close to the camera on separate computers. For modularity and real-time video processing, it is very important that the video be independently processed close to the sources thereof. The preferred hardware setup for the experimental exercise is pictorially diagrammed m Figure 15. Several independent heterogeneous computers -- a Sun SPARCstation models 10 and 20 and/or SGI models Indιgo2 , Indy and Challenge -- were selectively used based on criteria including d) the load on the CPU, and the computer throughput, (ii) computer proximity to the camera and availability of a frame capture ooard (for real-time setup' , and (m) the proximity of each computer to a storage location, measured in Mbps (for the experimental setup) .
The work stations m the experiment were connected on a 120 Mbps ethernet switch which guaranteed full-speed point-to-point connection. A central graphical work station was used to control the four video processing workstations, to maintain the environment model (and associated temporal database) , and, optionally, to communicate results to another computer process such as that exercising and performing an MPI video function.
The central master computer and the remote slave computers communicate at a high symbolic level; minimal image information is exchanged. Hence only a very low network bandwidth is required for master-slave communication. The master-slave information exchange protocol is preferably as follows:
First, the master computer initializes graphics, the database and the environment model, and waits on a pre-specified port .
Second, and based on the master computer's knowledge of the network, machine throughput etc. , a separate computer process starts the slave computer processes on selected remote machines.
Third, each slave computer contacts the master computer, using a pre-specified macnine-port comomation, and an initialization nand-sha.-ing protocol ensues.
Fourth, tne master computer acknowledges each slave computer and sends the slave computer initialization information such as ID where the images are actually stored (for the laboratory case,1 , (ii) the starting frame and frame interval, and (iii camera-specific image-processing information like thresholds, masKs etc.
Fifth, the slave initializes itself based on the information sent by the master computer
Sixth, once the initialization is completed, the master computer, either synchronously or asynchronously depending on application, will processes the individual cameras as described in following steps seven through nine.
Seventh, whenever a frame from a specific camera needs to be processed then the master computer sends a request to that particular slave computer with information about processing the frame focus of attention windows, frame specific thresholds and other parameters, current and expected locations and identifications of moving objects etc., continuing during this processing any user interaction. In synchronous mode, requests to all slave computers are sent simultaneously and the integration is done after all slave computers have responded. In asynchronous mode, this will not necessarily proceed in unison.
Eighth, when a reply is received, the frame information is used to update the environment model and the database as described in following Section 9.1.7.
The next sections describe the communication traffic between the master and the slave computers .
9.1.5 Modeling and Visualization
A communication master computer that manages all slave computers, assimilates the processed information into an environment model, process user input (if any) , and sends information to the MPI video process (if any) , resides at the heart of the multi-perspective perception system. In the preferred prototype system, this master computer is an SGI Indιgo2 work station with nigh-end graphics hardware. This machine, along with graphics software -- OpenGL and Inventor -- was used to develop a functional Environment Model building and visualization system. Reference J. Neidev, T. Davis, and M. Woo; OpenGL"" Programming Guide : Offi cial Guide to Learning OpenGL , Release 1, Addison-Wesley Publishing Company, 1993. Reference also C WernecKe, The Inventor Men tor : Programming CD ect - Ori en ted 3D Graem es wi tn Open Inventor " ; Rel ease 2 , Addison-Wesley Publishing Company, 1994
In the preferred system, Inventor manages tne scene database and OpenGL performs the actua^ rendering. A "snapsnot" he of the visualization system of the master computer, including four camera views, and a rendered model showing all tne moving objects m iconic forms, is snown m Figure 18.
Ϊ> .1.6 Video Processing
One of the goals cf the exercise of the multi-perspective perception system was to illustrate the advantages of using static cameras for scene capture, and the relative simplicity of visual processing m this scenario when compared to processing from a single camera While more sophisticated detection, recognition and tracking algorithms are still being developed and applied, the initial, prototype multi-perspective perception system uses simple yet robust motion detection and tracking. In the prototype system, and as described in previous sections, the processing of individual video streams is done using independent video processing slaves, possibly running on several different machines. The synchronization and coordination of these slaves, any required resolution of inconsistencies, and generation of expectations is done at the master.
Independent processing of information streams is an important feature of the information assimilation architecture of the present invention, and is a continuation and outgrowth of the work of some of the inventors and their colleagues. See, for example, R. Jam; Environment model s and informa tion assimilation, Technical Report RJ 6866(65692) , IBM Almaden Research Center, San Jose, CA, 1989; Y. Voth and R. Jam; Knowledge caching for sensor-based systems , Artificial Intelligence, 71:257-280, Dec. 1994; and A. Katkere and R. Jam; A framework for informa tion assimila tion , to be published m Exploratory Vision edited by M. Landy, et al . , 1994.
The independent processing results in pluggable and dynamically reconfigurable processing tracks. The preferred, prototypical, communication slave computers perform the following steps on each individual video frame. Video processing is limited by focus of attention rectangles specified by the master computer, and pre-computed static mask images delineating portions of a camera view which cannot possibly have any interesting motion. The computation of the former is done using current locations of the object hypotheses in each view and projected locations in the next view. The latter is currently created by hand, painting out areas of each view not on the navigable surface (wails, for example) . Camera coverage tables help the master computer these computations. Coverage tables, and the concept cf cpjects, are both illustrated in Figure 16.
In operation, the input frame is first smoothed to remove some noise. Then the difference image d,. . is computed as follows. Only pixels that are the focus of attention windows and that are not masked are considered.
cl . = Threshol d ( Abs ' F. - F. ) , threshol d val ue )
Optionally, to remove motion shadows, following operation is done :
d. = d. . & α. ..
This shadow-removing step is not invariably used nor required since it needs a one frame look-ahead. In many cases simple heuristics may be used to eliminate motion shadows at a symbolic level .
Nest, components on binary difference image are computed based on a four-neighborhood criterion. Components that are too small or too big are thrown away because they usually constitute noise. Frames that contain a large number of components are also discarded. Both centroid (from first moments) , and orientation and elongation (from the second moments) , are extracted for each component .
Next, several optional filters are applied at the slave site to the list of components obtained from the previous step. Commonly used filters include (i) merging of overlapping bounding boxes, (ii) hard limits of orientation and elongation, and (iii) distance from expected features etc.
Finally, the resulting list is sent back to the master site .
9.1.7 Assimilation and Updating Object Hypotheses
The central visualization and modeling site receives processed visual information from the video processing sites and creates/updates object hypotheses. There are several sophisticated ways of so doing. Currently, and for the saKe of simplicity in developing a completely operative prototype, this is done as follows:
First, he list of two-dimensional (2-D) object bounding boxes is further filtered based on global knowledge.
Second, the footprint of each bounding box is projected to the primary surface of motion by intersecting a ray drawn from tne optic center cf that particular camera through the foot of the pounding box with the ground surface.
Third, each valid footprint is tested for memoership with existing oojects and tne observation is added as support to the closest object, if any. If no object is close enough, then a new object hypothesis is created.
Fourth, all supporting observations are used twith appropriate weigntmg based on distance from the camera, direction of motion, etc.) to update the position of each object .
Fifth, the object positions are projected into the next frame based on a domain-dependent tracker.
Sixth, if events in the scene are to be recognized, oo ect positions and associations are compared against predetermined templates. For example, if the courtyard scene the robot has moved into spatial coincidence with one of the predetermined immovable objects, such as a bench, then the robot may have run into the bench -- an abnormal and undesired occurrence. For example, if in the scene of a football game the football has moved in a short time interval from spatial coincidence with a moving player that was predetermined to be of a first team to spatial coincidence with a moving player that is predetermined to be of a second team -- especially if the football is detected to have reversed its direction of movement on the field -- then any of a (1) kickoff, (ii) fumble, or (iii) interception may have transpired. If the detected event is of interest to the viewer in the MPI video system, then appropriate control signals are sent. Also, based on the sub-systems knowledge of static objects, if an actual or projected position of a dynamic object intersects a static object, then an appropriate message may be sent. If the scene of a football game the football is determined to be in spatial coincidence with the forty yard marker, then it is reported that the football is on the forty yard line.
9.1.8 Results
Each of Figures 17 through 21 frames in an exemplary exercise -- consisting of one thousand (1000) total frames from four (4) different cameras acquired as described in Section 9.1.2 -- of the Multi-perspective perception subsystem. Figures 17 tnrougn 19 show the state of the subsystem at global time 00.22:29:06 Figures 20 and 21 show the state of the subsystem at the global time 00.22 39:06 In Figure 17, four dynamic objects are shown m the scene: a rooot vehicle, two pedestrians and a bicyclist. The scene is covered by four different cameras A f fth object -- another bicyclist -- is snown, but is not laoeied for clarity
Ξacn of the four cameras has its own ciocK, as is shown jnder the camera's "lew m one cf Figures 17 through 17d Camera numoer three #3 , wnicn is arp^trarny Known as "Saied's camera1', is αseα to maintain the global clock since tnis camera has the largest coverage and the best image quality Figure 17a-17d clearl snows tne coverage of each camera
As shown in Figure 17, an object that s out of iew, too small, and/or occiuoed from view m one camera is m view, large and/or un-occluded to tne view of another camera Note that the opject laPe_s _<seo m tne Figure 17 are for explanation only The prototype suosvstem does not include any non-trivial opject recognition, and al.*. object identifiers that persist over time are automatically assignee by the system Mnemonic names like "Walker 1", or "Walκer" refer to the same ooject identification (e.g., what the software program would label
"BasιcEnvObject0023 " , "BasιcEnvObject0047" , etc ) over all the different frames of Figures 17-21.
A pictorial representation of the display screen showing the operator interface to the multi-perspective perception subsystem is snown m Figure 18. Four camera views are shown in the top row of Figure 18. Each view is labeled using its mnemonic identification instead of its numeric identification because humans respond better to mnemonic "id's" Each view may be associated with a one of Figures 17a-17d.
A red rectangle is drawn automatically around each detected object in each camera's view of the scene. It can be clearly seen how objects are robustly detected m the different images obtained with cameras of different characteristics (huge variations in color, color vs. monochrome) -- even when the object is just a few pixels wide.
The bottom section of the operator display screen in Figure 18 shows the object hypotheses which are formed over several frames (first frame is global clock 00:22:10:0) . The intensity each object's marker represents the confidence each hypotheses. The entire display screen, the objects depicted, and the object hypothesis diagrammatically depicted, is, as might well be expected, in full color. Figures 17-21 are therefore monochrome of color images. In particular, the object markers are preferably m the color yellow, and the intensity of the bight yellow color of each object's marker represents the confidence m the hypotheses for that opject The eye is sensitive to discern even such slight differences color intensity as correspond to differences m confidence.
The multi-perspective perception suosystem has a high confidence in each object for which a marker is depicted m Figure 18 because, at the particular glooal time represented, each ooject happens to nave peen ooserveo from many cameras over several past frames .
Tne three-dimensional ^odei at gloca_ time CC:Z2.29:06 is snown in Figures 19a-19e m ootn rea_ and "irtua views. Figures 19a-19α show the moαel from tne four real camera views. One-to-one correspondence between the mode- and the camera views can be clearly seen The fifth view cf Figure 19e is a virtual view cf the model from directly overhead the courtyard -- where no real camera actually exists. This vi tual view shows the exact locations of all three objects, including the robotic vehicle, the two-dimensional plane of tne courtyard. Three objects are very accurately localized, The fourth object, Walker Numper Two ( 2) Figure 17 and IB, nas some error in localization since this person is d) not visible Camera numoer four (#4) , and (ii) his/her coverage is very small in Cameras numbers two and three (#2 & #3) , hence leading to some errors .
Note that even though the object Walker number two (#2) 2 is visible Camera number one (#1) , that particular observation is not used since its bounding box intersects the bottom of the image. Obviously, when an object's bounding box intersects the bottom of the image, its full extent cannot be determined and should be ignored. To show the development of object hypotheses over time, a snapshot of the experiment is taken ten (10) seconds later. Figures 20 and 21 show that state. Figure 20 corresponds to Figure 18 wnile Figure 21 corresponds to Figure 19. One important observation to make in Figures 20 and 21 is that, given the relative proximity of Walker number one (#1) and Bicyclist number one (#1) , both are still classified as separate objects. This is only possible due to the subsystem's history and tracking mechanism.
9.2 Applications
In addition to multi-perspective interactive (MPI) video, a variety of other application areas can benefit from the global multi-perspective perception subsystem described. For instance, environments demanding sophisticated visual monitoring, such as airport runways and hazardous or complex roadway traffic situations can advantageously use the global multi-perspective perception subsystem. In these environments, as MPI video, objects must be recognized and identified, and spatial-temporal information about objects' locations and behaviors must provided to a user.
The expected first application of the global multi-perspective perception subsystem to the MPI video system' has been sports, and it is expected that sports and other entertainment applications -- which greatly benefit -- will be the first commercial application of the subsystem/system. Sports events, e.g. football games, are already commonly imaged with video cameras fror several different spatial perspectives - - as many as several cozen such for a major professional football game. The reason that still more cameras are not used is primarily perceived as having to do to the expense of such human cameramen as are required to focus the camera image on the "action", and not the cost of the camera. Additionally, it is unsure how many different "feeds" a sports editor can use and select amongst -- especially m real time. The reason the televised sporting event viewing public is by an large satisfied with the coverage offered is that they have never seen anything better -- including the movies. Few people have been privileged to edit a movie or a video, and even fewer to their own personal taste 'no matter how weird, or deviant1. The machine-based MPI video of the present invention will, of course, accord viewing diversity without the substantial expense of human labor.
Consider that, m using the global multi-perspective perception subsystem and the MPI video system, multiple video perspectives are integrated into a single comprehensive model of the action. Such a representation can initially assist a number of video editors in cnoosmg between different perspectives, for example a video editor for the "defense", and one for the "offense" and one for the "offensive receivers", etc., as well as the standard "whole game" video editor. Ultimately, and with increasingly affordable computer power, even a regular viewer who is interested, for example, in a particular player would be able to customize his video display based on that player. Interactive Video applications such as these will greatly benefit from, and will use, both the global multi-perspective perception subsystem and the MPI video system.
Still another application where the global multi-perspective perception subsystem may be used directly is as a tour guide in a museum or any such confined space. Rather than moving objects in the scene (i.e, the courtyard, or the football field) , the scene can remained fixed (i.e., the museum) and the camera can move. The response accorded a museum visitor/video camera user will be even more powerful than, for example, the hypertext linkage on the World Wide Web of the Internet. On an interactive computer screen and system (whether
63 on the Internet or not) a viewer/user and point and click his/her way to additional information. However, the viewer/user is viewing on a video representation of museum art, and not the real thing.
Consider now a visit to a museum of art using, instead of a self-guided tour headset, a nand-held video camera. The user/viewer can go any nere tnat he or she wants -within the galleries of the museur, and can point at any art work, tc pernapε show not only t e scene at hand m the viewfmder cf his cr her video camera, Put perhaps also a video and/or audio overlay that has interactively been sent to the user's video camera from "computer central". The "computer central" recognizes where m tne museum the user's video -- which is also transmitted out to the "computer central" -- arises from. Simple "helps" the gallery rooms, such as bar codes, may perhaps help the "computer central" to oetter recognize where an individual user is, and m what direction the user is pointing. So far this scheme may not seem much different, and potentially more complex and expensive, than simply having a user-initiated information playbacK system at each painting (altnougn problems of time synchronization for multiple simultaneous viewers may be encountered with such a system) .
The advantage that tne global multi-perspective perception subsystem offers the art museum environment is that accumulation of a "user track", instead of an "object track", becomes trivial. The user may be guided in a generally non- repetitious track through the galleries. If he/she stops and lingers for a one artist, or a one subject matter, or a style, or a period, etc., then selected further works of the artist, subject matter, style, period, etc., that seem to command the user's interest may be highlighted to the user. If the user dwells at length at a single work, or at a portion thereof, the central computer can perhaps send textual or audio information so regarding. If the user fidgets, or moves on, then the provided information is ooviously of no interest to the user, and may be terminated. If the user listens and views through all offered messages that are classified "historical perspective of the persons and things depicted in the art work viewed" , then it might reasonably be assumed that the user is interested in history. If, on the contrary, the user listens and views through all offered messages that are classified "life of the artist", then it might reasonably be assumed that the user is interested in biography.
9.3 Conclusions, and Future Developments, Concerning the Global Multi-perspective Perception Subsystem The complex phenomena of "man-machine information systems
64 of tne future" discussed the immediately proceeding section may seem all "fine and good", or even fascinating, but some minutes deliberation are likely required to understand exactly what this all has to do with the present invention In the simplest possible terms, information -- and a great, great deal cf such information, indeed -- comes to a camera, wnich is the pest present machine substitute for human vision, the form of two-dimensional images However, our own human vision is stereoscopic, and our eye-Orams compination, perceptive cf not two, Put tnree dimensions We reason things out spatially m tnree dimensions, and we are interested what goes on m three dimensions -- as at a real live football game -- as well as in two dimensions -- as in the presentation cf a footoail game on television (We are also interested smelling, tasting and/or nearing concurrently with our viewing, but the present invention cannot do anything about satisfying this desire ι
It is the teaching of the present invention, broadly speaking, that order to pest serve man, machine systems that convey visual information ought to, if at all possible or practical, "rise to tne level" of three-dimensiona.. information. The machine system would desirably so rise not the images that it displays to viewers (which displayed images will, alas, remain two-dimensional for the foreseeable future^ but, instead, in the construction and management of a database from wnich information can be drawn. Moreover, if this three-dimensional database is good enough, and if the machine (computer) processes that operate upon it are clever enough, then the power, and the flexibility, or viewer service, and presentations, are magnified This magnification is in the same sense that we get more out of life by operating as autonomous agents in the three- dimensional world than we would if we could view all the cinema of the world for free forever in a darkened room If a human cannot interact with his/her environment -- even as viewed, when necessary, through a two-dimensional window -- then some of the essence of living is surely lost.
It is the teaching of the present invention how to so construct from multiple two-dimensional video images a three- dimensional database, and how to so manage the three-dimensional database for the production of two-dimensional video images that not necessarily those images from which the database was constructed.
Future improvements to the global multi-perspective perception subsystem will involve building on the complete framework provided in this specification. Improvements on two dimensional motion detection and tracking, three dimensional integration and tracking, etc. are possible. Another important extension of the present invention would be to use cooperative active cameras for enhanced track robots and other moving objects over wide areas This approach could both α) reduce the numoer of cameras required to cover an area, and (11 improve object detection and recognition by keeping objects towards center of view.
Future improvements to tne global multi-perspective perception supsystem may also be taKen m the area of cooperative human-machine systems Interactivity at tne central s te _gnt Pe improved so as to permit a human tc perfcr~ mgner - _evel cognitive tasKs tnan simply asκιng "wnere", or "wnat no0" cr "wnen" The human might asκ( for example, "why-5" In the context of football, and for the event of a tackle, tne machine 'the computer* might oe able to advance as a possiP_e answer (which would not invariably be correct) to the question "why (the tackle)9" something lικe "Defensive L eoacKer 24 at tne site of tackle has not been impeded in his motion since the start of the play " The machine nas sensed that linebacker #24 -- wno may or may not nave actually made the tackle but who was apparently nearby -- was not m contact with any defensive player prior to the tackle In a highest- revel interpretation of this event as would be, and as of the present can be, rendered only by a human being, the likely interpretation cf this sequence -- as was recognized by the macnme -- is that someone has missed a tackle.
10. The Particular, Rudimentary, Embodiment of the Invention
Taught Within This Specification
Tne present specification has taught a coherent, logical, and useful scheme of implementing virtual video/television. The particular embodiment within which the invention is taught is, as would be expected and as is desirable for the saKe of simplicity of teaching, rudimentary.
The rudimentary nature of the particular embodiment taught within this specification dictates, for example, that the descrioed manipulation and synthesis is of recorded video images, and is not of television in real time. However, this factor is a function only of the power of the computer used. The efficacy and utility of the image manipulation and synthesis scheme of the present invention taught, including by rigorous mathematics, is not diminished by the computational speed at which it is accomplished.
The rudimentary nature of the particular embodiment taught within this specification further dictates, for example, that the extraction of some scene features from these video images is not only not in real time, but is m fact done manually. This will turn out to be an insignificant expedient. First, many of the features extracted will turn out to be d) distinct and (ii)
66 fixed; and are fact the hash marks and yard marκιngs of an American football field' It is clear that these fixed features could be entered into any systeπ, even by manual means, just once "before the game" Moreover, they are easily captured by even the most rudimentary machine vision programs. Other features extracted from the video images -- such as football players and/cr a footoall ir motion -- are much harder to extract, especially at nign speeds and most especially m real time To extract these moving features enters the realm of machine vision Nonetne_ess tnat this portion cf the system cf the present invention _s cnaiienging, many simple macn e solutions -- ranging from fiuorescently bar-coded objects the scene (e g , players and football to full-blown, state-of-the- art machine vision programs -- are possible and are discussed within this specification In fact, with non-real-time video it is even possible -- ano quite practical -- to have a trained human, or a squad of such, track eacn player or other object of concern tnrough each video scene e g , a football play The "tracked" objects (the players are only viewed later, upon an "instant replay" or from a video archive on tape or CD-ROM Accordingly, it is respectfully suggested that the utility, and the scope, of the present invention is not degraded by certain practical limitations, as of present, on the particular image extraction function performed the rudimentary embodiment of the invention
Finally, m the particular, rudimentary, embodiment of the invention taught m this specification the synthesized video image is not completely of a virtual camera/image that may be located anywhere, but is instead of a machine-determined most appropriate real-world camera This may initially seem like a significant, and substantive, curtailment of the described scope of the present invention. However, important mitigating factors should be recognized First, the combination of multiple images, even video images, to generate a new image is called "morphmg", and is, circa 1995, well known One simple reason that the rudimentary system of the present invention does proceed to perform this "well known" step is that it is slow when performed on the engineering workstation on which the rudimentary embodiment of the present invention has been fully operationally implemented Another simple reason that the rudimentary system of the present invention does proceed to perform this "well known" step is that, for the example of American football initially dealt with by the system and method of the present invention, it is uncertain whether this expensive, and computationally extensive, step (which turns out to be a final step) is actually needed. Namely, many cameras exist, and will exist, at a football telecast. Even if some
67 virtual image is desired of, for example, the rignt halfback during the entirety of one play, it is likely that some existing camera or comomations thereof can deliver the desired image (s) . Accordingly, it is again respectfully suggested that the utility, and the scope, of the present invention is not degraded by certain practical limitations, as cf present, cn the particular selection/morphmg function performed in the rudimentary emPodiment cf the invention.
In return for some compromises rooted in practical considerations, the present specification completely teacnes, replete with pictures, now to implement a virtual video camera, and a virtual video image, by synthesis m a computer and m a computer system from multiple real video images that are obtained by multiple real video cameras. Because this synthesis is computationally intensive, the computer is usefully powerful, and is, the preferred embodiment, an engineering workstation.
Moreover, depending upon how extensively and how fast (D three-dimensional analysis of the multiple scenes is to transpire, ID information from the multiple scenes is to be extracted, and (111; linkage between the multiple scenes is to be established, the computer and computer system realizing the present can usefully Pe very powerful, and can usefully exercise certain exotic software functions m the areas of machine vision, scene and feature analysis, and interactive control.
As explained, the present invention has not been, to the present date of filing, implemented at its "full blown" level of interactive virtual television. It need not be order that it may be understood as a coherent, logical, and useful scheme of so implementing virtual video/television.
10.1 Directions of Future Development
This specification has described the development and actual use of a prototype football video retrieval system. This system serves to demonstrate the concepts and the potential of MPI video. The feasibility of the broader concepts is completely demonstrated. Design and implementation of MPI video for longer sequences of football, and also for other applications, is still proceeding as of the filing date.
However, as is also clear from the present specification, the MPI video system is m its infancy. The potential of the MPI video techniques is obvious, but cost effective implementation, especially for the individual "John Q. Public" viewer has a long way to go. Almost all medium- to large- scale computer technology involved in the implementation of the prototype MPI video systems was stretched to its limits. The following are only a few examples of the useful, and probable,
68 future developments and enhancements.
10.1.1 Scene Analysis
In the prototype MPI video system, much information was inserted manually by an operator. However to make MPI video practical for commercial use, this process should be automated as much as possible. ''Notice that it is not necessary that MPI video should invariably be so automated in order to be used. Certain very crucia± cr interesting events for which multiple video images exist -- such as key plays in sporting events -- may be well deserving of careful analysis after the fact.)
Also, and as may be recalled, it was found to be difficult to determine camera status for some video frames which contain very few known points to calibrate. This problem may be solved by using information obtained from other video frames, both of other cameras in the same instant and/or of the same camera in the instants before and after. Once this technology becomes practical, it will be possible to structure many other items and objects to simplify the object recognition task.
10.1.2 Data Modelinc. and Indexing
Information structure that is contained in a scene is usually complicated, and the amount of information in the scene is huge. Moreover, this video information is developed and received over but a short period of time. To deal with various types of queries, good data modeling is required. See Amarnath Gupta, Terry Weymouth, and Ramesh Jain; "Semantic queries with pictures: the VIMSYS model" appearing in Proceedings of the 17th Interna tional Conference on Very Large Da ta Bases, September 1991.
To enable the best quick response to the queries, indexing techniques will be required. These techniques for images and video are just being developed.
10.1.4 The Human Interface
The present specification has taught that interaction using three-dimensional cursor is a good way for a user/viewer to point or highlight objects in three-dimensional space. However, in the field of entertainment and training, where interactive video is expected to be useful, an even more friendly interface is desired. Techniques to specify camera location, describe events of interest, and other similar things need further development. In many applications, like "telepresence", one may require extensive use of virtual reality environments. In applications like digital libraries, strong emphasis on user modeling will be essential.
Nonetheless to the potential of improving, and rendering more aostract, the user'viewer interface some applications, this interface is most assuredly not a "weaK point" of the present invention cf MPI video Indeed, it is difficult to even imagine how new and improved user/viewer interface tools may be used m the context of interactive movies and similar other applications of MPI video It seems as if tne too s that the user viewer right reasonably require are already available right now
10 1 4 Video ataoases
As access to data from more and more cameras _s permitted, the storage requirements for MPI video will mcrease significantly Where and how to store this video data, and how to organize it for timely retrieval, s likely to Pe a major issue for expansion and extension of tne MPI video system In the prototype system, the single most critical problem has been the storage cf data Future MPI video wιl_ continue to put tremendous demands on the capacity and efficiency of organization of the storage and database systems
10 2 Recapitulation of the MPI Portion of tne Present Invention
In one, rudimentary, embodiment of present invention, a virtual video camera, and a virtual video image, of a scene were synthesized a computer and m a computer system from multiple real video images of the scene that were obtained by multiple real video cameras
This synthesis of a virtual video image was computationally intensive. Depending upon how extensively and how fast d) three-dimensional analysis of the multiple scenes is to transpire, (ii) information from the multiple scenes is to be extracted, and (in) linkage between the multiple scenes is to be established, the computer and computer system realizing the present can usefully be very powerful, and can usefully exercise certain exotic software functions m the areas of machine vision, scene and feature analysis, and interactive control. In the prototype system network-connected engineering work stations that were relatively new as of the 1995 date of filing were used.
Notably, however, the present invention need not be (and to the present date of filing has not been) implemented at its "full blown" level of interactive virtual television in order that it may be recognized that a coherent, logical, and useful scheme of implementing virtual video/television is shown taught.
The virtual video camera, and virtual image, produced by the MPI video system need not, and commonly does not, have any real-world counterpart. The virtual video camera and virtual
70 image may snow, for example, a view of a sporting event, for example American football, from an aerial, or an on-field, perspective at which no real camera exists or can exist
In advanced, computationally intensive, from the virtual camera/virtual image can be computer synthesized m real time, producing virtual teie-ision.
The synthesis of "irtual video images/ v rtua- television pictures may be linked to any of D a perspective, \ i an opiect in the video/television scene, cr m an event m tne vioeo/ television scene The l Kage may be to a static, or a dynamic, D perspect ---, n object or m event For example, the virtual video/television camera could be located
D statically at the line of scrimmage, (ID dynamically behind the halfback wheresoever he might go, or 'ιιu dynamically on the football wheresoever it might go, a video/television presentation of a game of American football
The virtual camera, and virtual image, that is synthesized from multiple real wor_d video images may Pe so synthesized interactively, and on demand For example, and early deployments of the system of the invention, a television sports director might select a virtual video replay of a play m a football game keyed on a perspective, player or event, or might even so key a selected perspective of an upcoming play to be synthesized in real time, and shown as virtual television. Ultimately, many separate viewers are able to select, as sports fans, their desired virtual images. For example, a virtual video replay, or even a virtual television, image of each of the eleven players on each of two American football teams, plus the image of the football, is carried on twenty-three television channels. The "fan" can thus follow his favorite player.
Ultimate interactive control where each "fan" can be his own sports director is possible, but demands that considerable image data (actually, three-dimensional image data) be delivered to the "fan" either non-real time in batch (e.g. , on CD-ROM) , or m real time (e.g., by fiber optics) , and, also, that the "fan" should have a powerful computer (e.g., an engineering workstation, circa 1995) .
In accordance with the preceding explanation, variations and adaptations of Multiple Perspective Interactive (MPI) video in accordance with the present invention will suggest themselves to a practitioner of the digital imaging arts. For example, monitors of the positions of the eyes might "feed back" into the view presented by the MPI video system in a manner more akin to "flying" m a virtual reality landscape than watching a football game -- even as a live spectator. It may be possible for a viewer to "swoop" onto the playing field, to "circle" the stadium, and even, having crossed over to the "other side" of
71 the stadium, to pause for a look at that side's cheerleaders
11 Immersive Video, and the Motivation for Immersive Video
Because it provides a comprehensive visual record of environment activity, video data is an attractive source of Dhformation for tne creation cf "virtual worlds" wnich, nonetheless tc being virtual, incorporate some "real world" fidelity The present invention concerns the use of multiple streams of video data for tne creation cf _mmersιve, "visual reality", environments
The immersive Cι c svstem cf tne present inv ntion for so synthesizing "visual realDt from multiple streams cf video data is based on, and is a continuance of the Multiple Perspective Interactive Video (MPI-Video ust discussed, An immersive video system incorporates the KPI -Video architecture, wnicn architecture provides the infrastructure for tne processing and the analysis of multiple streams of video data
The MPI-Video portion of the immersive video system D performs automated analysis of the raw video and ID constructs a model of the environment and object activity witnm the environment This mode.., together with the raw video data, can be used to create immersive -ιαeo environments This is tne most important, and most difficult, functional portion of the immersive video system. Accordingly, this MPI-Video portion of tne immersive video system is first re-visited, and actual results from an immersive "virtual" walk through as processed by the MPI-Video portion of the immersive video system are presented.
As computer applications that model and interact with the real-world increase in numbers and types, the term "virtual world" is becoming a misnomer These applications, which require accurate and real-time modeling of actions and events m the "real world" [ e . g . , gravity) , interact with a world model either directly (e.g., "telepresence") or in a modified form (e.g., augmented reality) A variety of mechanisms can be employed to acquire data about the "real world" which is then used to construct a model of the world for use in a "virtual" representation.
Long established as a predominant medium in entertainment and sports, video is now emerging as a medium of great utility in science and engineering as well It thus comes as little surprise that video should find application as a "sensor" the area of "virtual worlds." Video is especially useful in cases where such "virtual worlds" might usefully incorporate a significant "real world" component . These cases turn out to be both abundant and important; basically because we all live in, and interact with, the real world, and not inside a computer video game. Therefore, those sensations and experiences that are most valuable, entertaining an pleasing to most people most of the time are sensations and experiences of the real world, or at least sensations and
72 experiences that nave a strong real-world component. Man cannot thrive on fantasy alone iwhich state is called insanity) ; a good measure of reality is required.
In one such use of ideo as a "sensor", multiple video cameras cover a dynamic, real-world, environment. These multiple video data streams are a αseful source of information for building, first, accurate three-dimensional nπodels cf tne events occurring tne real world, ana, then, completely immersive environments. Note that the immersive environment does not, accordance with the present invention, some straight from the real world environment. The present invention is not simply a linear, orute-force, processing of two-dimensional (video) data into a three-dimensional 'vioeo i database land the subsequent uses thereof, Instead, in accordance with the present invention, the immersive environment comes to exist through a tnree dimensional model, particularly a model of real-world dynamic events. This will later become clearer such as in, inter alia, the discussion of Figure 25.
In tne immersive video system of the present .nvention, visual processing algorithms are used to extract information about object motion and activity (botn of which are dynamic oy definition) in the real world environment. This information -- along with (1) the raw video data and (11) a priori information about the geometry of the environment -- is used to construct a coherent and complete visual representation of the environment . This representation can then be used to construct accurate immersive environments based on real world object behavior and events. Again, the rough concept, if not the particulars, is clear. The immersive environment comes to be only through a model, or representation, or the real world environment .
While video data proves powerful source medium for these tasks (leading to the model, and the immersive environment) , the effective use of video requires sophisticated data management and processing capabilities. The manipulation of video data is a daunting task, as it typically entails staggering amounts of complex data. However, in restricted domains, using powerful visual analysis techniques, it is possible to accurately model the real world using video streams from multiple perspectives covering a dynamic environment. Such "real-world" models are necessary for "virtual world" development and analysis.
The MPI-Video portion of the immersive video system builds the infrastructure to capture, analyze and manage information about real-world events from multiple perspectives, and provides viewers (or persons interacting with the scene) interactive access to this
73 information. The MPI-Video sub-system uses a variety cf visual computing operations, modeling and visualization techniques, and multimedia database methodologies to (i) synthesize and (ii) manag a rich and dynamic representation of object behavior real-world environments monitored oy multiple cameras (see Figures 2 and 22) .
An Environment Model 'EM' is a hierarchical representation of D the structure cf an environment and '. ID tne actions that take place in this environment . The EM is used as a oridge oetween the process of analyzing and monitoring the environment and tnose processes that present information to the viewer and support the construction of "immersive visual reality" based on the video data input .
The following sections explain the use of multiple streams of video data to construct "immersive visual reality" environments. In addition, salient details are provided regarding support of the MPI-Video subsystem for other video analysis tasks .
A variety of design issues arise in realizing immersive environments, and in managing and processing of multiple streams o video data area. These issues include, for instance, how to selec a "pest" view from the multiple video streams, and how to recogniz the frame (s) of a scene "event". Interactively presenting the information about the world to the viewer is another important aspect of "immersive visual reality". For many applications and many viewer/users, this includes presentation of a "best" view of the real-world environment at all times to the viewer/user. Of course, the concept of what is "best" is dependent on both the viewer and the current context. In following Section 12, the different ways of defining the "best" view, and how to compute the "best" view based on viewer preferences and available model information, is described.
In some applications, e.g., "telepresence" and "telecontrol", immersion of the viewer/user is vital. Selecting the "best" view among available video streams, which selection involves constant change of viewer perspective, may be detrimental towards creating immersion. Generalizing the "best" view concept to selecting a continuous sequence of views that best suit viewer/user requirements and create immersion overcomes this. When such arbitrary views are selected, then the world must somehow be visualized from that perspective for the viewer/user.
Traditionally, immersion has been realized by rendering three dimensional models realistically, preferably in stereo. This is the approach of the common computer game, circa 1995, offering "graphics immersion" . This approach, which uses a priori texture
74 maps, suffers from some defects when the immersive experience to be created is that of a real-world environment. In real-world environments, the lighting conditions change constantly in ways that cannot be modeled precisely. Also, unknown dynamic objects can appear, and wnen they do it is not clear how and what to render.
When multiple video cameras covering an environment from multiple perspectives, as m tne immersive video system cf the present invention, than, in acccrdance with the invention, video can Pe used as a dynamic source of generating texture information. The complete immersive video system discussed in Section 13 uses comprehensive tnree dimensional model of the environment and tne multiple video channels to create immersive, realistic renditions of real-world events from aroitrary perspective Poth monocular and stereo presentations.
The furtner sections of this specification are organized as follows: Section 12 is a description of the construction of accurate three dimensional models of an environment from multi-perspective video streams m consideration cf a priori knowledge of an environment Specifically, section 12 discusses the creation of an Environment Model and also provide details on the preferred MPI-Video architecture.
Following this, section 4 describes how this model, along with the raw video data, can be used to build realistic "immersive visual reality" vistas, and how a viewer can interact with the model .
Details on the implementation of the MPI-Video portion of the immersive video system, outlining hardware details, etc., are given m section 14.
The possibilities of using video to construct immersive environments are limitless. Section 15 describes various applications of the immersive video system of the present invention.
12. Applications of Video-Based Immersive Environments
It is the contention of the inventors that video of real-world scenes will play an important role m automation and semi-automation of both (D virtual and (ii) immersive visual reality environments. In telepresence applications, a virtual copy of the world is created at a remote site to produce immersion. See B. Chapin, Telepresence Defini tions , a World Wide Web (WWW) document on the Internet at URL http .- //cdr. Stanford,edu/html/telepresence/definition.html, 1995.
75 Key features of telepresence applications are: 1) the entire application is real-time; 2) the virtual world is reasonably faithful to the real world being mimicked, 2 , since real-time and real-world are cardinal, sensors should be used m acquiring the "irtua. world a completely automated way; and 4 the virtual world must pe visualized realistically from the v_ewer perspective
The MPI-Video modeling system described m Section 12 uses multiple video signals to faithfully reconstruct a mode_ of the real-world actions and structure A distributed implementation coupled with expectation-driven, need-based analysis (described m Section 14) ensures near real-time model construction. The preferred immersive video system, described Section 13, reconstructs realistic monocular and stereo stas from the viewer perspective (see, for example, Figure 33;
Even m non-real time applications, vioeo-oased systems, such as the one taugnt this specification, can be very beneficial. Generally, it is very difficult and laborious to construct virtual environments Py hano In a semi-autonomous mode, nowever, a video-based system can assist the user byr assuming the low level tasKS like building tne structural model oased on the real-world, leaving only higa level annotation to the user
Video data can pe used to collect a myriad of visual information about an environment. This information can be stored, analyzed and used to develop "virtual" models cf the environment. These models, m turn can be analyzed to determine potential changes or modifications to an environment For instance, MPI-Video might be employed at a particularly hazardous traffic configuration. Visual data of traffic would be recorded and analyzed to determine statistics about usage, accident characteristics, etc Based on this analysis, changes to the environment could be designed and modeled, where input to the model again could come from the analysis performed on the real data. Similarly, architectural analysis could benefit py the consideration of current building structures using MPI-Video. This analysis could guide architects the identification and modeling of building environments.
13. MPI-Video Architecture
To effectively create synthetic worlds which integrate real and virtual components, sophisticated data processing and data management mechanisms are required. This is especially true in the case where video is employed because high frame rates and large images result in daunting computational and storage demands. The
76 present invention address such data processing and management issues througn the concept of Multiple Perspective Interactive Video (MPI-Video) .
MPI-Video is a framework for the management and interactive access to multiple streams cf video data capturing different perspectives of the same cr cf related events As applied to tne creation of /Drtuai environments, MPI-Video supports the collection, processing and maintenance of multiple streams of data which are integrated tc represent an environment Sucn representations can Pe based solely on the "real" world recorded by the video cameras, or can incorporate elements of a "virtual" world as well .
The preferred MPI-Video system supports a structured approach to the construction of "virtual worlds" using video data. In this section tne MPI-Video architecture, shown m Figure 1, is outlined. Those elements salient to the application of MPI-Video m the context of the processing and creation of "immersive visual reality" are highlighted.
In brief, MPI-Video architecture involves the following operations During processing, multiple data streams are forwarded to the Video Data Analyzer. This unit evaluates each stream to (1) detect and track objects and (11) identify events recorded m the data. Information derived m the Video Data Analyzer is sent to the Assimilator Data from all input streams is integrated by the Assimilator and used to construct a comprehensive representation of events occurring m the scene over time (e.g. object movements and positions) .
The Assimilator thus models spatial-temporal activities of objects in the environment, building a so-called environment model. In addition, these tracking and modeling processes provide input to the database which maintains both the annotated raw video data as well as information about object behavior, histories and events. Information in the database can be queried by the user or by system processes for information about the events recorded by the video streams as well as being a data repository for analysis operations. A View Selector module -- used to compute and select "best views" and further discussed below -- interfaces with the database and a user interface subsystem to select appropriate views in response to user or system input .
A visualizer and virtual view builder uses the raw video data, along with information from the environment model to construct synthetic views of the environment.
Finally, a user interface provides a variety of features to support access, control and navigation of the data.
To demonstrate and explore the ideas involved in MPI-Video, a prototype system was constructed. The prototype system uses data from a university courtyard environment. Figure 22a shows a schematic cf this courtyard environment, indicating the positions cf the cameras. Synchronized frames from each of the four cameras are shown in Figures 22b and 22c.
13.1 Three-Dimensional Environmental Model
"Virtual worlds" -- whether of an actual "real world" environment or a purely synthetic environment -- depend on the creation and maintenance cf an Environment Model (EM) . The EM will be understood to be a comprehensive three-dimensional model containing both (i) the structural primitives of the static environment, e.g. surfaces, shapes, elevation, and (ii) characteristics of moving objects such as motion, position and shapes .
Formally, the preferred EM consists cf a set cf interdependent objects 0-(t) . This set in turn is comprised of a set cf dynamic objects D (t) and a set of static objects S._. For instance, vehicles moving in a courtyard are dynamic objects; pillars standing in the courtyard are static objects. The time variance of the set 0. (t) is a result of the time variation of the dynamic objects .
As befit their name, static objects do not vary with time. The set of values of these objects at any instant comprises the state of the system S(t) . The preferred EM uses a layered model to represent objects at different levels of abstraction, such that there is a strong correlation between objects at different abstractions. Figure 4 shows some of the possible layers of the environment model, and how each layers communicates independently with other modules. Reference A. Katkere and R. Jain, A framework for information assimilation, in Explora tory Vision edited by M. Landy, et al . , 1995.
To ensure consistency, any changes that occur in one level should be propagated to other levels (higher and lower) , or at least tagged as an apparent inconsistency for future updating.
In general, propagation from higher to lower levels of abstractions is easier than vice versa. Accordingly, changes are attempted to be assimilated at as high level of abstraction as possible. Each dynamic object at the lowest level has a spatial extent of exactly one grid. Objects with higher extent are composed of these grid objects, and hence belong to higher levels.
78 Direct information acquisition at higher levels must pe followed py conversion of that information to the information at the densest level, so that information at all levels are consistent. It is important to come up with efficient access (and update strategies at this level since this could potentially be the cottieneck of the entire representation and assimilation module.
Eacn dynamic object has several attributes, ™cεt oasic oe g tne confidence that it exists Each cf the above factors may contrioute to either an increase or decrease m this confidence. These factors also affect tne values of other object attributes. The value of an object 0 t , and hence, the state S't1 , may change due to the following factors. I1 New input information, i.e., new data regarding object position from the video data, 2 change related model information; I advice from higher processes, and (4) decay (due to aging) .
The preferred MPI-Video system provides facilities for managing dynamic and static objects, as is discussed further below this section.
The EM, informed by the two-dimensional video data, provides a wealth of information not available from a single camera view. For instance, objects occluded in one camera view may oe visible in another. In this case, comparison ot objects m 2 vt> at a particular time instant t with objects in Sc 1 can help anticipate and resolve such occlusions. The model, which takes inputs from both views, can continue to update the status of an object regardless of the occlusion a particular camera plane To maintain and manipulate information about position of static and dynamic objects m the environment, a representation must be chc en which facilitates maintenance of object positional information as well as supporting more sophisticated questions aoout ooject behavior. The preferred dynamic model relies on the following two components .
Tne first component is voxels. In this representation, the environment is divided up into a set of cubic volume elements, or voxels. Each voxel contains information such as which objects currently occupy this voxel, information about the history of objects in this voxel, an indication of which cameras can "see" this voxel. In this representation, objects can be described by the voxels they occupy. The voxel representation is discussed in greater detail in section 4.
The second component is (x,y,z) world coordinates. In this case, the environment and objects in the environment are represented using (x,y,z) world coordinates. Here objects can be
79 described by a centroid m (x,y,z) , by bounding boxes, etc.
Each of these representations provides different support for modeling and data manipulation activities. The preferred MPI-Vide system utilizes both representations.
13.2 Video Data Analysis and Information Assimilation
The Video Data Analyzer uses image and visual processing techniques to perform object detection, recognition and tracking i each cf the camera planes corresponding to the different perspectives. The currently employed technique is based on differences in spatial position to determine object motion in each of the camera views. The technique is as follows.
First, each input frame is smoothed to remove some noise. Second, the difference image d: _ .. is computed as follows. Only pixels that are the focus cf attention windows and that ar not masked are considered. (Here F(t) refers to the pixels in the focus cf attention, i.e. , a region of interest in the frame t.)
d.. _ = Threshold (Abs(F._ . - F_) , threshold_value. ; (1)
To remove motion shadows, the following operation is done:
d." = d.... & cL..... (2)
Third, components on the thresholded binary difference image are computed based on a 4-neighborhood criterion. Components that are too small or too big are thrown away as they usually constitut noise . Also frames that contain a large number of components are discarded. 3oth centroid (from first moments) , and orientation an elongation (from the second moments) are extracted for each component.
Fourth, any of several optional filters can be applied to the components obtained from the previous step. These filters include merging of overlapping bounding boxes, hard limits of orientation and elongation, distance from expected features etc.
The list of components associated with each camera is sent from the Video Analysis unit to the Assimilator module which integrates data derived from the multiple streams into a comprehensive representation of the environment.
The Assimilator module maintains a record of all objects in the environment . When new data arrives from the Video Data Analysis module the Assimilator determines if the new data corresponds to an object whose identity it currently maintains. I
80 so, it uses the new data to update the object information. Otherwise, it instantiates a new object with the received information. The following steps are employed to update objects.
First, the list of 2D object bounding boxes is further filtered based on global knowledge
Second, the footprint of each pouncing POX S projected to the primary surface cf motion by intersecting a ray drawn from the optic center of that particular camera througn the foot of the bounding POX with the ground surface .
Third, eacn valid footprint is tested for membership with existing oo ects and the observation is added as support to the closest ooject, f any. If no obiect is close enough, a new object hypothesis is created.
Fourth, all supporting observations are used (with appropriate weighting based on distance from the camera, direction of motion, etc. to update the position of each object.
Fifth, the object positions are projected into the next frame cased on a domain dependent tracker.
More sophisticated tracking mechanisms are easily integrated into the preferred system. A current area of our research seeks to employ additional methods to determine and maintain object identity For instance, active contour models can be employed in each of the cameras to track object movements. See A. M. Baumberg and D. C. Hogg, An efficient method for contour tracking using active shape models, Techni cal Report 94 . 11 , School of Computer Studies, University of Leeds, April, 1994. See also M. Kass, A. Witkm, and D. Terzopolous, Snakes: Active contour models, In terna tional Journal of Compu ter Vision , pages 321-331, 1988. See also F. Leymarie and M. D. Levme, Tracking deformable objects in the plane using an active contour model, IEEE Transactions on Pa ttern Analysis and Machine Intelligence , 15(6) : 617-634, June 1993. Sucn methods provide a more refined representation of objec shape and dynamics .
One important assumption that is made is that the "static" world is known a priori and the only elements of interest in the video frames are the objects that undergo some type of change, e.g., a player running on a field. In addition, we introduce additional constraints by requiring cameras to be stationary and make the following realistic assumptions about objects of interest:
First, these objects are m motion most of the time.
Second, these objects move on known planar surfaces.
Third, these objects are visible from at least two viewpoints.
This knowledge of the "static" world is captured through the
81 camera calibration process which maps related locations in the two-dimensional video data to a fully three-dimensional representation of the world recorded by the cameras. If an event is seer, m one camera, e.g., a wide receiver making a catch, or a dancer executing a ump, the system, using this mapping, can determine ether cameras that are also recording tne event, and where the various video frames the event is occurring. Then a viewer, cr the system, can choose between these different views cf the action, subject to some preference. For example, the frames which provide a frontal view of the wide receiver or the dancer. This "best view" selection is descπoed further pelow and in section 14.
When their positions and orientations are fixed, cameras can be calibrated before processing the video data using methods such as those described by Tsai and Lenz . See R. Y. Tsai and R. K. Lenz , A new technique for fully autonomous and efficient 3D robotics hand/eye calibration, IEEE Transactions on Roboti cs and Automa tion , 5(3) :345-58, June 1989.
Calibration of moving cameras is a more difficult task and is currently an area of active research, e.g. , ego motion. See E. S. Dickmanns and V. Graefe, Dynamic monocular machine vision, Machine Vision and Appl i ca tions , 1:223-240, 1988.
The preferred MPI-Video system of the present invention has the capability to integrate these techniques into our analysis and assimilation modules when they become available. To date, evaluation of the preferred MPI-Video system has been done only by use of fixed cameras. The Assimilator maintains the Environment Model discussed above.
13.2.1 Camera Handoff
A key element in the maintenance of multiple camera views is the notion of a Camera Hand-off, here understood to be the event m which a dynamic object passes from one camera coverage zone to another. The Assimilator module also manages this processing task, maintaining a consistent representation of an object's identity and behavior during camera hand-off. This requires information about the object's position, its motion, etc.
Using the voxel information, noted above, we can determine which cameras can "see" (or partially "see") an object. Namely, a camera completely "sees" an object if all voxels occupied by the object are also seen by the camera. Let c (v) be the camera list, or set, associated with a particular voxel v, and V be the set of all voxels in which an object resides. Then, C. is the complete
82 coverage, i.e. that set of cameras which can see all voxels m wnich an object resides and P is the partial coverage set, i.e. tnose cameras which can see some part of the object. These are defined as:
= ~ c v 1 ' vev
P. = -J civ; (4)
V€V
Thus, we can determine wnicn cameras "see" a particular ooject cy considering the intersection and/or union of tne camera lists associated witn the voxels wnich tne object resides. When an opject moves between different zones of coverage, camera nandoff is essentially automatic as a result of tne a priori information regarding camera location and environment configuration. This is significant as it alleviates the necessity of reclassifymg objects wnen they appear m a different camera view That is, an object may enter a camera view and appear quite different then it did before, e.g., m this new perspective it may appear quite large.
However, reclassification is not necessary as tne system, using its three dimensional model of the world, can determine which object this new camera measurement belongs to and can update the appropriate object accordingly This capability is important for maintaining a temporally consistent representation of the objects m the environment. Such a temporal representation is necessary if the system is to keep track of ooject behavior and events unfolding m the environment over time.
13.3 Best View Selection
The View Selector can use a variety of criteria and metrics to determine a "best" view. Here, "best" is understood to be relative to a metric either specified by the user or employed by the system in one of its processing modules.
The best view concept can be illustrated by considering a case where there are N cameras monitoring the environment. Cameras will be denoted by C where the index I e {l, ... , N} varies over all cameras. At every time step, t, each camera produces a video frame, F _ t. The term ιBV , will be used to indicate the best view index. That is, ιBV is the index of the camera which produces the best view at time t. Then, the best view is found by selecting the
83 frame from camera C:s at time t, i.e., the best view is FI3Vr..
Some possible best view criteria include the least occluded view, the distance of the object to the camera, and object orientation.
In the case cf a least occluded view criteria, the system chooses, at time t, that frame from the camera in which an object cf interest s least occluded. Here, the best view camera index is defined according tc the following criteria,
iB.; = arg. 'max S (5)
The object size metric S is given by:
Figure imgf000086_0001
(x, y
where p(x, y> = 1 if pixel (x,y) e R_ and 0 otherwise. R. being the region of frame F_ . that contains the object of interest. We normalize the total size by the expected size, S_r of the object, i.e. , the number of pixels we expect the object to occupy in the camera view if no occlusion occurs. Finally, arg returns the index which optimizes this criteria.
In the case of an object distance of camera criteria, the best view is the frame in which an object of interest is closest to the corresponding camera.
iEV = arg. (min (D. (t) ) (7)
where, D_(t) is the Euclidean distance between the (x,y,z) location of camera C. and the world coordinates of the object of interest. The world coordinate representation, mentioned above, is most appropriate for this metric. Note also, that this criteria does not require any computation involving the data in the frames. However, it does depend on three-dimensional data available from the environment model .
For an orientation criteria a variety of possibilities exist. For instance, direction of object motion, that view in which a face is most evident, or the view in which the object is located closest to the center of the frame. This last metric is described by,
D., = arg, (min (D. (t) ) (8)
Here, CD. (t) is given by
84 CD. ( t ) = V - x ( t ) - ( Xs ι ze / 2 ) ) - + ( y , t ) - ( Ys ι z e / 2 ) ) - ( 9 )
The values Xsize and Ysize give the extent of the screen and x(t, , y(t) ) are the screen coordinates of the object's two-dimensional centroid in frame F_ ..
Combinations cf metrics can also be employed. We can formulate a general representation of best view as follows:
i = arg 3 (g. - m C
I ε (l M > , i e {l, N}; t e {i, T }) ) ■ (10)
In this equation, each m is a metric, e.g. , size as defined adove, and we have M such metrics each of which is applied to the data from each camera, hence, the C. terms in equation (10) . Furthermore, each g_ . combines these metrics for C , e.g. as a weighted linear sum. The use of the time t m this equation supports a best view optimization which uses a temporal selection criteria involving many frames over time, as well as spatial metrics computed on each frame. This is addressed m the followin paragraph. Finally, the criteria G chooses between all such combinations and arg. selects the appropriate index. For instance, G might specify the minimum value.
For example, if we have three cameras (N -= 3) , two metrics (M = 2) and g specifying a linear weighted sum (using weights ω_ and α' ) , G would pick the optimum of
g_ t = ω, m (C:) + ω m2 !C:)
g: _ = ω, m_ (C- ) + ω m~ (C-)
g3 _ = ω m, (C3) + ω m: (C
ιEV = arg. G(gx ., g-_., g3 t)
Again, G is a criteria which chooses the optimum from the set of gi,-t's. Note that time does not appear explicitly in the right hand side of this equation, indicating that the same best view evaluation is applied at each time step t. Note, in this case, th same g (here, a weighted linear sum) is applied to all cameras, although, this need not be the case.
Two further generalizations are possible. Both are research issues we are currently addressing. Firstly, an optimization whic
85 accounts for temporal conditions is possible. The best view is a frame from a particular camera. However, smoothness over time may also be important to the viewer or a system processing module. Thus, while a spatial metric such as object size or distance from camera is important, a smooth sequence of frames with some minimum number of cuts (i.e. camera changes) may also be desired. Hence, best view selection can be a result of optimizing some spatial criteria such that a temporal criteria is also optimum.
A second generalization results if we consider the fact that the C 's do not have to correspond to actual cameras views. That s, the preferred MPI-Video system has the capability of locating camera anywhere in the environment. Thus, best v ew selection can be a function of data from actual cameras as well as from "virtual cameras. In this case, equation 10 becomes a continuous function m the camera "index" variable. That is, we no longer have to restrict ourselves to the case of a finite number cf cameras from which to chose the best view. Letting x = (x, y, z , a , β, f) where '.x,y,z is the world coordinate position, or index, of the camera, o is a pan angle, β is camera tilt angle and f is a camera parameter which determines zoom in/out . The set of all such vectors x forms a 6-dimensional space, Ω. In Ω, <x,y,z) varies continuously over all points in R-', -π , β ≥ π, and f > 0.
To determine the best view in the environment subject to some criteria, we search over all points in this space minimizing the optimization function. In this case, the best view is that camera positioned at location "xa , where this value of the vector optimizes the constraint G given by:
Gιg-_t(m, (x) I j e {1, ... , MJ , t G {1 T}# x G Ω) ) (11)
The camera index, x can vary over all points in the environment and the system must determine, subject to a mathematical formulation of viewing specification, where to position the camera to satisfy a best view criteria. Views meetin this criteria can then be constructed using the techniques outline in section 14.
For instance, using the same parameters as above, i.e. two metrics mi, the weighted linear summing function g and the criteri function G, we have the
g cr. = ω_ m, (x) + ω: m2 (x) (12)
Then to determine the best view we find the value of x for
86 which
G(g --) , x G Ω (13)
is optimal .
Note that, assuming tne computational power is available, the pest view computation ir. equations (5) , (7) and -,3 can all be computed on the fly as video data comes into the system. More complex best view calculations, including those that optimize a temporal measure, may require ouffered or stored data to perform best view selection.
Figure 23 shows how a selected image sequence is derived from four cameras and the determined "best" view. In this example, the "best" view is based upon two criteria, largest size and central location within the image where size takes precedence over location. Here, the function gi;tιs just a simple weignted sum, as above, of the size and location metrics. The outlined frames represent chosen images which accommodate the selection criteria. Moreover, the oval tracings are superimposed onto the images to assist the viewer tracking the desired object. The last row presents the preferred "best" view according to the desired criteria. In order to clarify the object's location, a digital zoom mechanism has been employed to the original image . In images TO and Tl , only from the view of camera 3 is the desired object visible. Although all camera views detect the object image T2 and T3 , the criteria selects the image with the greatest size. Once again in image T4 , the object is only visible m camera 4.
13.4 Visualizer and Virtual View Builder
The visualizer and virtual view builder provides processing to create virtual camera views . These are views which are not produced by physical cameras. Rather they are realistic renditions composed from the available camera views and appear as if actually recorded. Such views are essential for immersive applications and are addressed in section 4 below.
13.5 Model and Analysis Interface
Figures 27, 28 and 29 show the current Motif-based preferred MPI-Video interface. This interface provides basic visualization of the model, the raw camera streams and the results of video data analysis applied to these streams. In addition, its menus provide control over the data flow as well as some other options. We are currently developing a hyper-media interface, in conjunction with
87 the development cf a database system, which will extend the range cf control and interaction a user has with the data input to and generated by our MPI-Video system. In the context of virtual scene creation such augmentations may include user selection cf viewing position and manipulation (e.g. placemen of virtual model information into the environment.
The model shown in figures 27, 28 and 29 employs an x, y, z; world coordinate, bounding box object representation. That is, the system! tracks object centroid and uses a bounding box to indicate presence of an object at a particular location. A voxel-based representation supports finer resolution of object shape and location. Such a formulation is discussed in the next section 14.
14. Operation of Immersive Video (ImmV) , or Interactive
Telepresence, or Visual Reality (VisR)
Immersive and interactive telepresence is an idea that has captured the imagination of science fiction writers for a long time. Although not feasible in its entirety, it is conjectured that limited telepresence will play a major role in visual communication media in the foreseeable future. See, for example, N. Negroponte, Being digi tal , Knopf, New York, 1995.
In this section we describe Immersive Video (ImmV) , a spatially- emporally realistic 3D rendition of real-world events. See the inventors' own papers: S. Moezzi, A. Katkere, S. Chatterjee, and R. Jain, Immersive Video, Techni cal Report VCL- 95 - 104 , Visual Computing Laboratory, University of California, San Diego, Mar. 1995; and S. Moezzi, A. Katkere, S. Chatterjee, and R. Jain, Visual Reality: Rendition of Live Events from Multi-Perspective Videos, Technical Report VCL- 95 - 102 , Visual Computing Laboratory, University of California, San Diego, Mar. 1995.
These events are simultaneously captured by video cameras placed at different locations in the environment. ImmV allows an interactive viewer, for example, to watch a broadcast of a football or soccer game from anywhere in the field, even from the position of the quarterback or "walk" through a live session of the US Congress .
Immersive Video involves manipulating, processing and compositing of video data, a research area that has received increasing attention. For example, there is a growing interest in generating a mosaic from a video sequence. See M. Hansen, P. Anandan, K. Dana, G. van der Wal, and P. Burt, Real-time Scene Stabilization and Mosaic Construction, in ARPA Image Unders tanding
88 Workshop, Monterey, CA, Nov. 13-16 1994. See also H. Sawhney, Motion Video Annotation and Analysis: An Overview, Proc . 27 th Asilomar Conference on Signals , Systems and Computers , pages 85-89. IEEE, Nov. 1993.
The underlying task is to create larger images from frames obtained from a single-camera ipanning video stream. Video mosaicmg πas numerous applications including data compression. Another application is video enhancement. See M. Irani and S. Peleg, Motion analysis for image enhancement: resolution, occlusion, and transparency, J. cf Vi sual Communi ca tion and Image Represen ta ti on , 4(4^ :324-35, Dec. 1993. Yet another application is the generation of panoramic views. See R. Szeliski, Image mosaicing fcr tele-reality applications, Proc . of Workshop on Appl i ca ti ons of Compu ter Vision , pages 44-52, Sarasota, FL, Dec. 1994. IEEE, IEEE Computer Society Press. See also L. McMillan. Acquiring Immersive Virtual Environments with an Uncalibrated Camera, Tecnni cal Report TR95 - 006 , Computer Science Department, University cf North Carolina, Chapel Hill, Apr. 1995. See also S. Mann and R. W. Picard. Virtual Bellows: Constructing High Quality Stills from Video. Techni cal Report TR#259 , Media Lab, MIT, Cambridge, Mass., Nov. 1994. Still further applications included high-definition television, digital libraries etc.
To generate seamless video mosaics, registration and alignment of the frames from a sequence are critical issues. Simple, yet robust techniques have been suggested to alleviate this problem using multi-resolution area-based schemes. See M. Hansen, P. Anandan, K. Dana, and G. van der Wal et al . , Real-time scene stabilization and mosaic construction, in Proc. of Workshop on Appli ca tions of Compu ter Vision , pages 54-62, Sarasota, FL, Dec. 1994. IEEE, IEEE Computer Society Press. For scenes containing dynamic objects, parallax has been used to extract dominant 2D and 3D motions which were then used in registration of the frames and generation of the mosaic. See H. Sawhney, S. Ayer, and M. Gorkani, Model-based 2D and 3D Dominant Motion Estimation for Mosaicing and Video Representation, Technical report , IBM Almaden Res . Ctr . , 1994.
For multiple moving objects in a scene, motion layers have been introduced where each dynamic object is assumed to move in a plane parallel to the camera. See J. Wang and E. Adelson, Representing moving images with layers, IEEE Transactions on Image Processing, 3(4) :625-38, Sept. 1994. This permits segmentation of the video into different components each containing a dynamic object, which can then be interpreted and/or re-synthesized as a video stream.
However, for immersive telepresence there is a need to generate 3D mosaics -- a "hyperMosaic" -- that can also handle multiple dynamic objects Maintaining spatial-temporal coherence and consistency is integral to generation of such a HyperMosaic. In order to obtain 3D description, multiple perspectives that provide simultaneous coverage must tnerefore ce used and tneir associated visual information integrated Another necessary feature would pe to provide a viewpoint that may he selected The immersive video system and method of the present invention caters to tnese needs.
Immersive "ideo requires sophisticated "isιon processing and modeling next described Section 14.1 Whι_e Virtual Reality systems use graphical models and texture mapping to create realistic replicas cf both static and aynamic components, m immersive video, distinctively, the data used is from actual video streams. This also aids the rendition cf exact ambiance, i.e purely two dimensional image changes are also captured. For example, m ImmV, a viewer is able to move around a football stadium and watch the spectators from anywhere m the field and see them waving, moving, etc , m live video. For faithful reconstruction of realism, ImmV requires addressing issues such as synchronization between cameras, maintenance of consistency both spatial and temporal signals, distributed processing and efficient data structures.
14.1 HyperMosaicmg: Creating "Visual Realism"
Given the comprehensive model of the environment and accurate external and internal camera calibration information, compositing new vistas is accomplished by mosaicmg pixels from the appropriate video streams. Algorithm 1 shown in Figure 31 outlines the steps involved. Algorithm 1 is the vista compositing algorithm. At each time instant, multiple vistas are computed using the current dynamic model and video streams from multiple perspective. For stereo, vistas are created from left and right cameras.
A basic element of this algorithmic process is a set of transformations between the model (or world) coordinate system W { (x^,y_, z ) } , the coordinate system of the cameras C : { (x_,y_, z.) } and the vista coordinate system V : { (x ,y ,z ) } . For each pixel, (xv,y , d„(x. ,yv) ) , on the vista the corresponding point, (x^,y ,z , is found in the world coordinate system using its depth value.
[x, y. z_ l]τ = M. [x., yv z, l]τ (14)
90 where M_. is the 4 x 4 homogeneous transformation matrix representing transformation between V and the world W [6] .
This point is then projected onto each of the camera image planes c .
[x, y, z^ l]τ = MD [x. y z, l]τ (15)
where M. is the 4 x homogeneous transformation matrix representing transformation between c and the world.
These points (x_,y-,zr) V c, are tested for occlusion from that view by comparing zc with the depth value of the corresponding pixel. At this point, we have several candidates that could be used for the pixel (x ,y...j for the vista. Each candidate view cv is evaluated using the following two criteria:
First, the angle A subtended by line a of Figure 31 with the object point (xl, y_ , z_,) , computed using the cosine formula:
arccos V(b- (2bc, (16)
See, for example, R. Courant and D. Hubert, Methods of Ma thema ti cal Physi cs , volume 1. New York: Interscience Publishers, first engiish edition, 1953.
Second, the distance of the object point (x_.,y.,z^,) from camera window coordinate (x_,y_) , which is the depth value d_(x_,y.) .
The evaluation criterion ecv for each candidate view cv is:
e_... = f (A, B * d_(xc,yc) ) , where B is a small number (17)
14.2 Immersive Video Prototype and Results
Our Immersive Video prototype is built on top of our MPI-Video system. See S. Chatterjee, R. Jain, A. Katkere, P. Kelly, D. Kuramura, and S. Moezzi, Model ing and In teractivi ty in MPI- Video , Technical Report VCL-94-104, Visual Computing Laboratory, UCSD, Dec. 1994; and A. Katkere, S. Moezzi, and R. Jain, Global Multi-Perspective Perception for Autonomous Mobile Robots, Technical Report VCL- 95 - 101 , Visual Computing Laboratory, UCSD, 1995.
People in the scene are detected and modeled as cylinders in our current implementation. For our experiments, a one minute long scene was digitized, at 6 frames/sec, from a half hour recording of four video cameras overlooking a typical campus scene. The digitized scene covers three pedestrians, a vehicle, and two bicyclists moving between coverage zones. Figure 22 shows the
91 relative placements of all four cameras. Frames from four cameras (for the same arbitrary time instant, 00:21:08:02; are shown in Figure 27. The scene contains three walkers . Note that though th zones cf coverage have significant overlaps, they are not identical, tnus, effectively increasing the overall zone being covered.
Some cf the vistas generated py tne prototype immersive video system of tne present invention are shown Figures 29a and 29b. White portions represent areas not covered by any camera. Note ho each of the perspectives shown is completely different from any of the four original camera views.
Figure 29b illustrates how photo-realistic video images are generated by tr.e system for a given viewpoint, this case a ground level view overlooking the scene entrance. This view was generated oy the prototype immersive video system using the comprehensive 3D model built by the MPI-Video modeling system and employing Algorithm 1 for the corresponding video frames shown in Figure 28. Note tnat this perspective is entirely different from the original views. A panoramic view cf the same scene may also b produced. A oird' s eye view of the walkway for tne same time instant is snown in Figure 29a. Again, white portions represent areas not covered by any camera. Note the alignment of the circular arc. Images from all four cameras contributed towards the construction of the views.
Figure 28 also illustrates immersive abilities of the immersive videc technology of the present invention by presenting selected frames from a 116-frame sequence generated for a walk through the entire courtyard. The walk through sequence illustrates now an event can be viewed from any perspective, while taking into account true object bearings and occlusions.
14.3 Discussion on the Representations
In this section 14, the concept for Immersive Video for rendition of live video from multiple perspectives has been described, and key aspects of the prototype system are described and shown. Although the system is at an early stage, it has been illustrated that immersive video can be achieved using today's technology and that photo-realistic video from arbitrary perspectives can be generated given appropriate camera coverage.
One of the limitations of the immersive video system, highlighted closeups of people is simplistic modeling of dynami objects (as bounding cylinders) . While this simplification permitted development of a complete and fairly functional prototype, sucn quirks snould oe, can be, and will be removed to achieve a greater degree of immersion Towards this end, objects should be modeled more accurately. Two ways of achieving this are contemplated- detecting oojects using predicted contours (Kalman snakes; and integrating these contours across perspectives, and using voxe^-baseo integration. See D Terzopolcus and ?. Szeliski, Tracking with P'arman snakes, A. Blake and A Yuiile, editors, •Active Vi sion , pages 3-2:, MIT Press, Cambridge, Mass., 1992 See also 2 Koller, Webe , and J Malik, Robust Multiple Car Tracking with Occlusion Reasoning, Proc. 3rd European Conference on Compu ter Vision , pages 189-96, Stockholm, Sweden, May 1994. Spπnger-Verlag
In the next section 15, how better ooject models can be built using voxels, and how this will improve the ouilding of virtual vistas, ^s onefly described.
14 4 Voxel-Based Obiect Models
Voxels (or Spatial Occupancy Enumeration Cells) -- which are cells on a three-dimensional grid representing spatial occupancy -- provide one way cf building accurate and tight object models See J D Foley, A van Dam, S K. Feiner, and J F Hughes, Computer Graphi cs : Principl es and Practi ces , Addison-Wesiey Publishing Company, Inc , second edition, 1990.
Using techniques to determine occupancy cf the voxels, the immersive video system of the present invention builds an accurate three dimensional model of the environment An a priori static model (which occupies majority of filled space; is used to determine default occupancy of the voxels To build the dynamic model, the occupancy of only those voxels whose state could have changed from the previous time instant is continuously determined Using higher level knowledge, and information from prior processing, this computation may be, and preferably is, restricted to expected locations of dynamic objects.
The set of points that denote motion in an image can be computed using Algorithm 2 shown m Figure 32. Algorithm 2 is the voxel-construetion-and-visualization-for-moving-objects algorithm.
This set subtends a portion of three dimensional space where motion might have occurred. Figure 23 and the diagrammatic portion of Figure 31 illustrate the viewing frustrums that define this space. Treating voxels as a accumulative array to hold positive and negative evidence of occupancy, the positive evidence of occupancy for this subtended space can be increased. Similarly, the space not subtended by motion points contribute to the negative
93 evidence. Assuming synchronized video streams, this information is accumulated over multiple perspectives (as shown in Figure 23 and the diagrammatic portion of Figure 31) . A suitably selected threshold will separate voxels that receive positive support from multiple perspectives. Such voxels, with a high probability, represent dynamic objects. Algorithm 2 of Figure 32 shows the exact steps involved this process .
Voxels are generated by integrating motion information across the four frames cf Figure 27. The physical dimension of each voxel is 8 dm- or 2x2x2 dm- . Comparing this with the cylindrical approximations of the MPI-Video modeling system, it is evident that more realistic virtual vistas can be created with voxels. Close contour approximations like Kalman snakes can also be used to achieve similar improvements.
14.4.1 Discussion on Computational and Storage Efficiency of Voxels
Voxels have been traditionally vilified for their extreme computing and storage requirements. To even completely fill a relatively small area like the courtyard used in the prototype system, some 14.4 million 1 drrr' voxels are needed. With the recent and ongoing advances m storage and computing, this discussion may be moot. High speed, multi-processor desk-top machines with enormous amounts of RAM and secondary storage have arrived (e.g., high-end desk top computers from SGI) . However for efficiency considerations and elegance, it is herein discussed how storage an computing requirements can greatly be reduced using certain assumptions and optimization.
One basic assumption is that motion is restricted to a small subset of the total three dimensional space and the static portion of the world is known a priori. Hence a combination of efficient geometry-based representation, like the Inventor format, can be used. See J. Wernecke, The Inventor Mentor : Programming Obj ect -Oriented 3D Graphics wi th Open T M Inventor; Release 2. Addison-Wesley Publishing Company, 1994. Given that a three dimensional structure can be derived out of such a format, it is then necessary just model the dynamic portions using voxels.
Next, several assumptions are made about the dynamic objects:
First, the dynamic objects are assumed to be limited in their vertical extent. E.g., in the prototype immersive video system, all dynamic objects are in the range of 10-20 dm in height.
Second, bounds are put on where the objects may be at the current time instant based on prior state, tracking information,
94 assumptions aoout surfaces of motion etc.
The former assumption reduces the number of voxels by limiting the vertmcal dimension. Using the latter assumption, voxels are dynamically allocated to certain limited regions m the environment, and it is assumed that the remaining space retains the characteristics of the a priori static model. With this assumption, tne number of voxels become a function of the number of expected dynamic oo eccs instead of being a function of the total modeled space. While making these assumptions, and using two representations, slightly complicates spatial reasoning, the complexity m terms cf storage and computation is greatly reduced.
In addition, to reduce the computational complexity of Algorithm 2, it is preferred to build look-up tables a priori to store the projection cf each voxel on each camera. Since the relationship between each camera and the world s accurately known, this is a valid optimization.
15. Immersive Video / MPI-Video Prototype Implementation
This section provides some details on our MPI-Video prototype system used in the creation of the "virtual views" discussed in section 14.
Figure 15 shows the hardware configuration of the prototype immersive video system incorporating MPI video. The preferred setup consists of several independent heterogeneous computers. Ideally, one work station is used to process data from a single camera, preferably a Model 10 or 20 work station available from Sun. However, using a socket-based protocol multiple video processing modules can run on a reduced number of work stations (down to a single work station) . In addition, a central (master) graphics work station (a SGI Indigo2, Indy or Challenge) controls these video processing work stations (slaves) and maintains the Environment Model (and associated temporal database) . The central master and the remote slaves communicate at a high symbolic level and minimal image information is exchanged. For instance, as will be discussed further below, object bounding box information is sent from the slaves to the master. Thus, actual image data need not be ex-changed, resulting m a very low required network bandwidth for master-slave communication. The work stations in the prototype system are connected on a 120 Mbps Ethernet switch which guarantees full-speed point-to-point connection.
The master-slave information exchange protocol is as follows: First, the master initializes graphics, the database and the Environment Model (EM) , and waits on a pre-specified port.
95 Second, cased on its knowledge of the network, machine throughput etc. a separate process starts the slave processes on selected remote machines.
Third, each slave contacts the master (using pre-specified macnine-port comomaticr) and a initialization hand-shaking protocol ensues
Fourth, tne master acknowledges each slave ar.α sends it initialization mformaticr., e.g. , wnere the images are actually stored 'for tne laboratory case; , the starting frame and frame interval, camera-specific image-processing information like tnresnolds, masks etc
Fifth, each slave initializes itself based on the information sent cy the master
Sixth, once the initialization is completed, the master processes individual cameras as described m the next steps
Seventh, whenever a frame from a specific camera needs to be processed the master sends a request to that particular slave with information about processing the frame viz focus cr attention windows frame specific thresholds and other parameters, current an expected locations and identifications cf moving oojects etc. and continues its processing (modeling and user interaction) (The focus of attention is essentially a region of interest the image specifying where the visual processing algorithms should concentrate their action.) In synchronous mode, requests to all slaves are sent simultaneously and the integration is done after all slaves have responded. In asynchronous mode, this will not necessarily go m unison.
Eighth, when a reply is received, the frame information is used to update the Environment Model (EM) . The following subsections present more detail on the individual components of ou MPI-Video architecture. Virtual view synthesis is discussed m greater detail below.
16. Conclusions
Immersive Video so far presented has used multi-perspective video and a priori maps to construct three-dimensional models that can be used in interaction and immersion for diverse virtual world applications. One of these application is real-time virtual video, or virtual television, or telepresence -- next discussed in the following section 6. Various ways of presenting virtual video information have been discussed. Selection of the best view, creation of visually realistic virtual views, and interactive querying of the model have also been discussed. The actual
96 implementation of an immersive video system presented show that construction of video-based immersive environments is feasible and viable The goal of the initial prototype immersive video system was not only to ouild a complete working system, but to also build a test-bed for the continuing development of more complicated and refined algorithms and techniques yet to Pe developed and tested Towards this end, simple analysis and modeling techniques were used Future work includes making these more sophisticated so that tru y immersive environments can Pe constructed and used.
17 Immersive Telepresence
Immersive telepresence, or visual reality, is an immersive, interactive and realistic real-time rendition of real-world events captured Py multiple video cameras placed at different locations in the environment It is the real-time rendition of virtual video, "virtual television" instead of just "virtual video"
Unlike virtual reality, which is synthesized using graphical primitives, visual reality provides total immersion in live events For example, a viewer can elect to watch a live broadcast of a football or soccer game from anywnere in tne field. As with immersive video, immersive telepresence is based on and incorporates Multiple Perspective Interactive Video (MPI-Video) infrastructure for processing video data from multiple perspectives. In this section the particular adaptations of immersive video/MPI video for the implementation of immersive telepresence, or just plain "telepresence", are discussed. It is particularly shown and discussed as to how immersive telepresence may become an integral part of future television.
Alas, the drawings of this specification, being both (i) static, and (11) two-dimensional, necessarily give only partial renditions of both (1) dynamic video and (ID stereoscopy. Exemplary stereoscopic views produced by the immersive video system of the present invention respectively for the left and the right eyes are shown in Figures 14a, 14b and also 15a, 15b. In actual use both images are presented so as to be gated to an associated eye by such well-known virtual reality equipments as the "CrystalEyes" 3D Liquid Crystal Shutter (LCS) technology eyewear available from Stereographies Corporation.
It also impossible to convey in the drawings when something is happening m real time. In some cases the multiple video feeds from a scene that was processed in real time to present telepresence to a user/view were also recorded and were then later processed as immersive video If the processing is the same then, quite obviously, the presentations are also the same. Accordingly, some of the following discussion of exemplary results of immersive telepresence will refer to the same figures as did the discussion of immersive video1
The distinctions cf note between immersive telepresence and immersive video are these First, more computer processing time _s clearly available ir non-real time immersive videc than immersive telepresence This may not be, however, of any great significance. More importantly, with immersive "ideo tne scene model may be revised, so as to improve the video renderings on an iterative basis and/cr to account for scene occurrences that are unanticipated and net within the modeled space e.g , the parachutist falling m elevation into tne scene cf a football game, which motion is totally unlike the anticipated motion of the football players and is not at or near ground level The scene models used for immersive telepresence have een developed, and validated, for virtual video.
To be processed nto immersive telepresence, it is not required that a scene snould be "canned", or rote It is, however, required that the structure of tne scene 'note that tne scene has "structure", and is not a "windy jungle'n should pe, to a certain extent, pre-processed into a scene model. Therefore, not only doe the scene model of a "football game" cover all football games, or the scene model of a "prizefight" cover all prizefights, but a scene model of a "news conference" may be pretty good at capturing the human actors therein, or a scene model of a "terrain scene including freeways from multiple helicopters" may be pretty good a capturing and displaying buildings, vehicles and pedestrians". Th former two models are, of course, usable by sports broadcast organizations in the televising of scheduled events. However the last two models are usable by broadcast news organizations m the televising of events that may be unscheduled.
Competition by software developers m the development, and licensing, of scene models may arise. A television broadcaster able to obtain multiple television feeds would select and use the telepresence model giving best performance. Ultimately scene models will grow m sophistication, integration, and comprehensiveness, becoming able to do better m presentation, wit fewer video feeds, faster.
17.1 The Use of Immersive Telepresence
It is conjectured that telepresence will play a major role visual communication media. See N. Negroponte, Being digi tal .
98 Knopf, New York, 1995. Telepresence is generally understood m th context of Virtual reality (VR) with displays of real, remote scenes. This specification and this section instead describe immersive telepresence, being the real-time interactive and realistic rendition cf real-world events, i.e , television where the viewer cannot control (does not interact witri what s nappen g m a real worlα scene, but can interact with now the scene is viewed.
Jaron Lamer defines Virtual Reality as an immersive, interactive simulation of realistic or imaginary environments. Se J. Lamer. Virtual reality tne promise of the future. Interactive Learning In terna ti onal , B(4; :275-9, Oct. -Dec. 1992 The new concept called visual reality is an immersive, interactive and realistic rendition of real-world events simultaneously captured b video cameras placed at different locations m the environment. I contrast with virtual reality, or VR, where one can interact with and view a virtual world, visual reality, or VisR, permits a viewer/user one, for example, to watch a live broadcast of a football or soccer game from anywhere in the field, even from the position of the quarteroack! Visual reality uses the Multiple Perspective Interactive Video (MPI-Video) infrastructure. See S. Chatterjee, R. Jam, A. Katkere, P. Kelly, D. Kuramura, and S. Moezzi, Modeling and interactivity in MPI-video, Techni cal Report VCL- 94 - 104 , Visual Compu ting Lab, UCSD, Dec. 1994.
MPI-Video is a move away from conventional video-cased system which permit users only a limited amount of control and insight into the data. Traditional systems provide a sparse set of action such as fast-forward, rewind and play of stored information. No provision for automatic analysis and management of the raw video data is available.
Visual Reality involves manipulating, processing and compositing of video data, a research area that has received increasing attention. For example, there is a growing interest in generating a mosaic from a video sequence. See M. Hansen, P. Anandan, K. Dana, G. van der Wal, and P. Burt, Real-time scene stabilization and mosaic construction, in ARPA Image Understanding Workshop, Monterey, CA, Nov. 13-16 1994. See also H. Sawhney, Motion video annotation and analysis: An overview, in Proc . 27 th Asilomar Conference on Signals , Systems and Compu ters , pages 85-89 IEEE, Nov. 1993.
The underlying task in video mosaicing is to create larger images from frames obtained as a video stream. Video mosaicmg ha numerous applications including data compression, video
99 enhancement. See M. Irani and S. Peleg, Motion analysis for image enhancement: resolution, occlusion, and transparency, in J. of Visual Communica tion and Image Represen ta tion , 4(4) :324-35, Dec. 1993. See also R. Szeiiski, Image mosaicing for tele-reality applications, in Proc. of Workshop on Applica tions cf Computer Vision , pages 44-53, Sarasota, FL, Dec. 1994. See also the IEEE, IEEE Comput . Soc. Press, high-definition television, digital libraries etc .
To generate video mosaics, registration and alignment of the frames from a sequence are critical issues. Simple, yet robust techniques have been suggested to alleviate this problem using multi-resolution area-cased schemes. See M. Hansen, ?. Anandan, K Dana, and G. van der Wal et al . , Real-time scene stabilization and mosaic construction, in Proc. of Workshop on Appl i ca ti ons of Compu ter Vision , pages 54-62, Sarasota, FL, Dec. 1994. IEEE, IEEE Comput. Soc. Press. For scenes containing dynamic oo ects, parallax has been used to extract dominant 2-D and 3-D motions, which were then used in registration of the frames and generation of the video mosaic. See H. Sawhney, S. Ayer, and K . 3orkani, Model-based 2D and 3D dominant motion estimation for mosaicing and video representation, Technical Report, IBM Almaάen Res . Ctr . , 1994.
For multiple moving objects in a scene, motion layers have been introduced where each dynamic object is assumed to move in a plane parallel to the camera. See J. Wang and E. Adelson. Representing moving images with layers. IEEE Transacti ons on Image Processing, 3(4) :625-38, Sept. 1994. This permits segmentation of the video into different components each containing a dynamic object, which components can then be interpreted and/or re- synthesized as a video stream.
However, for immersive telepresence there is a need to generate a comprehensive 3-D mosaic that can handle multiple dynamic objects as well. The name affixed to this process is "hyper-mosaicing" . In order to obtain a 3-D description, multiple perspectives that provide simultaneous coverage must be used, and their associated visual information must be integrated. Another necessary feature is provide a selected viewpoint . Visual reality satisfies all these requirements.
These issues, and a description of a prototype visual reality are contained in the following sections. Section 6.2 recapitulate the concepts of MPI-Video as is especially applied to VisR. Section 6.3 provides implementation details and present results fo the same campus walkway covered by multiple video cameras -- only
100 this time as television in real time as opposed to non-real-time video. Future directions for VisR are outlined in section 6.4.
17.2 Visual Reality using Multi-Perspective Videos
Visual Reality requires sophisticated vision processing, as well as modeling and visualization. Some of these are readily available under MPI-Video. See S. Chatterjee, R. Jain, A. Katkere, P. Kelly, Z . Kuramura, and S. Moezzi. Modeling and interactivity in MPI-video. Technical Report VCL - 94 - 104 , Visual Compu ting Lab, UCSD, Dec. 1994. MPI-Video is a framework for management and interactive access to multiple streams of video data capturing different perspectives of related events. It involves automatic or semi-automatic extraction of content from the data streams, modeling of the scene observed by these video streams, management of raw, derived and associated data. These video data streams can reflect different views of events such as movements of people and vehicles. In addition, MPI-Video also facilitates access to raw and derived data through a sophisticated hypermedia and query interface. Thus a user, or an automated system, can query about objects and events in the scene, follow a specified object as it moves between zones of camera coverage and select from multiple views. A schematic showing multiple camera coverage typical in a MPI-Video analysis was shown in Figure 22a.
For a true immersive experience, a viewer should be able to view the events from anywhere. To achieve this, vistas composed from appropriate video streams must be made available. Generating these vistas requires a comprehensive three-dimensional model that represents events captured from these multiple perspective videos. Given multiple 'static' views, it is possible theoretically to extract this 3-D model using low-level vision algorithms e.g., shape from X methods. However, it is widely accepted that current methods make certain assumptions that cannot be met and that are, in general, non-robust. For environments' that are mostl static, a priori information, e.g. a CSG/CAD model of the scene, can be used in conjunction with camera information to bypass the extraction of the static portions and to reduce the complexity of processing the dynamic portions. This is analogous to extracting the optical flow in only the portions of the scene where brightnes changes are expected due to motion (flow discontinuities) . This i exploited in the present implementation of Visual Reality (VisR) t create realistic models.
While in virtual reality (VR) texture mapping is used to create realistic replicas of both static and dynamic components, i visual reality (VisR) , distinctively, actual video streams ar used. Ideally, exact ambiance will always be reflected the rendition, i.e. , purely two dimensional images changes are also captured. Fo example, m VisR a viewer is aole to move around a football stadiu and watcn the spectators from anywnere m the fre.d and see them waving, moving, etc
17 3 Approach and Results
The current prototype immersive telepresence system is used conjunction with multiple actual video feeds of a real-world scene to compose vistas of this scene Experimental results obtained fo a campus scene show how an interactive viewer can 'walk through' tnis dynamic, live environment in as it exists in real time 'e.g. , as seen through television) .
17.3 1 Building a Comprehensive, Dynamic 3-D Model
Any comprehensive tnree-d3mensional model consists of static and dynamic components For the static model a priori information e.g , a CAD model, about the environment is used The model views are then be registered with the cameras Accurate camera calibration plays a significant role in this. For the dynamic model, it is necessary to d) detect tne objects m the images fro different views, (ii) position them m 3-D using calibration information, (in) associate them across multiple perspectives, an (iv) obtain their 3-D shape characteristics. These issues hereinafter next described are also accorded explanation m the technical report by S. Chatterjee, R. Jam, A. Katkere, P. Kelly, D. Kuramura, and S. Moezzi titled Modeling and interactivity in MPI-video, Techni cal Report VCL - 94 - 104 , Visual Computing Lab, UCSD Dec. 1994. See also A. Katkere, S. Moezzi, and R. Jam, Global multi-perspective perception for autonomous mobile robots, Technical Report VCL-95-101, Vi sual Compu ting Labora tory, UCSD, 1995.
It is widely accepted that if a 3-D model of the scene is available, then many of tne low-level processing tasks can be simplified. See Y. Roth and R. Jam, Simulation and expectation i sensor-based systems, International Journal of Pa ttern Recogni tion and Artifi cial Intelligence, 7(l) :l45-73, Feb. 1993. For example, associating images taken at different times or from different views becomes easier if one has some knowledge about the 3-D scene points and the camera calibration parameters (both internal and external) . In VisR this is exploited -- as it was in immersive video -- to simplify vision tasks, e.g., segmentation etc. imodel-oased vision; .
In the approach cf the present invention cameras are assumed to be calibrated a priori. Using pre-computed camera coverage tables 2-D observations are mapped into 3-D model space and 3-D expectations into 2D image space. Note the bi-directional operation.
For tne prototype ''isR system, a complete, geometric 3-2 model of a campus scene was C - lz using architectural map data. At a casic level, the VisR system must and does extract information from all tne video streams, reconciling extracted information with the 3-D model As sucn, a scene representation was cnosen wnich facilitates maintenance cf ooject' s location and snape information
In tne preferred VisR, or telepresence, system, ooject information is stored as a comcmation of voxel representation, grid-map representation and object-location representation Note the somewnat lavisn use of information The systems of the present invention are generally compute limited, and are generally not limited in storage Consider also that more and faster storage may be primarily a matter cf expending more money, but there is a limit to how fast the computers can compute no matter how much money is expended. Accordingly, it is generally better to maintain an mformation-rich texture from which the computer (s) can quickly recognize and maintain scene objects than to use a more parsimonious data representation at the expense of greater computational requirements.
For each view, the prototype VisR, or telepresence, system is able to compute the 3-D position of each dynamic object detected by a motion segmentation module m real time. A priori information about the scene and camera calibration parameters, coupled with the assumption that all dynamic objects move on planar surfaces permits object detection and localization. Note the similarity in constraints to object motion(s) , and the use of a priori information, to immersive video. Using projective geometry, necessary positional information is extracted from each view. The extracted information is then assimilated and stored in a 2D grid representing the viewing area.
17.3.1.1 Dynamic Obiects
While more sophisticated detection, recognition and tracking algorithms are still susceptible of development and application, the initial prototype VisR, or immersive telepresence, system uses simple yet robust motion detection and tracking. Connected components labelling is used on the difference images to detect
103 moving oojects This also initializes/updates a tracker which exchanges information with a global tracker that maintains state information of all the moving objects.
Even though instantaneous 3-D shape information is not currently processed due to lack of computation power, it is an option nαei development See A Baumoerg and D Hogg An efficient method ror contour tracking using active snape models, i Proc Worksncp cn Moti on cf Non -rigid and Arti cula ted Obj ects , pages 194-y, Austin, TX, Nov 1994, IEEE, Comput Soc Press Video processing is simplified by "focus of attention rectangles" and pre-computed static mask images delineating portions of a camera view
Figure imgf000106_0001
cannot possibly have any interesting motion Th computation of the former is done using current locations of the object nypotheses m each view and projected locations in the next view The latter is created by painting out areas of each view no on the planar surface twalls, for example'
17.2 Vista Compositing
Given the comprenensive model tne environment and accurate external and internal camera calibration information, compositing new vistas at the view-port is simply a number of transformations between tne model or world) coordinate system (x , y , z > , the coordinate system of the cameras (x_,y„,z ) and the view-port coordinate system ix ,y , z,) Each pixel (on the composited display) is projected on the world coordinate system The world point is then projected onto each of the camera image planes and tested for occlusion from that view Given all such un-occluded points (and their intensity values) , the following selection criteria is useo First, the pixel value for the point which subtends the smallest angle with respect to the vista and is closest to the viewing position is used in the rendition. This is then repeated for every time instant (or every frame) assuming stationary view-port To generate a "fly-by" sequence this is repeated for every position of the view-port in the world coordinate. Note that this also makes the task of handling sudden zonal illumination changes ("spotlight effects") easier Algorith 1 shown in Figure 31 outlines the steps involved Note that the generation of panoramic views from any view-port is a by product with a suitable selection of camera parameters (angle of view, depth of field etc.) .
17.3 Visual Reality Prototype and Results
The prototype application of the immersive telepresence syste
104 of the present invention involved the same campus scene (actually, a courtyard) as was used for the immersive video. The scene was covered by four cameras at different locations. Figure 22a shows the model schematic (of the environment1 along with the camera positions. Note that though the zones of camera coverage have significant overlaps, they are not identical, thus, effectively increasing the overall zone being covered.
To illustrate the compositing effect, cameras with different physical characteristics were used. To studyr tne dynamic objects, people were allowed to saunter through the scene. Although in the current version, no articulated motion analysis is incorporated, work is underway to integrate such and other higher-order behaviors. See ≤. Niyogi and E. Adelson. Analyzing gait with spatio-temporal surfaces, in Proc . of Workshop cn Mo tion of Non -Rigid and Arti cula ted Obj ects , pages 64-?, Austin, TX, Nov. 1994, IEEE, Comput. Soc. Press.
As previously discussed, Figure 27 shows corresponding frames from four views of the courtyard with three people walking. The model view cf the scene is overlaid on each image. Figure 28 shows some "snapshots" from a 116-frame sequence generated for a "walk through" the entire courtyard. People in the scene are detected and modeled as cylinders in our current implementation as shown in Figure 29. The "walk" sequence illustrates how an event can be viewed from anywhere, while taking into account true object bearings and pertinent shadows. Also as previously discussed, Figure 29a shows a ground level view of the scene, and Figure 29b a bird's eye from the top of the scene. Each view is without correspondence to any view within any of the video streams.
17.4 Conclusions and Future Work
The prototype VisR system serves to render live video from multiple perspectives. This provides a true immersive telepresence with simple processing modules. The incorporation of more sophisticated vision modules, e.g., detecting objects using predicted contours (Kalman snakes) , distributed processing of the video streams, etc., is expected in the future.
In the prototype system each of the cameras is assumed to be fixed with respect to the static environment. An incorporation of camera panning and zooming into the model is expected to be useful in representing sporting events. To date no problems with camera jitter, frame dropouts etc. have been encountered in the prototype system. However, if the frame digitalizations are synchronized, then any such occurrences as these can be handled quite
105 efficiently.
Given the nature of events transpiring in the scene, and the simplified processing transpiring, digitalization in the prototype system was set at 6 frames/second. This can be easily made adaptive for each individual camera.
The next generation cf television is anticipated to contain features of VisR, although a great deal of work remains in either reducing or meeting some cf the stringent computational and memory demands. See N. Negroponte, Being digi tal , Knopf, New York, 1995
18. Immersive Video/Television At the Present Time, or How to Use
Five Hundred Television Channels Beneficially
The diverse sophisticated video presentations discussed in this specification are so discussed in the necessarily formative terminology of the present time, when not enough people have seen these effects of these video presentations so as to give them the popular names that they will, no doubt, ultimately assume. Moreover, the showing within this specification of examples of these video presentations is limited to drawings that are both (i) static and (ii) one dimensional (and, as will be explained, are of scenes intentionally rendered sufficiently crudely so that certain effects can be observed) . According to the limits of description and of illustration, it is perhaps difficult for the reader to kno what is reality and what is "hype", and what can be done right now (circa 1995) versus what is likely coming in the future world of video and television. The inventors endeavor to be candid, and blunt, while acknowledging that they cannot perfectly foresee the future .
Immersive video may be divided into real-time applications, i.e., immersive television, and all other, non-real-time, applications where there is, mercifully, more time to process vide of a scene.
Both applications are presently at developed to a usable, and arguably a practically usable, state. Each application is, however, perceived to have a separate development and migration path, roughly similarly as video and television entertainments constitute a separate market from computer games and interactive computerized tutors at the present time.
18.1 Monitoring Live Events in Real Time or Near Real Time
With high speed video digitalizers, an immersive video system based on a single engineering work station class computer can, at the present time, process and monitor (being two separate things)
106 the video of live events in real time or near real time.
Such a system can, for example, perform the function of a "television sports director" -- at least so far as a "video sports director" focused on limited criteria -- reasonably well. The immersive video "sports director" would, for example, be an aid to tne human sports director, who would control the live television primary feed of a televised sporting event such as a football game. The immersive video "sports director" might be tasked, for example, tc "follow the football". This view could go out constantly upon a separate television channel.
Upon incipient use cf an immersive video system so applied, however, the view would normally only be accessed upon selected occasions such as, for example, an "instant replay". The synthesized virtual view is immediately ready, without any such delay as normally presently occurs while humans figure out what camera or cameras really did show the best view(s of a football play, upon the occasion cf an instant replay. For example, the synthesized view generally presenting the "football" at center screen can be ordered. If a particular defensive back made a tackle, then his movements throughout the play may be of interest. In that case a sideline view, or helmet view, of this defensive back can be ordered.
With multiple computers, multiple video views can be simultaneously synthesized, each transmission upon a separate television channel. Certain channels would be devoted to views of certain players, etc.
As the performance of computer hardware and communication links increase, it may ultimately be possible to have television views on demand.
Another presently-realizable real-time application is security, as at, for example, airports. An immersive video system can be directed to synthesize and deliver up "heads-up facial view" images of people in a crowd, one after the next and continuously as and when camera (s) angle is) permit the capture/synthesis of a quality image. Alternatively, the immersive system can image, re- image and synthetically image anything that its classification stage suspects to be a "firearm". Finally, just as the environment model of a football game expects the players to move but the field to remain fixed, the environment model of a secured area can expect the human actors therein to move but the moveable physical property (inventory) to remain fixed or relatively fixed, and not to merge inside the human images as might be the case if the property was being concealed for purposes of theft .
107 It will be understood that the essence of an immersive video system is image synthesis and presentation, and not image classification However, by "forming up" images from desired optimal vantage points, and by operating under an environment model, the immersive video system has good ability (as Dt should, at Dts high cost to permit existing computer image classification programs tc succesεfu_ly recognize deviations -- oojects the scene or events in the scene Although human judgment as to what is being represented, and "seen", by the system may ultimately be required, tne system, as a machine, is tireless and continuously regards the world that D views with an "attentiveness" not realizable py humans
It snould further be considered that tne three-dimensional database, or world model, within an immersive viαeo system can be the input tc three-, as opposed to two-, dimensional classificatio programs Human faces (heads) in particular might oe matched against stored data representing existing, candidate, human heads in tnree dimensions Even wnen humans regard "mug shots", they prefer both frontal ana side views Machine classification cf human facial images, as just one example, is expected to be much improved if, instead of ust one video view at an essentially random view angle, video of an entire observed head is available for comparison.
The ultimate use of real-time and near-real-time immersive video may in fact be m machine perception as opposed to human entertainment. The challenge of satisfying the military requirement cf an autonomous vehicle that navigates m the environment, let alone the environment of a battlefield, is a very great one The wondrous "visual world view" presented to our brains by our eyes is actually quite limited in acuity, sensitivity, spectral sensitivity, scale, detection of temporal phenomena, etc., etc. However, a human does a much better job of making sense of the environment than does a computer that may actually "see" better because the human's understanding, or "environmental model", of the real-world environment is much bette than that of the computer. Command and control computers should perhaps compensate for the crudity of their environmental models b assimilating more video data inputs derived from more spatial sites. Interestingly, humans, as supported by present-day militar computer systems, already recognize the great utility of sharing tactical information on a theater of warfare basis. In particular the Naval Tactical Data System (NTDS) -- now almost forty years ol -- permits sharing of the intelligence data developed from many
108 separate sensor platforms (ships, planes, submarines, etc.) .
It may be essential that computers that operate autonomously or semi-autonomously during warfare should be allowed to likewise share and assimilate sensor information, particularly including video data, from multiple spatially separated platforms. In other words, although one robot tank seeing a battlefield from, ust one vantage point even with binocular vision; may become totally lost, three cr four such tanks together sharing information might be able to collectively "make sense" of what is going on. The immersive video system of the present invention is clearly involved with world-, or environment-, level integration of video information taken from spatially separated video sources (cameras) , and it would Pe a mistake to think that the only function cf an immersive video system is for the entertainment cr education of humans.
An attached appendix contains the computer program source code for realizing immersive video in accordance with the present invention.
18.1 Processing of Video m Non-Real-Time
Meanwhile to developments in immersive television, the processing of video information -- which is not required to transpire m real time -- and the communication of video information -- which may be by disc or like transportable storage media instead of over land cable or radio frequency links -- may proceed in another direction. Anything event or scene that people wish to view with great exactitude, or to interact with realistically (which are not the same thing) , can be very extensively "worked up" with considerable computer processing. A complete 3-D database of fine detail can be developed, over time and by computer processing, from historical multiple video feeds of anything from a football game to a stage play or, similarly to the more exotic scenes common in "surround vision" theaters, travel locales and action sequences. When recorded, a scene from the 3-D database can be "played back" at normal, real-time, speeds and in accordance with the particular desires of a particular end viewer/user by use of a computer, normally a personal computer, of much less power than the computer (s) that created the 3-D database. Every man or woman will thus be accorded an aid to his or her imagination, and can, as did the fictional Walter Mitty, enter into any scene and into any event .
For example, one immediate use of immersive video is in the analysis of athlete behaviors. An athlete, athlete in training, or aspiring athlete performs a sports motion such as, for example, a golf swing that is videotaped from multiple, typically three, camera perspectives. A 3-D video model cf the swing, which may only be a matter of ten or so seconds, is constructed at leisure, perhaps over some minutes in a personal computer. A student golfer and/or his/her instructor can subsequently play cack the swing from any perspective that best suits observation of its salient Characteristics, cr those cf its attributes that are undergoing corrective revision. If twc such i-I models cf t e same golfer are made, one can be compared against the other for deviations, which may possibly be presented as colored areas or the like on the video screen. If a model or an expert golfer, or a composite of expert golfers, is made, then the swing of the student golfer can be compared ir. three dimensions to tne swing- s,< cf the expert golfer ι.sj .
Another use of machine-aided comparison, and content-based retrieval, of video, or video-type, images is in medicine. New generations of Magnetic Resonance Imaging (MRI) sensors are already poised to deliver physiological information in stereoscopic representation, for example as a 3-D model of the patient's brain facilitating the planning cf neurosurgery. However, immediate medical applications cf immersive video in accordance with the present invention are much more mundane. A primary care physician might, instead of just recording patient height and weight and relying on his or her memory from one patient visit to the next, might simply videotape the standing patient's unclothed body from multiple perspectives at periodic intervals, an inexpensive procedure conducted in but a few seconds. Three-dimensional patient views constructed from each session could subsequently be compared to note changes in weight, general appearance, etc.
In the long term, the three-dimensional imaging of video information (which video information need not, however, have been derived from video cameras) as is performed by the immersive video system of the present invention will likely be useful for machine recognition of pathologies. For somewhat the same reasons that it is difficult for the computerized battlefield tank discussed above to find its way around on the battlefield from only a two- dimensional view thereof, a computer is inaccurate in interpreting, for example, x-ray mammograms, because it looks at only a two- dimensional image with deficient understanding of how the light an shadow depicted thereon translates to pathology of the breast . It is now so much that a tumor might be small, but that a small object shown at low contrast and high visual signal-to-noise is difficult to recognize in two dimensions. It is generally easier to
110 recognize, and to classify, a medical image in three dimensions because most of our bodies and their ailments -- excepting the skin and the retina -- are substantially three-dimensional.
Another use cf the same 3-D human images realized with immersive video system cf the present invention would be in video representations cf the prospective results of reconstructive or cosmetic (plastic1 surgery, cr cf exercise regimens. The surgeon or trainer would modify tne body image, likely by manipulation of the 3-D image database as opposed to 2-D views thereof, much in the manner that any computerized video image is presently edited. The patient/client would be presented with the edited view(s) as being the possible or probable results of surgery, or of exercise.
In accordance with these and other possible variations and adaptations of the present invention, the scope cf the invention should be determined in accordance with the following claims, only, and not solely in accordance with that embodiment within which the invention has been taught.
Ill

Claims

What is claimed is :
1. A method cf presenting a particular two-dimensional video image of a real-world three dimensional scene to a viewer comprising: imaging m multiple video cameras eacn at a different spatial location multiple two-dimensional images of a real-world scene eacn at a different spatial perspective; combining a computer the multiple two-dimensional images of the scene into a three-dimensional model of the scene; receiving m a the computer from a prospective viewer of the scene a viewer-specified criterion relative to which criterion the viewer wishes to view the scene; producing the computer from the three-dimensional model a particular two-dimensional image of the scene m accordance with tne received viewer criterion; and displaying in a video display the particular two-dimensional image of the real-world scene to the viewer
2. The method according to claim 1 wherein the producing m the computer comprises: selecting from the three-dimensional model a two-dimensional image corresponding to one of the images of the real-world scene that is imaged by one of the multiple video cameras.
3. The method according to claim 1 wherein the producing in the computer comprises : synthesizing from the three-dimensional model a two- dimensional image that is without exact correspondence to any of the images of the real-world scene that are imaged by any of the multiple video cameras.
4. The method according to claim 1 wherein the receiving is of the viewer-specified criterion of a particular spatial perspective, relative to which particular spatial perspective the viewer wishes to view the scene; and wherein the producing in the computer from the three- dimensional model is of a particular two-dimensional image of the scene in accordance with the particular spatial perspective criterion received from the viewer; and wherein the displaying in the video display is of the
112 particular two-dimensional image of the scene that is in accordance with the particular spatial perspective received from the viewer.
5. The method according to claim 4 wnerem tne producing m tne computer comprises selecting from the three-dimensional model an actual image of tne scene as was imaged py a one cf tne multiple "ioeo cameras, this selected image oe g an actual image cf the scene, out of all the actual images cf the scene as were imaged by a~l the multiple video cameras, that is most closely m accordance with the particular spatial perspective criterion received from the viewer.
6 The method according to claim 5 wherein the selecting from the three-dimensional model is, over time, of plural actual images of the scene as are imaged, over time, by plural ones of the multiple video cameras, wherein the computer does not invariably select from the three-dimensional model an image that arises from one only of the multiple video cameras, but instead selects plural images as arise over time from plural ones of the multiple video cameras.
7 The method according to claim 4 wnerem the producing m the computer comprises : synthesizing from the three-dimensional model a virtual image that is without correspondence to any of the images of the scene that are imaged by any of the multiple video cameras, this synthesized virtual image being in accordance with the particular spatial perspective criterion received from the viewer.
8. The method according to claim 1 wherein the combining is so as generate a three-dimensional model of the scene in which model objects in the scene are identified; wherein the receiving is of the viewer-specified criterion of a selected object that the viewer wishes to particularly view within the scene; and wherein the producing in the computer from the three- dimensional model is of a particular two-dimensional image of the selected object in the scene; and wherein the displaying in the video display is of the particular two-dimensional image of the scene showing the viewer- selected object.
113 9 The method according to claim 8 wherein the viewer- selected object m the scene is static, and unmovmg, m the scene
IC The method according to claim 8 wnerem the viewer- selected ooject in the scene is dynamic, and moving, in the scene
11 The etnod according tc claim 8 wnereir tne viewer selects the object that - cr sne wishes to particularly view ir tne scene by act cf pos_tιonιng a cursor on the viαeo display, wnicn cursor
Figure imgf000116_0001
specifies an object in tne scene by an association between the ooject position and the cursor position m three dimensions and is tnus a three-dimensional cursor
12 The method according to claim 1 wherein the combining is so as generate a three-dimensional model of the scene m wnicn model events m the scene are identified; wherein the receiving is of the viewer-specified criterion of a selected event that the v ewer wishes to particularly view within tne scene; and wherein the producing m the computer from the three- dimensional model is of a particular two-dimensional image of the selected event m the scene, and wherein the displaying m the video display is of the particular two-dimensional image of the scene showing the viewer- selected event .
13. The method according to claim 12 wherein the viewer selects the event that ne or she wishes to particularly view the scene by act of positioning a cursor on the video display, which cursor unambiguously specifies an event in the scene by an association between the event position and the cursor position in three dimensions and is thus a three-dimensional cursor
14. The method according to claim 1 performed in real time as television presented to a viewer interactively in accordance with the viewer-specified criterion.
15. A method of synthesizing a virtual video image from real video images obtained by a multiple real video cameras, the method comprising: storing in a video image database the real two-dimensional video images of a scene from each of a multiplicity of real video
114 cameras; creating in a computer from the multiplicity of stored two- dimensional video images a three-dimensional video database containing a three-dimensional video image of the scene,- and generating a two-dimensional virtual video image cf the scene from the three-dimensional video database.
15. The method according to claim 15 wherein the generating comprises : selecting from the three-dimensional video database a two- dimensional virtual video image of the scene that corresponds to a real two-dimensional videc image of a scene.
17. The method according to claim 15 wherein the generating comprises : synthesizing from the three-dimensional video database a two- dimensional virtual video image of the scene that is without correspondence to any real two-dimensional videc image of a scene.
18. The method according to claim 15 that, between the creating and the generating, further comprises: selecting a spatial perspective, which spatial perspective is not that of any of the multiplicity of real video cameras, on the scene as is imaged within the three-dimensional video database; wherein the generating of the two-dimensional virtual video image is so as to show the scene from the selected spatial perspective .
19. The method according to claim 18 wherein the selected spatial perspective is static, and fixed, during the video of the scene .
20. The method according to claim 18 wherein the selected spatial perspective is dynamic, and variable, during the video of the scene.
21. The method according to claim 18 wherein the selected spatial perspective is so dynamic and variable dependent upon occurrences in the scene .
22. The method according to claim 15 that, between the creating and the generating, further comprises: locating a selected object in the scene as is imaged within
115 tne three-dimensional video dataoase; wnerem the generating of the
Figure imgf000118_0001
virtual video image is so as to Pest show the selected object
23 The method according tc claim 15 that, oetweer- the creating and tne generating, farther comprises- αynamιcaα.._ tracking tne scene as is imaged , rtr.ir tne three- dimensional video dataoase _n order tc recognize any occurrence of a predetermined event _n the scene, wnerem tne generating of the two-dιmensιona_ virtual video image is so as to best show tne predetermined event .
2-± The metnod according to claim 15 wnereir the generating is of a selected two-dimensional virtua- video image, on demand.
25 The method according to claim 15 wherein the generating of tne selected two-dimensional virtual video image is in. real tim on demand, thus interactive virtual television.
26 A met.hod cf telepresence, being a video representation o being at real-world scene that is other than the instant scene of the viewer, the method comprising: capturing video of a real-world scene from each of a multiplicity of different spatial perspectives on the scene; creating from the captured video a full three-dimensional model of the scene, producing from the three-dimensional model a video representation on the scene that is m accordance with the desired perspective on the scene of a viewer of the scene, thus immersive telepresence because the viewer can view the scene as if immersed therein, and as if present at the scene, all in accordance with his/her desires; wnerem the representation is called immersive telepresence because it appears to the viewer that, since the scene is presente as the viewer desires, the viewer is immersed in the scene; wherein the viewer-desired perspective on the scene, and the video representation m accordance with this viewer-desired perspective, need not be accordance with any of the captured video.
27. The method of immersive telepresence according to claim 26 wherein the video representation is in accordance with the
116 position and direction of the viewer's eyes and head, and exhibits motional parallax; wherein motional parallax is, normally and conventionally, a three-dimensional effect where different views on the scene are produced as the viewer moves position even should the viewer have but one eye, making the viewer's bra to comprehend that the viewed scene is three-dimensional.
28. The method cf immersive telepresence according to claim 26 wnerem the video representation is stereoscopic; wherein stereoscopy is, normally and conventionally, a three- dimensional effect wnere each of tne viewer's two eyes sees a slightly different view on the scene, making the viewer's bram to comprehend that the viewed scene is three-dimensional even should the viewer not move his/her head cr eyes in spatial position.
29. A method of telepresence, being a video representation of being at real-world scene that is other than the instant scene of the viewer, trie method comprising: capturing video of a real-world scene from a multiplicity of different spatial perspectives on the scene; creating from the captured video a full three-dimensional model of the scene; producing from the three-dimensional model a video representation on the scene that is in accordance with a predetermined criterion selected from among criteria including a perspective on the scene, an object n the scene and an event in the scene, thus interactive telepresence because the presentation to the viewer is interactive in accordance with the criterion; wherein the video presentation of the scene in accordance with the criterion need not be in accordance with any of the captured video.
30. The method of viewer-interactive telepresence according to claim 29 wherein the video representation is in accordance with a criterion selected by the viewer, thus viewer-interactive telepresence .
31. The method of viewer-interactive telepresence according to claim 30 wherein the presentation is in accordance with the position and direction of the viewer's eyes and head, and exhibits
117 motional parallax.
32 The method of viewer-interactive telepresence according to claim 30 wherein the presentation exhibits stereoscopy
33 An immersive video system for presenting video images cf a rea^-wcrld scene m accordance with a predetermined criterion, the system comprising a knowledge database containing information aoout tne scene multiple video sources each at a different spatial location for producing multiple two-dimensional video images of a real-world scene eacn at a different spatial perspective, a viewer interface at which a prospective viewer of the scene may specify a criterion relative to wnich criterion the viewer wishes to view the scene, a computer, receiving the multiple two-dimensional video images of the scene from the multiple ^ deo cameras and the viewer- specified criterion from the viewer interface, tne computer including a video data analyzer for detecting and for tracking objects of potential interest and their locations in the scene, an environmental model builder for combining multiple individual video images of the scene to build a three-dimensional dynamic model of the environment of the scene within which three- dimensional dynamic environmental model potential objects of interest the scene are recorded along with their instant spatial locations, and a viewer criterion interpreter for correlating the viewer-specified criterion with the objects of interest m the scene, and with the spatial locations cf these objects, as recorde in the dynamic environmental model in order to produce parameters of perspective on the scene, and a visualizer for generating, from the three-dimensional dynamic environmental model in accordance with the parameters of perspective, a particular two-dimensional video image of the scene; and a video display, receiving the particular two-dimensional video image of the scene from the computer, for displaying this particular two-dimensional video image of the real-world scene to the viewer as that particular view of the scene which is in satisfaction of the viewer-specified criterion.
34. The immersive video system according to claim 33 wherein
118 the knowledge database contains data regarding at least two of the geometry of the real-world scene, potential shapes cf objects m the real-world scene, dynamic behaviors of objects in the real-world scene, and a camera calibration model
35 The immersive video system according to c.ai" 33 wnerem the knowledge dataoase contains data regarding eacn of tne geometry of tne real-world scene, potential shapes of oojects m the real-world scene, dynamic behaviors of oojects in the real-world scene, and a camera calibration model
36 The immersive video system according to claim 33 wherein the camera calibration model cf the knowledge dataoase includes at least one of an internal camera calibration model, and an external camera calibration model
119
PCT/US1996/004400 1995-03-31 1996-03-29 Immersive video WO1996031047A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU53802/96A AU5380296A (en) 1995-03-31 1996-03-29 Immersive video

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US08/414,437 1995-03-31
US08/414,437 US5729471A (en) 1995-03-31 1995-03-31 Machine dynamic selection of one video camera/image of a scene from multiple video cameras/images of the scene in accordance with a particular perspective on the scene, an object in the scene, or an event in the scene
US08/554,848 US5850352A (en) 1995-03-31 1995-11-06 Immersive video, including video hypermosaicing to generate from multiple video views of a scene a three-dimensional video mosaic from which diverse virtual video scene images are synthesized, including panoramic, scene interactive and stereoscopic images
US08/554,848 1995-11-07

Publications (2)

Publication Number Publication Date
WO1996031047A2 true WO1996031047A2 (en) 1996-10-03
WO1996031047A3 WO1996031047A3 (en) 1996-11-07

Family

ID=27022548

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1996/004400 WO1996031047A2 (en) 1995-03-31 1996-03-29 Immersive video

Country Status (3)

Country Link
US (1) US5850352A (en)
AU (1) AU5380296A (en)
WO (1) WO1996031047A2 (en)

Cited By (60)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0785532A3 (en) * 1996-01-16 1998-07-29 University Corporation For Atmospheric Research Virtual reality imaging system
WO1998045813A1 (en) * 1997-04-07 1998-10-15 Synapix, Inc. Media production with correlation of image stream and abstract objects in a three-dimensional virtual stage
WO1998045814A1 (en) * 1997-04-07 1998-10-15 Synapix, Inc. Iterative process for three-dimensional image generation
WO1998046029A1 (en) * 1997-04-04 1998-10-15 Orad Hi-Tec Systems Limited Graphical video systems
WO1998045812A1 (en) * 1997-04-07 1998-10-15 Synapix, Inc. Integrating live/recorded sources into a three-dimensional environment for media productions
WO1998050834A1 (en) * 1997-05-06 1998-11-12 Control Technology Corporation Distributed interface architecture for programmable industrial control systems
EP0903695A1 (en) * 1997-09-16 1999-03-24 Canon Kabushiki Kaisha Image processing apparatus
EP0930585A1 (en) * 1998-01-14 1999-07-21 Canon Kabushiki Kaisha Image processing apparatus
US6124864A (en) * 1997-04-07 2000-09-26 Synapix, Inc. Adaptive modeling and segmentation of visual image streams
EP1064817A1 (en) * 1998-08-07 2001-01-03 Be Here Corporation Method and apparatus for electronically distributing motion panoramic images
SG83686A1 (en) * 1997-09-12 2001-10-16 Matsushita Electric Ind Co Ltd Virtual www server for enabling a single display screen of a browser to be utilized to concurrently display data of a plurality of files which are obtained from respective servers and to send commands to these servers.
CN1107292C (en) * 1997-06-20 2003-04-30 日本电信电话株式会社 Scheme for interactive video manipulation and display of moving object on background image
US6570587B1 (en) 1996-07-26 2003-05-27 Veon Ltd. System and method and linking information to a video
GB2403364A (en) * 2003-06-24 2004-12-29 Christopher Paul Casson Virtual scene generating system
US6853867B1 (en) 1998-12-30 2005-02-08 Schneider Automation Inc. Interface to a programmable logic controller
WO2005048200A2 (en) * 2003-11-05 2005-05-26 Cognex Corporation Method and system for enhanced portal security through stereoscopy
US7146408B1 (en) 1996-05-30 2006-12-05 Schneider Automation Inc. Method and system for monitoring a controller and displaying data from the controller in a format provided by the controller
WO2007014216A2 (en) * 2005-07-22 2007-02-01 Cernium Corporation Directed attention digital video recordation
US7397929B2 (en) 2002-09-05 2008-07-08 Cognex Technology And Investment Corporation Method and apparatus for monitoring a passageway using 3D images
US7400744B2 (en) 2002-09-05 2008-07-15 Cognex Technology And Investment Corporation Stereo door sensor
EP2034441A1 (en) * 2007-09-05 2009-03-11 Sony Corporation System and method for communicating a representation of a scene
US7614083B2 (en) 2004-03-01 2009-11-03 Invensys Systems, Inc. Process control methods and apparatus for intrusion detection, protection and network hardening
US7650058B1 (en) 2001-11-08 2010-01-19 Cernium Corporation Object selective video recording
WO2010127418A1 (en) * 2009-05-07 2010-11-11 Universite Catholique De Louvain Systems and methods for the autonomous production of videos from multi-sensored data
US7984420B2 (en) * 1999-05-17 2011-07-19 Invensys Systems, Inc. Control systems and methods with composite blocks
US20120293606A1 (en) * 2011-05-20 2012-11-22 Microsoft Corporation Techniques and system for automatic video conference camera feed selection based on room events
US8326084B1 (en) 2003-11-05 2012-12-04 Cognex Technology And Investment Corporation System and method of auto-exposure control for image acquisition hardware using three dimensional information
EP2533533A1 (en) * 2011-06-08 2012-12-12 Sony Corporation Display Control Device, Display Control Method, Program, and Recording Medium
EP2634772A1 (en) * 2012-02-28 2013-09-04 BlackBerry Limited Methods and devices for selecting objects in images
EP2791909A4 (en) * 2011-12-16 2015-06-24 Thomson Licensing Method and apparatus for generating 3d free viewpoint video
US9215467B2 (en) 2008-11-17 2015-12-15 Checkvideo Llc Analytics-modulated coding of surveillance video
US9317133B2 (en) 2010-10-08 2016-04-19 Nokia Technologies Oy Method and apparatus for generating augmented reality content
EP3007038A3 (en) * 2014-09-22 2016-06-22 Samsung Electronics Co., Ltd. Interaction with three-dimensional video
US9558575B2 (en) 2012-02-28 2017-01-31 Blackberry Limited Methods and devices for selecting objects in images
EP3151554A1 (en) * 2015-09-30 2017-04-05 Calay Venture S.a.r.l. Presence camera
EP3264761A1 (en) * 2016-06-23 2018-01-03 Thomson Licensing A method and apparatus for creating a pair of stereoscopic images using least one lightfield camera
DE102016119637A1 (en) * 2016-10-14 2018-04-19 Uniqfeed Ag Television transmission system for generating enriched images
US10181218B1 (en) 2016-02-17 2019-01-15 Steelcase Inc. Virtual affordance sales tool
US10182210B1 (en) 2016-12-15 2019-01-15 Steelcase Inc. Systems and methods for implementing augmented reality and/or virtual reality
CN109275358A (en) * 2016-05-25 2019-01-25 佳能株式会社 The method and apparatus for generating virtual image from the camera array with chrysanthemum chain link according to the selected viewpoint of user
EP3338106A4 (en) * 2015-08-17 2019-04-03 C360 Technologies, Inc. Generating objects in real time panoramic video
US10257494B2 (en) 2014-09-22 2019-04-09 Samsung Electronics Co., Ltd. Reconstruction of three-dimensional video
US10277890B2 (en) 2016-06-17 2019-04-30 Dustin Kerstein System and method for capturing and viewing panoramic images having motion parallax depth perception without image stitching
US10404938B1 (en) 2015-12-22 2019-09-03 Steelcase Inc. Virtual world method and system for affecting mind state
US10740905B2 (en) 2016-10-14 2020-08-11 Uniqfeed Ag System for dynamically maximizing the contrast between the foreground and background in images and/or image sequences
US10748008B2 (en) 2014-02-28 2020-08-18 Second Spectrum, Inc. Methods and systems of spatiotemporal pattern recognition for video content development
CN111580670A (en) * 2020-05-12 2020-08-25 黑龙江工程学院 Landscape implementing method based on virtual reality
US10769446B2 (en) 2014-02-28 2020-09-08 Second Spectrum, Inc. Methods and systems of combining video content with one or more augmentations
US10805558B2 (en) 2016-10-14 2020-10-13 Uniqfeed Ag System for producing augmented images
US10832057B2 (en) 2014-02-28 2020-11-10 Second Spectrum, Inc. Methods, systems, and user interface navigation of video content based spatiotemporal pattern recognition
US11049218B2 (en) 2017-08-11 2021-06-29 Samsung Electronics Company, Ltd. Seamless image stitching
US11113535B2 (en) 2019-11-08 2021-09-07 Second Spectrum, Inc. Determining tactical relevance and similarity of video sequences
US11120271B2 (en) 2014-02-28 2021-09-14 Second Spectrum, Inc. Data processing systems and methods for enhanced augmentation of interactive video content
CN113938653A (en) * 2021-10-12 2022-01-14 钱保军 Multi-video monitoring display method based on AR echelon cascade
US11275949B2 (en) 2014-02-28 2022-03-15 Second Spectrum, Inc. Methods, systems, and user interface navigation of video content based spatiotemporal pattern recognition
US11380014B2 (en) 2020-03-17 2022-07-05 Aptiv Technologies Limited Control modules and methods
US11380101B2 (en) 2014-02-28 2022-07-05 Second Spectrum, Inc. Data processing systems and methods for generating interactive user interfaces and interactive game systems based on spatiotemporal analysis of video content
EP3882866A4 (en) * 2018-11-14 2022-08-10 Canon Kabushiki Kaisha Information processing system, information processing method, and program
EP4122767A1 (en) * 2021-07-22 2023-01-25 Continental Automotive Systems, Inc. Vehicle mirror image simulation
US11861906B2 (en) 2014-02-28 2024-01-02 Genius Sports Ss, Llc Data processing systems and methods for enhanced augmentation of interactive video content

Families Citing this family (604)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6418424B1 (en) 1991-12-23 2002-07-09 Steven M. Hoffberg Ergonomic man-machine interface incorporating adaptive pattern recognition based control system
US6850252B1 (en) 1999-10-05 2005-02-01 Steven M. Hoffberg Intelligent electronic appliance system and method
US6400996B1 (en) 1999-02-01 2002-06-04 Steven M. Hoffberg Adaptive pattern recognition based control system and method
US10361802B1 (en) 1999-02-01 2019-07-23 Blanding Hovenweep, Llc Adaptive pattern recognition based control system and method
US8352400B2 (en) 1991-12-23 2013-01-08 Hoffberg Steven M Adaptive pattern recognition based controller apparatus and method and human-factored interface therefore
WO1996024216A1 (en) 1995-01-31 1996-08-08 Transcenic, Inc. Spatial referenced photography
US6654014B2 (en) * 1995-04-20 2003-11-25 Yoshinori Endo Bird's-eye view forming method, map display apparatus and navigation system
EP0773515B1 (en) * 1995-04-27 2003-08-27 Sega Enterprises, Ltd. Image processor, image processing method, game apparatus using them, and memory medium
JP3871224B2 (en) * 1995-12-07 2007-01-24 株式会社セガ Image generating apparatus, image generating method, game machine using the same, and medium
US6009188A (en) * 1996-02-16 1999-12-28 Microsoft Corporation Method and system for digital plenoptic imaging
CN1286327C (en) * 1996-02-28 2006-11-22 松下电器产业株式会社 High-resolution optical disk for recording stereoscopic video, optical disk reproducing device, and optical disk recording device
US6084979A (en) * 1996-06-20 2000-07-04 Carnegie Mellon University Method for creating virtual reality
US6373642B1 (en) 1996-06-24 2002-04-16 Be Here Corporation Panoramic imaging arrangement
US6341044B1 (en) 1996-06-24 2002-01-22 Be Here Corporation Panoramic imaging arrangement
US6459451B2 (en) 1996-06-24 2002-10-01 Be Here Corporation Method and apparatus for a panoramic camera to capture a 360 degree image
US6493032B1 (en) 1996-06-24 2002-12-10 Be Here Corporation Imaging arrangement which allows for capturing an image of a view at different resolutions
US6064749A (en) * 1996-08-02 2000-05-16 Hirota; Gentaro Hybrid tracking for augmented reality using both camera motion detection and landmark tracking
US6384908B1 (en) 1996-08-15 2002-05-07 Go Sensors, Llc Orientation dependent radiation source
US6108005A (en) * 1996-08-30 2000-08-22 Space Corporation Method for producing a synthesized stereoscopic image
US6421047B1 (en) * 1996-09-09 2002-07-16 De Groot Marc Multi-user virtual reality system for simulating a three-dimensional environment
JP3064928B2 (en) * 1996-09-20 2000-07-12 日本電気株式会社 Subject extraction method
JPH10111953A (en) * 1996-10-07 1998-04-28 Canon Inc Image processing method, device therefor and recording medium
JP3745475B2 (en) * 1996-12-06 2006-02-15 株式会社セガ GAME DEVICE AND IMAGE PROCESSING DEVICE
US6080063A (en) * 1997-01-06 2000-06-27 Khosla; Vinod Simulated real time game play with live event
US6195090B1 (en) 1997-02-28 2001-02-27 Riggins, Iii A. Stephen Interactive sporting-event monitoring system
JPH1186038A (en) * 1997-03-03 1999-03-30 Sega Enterp Ltd Image processor, image processing method, medium and game machine
US6209028B1 (en) 1997-03-21 2001-03-27 Walker Digital, Llc System and method for supplying supplemental audio information for broadcast television programs
US6205485B1 (en) * 1997-03-27 2001-03-20 Lextron Systems, Inc Simulcast WEB page delivery using a 3D user interface system
JP3861928B2 (en) * 1997-04-03 2006-12-27 株式会社セガ・エンタープライゼス Game image display method and control method
US6708184B2 (en) * 1997-04-11 2004-03-16 Medtronic/Surgical Navigation Technologies Method and apparatus for producing and accessing composite data using a device having a distributed communication controller interface
JP4332231B2 (en) * 1997-04-21 2009-09-16 ソニー株式会社 Imaging device controller and imaging system
US6356296B1 (en) 1997-05-08 2002-03-12 Behere Corporation Method and apparatus for implementing a panoptic camera system
US6466254B1 (en) 1997-05-08 2002-10-15 Be Here Corporation Method and apparatus for electronically distributing motion panoramic images
JP3932461B2 (en) * 1997-05-21 2007-06-20 ソニー株式会社 Client device, image display control method, shared virtual space providing device and method, and recording medium
JP3799134B2 (en) 1997-05-28 2006-07-19 ソニー株式会社 System and notification method
US6392665B1 (en) * 1997-05-29 2002-05-21 Sun Microsystems, Inc. Capture mechanism for computer generated motion video images
JP3183632B2 (en) * 1997-06-13 2001-07-09 株式会社ナムコ Information storage medium and image generation device
JP3116033B2 (en) * 1997-07-02 2000-12-11 一成 江良 Video data creation method and video data display method
JPH11126017A (en) 1997-08-22 1999-05-11 Sony Corp Storage medium, robot, information processing device and electronic pet system
JP3931392B2 (en) 1997-08-25 2007-06-13 ソニー株式会社 Stereo image video signal generating device, stereo image video signal transmitting device, and stereo image video signal receiving device
US20020113865A1 (en) * 1997-09-02 2002-08-22 Kotaro Yano Image processing method and apparatus
US6230200B1 (en) * 1997-09-08 2001-05-08 Emc Corporation Dynamic modeling for resource allocation in a file server
US6409596B1 (en) * 1997-09-12 2002-06-25 Kabushiki Kaisha Sega Enterprises Game device and image displaying method which displays a game proceeding in virtual space, and computer-readable recording medium
US7190392B1 (en) * 1997-10-23 2007-03-13 Maguire Jr Francis J Telepresence system and active/passive mode display for use therein
JP3594471B2 (en) * 1997-12-19 2004-12-02 株式会社日立製作所 Scenario display device and method
JPH11203009A (en) 1998-01-20 1999-07-30 Sony Corp Information processing device and method and distribution medium
JP3855430B2 (en) 1998-01-23 2006-12-13 ソニー株式会社 Information processing apparatus and method, information processing system, and recording medium
US6266068B1 (en) * 1998-03-13 2001-07-24 Compaq Computer Corporation Multi-layer image-based rendering for video synthesis
RU2161871C2 (en) * 1998-03-20 2001-01-10 Латыпов Нурахмед Нурисламович Method and device for producing video programs
IL138808A0 (en) 1998-04-02 2001-10-31 Kewazinga Corp A navigable telepresence method and system utilizing an array of cameras
US6522325B1 (en) 1998-04-02 2003-02-18 Kewazinga Corp. Navigable telepresence method and system utilizing an array of cameras
US6192393B1 (en) * 1998-04-07 2001-02-20 Mgi Software Corporation Method and system for panorama viewing
US6333749B1 (en) * 1998-04-17 2001-12-25 Adobe Systems, Inc. Method and apparatus for image assisted modeling of three-dimensional scenes
US6154771A (en) * 1998-06-01 2000-11-28 Mediastra, Inc. Real-time receipt, decompression and play of compressed streaming video/hypervideo; with thumbnail display of past scenes and with replay, hyperlinking and/or recording permissively intiated retrospectively
US6281904B1 (en) * 1998-06-09 2001-08-28 Adobe Systems Incorporated Multi-source texture reconstruction and fusion
US6442293B1 (en) * 1998-06-11 2002-08-27 Kabushiki Kaisha Topcon Image forming apparatus, image forming method and computer-readable storage medium having an image forming program
JP3522537B2 (en) * 1998-06-19 2004-04-26 洋太郎 村瀬 Image reproducing method, image reproducing apparatus, and image communication system
JP4146938B2 (en) * 1998-09-08 2008-09-10 オリンパス株式会社 Panorama image synthesis apparatus and recording medium storing panorama image synthesis program
US6215498B1 (en) * 1998-09-10 2001-04-10 Lionhearth Technologies, Inc. Virtual command post
ATE420528T1 (en) * 1998-09-17 2009-01-15 Yissum Res Dev Co SYSTEM AND METHOD FOR GENERATING AND DISPLAYING PANORAMIC IMAGES AND FILMS
US6111571A (en) * 1998-10-01 2000-08-29 Full Moon Productions, Inc. Method and computer program for operating an interactive themed attraction accessible by computer users
US6374402B1 (en) 1998-11-16 2002-04-16 Into Networks, Inc. Method and apparatus for installation abstraction in a secure content delivery system
US7017188B1 (en) 1998-11-16 2006-03-21 Softricity, Inc. Method and apparatus for secure content delivery over broadband access networks
US6763370B1 (en) 1998-11-16 2004-07-13 Softricity, Inc. Method and apparatus for content protection in a secure content delivery system
US6369818B1 (en) 1998-11-25 2002-04-09 Be Here Corporation Method, apparatus and computer program product for generating perspective corrected data from warped information
US6320600B1 (en) * 1998-12-15 2001-11-20 Cornell Research Foundation, Inc. Web-based video-editing method and system using a high-performance multimedia software library
JP4240343B2 (en) * 1998-12-19 2009-03-18 株式会社セガ Image generating apparatus and image generating method
US7143358B1 (en) * 1998-12-23 2006-11-28 Yuen Henry C Virtual world internet web site using common and user-specific metrics
AU2487000A (en) * 1998-12-30 2000-07-31 Chequemate International Inc. System and method for recording and broadcasting three-dimensional video
US6282317B1 (en) * 1998-12-31 2001-08-28 Eastman Kodak Company Method for automatic determination of main subjects in photographic images
US6384823B1 (en) * 1999-01-05 2002-05-07 C. Michael Donoghue System and method for real-time mapping and transfer of coordinate position data from a virtual computer-aided design workspace to the real workspace
US7271803B2 (en) * 1999-01-08 2007-09-18 Ricoh Company, Ltd. Method and system for simulating stereographic vision
US6175454B1 (en) 1999-01-13 2001-01-16 Behere Corporation Panoramic imaging arrangement
US7035897B1 (en) 1999-01-15 2006-04-25 California Institute Of Technology Wireless augmented reality communication system
US6337688B1 (en) * 1999-01-29 2002-01-08 International Business Machines Corporation Method and system for constructing a virtual reality environment from spatially related recorded images
US7904187B2 (en) 1999-02-01 2011-03-08 Hoffberg Steven M Internet appliance system and method
US6396535B1 (en) 1999-02-16 2002-05-28 Mitsubishi Electric Research Laboratories, Inc. Situation awareness system
JP3851014B2 (en) * 1999-03-01 2006-11-29 富士通株式会社 Effective visual field verification device, effective visual field verification method, and computer-readable recording medium
JP2002538542A (en) * 1999-03-02 2002-11-12 シーメンス アクチエンゲゼルシヤフト Enhanced Reality System for Supporting User-Industry Equipment Dialogue According to Context
JP4006873B2 (en) 1999-03-11 2007-11-14 ソニー株式会社 Information processing system, information processing method and apparatus, and information providing medium
US6751354B2 (en) * 1999-03-11 2004-06-15 Fuji Xerox Co., Ltd Methods and apparatuses for video segmentation, classification, and retrieval using image class statistical models
DE50003361D1 (en) * 1999-03-25 2003-09-25 Siemens Ag SYSTEM AND METHOD FOR DOCUMENTATION PROCESSING WITH MULTILAYER STRUCTURE OF INFORMATION, ESPECIALLY FOR TECHNICAL AND INDUSTRIAL APPLICATIONS
US6891561B1 (en) * 1999-03-31 2005-05-10 Vulcan Patents Llc Providing visual context for a mobile active visual display of a panoramic region
AU4221000A (en) * 1999-04-08 2000-10-23 Internet Pictures Corporation Remote controlled platform for camera
US7370071B2 (en) 2000-03-17 2008-05-06 Microsoft Corporation Method for serving third party software applications from servers to client computers
US7730169B1 (en) 1999-04-12 2010-06-01 Softricity, Inc. Business method and system for serving third party software applications
US7015950B1 (en) 1999-05-11 2006-03-21 Pryor Timothy R Picture taking method and apparatus
US8099758B2 (en) 1999-05-12 2012-01-17 Microsoft Corporation Policy based composite file system and method
US7620909B2 (en) * 1999-05-12 2009-11-17 Imove Inc. Interactive image seamer for panoramic images
US20040075738A1 (en) * 1999-05-12 2004-04-22 Sean Burke Spherical surveillance system architecture
JP2000350865A (en) * 1999-06-11 2000-12-19 Mr System Kenkyusho:Kk Game device for composite real space, image processing method therefor and program storage medium
US7084887B1 (en) * 1999-06-11 2006-08-01 Canon Kabushiki Kaisha Marker layout method, mixed reality apparatus, and mixed reality space image generation method
JP3831548B2 (en) * 1999-06-16 2006-10-11 ペンタックス株式会社 Photogrammetry image processing apparatus, photogrammetry image processing method, and storage medium storing photogrammetry image processing program
US6954217B1 (en) * 1999-07-02 2005-10-11 Pentax Corporation Image processing computer system for photogrammetric analytical measurement
US6529723B1 (en) * 1999-07-06 2003-03-04 Televoke, Inc. Automated user notification system
US7310509B2 (en) * 2000-04-17 2007-12-18 Decarta Inc. Software and protocol structure for an automated user notification system
US6591094B1 (en) * 1999-07-06 2003-07-08 Televoke, Inc. Automated user notification system
US6409599B1 (en) * 1999-07-19 2002-06-25 Ham On Rye Technologies, Inc. Interactive virtual reality performance theater entertainment system
US6693670B1 (en) * 1999-07-29 2004-02-17 Vision - Sciences, Inc. Multi-photodetector unit cell
US7996878B1 (en) * 1999-08-31 2011-08-09 At&T Intellectual Property Ii, L.P. System and method for generating coded video sequences from still media
US20070008099A1 (en) * 1999-09-01 2007-01-11 Nettalon Security Systems, Inc. Method and apparatus for remotely monitoring a site
US6972676B1 (en) 1999-09-01 2005-12-06 Nettalon Security Systems, Inc. Method and apparatus for remotely monitoring a site
US6281790B1 (en) 1999-09-01 2001-08-28 Net Talon Security Systems, Inc. Method and apparatus for remotely monitoring a site
US6917288B2 (en) * 1999-09-01 2005-07-12 Nettalon Security Systems, Inc. Method and apparatus for remotely monitoring a site
US6795109B2 (en) 1999-09-16 2004-09-21 Yissum Research Development Company Of The Hebrew University Of Jerusalem Stereo panoramic camera arrangements for recording panoramic images useful in a stereo panoramic image pair
US7111252B1 (en) * 1999-09-22 2006-09-19 Harris Scott C Enhancing touch and feel on the internet
US6330356B1 (en) * 1999-09-29 2001-12-11 Rockwell Science Center Llc Dynamic visual registration of a 3-D object with a graphical model
US6654031B1 (en) * 1999-10-15 2003-11-25 Hitachi Kokusai Electric Inc. Method of editing a video program with variable view point of picked-up image and computer program product for displaying video program
JP2001118081A (en) 1999-10-15 2001-04-27 Sony Corp Device and method for processing information, and program storing medium
US7075556B1 (en) * 1999-10-21 2006-07-11 Sportvision, Inc. Telestrator system
US6628282B1 (en) * 1999-10-22 2003-09-30 New York University Stateless remote environment navigation
JP2001128122A (en) * 1999-10-28 2001-05-11 Venture Union:Kk Recording medium reproducing device
US7369130B2 (en) * 1999-10-29 2008-05-06 Hitachi Kokusai Electric Inc. Method and apparatus for editing image data, and computer program product of editing image data
US7230653B1 (en) * 1999-11-08 2007-06-12 Vistas Unlimited Method and apparatus for real time insertion of images into video
DE19953739C2 (en) * 1999-11-09 2001-10-11 Siemens Ag Device and method for object-oriented marking and assignment of information to selected technological components
AU1599801A (en) * 1999-11-12 2001-06-06 Brian S. Armstrong Robust landmarks for machine vision and methods for detecting same
WO2001039130A1 (en) * 1999-11-24 2001-05-31 Dartfish Ltd. Coordination and combination of video sequences with spatial and temporal normalization
WO2001040912A2 (en) * 1999-11-30 2001-06-07 Amico Joseph N D Security system linked to the internet
US7231327B1 (en) * 1999-12-03 2007-06-12 Digital Sandbox Method and apparatus for risk management
US6507353B1 (en) * 1999-12-10 2003-01-14 Godot Huard Influencing virtual actors in an interactive environment
US6957177B1 (en) 1999-12-10 2005-10-18 Microsoft Corporation Geometric model database for use in ubiquitous computing
US6687387B1 (en) 1999-12-27 2004-02-03 Internet Pictures Corporation Velocity-dependent dewarping of images
JP3363861B2 (en) 2000-01-13 2003-01-08 キヤノン株式会社 Mixed reality presentation device, mixed reality presentation method, and storage medium
US6980690B1 (en) * 2000-01-20 2005-12-27 Canon Kabushiki Kaisha Image processing apparatus
EP1125608A3 (en) * 2000-01-21 2005-03-30 Sony Computer Entertainment Inc. Entertainment apparatus, storage medium and object display method
US6989832B2 (en) * 2000-01-21 2006-01-24 Sony Computer Entertainment Inc. Entertainment apparatus, storage medium and object display method
US20010034788A1 (en) * 2000-01-21 2001-10-25 Mcternan Brennan J. System and method for receiving packet data multicast in sequential looping fashion
US6762746B2 (en) 2000-01-21 2004-07-13 Sony Computer Entertainment Inc. Entertainment apparatus, storage medium and operation method of manipulating object
US6798923B1 (en) 2000-02-04 2004-09-28 Industrial Technology Research Institute Apparatus and method for providing panoramic images
US7240067B2 (en) * 2000-02-08 2007-07-03 Sybase, Inc. System and methodology for extraction and aggregation of data from dynamic content
US7522186B2 (en) * 2000-03-07 2009-04-21 L-3 Communications Corporation Method and apparatus for providing immersive surveillance
US7328119B1 (en) * 2000-03-07 2008-02-05 Pryor Timothy R Diet and exercise planning and motivation including apparel purchases based on future appearance
JP3983953B2 (en) * 2000-03-10 2007-09-26 パイオニア株式会社 Stereoscopic two-dimensional image display apparatus and image display method
DE60137660D1 (en) * 2000-03-17 2009-04-02 Panasonic Corp Map display and navigation device
WO2001071662A2 (en) * 2000-03-20 2001-09-27 Cognitens, Ltd. System and method for globally aligning individual views based on non-accurate repeatable positioning devices
AU2001252223A1 (en) * 2000-04-07 2001-10-23 Pilz Gmbh And Co. Protective device for safeguarding a dangerous area and method for verifying thefunctional reliability of such device
KR100328482B1 (en) * 2000-04-24 2002-03-16 김석배 System for broadcasting using internet
JP2002014611A (en) * 2000-04-28 2002-01-18 Minoruta Puranetariumu Kk Video projecting method to planetarium or spherical screen and device therefor
US7475404B2 (en) 2000-05-18 2009-01-06 Maquis Techtrix Llc System and method for implementing click-through for browser executed software including ad proxy and proxy cookie caching
US8086697B2 (en) 2005-06-28 2011-12-27 Claria Innovations, Llc Techniques for displaying impressions in documents delivered over a computer network
US20010056574A1 (en) * 2000-06-26 2001-12-27 Richards Angus Duncan VTV system
US7812856B2 (en) 2000-10-26 2010-10-12 Front Row Technologies, Llc Providing multiple perspectives of a venue activity to electronic wireless hand held devices
US7630721B2 (en) 2000-06-27 2009-12-08 Ortiz & Associates Consulting, Llc Systems, methods and apparatuses for brokering data between wireless devices and data rendering devices
US7353188B2 (en) * 2000-06-30 2008-04-01 Lg Electronics Product selling system and method for operating the same
JP3992909B2 (en) * 2000-07-03 2007-10-17 富士フイルム株式会社 Personal image providing system
US6788333B1 (en) * 2000-07-07 2004-09-07 Microsoft Corporation Panoramic video
WO2002007440A2 (en) 2000-07-15 2002-01-24 Filippo Costanzo Audio-video data switching and viewing system
US7193645B1 (en) 2000-07-27 2007-03-20 Pvi Virtual Media Services, Llc Video system and method of operating a video system
US7039630B2 (en) * 2000-07-27 2006-05-02 Nec Corporation Information search/presentation system
US6636237B1 (en) * 2000-07-31 2003-10-21 James H. Murray Method for creating and synchronizing links to objects in a video
US6778207B1 (en) * 2000-08-07 2004-08-17 Koninklijke Philips Electronics N.V. Fast digital pan tilt zoom video
US7464344B1 (en) * 2000-08-14 2008-12-09 Connie Carmichael Systems and methods for immersive advertising
US7788323B2 (en) * 2000-09-21 2010-08-31 International Business Machines Corporation Method and apparatus for sharing information in a virtual environment
US8564661B2 (en) 2000-10-24 2013-10-22 Objectvideo, Inc. Video analytic rule detection system and method
US7868912B2 (en) * 2000-10-24 2011-01-11 Objectvideo, Inc. Video surveillance system employing video primitives
US20050146605A1 (en) * 2000-10-24 2005-07-07 Lipton Alan J. Video surveillance system employing video primitives
US8711217B2 (en) 2000-10-24 2014-04-29 Objectvideo, Inc. Video surveillance system employing video primitives
US9892606B2 (en) 2001-11-15 2018-02-13 Avigilon Fortress Corporation Video surveillance system employing video primitives
GB2370738B (en) * 2000-10-27 2005-02-16 Canon Kk Image processing apparatus
US6947578B2 (en) * 2000-11-02 2005-09-20 Seung Yop Lee Integrated identification data capture system
US6963662B1 (en) * 2000-11-15 2005-11-08 Sri International Method and system for detecting changes in three dimensional shape
US7221794B1 (en) 2000-12-18 2007-05-22 Sportsvision, Inc. Foreground detection
EP1371019A2 (en) * 2001-01-26 2003-12-17 Zaxel Systems, Inc. Real-time virtual viewpoint in simulated reality environment
US20020118883A1 (en) * 2001-02-24 2002-08-29 Neema Bhatt Classifier-based enhancement of digital images
US8306635B2 (en) * 2001-03-07 2012-11-06 Motion Games, Llc Motivation and enhancement of physical and mental exercise, rehabilitation, health and social interaction
US6747647B2 (en) 2001-05-02 2004-06-08 Enroute, Inc. System and method for displaying immersive video
EP1397144A4 (en) * 2001-05-15 2005-02-16 Psychogenics Inc Systems and methods for monitoring behavior informatics
WO2002096096A1 (en) * 2001-05-16 2002-11-28 Zaxel Systems, Inc. 3d instant replay system and method
US7540011B2 (en) * 2001-06-11 2009-05-26 Arrowsight, Inc. Caching graphical interface for displaying video and ancillary data from a saved video
US7224892B2 (en) * 2001-06-26 2007-05-29 Canon Kabushiki Kaisha Moving image recording apparatus and method, moving image reproducing apparatus, moving image recording and reproducing method, and programs and storage media
US7046273B2 (en) * 2001-07-02 2006-05-16 Fuji Photo Film Co., Ltd System and method for collecting image information
US7206434B2 (en) * 2001-07-10 2007-04-17 Vistas Unlimited, Inc. Method and system for measurement of the duration an area is included in an image stream
US20030023974A1 (en) * 2001-07-25 2003-01-30 Koninklijke Philips Electronics N.V. Method and apparatus to track objects in sports programs and select an appropriate camera view
US20030034956A1 (en) * 2001-08-17 2003-02-20 Yuichiro Deguchi Virtual e-marker
US7091931B2 (en) * 2001-08-17 2006-08-15 Geo-Rae Co., Ltd. Method and system of stereoscopic image display for guiding a viewer's eye motion using a three-dimensional mouse
US6999083B2 (en) * 2001-08-22 2006-02-14 Microsoft Corporation System and method to provide a spectator experience for networked gaming
EP1425718B1 (en) * 2001-08-31 2011-01-12 Dassault Systemes SolidWorks Corporation Simultaneous use of 2d and 3d modeling data
US7342489B1 (en) 2001-09-06 2008-03-11 Siemens Schweiz Ag Surveillance system control unit
EP1301039B1 (en) * 2001-09-07 2006-12-13 Matsushita Electric Industrial Co., Ltd. A video distribution device and a video receiving device
US6641484B2 (en) * 2001-09-21 2003-11-04 Igt Gaming machine including security data collection device
US20030058342A1 (en) * 2001-09-27 2003-03-27 Koninklijke Philips Electronics N.V. Optimal multi-camera setup for computer-based visual surveillance
US6583808B2 (en) 2001-10-04 2003-06-24 National Research Council Of Canada Method and system for stereo videoconferencing
JP3744841B2 (en) * 2001-10-22 2006-02-15 三洋電機株式会社 Data generator
JP4148671B2 (en) * 2001-11-06 2008-09-10 ソニー株式会社 Display image control processing apparatus, moving image information transmission / reception system, display image control processing method, moving image information transmission / reception method, and computer program
US20030210329A1 (en) * 2001-11-08 2003-11-13 Aagaard Kenneth Joseph Video system and methods for operating a video system
JP3529373B2 (en) * 2001-11-09 2004-05-24 ファナック株式会社 Work machine simulation equipment
US20030105558A1 (en) * 2001-11-28 2003-06-05 Steele Robert C. Multimedia racing experience system and corresponding experience based displays
US7265663B2 (en) 2001-11-28 2007-09-04 Trivinci Systems, Llc Multimedia racing experience system
US7050050B2 (en) * 2001-12-07 2006-05-23 The United States Of America As Represented By The Secretary Of The Army Method for as-needed, pseudo-random, computer-generated environments
US7088773B2 (en) * 2002-01-17 2006-08-08 Sony Corporation Motion segmentation system with multi-frame hypothesis tracking
JP2003216981A (en) * 2002-01-25 2003-07-31 Iwane Kenkyusho:Kk Automatic working system
DE10204430A1 (en) 2002-02-04 2003-08-07 Zeiss Carl Stereo microscopy method and stereo microscopy system
US6909790B2 (en) * 2002-02-15 2005-06-21 Inventec Corporation System and method of monitoring moving objects
JP3962607B2 (en) * 2002-02-28 2007-08-22 キヤノン株式会社 Image processing apparatus and method, program, and storage medium
US20030227453A1 (en) * 2002-04-09 2003-12-11 Klaus-Peter Beier Method, system and computer program product for automatically creating an animated 3-D scenario from human position and path data
EP1497799A1 (en) * 2002-04-18 2005-01-19 Computer Associates Think, Inc. Integrated visualization of security information for an individual
US6976846B2 (en) * 2002-05-08 2005-12-20 Accenture Global Services Gmbh Telecommunications virtual simulator
JP4147059B2 (en) * 2002-07-03 2008-09-10 株式会社トプコン Calibration data measuring device, measuring method and measuring program, computer-readable recording medium, and image data processing device
JP4013684B2 (en) * 2002-07-23 2007-11-28 オムロン株式会社 Unauthorized registration prevention device in personal authentication system
JP2004072356A (en) * 2002-08-06 2004-03-04 Hitachi Ltd Server and program for performing the server
US20040032489A1 (en) * 2002-08-13 2004-02-19 Tyra Donald Wayne Method for displaying a visual element of a scene
US7631261B2 (en) * 2002-09-12 2009-12-08 Inoue Technologies, LLC Efficient method for creating a visual telepresence for large numbers of simultaneous users
US7447380B2 (en) 2002-09-12 2008-11-04 Inoe Technologies, Llc Efficient method for creating a viewpoint from plurality of images
US7002551B2 (en) 2002-09-25 2006-02-21 Hrl Laboratories, Llc Optical see-through augmented reality modified-scale display
US20060033739A1 (en) * 2002-09-30 2006-02-16 Tang Wilson J Apparatus and method of defining a sequence of views prior to production
JP3744002B2 (en) * 2002-10-04 2006-02-08 ソニー株式会社 Display device, imaging device, and imaging / display system
WO2004042662A1 (en) * 2002-10-15 2004-05-21 University Of Southern California Augmented virtual environments
US8458028B2 (en) * 2002-10-16 2013-06-04 Barbaro Technologies System and method for integrating business-related content into an electronic game
WO2004038657A2 (en) * 2002-10-22 2004-05-06 Artoolworks Tracking a surface in a 3-dimensional scene using natural visual features of the surface
AU2003269448B2 (en) * 2002-10-30 2008-08-28 Nds Limited Interactive broadcast system
US7603341B2 (en) 2002-11-05 2009-10-13 Claria Corporation Updating the content of a presentation vehicle in a computer network
CA2507959A1 (en) * 2002-11-29 2004-07-22 Bracco Imaging, S.P.A. System and method for displaying and comparing 3d models
JP4101043B2 (en) * 2002-12-11 2008-06-11 キヤノン株式会社 Image data display system, image data display method, program, storage medium, and imaging apparatus
JP2004199496A (en) * 2002-12-19 2004-07-15 Sony Corp Information processor and method, and program
US7050078B2 (en) * 2002-12-19 2006-05-23 Accenture Global Services Gmbh Arbitrary object tracking augmented reality applications
US20040166484A1 (en) * 2002-12-20 2004-08-26 Mark Alan Budke System and method for simulating training scenarios
US20050024488A1 (en) * 2002-12-20 2005-02-03 Borg Andrew S. Distributed immersive entertainment system
US7823058B2 (en) * 2002-12-30 2010-10-26 The Board Of Trustees Of The Leland Stanford Junior University Methods and apparatus for interactive point-of-view authoring of digital video content
US7082572B2 (en) * 2002-12-30 2006-07-25 The Board Of Trustees Of The Leland Stanford Junior University Methods and apparatus for interactive map-based analysis of digital video content
AU2004211721B2 (en) * 2003-02-11 2009-08-20 Nds Limited Apparatus and methods for handling interactive applications in broadcast networks
JP2004258802A (en) * 2003-02-24 2004-09-16 Fuji Xerox Co Ltd Working space management device
GB2400513B (en) * 2003-03-14 2005-10-05 British Broadcasting Corp Video processing
EP1577795A3 (en) * 2003-03-15 2006-08-30 Oculus Info Inc. System and Method for Visualising Connected Temporal and Spatial Information as an Integrated Visual Representation on a User Interface
WO2004086747A2 (en) 2003-03-20 2004-10-07 Covi Technologies, Inc. Systems and methods for multi-stream image processing
FR2854301B1 (en) * 2003-04-24 2005-10-28 Yodea METHOD FOR TRANSMITTING DATA REPRESENTING THE POSITION IN THE SPACE OF A VIDEO CAMERA AND SYSTEM FOR IMPLEMENTING THE METHOD
JP2007525068A (en) * 2003-06-19 2007-08-30 エル3 コミュニケーションズ コーポレイション Method and apparatus for providing scalable multi-camera distributed video processing and visualization surveillance system
US20050050021A1 (en) * 2003-08-25 2005-03-03 Sybase, Inc. Information Messaging and Collaboration System
JP2005100367A (en) * 2003-09-02 2005-04-14 Fuji Photo Film Co Ltd Image generating apparatus, image generating method and image generating program
US7634352B2 (en) * 2003-09-05 2009-12-15 Navteq North America, Llc Method of displaying traffic flow conditions using a 3D system
US7299126B2 (en) * 2003-11-03 2007-11-20 International Business Machines Corporation System and method for evaluating moving queries over moving objects
US7436429B2 (en) * 2003-11-24 2008-10-14 The Boeing Company Virtual pan/tilt camera system and method for vehicles
US20050131658A1 (en) * 2003-12-16 2005-06-16 Mei Hsaio L.S. Systems and methods for 3D assembly venue modeling
US20050131657A1 (en) * 2003-12-16 2005-06-16 Sean Mei Hsaio L. Systems and methods for 3D modeling and creation of a digital asset library
US20050131659A1 (en) * 2003-12-16 2005-06-16 Mei Hsaio L.S. Systems and methods for 3D modeling and asset management
US7683937B1 (en) 2003-12-31 2010-03-23 Aol Inc. Presentation of a multimedia experience
WO2005081191A1 (en) * 2004-02-18 2005-09-01 Bloodworth, Keith Adaptive 3d image modelling system and appartus and method therefor
US20050185047A1 (en) * 2004-02-19 2005-08-25 Hii Desmond Toh O. Method and apparatus for providing a combined image
EP1569150A1 (en) * 2004-02-27 2005-08-31 Koninklijke KPN N.V. Method and system for storing and presenting personal information
US7139651B2 (en) * 2004-03-05 2006-11-21 Modular Mining Systems, Inc. Multi-source positioning system for work machines
WO2005088970A1 (en) * 2004-03-11 2005-09-22 Olympus Corporation Image generation device, image generation method, and image generation program
JP2005295004A (en) * 2004-03-31 2005-10-20 Sanyo Electric Co Ltd Stereoscopic image processing method and apparatus thereof
JP2005288014A (en) * 2004-04-05 2005-10-20 Interstructure Co Ltd System and method for evaluating form
EP1589758A1 (en) * 2004-04-22 2005-10-26 Alcatel Video conference system and method
JP4474640B2 (en) * 2004-05-11 2010-06-09 株式会社セガ Image processing program, game processing program, and game information processing apparatus
JP2005353047A (en) * 2004-05-13 2005-12-22 Sanyo Electric Co Ltd Three-dimensional image processing method and three-dimensional image processor
US7680300B2 (en) * 2004-06-01 2010-03-16 Energid Technologies Visual object recognition and tracking
US8063936B2 (en) * 2004-06-01 2011-11-22 L-3 Communications Corporation Modular immersive surveillance processing system and method
WO2006071259A2 (en) * 2004-06-01 2006-07-06 L-3 Communications Corporation Method and system for wide area security monitoring, sensor management and situational awareness
JP4952995B2 (en) * 2004-06-18 2012-06-13 日本電気株式会社 Image display system, image display method, and image display program
US7724258B2 (en) * 2004-06-30 2010-05-25 Purdue Research Foundation Computer modeling and animation of natural phenomena
US8730322B2 (en) 2004-07-30 2014-05-20 Eyesee360, Inc. Telepresence using panoramic imaging and directional sound and motion
US20060028476A1 (en) * 2004-08-03 2006-02-09 Irwin Sobel Method and system for providing extensive coverage of an object using virtual cameras
WO2006023647A1 (en) * 2004-08-18 2006-03-02 Sarnoff Corporation Systeme and method for monitoring training environment
US8255413B2 (en) 2004-08-19 2012-08-28 Carhamm Ltd., Llc Method and apparatus for responding to request for information-personalization
US8078602B2 (en) 2004-12-17 2011-12-13 Claria Innovations, Llc Search engine for a computer network
US7500795B2 (en) * 2004-09-09 2009-03-10 Paul Sandhu Apparatuses, systems and methods for enhancing telemedicine, video-conferencing, and video-based sales
US7529429B2 (en) * 2004-11-12 2009-05-05 Carsten Rother Auto collage
US7653261B2 (en) * 2004-11-12 2010-01-26 Microsoft Corporation Image tapestry
WO2006053271A1 (en) 2004-11-12 2006-05-18 Mok3, Inc. Method for inter-scene transitions
US20060103723A1 (en) * 2004-11-18 2006-05-18 Advanced Fuel Research, Inc. Panoramic stereoscopic video system
CN101080762A (en) * 2004-11-19 2007-11-28 Daem交互有限公司 Personal device and method with image-acquisition functions for the application of augmented reality resources
US7693863B2 (en) 2004-12-20 2010-04-06 Claria Corporation Method and device for publishing cross-network user behavioral data
US8880205B2 (en) * 2004-12-30 2014-11-04 Mondo Systems, Inc. Integrated multimedia signal processing system using centralized processing of signals
US8015590B2 (en) 2004-12-30 2011-09-06 Mondo Systems, Inc. Integrated multimedia signal processing system using centralized processing of signals
US7653447B2 (en) 2004-12-30 2010-01-26 Mondo Systems, Inc. Integrated audio video signal processing system using centralized processing of signals
US7524119B2 (en) * 2005-02-03 2009-04-28 Paul Sandhu Apparatus and method for viewing radiographs
US20070065002A1 (en) * 2005-02-18 2007-03-22 Laurence Marzell Adaptive 3D image modelling system and apparatus and method therefor
US20060187297A1 (en) * 2005-02-24 2006-08-24 Levent Onural Holographic 3-d television
US8073866B2 (en) 2005-03-17 2011-12-06 Claria Innovations, Llc Method for providing content to an internet user based on the user's demonstrated content preferences
US7996699B2 (en) 2005-04-11 2011-08-09 Graphics Properties Holdings, Inc. System and method for synchronizing multiple media devices
US20060247808A1 (en) * 2005-04-15 2006-11-02 Robb Walter L Computer-implemented method, tool, and program product for training and evaluating football players
US10692536B1 (en) * 2005-04-16 2020-06-23 Apple Inc. Generation and use of multiclips in video editing
US20060262140A1 (en) * 2005-05-18 2006-11-23 Kujawa Gregory A Method and apparatus to facilitate visual augmentation of perceived reality
US7884848B2 (en) * 2005-05-25 2011-02-08 Ginther Mark E Viewing environment and recording system
KR100785594B1 (en) * 2005-06-17 2007-12-13 오므론 가부시키가이샤 Image process apparatus
JP4774824B2 (en) * 2005-06-17 2011-09-14 オムロン株式会社 Method for confirming measurement target range in three-dimensional measurement processing, method for setting measurement target range, and apparatus for performing each method
WO2007013231A1 (en) * 2005-07-29 2007-02-01 Matsushita Electric Industrial Co., Ltd. Imaging region adjustment device
US20070070069A1 (en) * 2005-09-26 2007-03-29 Supun Samarasekera System and method for enhanced situation awareness and visualization of environments
JP4821253B2 (en) * 2005-10-13 2011-11-24 船井電機株式会社 Image output device
EP2005271A2 (en) * 2005-10-24 2008-12-24 The Toro Company Computer-operated landscape irrigation and lighting system
US10614626B2 (en) * 2005-10-26 2020-04-07 Cortica Ltd. System and method for providing augmented reality challenges
US20070146367A1 (en) * 2005-11-14 2007-06-28 Alion Science And Technology Corporation System for editing and conversion of distributed simulation data for visualization
US8025572B2 (en) * 2005-11-21 2011-09-27 Microsoft Corporation Dynamic spectator mode
US7632186B2 (en) * 2005-11-21 2009-12-15 Microsoft Corporation Spectator mode for a game
JP5026692B2 (en) * 2005-12-01 2012-09-12 株式会社ソニー・コンピュータエンタテインメント Image processing apparatus, image processing method, and program
US20070126864A1 (en) * 2005-12-05 2007-06-07 Kiran Bhat Synthesizing three-dimensional surround visual field
US20070141545A1 (en) * 2005-12-05 2007-06-21 Kar-Han Tan Content-Based Indexing and Retrieval Methods for Surround Video Synthesis
US20070126932A1 (en) * 2005-12-05 2007-06-07 Kiran Bhat Systems and methods for utilizing idle display area
US8130330B2 (en) * 2005-12-05 2012-03-06 Seiko Epson Corporation Immersive surround visual fields
US20100287473A1 (en) * 2006-01-17 2010-11-11 Arthur Recesso Video analysis tool systems and methods
US20070174010A1 (en) * 2006-01-24 2007-07-26 Kiran Bhat Collective Behavior Modeling for Content Synthesis
KR101249988B1 (en) * 2006-01-27 2013-04-01 삼성전자주식회사 Apparatus and method for displaying image according to the position of user
KR100841315B1 (en) * 2006-02-16 2008-06-26 엘지전자 주식회사 Mobile telecommunication device and data control server managing broadcasting program information, and method for managing broadcasting program information in mobile telecommunication device
US20070210985A1 (en) * 2006-03-13 2007-09-13 Royer George R Three dimensional imaging system
JP4302113B2 (en) * 2006-03-24 2009-07-22 三菱電機株式会社 In-vehicle control device
US7768527B2 (en) * 2006-05-31 2010-08-03 Beihang University Hardware-in-the-loop simulation system and method for computer vision
US20090167843A1 (en) * 2006-06-08 2009-07-02 Izzat Hekmat Izzat Two pass approach to three dimensional Reconstruction
US7907750B2 (en) * 2006-06-12 2011-03-15 Honeywell International Inc. System and method for autonomous object tracking
US7847808B2 (en) * 2006-07-19 2010-12-07 World Golf Tour, Inc. Photographic mapping in a simulation
US20080018792A1 (en) * 2006-07-19 2008-01-24 Kiran Bhat Systems and Methods for Interactive Surround Visual Field
GB2441365B (en) * 2006-09-04 2009-10-07 Nds Ltd Displaying video data
US20080071559A1 (en) * 2006-09-19 2008-03-20 Juha Arrasvuori Augmented reality assisted shopping
US8012023B2 (en) * 2006-09-28 2011-09-06 Microsoft Corporation Virtual entertainment
US9746912B2 (en) 2006-09-28 2017-08-29 Microsoft Technology Licensing, Llc Transformations for virtual guest representation
WO2008043036A1 (en) * 2006-10-04 2008-04-10 Rochester Institute Of Technology Aspect-ratio independent, multimedia capture and presentation systems and methods thereof
US20100157021A1 (en) * 2006-11-15 2010-06-24 Abraham Thomas G Method for creating, storing, and providing access to three-dimensionally scanned images
US8498497B2 (en) 2006-11-17 2013-07-30 Microsoft Corporation Swarm imaging
US9526995B2 (en) 2006-11-22 2016-12-27 Sony Interactive Entertainment America Llc Video game recording and playback with visual display of game controller manipulation
US10795457B2 (en) 2006-12-28 2020-10-06 D3D Technologies, Inc. Interactive 3D cursor
US11275242B1 (en) 2006-12-28 2022-03-15 Tipping Point Medical Images, Llc Method and apparatus for performing stereoscopic rotation of a volume on a head display unit
US11315307B1 (en) 2006-12-28 2022-04-26 Tipping Point Medical Images, Llc Method and apparatus for performing rotating viewpoints using a head display unit
US11228753B1 (en) 2006-12-28 2022-01-18 Robert Edwin Douglas Method and apparatus for performing stereoscopic zooming on a head display unit
CN101689394B (en) * 2007-02-01 2014-03-26 耶路撒冷希伯来大学伊森姆研究发展有限公司 Method and system for video indexing and video synopsis
EP2135197A4 (en) * 2007-03-02 2012-11-14 Organic Motion System and method for tracking three dimensional objects
US8593506B2 (en) * 2007-03-15 2013-11-26 Yissum Research Development Company Of The Hebrew University Of Jerusalem Method and system for forming a panoramic image of a scene having minimal aspect distortion
US20080252786A1 (en) * 2007-03-28 2008-10-16 Charles Keith Tilford Systems and methods for creating displays
US8599253B2 (en) * 2007-04-03 2013-12-03 Hewlett-Packard Development Company, L.P. Providing photographic images of live events to spectators
JP4561766B2 (en) * 2007-04-06 2010-10-13 株式会社デンソー Sound data search support device, sound data playback device, program
DE102007023739B4 (en) * 2007-05-16 2018-01-04 Seereal Technologies S.A. Method for rendering and generating color video holograms in real time and holographic display device
US9581965B2 (en) 2007-05-16 2017-02-28 Seereal Technologies S.A. Analytic method for computing video holograms in real time
DE102007023785B4 (en) * 2007-05-16 2014-06-18 Seereal Technologies S.A. Analytical method for calculating video holograms in real time and holographic display device
DE102007025069B4 (en) * 2007-05-21 2018-05-24 Seereal Technologies S.A. Holographic reconstruction system
CN101689309A (en) * 2007-06-29 2010-03-31 3M创新有限公司 The synchronized views of video data and three-dimensional modeling data
WO2009006605A2 (en) * 2007-07-03 2009-01-08 Pivotal Vision, Llc Motion-validating remote monitoring system
US7965866B2 (en) * 2007-07-03 2011-06-21 Shoppertrak Rct Corporation System and process for detecting, tracking and counting human objects of interest
US8235817B2 (en) * 2009-02-12 2012-08-07 Sony Computer Entertainment America Llc Object based observation
GB2452546B (en) * 2007-09-07 2012-03-21 Sony Corp Video processing system and method
US8973058B2 (en) * 2007-09-11 2015-03-03 The Directv Group, Inc. Method and system for monitoring and simultaneously displaying a plurality of signal channels in a communication system
MY162861A (en) 2007-09-24 2017-07-31 Koninl Philips Electronics Nv Method and system for encoding a video data signal, encoded video data signal, method and system for decoding a video data signal
KR102204485B1 (en) 2007-09-26 2021-01-19 에이큐 미디어 인크 Audio-visual navigation and communication
WO2009041918A1 (en) * 2007-09-26 2009-04-02 Agency For Science, Technology And Research A method and system for generating an entirely well-focused image of a large three-dimensional scene
US7929804B2 (en) * 2007-10-03 2011-04-19 Mitsubishi Electric Research Laboratories, Inc. System and method for tracking objects with a synthetic aperture
US20090113505A1 (en) * 2007-10-26 2009-04-30 At&T Bls Intellectual Property, Inc. Systems, methods and computer products for multi-user access for integrated video
US20090132967A1 (en) * 2007-11-16 2009-05-21 Microsoft Corporation Linked-media narrative learning system
US8081186B2 (en) * 2007-11-16 2011-12-20 Microsoft Corporation Spatial exploration field of view preview mechanism
US8073190B2 (en) * 2007-11-16 2011-12-06 Sportvision, Inc. 3D textured objects for virtual viewpoint animations
US8584044B2 (en) * 2007-11-16 2013-11-12 Microsoft Corporation Localized thumbnail preview of related content during spatial browsing
US8049750B2 (en) * 2007-11-16 2011-11-01 Sportvision, Inc. Fading techniques for virtual viewpoint animations
US9041722B2 (en) * 2007-11-16 2015-05-26 Sportvision, Inc. Updating background texture for virtual viewpoint animations
US8466913B2 (en) * 2007-11-16 2013-06-18 Sportvision, Inc. User interface for accessing virtual viewpoint animations
US8154633B2 (en) * 2007-11-16 2012-04-10 Sportvision, Inc. Line removal and object detection in an image
EP2063648A1 (en) * 2007-11-24 2009-05-27 Barco NV Sensory unit for a 3-dimensional display
US8264505B2 (en) * 2007-12-28 2012-09-11 Microsoft Corporation Augmented reality and filtering
US9418474B2 (en) * 2008-01-04 2016-08-16 3M Innovative Properties Company Three-dimensional model refinement
US8452052B2 (en) * 2008-01-21 2013-05-28 The Boeing Company Modeling motion capture volumes with distance fields
US8390685B2 (en) * 2008-02-06 2013-03-05 International Business Machines Corporation Virtual fence
US8345097B2 (en) * 2008-02-15 2013-01-01 Harris Corporation Hybrid remote digital recording and acquisition system
US20090237492A1 (en) * 2008-03-18 2009-09-24 Invism, Inc. Enhanced stereoscopic immersive video recording and viewing
US8237791B2 (en) * 2008-03-19 2012-08-07 Microsoft Corporation Visualizing camera feeds on a map
US10060499B2 (en) 2009-01-07 2018-08-28 Fox Factory, Inc. Method and apparatus for an adjustable damper
US11306798B2 (en) 2008-05-09 2022-04-19 Fox Factory, Inc. Position sensitive suspension damping with an active valve
US9452654B2 (en) 2009-01-07 2016-09-27 Fox Factory, Inc. Method and apparatus for an adjustable damper
US20120305350A1 (en) 2011-05-31 2012-12-06 Ericksen Everet O Methods and apparatus for position sensitive suspension damping
US8627932B2 (en) 2009-01-07 2014-01-14 Fox Factory, Inc. Bypass for a suspension damper
US10047817B2 (en) 2009-01-07 2018-08-14 Fox Factory, Inc. Method and apparatus for an adjustable damper
US20100170760A1 (en) 2009-01-07 2010-07-08 John Marking Remotely Operated Bypass for a Suspension Damper
US9033122B2 (en) 2009-01-07 2015-05-19 Fox Factory, Inc. Method and apparatus for an adjustable damper
US8824801B2 (en) * 2008-05-16 2014-09-02 Microsoft Corporation Video processing
FR2933218B1 (en) * 2008-06-30 2011-02-11 Total Immersion METHOD AND APPARATUS FOR REAL-TIME DETECTION OF INTERACTIONS BETWEEN A USER AND AN INCREASED REALITY SCENE
US20100005028A1 (en) * 2008-07-07 2010-01-07 International Business Machines Corporation Method and apparatus for interconnecting a plurality of virtual world environments
US8786596B2 (en) * 2008-07-23 2014-07-22 Disney Enterprises, Inc. View point representation for 3-D scenes
EP2150057A3 (en) * 2008-07-29 2013-12-11 Gerald Curry Camera-based tracking and position determination for sporting events
US8393446B2 (en) 2008-08-25 2013-03-12 David M Haugen Methods and apparatus for suspension lock out and signal generation
JP2012501506A (en) * 2008-08-31 2012-01-19 ミツビシ エレクトリック ビジュアル ソリューションズ アメリカ, インコーポレイテッド Conversion of 3D video content that matches the viewer position
US20120075296A1 (en) * 2008-10-08 2012-03-29 Strider Labs, Inc. System and Method for Constructing a 3D Scene Model From an Image
US8108267B2 (en) * 2008-10-15 2012-01-31 Eli Varon Method of facilitating a sale of a product and/or a service
US9158823B2 (en) * 2008-10-15 2015-10-13 At&T Intellectual Property I, L.P. User interface monitoring in a multimedia content distribution network
US8396004B2 (en) 2008-11-10 2013-03-12 At&T Intellectual Property Ii, L.P. Video share model-based video fixing
US9140325B2 (en) 2009-03-19 2015-09-22 Fox Factory, Inc. Methods and apparatus for selective spring pre-load adjustment
US10036443B2 (en) 2009-03-19 2018-07-31 Fox Factory, Inc. Methods and apparatus for suspension adjustment
US9422018B2 (en) 2008-11-25 2016-08-23 Fox Factory, Inc. Seat post
EP3666347B1 (en) 2008-11-25 2021-10-20 Fox Factory, Inc. Computer usable storage medium for virtual competition
US9266017B1 (en) * 2008-12-03 2016-02-23 Electronic Arts Inc. Virtual playbook with user controls
US20100156906A1 (en) * 2008-12-19 2010-06-24 David Montgomery Shot generation from previsualization of a physical environment
EP2385705A4 (en) 2008-12-30 2011-12-21 Huawei Device Co Ltd Method and device for generating stereoscopic panoramic video stream, and method and device of video conference
CN101771830B (en) * 2008-12-30 2012-09-19 华为终端有限公司 Three-dimensional panoramic video stream generating method and equipment and video conference method and equipment
US10821795B2 (en) 2009-01-07 2020-11-03 Fox Factory, Inc. Method and apparatus for an adjustable damper
US9038791B2 (en) 2009-01-07 2015-05-26 Fox Factory, Inc. Compression isolator for a suspension damper
US11299233B2 (en) 2009-01-07 2022-04-12 Fox Factory, Inc. Method and apparatus for an adjustable damper
US8624962B2 (en) * 2009-02-02 2014-01-07 Ydreams—Informatica, S.A. Ydreams Systems and methods for simulating three-dimensional virtual interactions from two-dimensional camera images
US9462030B2 (en) * 2009-03-04 2016-10-04 Jacquelynn R. Lueth System and method for providing a real-time three-dimensional digital impact virtual audience
US8599317B2 (en) * 2009-03-13 2013-12-03 Disney Enterprises, Inc. Scene recognition methods for virtual insertions
US8838335B2 (en) 2011-09-12 2014-09-16 Fox Factory, Inc. Methods and apparatus for suspension set up
US8936139B2 (en) 2009-03-19 2015-01-20 Fox Factory, Inc. Methods and apparatus for suspension adjustment
US8217993B2 (en) * 2009-03-20 2012-07-10 Cranial Technologies, Inc. Three-dimensional image capture system for subjects
US8103088B2 (en) * 2009-03-20 2012-01-24 Cranial Technologies, Inc. Three-dimensional image capture system
EP2415023A1 (en) * 2009-03-29 2012-02-08 Nomad3D SAS System and format for encoding data and three-dimensional rendering
WO2010116329A2 (en) * 2009-04-08 2010-10-14 Stergen Hi-Tech Ltd. Method and system for creating three-dimensional viewable video from a single video stream
WO2010119496A1 (en) * 2009-04-13 2010-10-21 富士通株式会社 Image processing device, image processing program, and image processing method
US8516534B2 (en) 2009-04-24 2013-08-20 At&T Intellectual Property I, Lp Method and apparatus for model-based recovery of packet loss errors
US9025007B1 (en) * 2009-04-28 2015-05-05 Lucasfilm Entertainment Company Ltd. Configuring stereo cameras
KR20100128233A (en) * 2009-05-27 2010-12-07 삼성전자주식회사 Method and apparatus for processing video image
JP2011101229A (en) * 2009-11-06 2011-05-19 Sony Corp Display control device, display control method, program, output device, and transmission apparatus
EP2287806B1 (en) * 2009-07-20 2018-08-29 Mediaproducción, S.L. Calibration method for a TV and video camera
US20110172550A1 (en) 2009-07-21 2011-07-14 Michael Scott Martin Uspa: systems and methods for ems device communication interface
WO2011127459A1 (en) 2010-04-09 2011-10-13 Zoll Medical Corporation Systems and methods for ems device communications interface
US8992315B2 (en) * 2009-07-27 2015-03-31 Obscura Digital, Inc. Automated enhancements for billiards and the like
US8727875B2 (en) * 2009-07-27 2014-05-20 Obscura Digital, Inc. Automated enhancements for billiards and the like
US8616971B2 (en) 2009-07-27 2013-12-31 Obscura Digital, Inc. Automated enhancements for billiards and the like
KR20110011000A (en) * 2009-07-27 2011-02-08 삼성전자주식회사 Method and appratus for generating three-dimensional image datastream including additional information for displaying three-dimensional image, method and apparatus for receiving the same
JP5008703B2 (en) * 2009-08-18 2012-08-22 株式会社ソニー・コンピュータエンタテインメント Content creation support device, image processing device, content creation support method, image processing method, and data structure of image display content
US8922628B2 (en) * 2009-09-01 2014-12-30 Prime Focus Vfx Services Ii Inc. System and process for transforming two-dimensional images into three-dimensional images
EP2312180B1 (en) 2009-10-13 2019-09-18 Fox Factory, Inc. Apparatus for controlling a fluid damper
US8672106B2 (en) 2009-10-13 2014-03-18 Fox Factory, Inc. Self-regulating suspension
US8325187B2 (en) * 2009-10-22 2012-12-04 Samsung Electronics Co., Ltd. Method and device for real time 3D navigation in panoramic images and cylindrical spaces
GB2475730A (en) * 2009-11-27 2011-06-01 Sony Corp Transformation of occluding objects in 2D to 3D image generation
US8817078B2 (en) * 2009-11-30 2014-08-26 Disney Enterprises, Inc. Augmented reality videogame broadcast programming
IL202460A (en) 2009-12-01 2013-08-29 Rafael Advanced Defense Sys Method and system of generating a three-dimensional view of a real scene
EP2339537B1 (en) * 2009-12-23 2016-02-24 Metaio GmbH Method of determining reference features for use in an optical object initialization tracking process and object initialization tracking method
US10697514B2 (en) 2010-01-20 2020-06-30 Fox Factory, Inc. Remotely operated bypass for a suspension damper
JP2013080987A (en) * 2010-02-15 2013-05-02 Panasonic Corp Stereoscopic image display device
GB2477793A (en) * 2010-02-15 2011-08-17 Sony Corp A method of creating a stereoscopic image in a client device
US20110222757A1 (en) 2010-03-10 2011-09-15 Gbo 3D Technology Pte. Ltd. Systems and methods for 2D image and spatial data capture for 3D stereo imaging
US8213708B2 (en) * 2010-03-22 2012-07-03 Eastman Kodak Company Adjusting perspective for objects in stereoscopic images
JP2011205573A (en) 2010-03-26 2011-10-13 Sony Corp Control device, camera system, and program
WO2011121741A1 (en) * 2010-03-30 2011-10-06 富士通株式会社 Image generation device, image generation program, synthesis table generation device, and synthesis table generation program
US8896671B2 (en) 2010-04-09 2014-11-25 3D-4U, Inc. Apparatus and method for capturing images
KR20110116525A (en) * 2010-04-19 2011-10-26 엘지전자 주식회사 Image display device and operating method for the same
US11606615B2 (en) * 2010-04-27 2023-03-14 Comcast Cable Communications, Llc Remote user interface
KR101108145B1 (en) * 2010-05-07 2012-02-06 광주과학기술원 Apparatus and method for cost effective haptic-based Networked Virtual Environments with High Resolution Display
US20110310003A1 (en) * 2010-05-21 2011-12-22 Fraunhofer-Gesellschaft Zur Forderung Der Angewandten Forschung E.V. Image display device and method of displaying images
US9183560B2 (en) * 2010-05-28 2015-11-10 Daniel H. Abelow Reality alternate
US8602887B2 (en) 2010-06-03 2013-12-10 Microsoft Corporation Synthesis of information from multiple audiovisual sources
US8694553B2 (en) 2010-06-07 2014-04-08 Gary Stephen Shuster Creation and use of virtual places
WO2011162227A1 (en) * 2010-06-24 2011-12-29 富士フイルム株式会社 Stereoscopic panoramic image synthesis device, image capturing device, stereoscopic panoramic image synthesis method, recording medium, and computer program
US9132352B1 (en) * 2010-06-24 2015-09-15 Gregory S. Rabin Interactive system and method for rendering an object
US8451346B2 (en) * 2010-06-30 2013-05-28 Apple Inc. Optically projected mosaic rendering
EP2402239B1 (en) 2010-07-02 2020-09-02 Fox Factory, Inc. Adjustable seat post
EP2413286A1 (en) 2010-07-29 2012-02-01 LiberoVision AG Image processing method and device for instant replay
JP5500255B2 (en) * 2010-08-06 2014-05-21 富士通株式会社 Image processing apparatus and image processing program
US9192110B2 (en) 2010-08-11 2015-11-24 The Toro Company Central irrigation control system
US9398315B2 (en) * 2010-09-15 2016-07-19 Samsung Electronics Co., Ltd. Multi-source video clip online assembly
US8533192B2 (en) 2010-09-16 2013-09-10 Alcatel Lucent Content capture device and methods for automatically tagging content
US8655881B2 (en) 2010-09-16 2014-02-18 Alcatel Lucent Method and apparatus for automatically tagging content
US8666978B2 (en) 2010-09-16 2014-03-04 Alcatel Lucent Method and apparatus for managing content tagging and tagged content
WO2012046239A2 (en) * 2010-10-06 2012-04-12 Nomad3D Sas Multiview 3d compression format and algorithms
US9736462B2 (en) * 2010-10-08 2017-08-15 SoliDDD Corp. Three-dimensional video production system
US20120105581A1 (en) * 2010-10-29 2012-05-03 Sony Corporation 2d to 3d image and video conversion using gps and dsm
US8890896B1 (en) 2010-11-02 2014-11-18 Google Inc. Image recognition in an augmented reality application
US8576276B2 (en) 2010-11-18 2013-11-05 Microsoft Corporation Head-mounted display device which provides surround video
KR101295712B1 (en) * 2010-11-22 2013-08-16 주식회사 팬택 Apparatus and Method for Providing Augmented Reality User Interface
US8817080B2 (en) 2010-12-02 2014-08-26 At&T Intellectual Property I, L.P. Location based media display
WO2012122269A2 (en) 2011-03-07 2012-09-13 Kba2, Inc. Systems and methods for analytic data gathering from image providers at an event or geographic location
JP5325248B2 (en) * 2011-03-18 2013-10-23 株式会社スクウェア・エニックス Video game processing apparatus and video game processing program
KR101779423B1 (en) * 2011-06-10 2017-10-10 엘지전자 주식회사 Method and apparatus for processing image
JP2013026808A (en) * 2011-07-21 2013-02-04 Sony Corp Image processing apparatus, image processing method, and program
US8928729B2 (en) * 2011-09-09 2015-01-06 Disney Enterprises, Inc. Systems and methods for converting video
US9639857B2 (en) * 2011-09-30 2017-05-02 Nokia Technologies Oy Method and apparatus for associating commenting information with one or more objects
JP6121647B2 (en) * 2011-11-11 2017-04-26 ソニー株式会社 Information processing apparatus, information processing method, and program
US20150049167A1 (en) * 2011-11-15 2015-02-19 Naoki Suzuki Photographic device and photographic system
US20130163854A1 (en) * 2011-12-23 2013-06-27 Chia-Ming Cheng Image processing method and associated apparatus
US11279199B2 (en) 2012-01-25 2022-03-22 Fox Factory, Inc. Suspension damper with by-pass valves
US8787726B2 (en) 2012-02-26 2014-07-22 Antonio Rossi Streaming video navigation systems and methods
KR101318552B1 (en) * 2012-03-12 2013-10-16 가톨릭대학교 산학협력단 Method for measuring recognition warping about 3d image
US9019316B2 (en) * 2012-04-15 2015-04-28 Trimble Navigation Limited Identifying a point of interest from different stations
US10330171B2 (en) 2012-05-10 2019-06-25 Fox Factory, Inc. Method and apparatus for an adjustable damper
US9153073B2 (en) * 2012-05-23 2015-10-06 Qualcomm Incorporated Spatially registered augmented video
US9747306B2 (en) * 2012-05-25 2017-08-29 Atheer, Inc. Method and apparatus for identifying input features for later recognition
US9873045B2 (en) 2012-05-25 2018-01-23 Electronic Arts, Inc. Systems and methods for a unified game experience
US20130321564A1 (en) 2012-05-31 2013-12-05 Microsoft Corporation Perspective-correct communication window with motion parallax
US9767598B2 (en) 2012-05-31 2017-09-19 Microsoft Technology Licensing, Llc Smoothing and robust normal estimation for 3D point clouds
US9846960B2 (en) 2012-05-31 2017-12-19 Microsoft Technology Licensing, Llc Automated camera array calibration
US9092896B2 (en) 2012-08-07 2015-07-28 Microsoft Technology Licensing, Llc Augmented reality display of scene behind surface
WO2014052802A2 (en) * 2012-09-28 2014-04-03 Zoll Medical Corporation Systems and methods for three-dimensional interaction monitoring in an ems environment
US20140125702A1 (en) * 2012-10-15 2014-05-08 Fairways 360, Inc. System and method for generating an immersive virtual environment using real-time augmentation of geo-location information
US9648357B2 (en) * 2012-11-05 2017-05-09 Ati Technologies Ulc Method and device for providing a video stream for an object of interest
US9210385B2 (en) * 2012-11-20 2015-12-08 Pelco, Inc. Method and system for metadata extraction from master-slave cameras tracking system
US20140164282A1 (en) * 2012-12-10 2014-06-12 Tibco Software Inc. Enhanced augmented reality display for use by sales personnel
US8837906B2 (en) 2012-12-14 2014-09-16 Motorola Solutions, Inc. Computer assisted dispatch incident report video search and tagging systems and methods
US9571726B2 (en) 2012-12-20 2017-02-14 Google Inc. Generating attention information from photos
US9116926B2 (en) 2012-12-20 2015-08-25 Google Inc. Sharing photos
US9224036B2 (en) 2012-12-20 2015-12-29 Google Inc. Generating static scenes
JP6182917B2 (en) * 2013-03-15 2017-08-23 ノーリツプレシジョン株式会社 Monitoring device
WO2014145722A2 (en) * 2013-03-15 2014-09-18 Digimarc Corporation Cooperative photography
JP6062039B2 (en) * 2013-04-04 2017-01-18 株式会社Amatel Image processing system and image processing program
US9264474B2 (en) 2013-05-07 2016-02-16 KBA2 Inc. System and method of portraying the shifting level of interest in an object or location
JP2014236874A (en) 2013-06-07 2014-12-18 任天堂株式会社 Information processing system, server device, information processor, server program, information processing program, and information processing method
JP6180802B2 (en) 2013-06-07 2017-08-16 任天堂株式会社 Information processing system, information processing apparatus, information processing program, and information display method
US9776085B2 (en) 2013-06-07 2017-10-03 Nintendo Co., Ltd. Information processing system, information processing device, server machine, recording medium and information processing method
US9727667B2 (en) 2013-06-10 2017-08-08 Honeywell International Inc. Generating a three dimensional building management system
CN105284108B (en) * 2013-06-14 2019-04-02 株式会社日立制作所 Image monitoring system, monitoring arrangement
KR102096398B1 (en) * 2013-07-03 2020-04-03 삼성전자주식회사 Method for recognizing position of autonomous mobile robot
EP2824913A1 (en) * 2013-07-09 2015-01-14 Alcatel Lucent A method for generating an immersive video of a plurality of persons
US10500479B1 (en) * 2013-08-26 2019-12-10 Venuenext, Inc. Game state-sensitive selection of media sources for media coverage of a sporting event
US9426365B2 (en) 2013-11-01 2016-08-23 The Lightco Inc. Image stabilization related methods and apparatus
WO2015066571A1 (en) 2013-11-01 2015-05-07 The Lightco Inc. Methods and apparatus relating to image stabilization
US10441876B2 (en) * 2013-12-20 2019-10-15 Activision Publishing, Inc. Video game integrating recorded video
CN103761734B (en) * 2014-01-08 2016-09-28 北京航空航天大学 A kind of binocular stereoscopic video scene fusion method of time domain holding consistency
CA2938973A1 (en) 2014-02-08 2015-08-13 Pictometry International Corp. Method and system for displaying room interiors on a floor plan
US9417911B2 (en) * 2014-03-12 2016-08-16 Live Planet Llc Systems and methods for scalable asynchronous computing framework
US9677840B2 (en) 2014-03-14 2017-06-13 Lineweight Llc Augmented reality simulator
JP5948508B2 (en) * 2014-05-08 2016-07-06 オリンパス株式会社 Video processor and method of operating video processor
US9723109B2 (en) 2014-05-28 2017-08-01 Alexander Hertel Platform for constructing and consuming realm and object feature clouds
US10089785B2 (en) * 2014-07-25 2018-10-02 mindHIVE Inc. Real-time immersive mediated reality experiences
JP2016046642A (en) * 2014-08-21 2016-04-04 キヤノン株式会社 Information processing system, information processing method, and program
CN107003600A (en) * 2014-09-15 2017-08-01 德米特里·戈里洛夫斯基 Including the system for the multiple digital cameras for observing large scene
US9977495B2 (en) * 2014-09-19 2018-05-22 Utherverse Digital Inc. Immersive displays
US10262426B2 (en) 2014-10-31 2019-04-16 Fyusion, Inc. System and method for infinite smoothing of image sequences
US10275935B2 (en) 2014-10-31 2019-04-30 Fyusion, Inc. System and method for infinite synthetic image generation from multi-directional structured image array
KR20160058519A (en) * 2014-11-17 2016-05-25 삼성전자주식회사 Image processing for multiple images
US20160198140A1 (en) * 2015-01-06 2016-07-07 3DOO, Inc. System and method for preemptive and adaptive 360 degree immersive video streaming
US9836895B1 (en) * 2015-06-19 2017-12-05 Waymo Llc Simulating virtual objects
US10701282B2 (en) * 2015-06-24 2020-06-30 Intel Corporation View interpolation for visual storytelling
US10628009B2 (en) 2015-06-26 2020-04-21 Rovi Guides, Inc. Systems and methods for automatic formatting of images for media assets based on user profile
US10672186B2 (en) * 2015-06-30 2020-06-02 Mapillary Ab Method in constructing a model of a scenery and device therefor
EP3113159A1 (en) * 2015-06-30 2017-01-04 Thomson Licensing Method and device for processing a part of an immersive video content according to the position of reference parts
US10222932B2 (en) 2015-07-15 2019-03-05 Fyusion, Inc. Virtual reality environment based manipulation of multilayered multi-view interactive digital media representations
US10147211B2 (en) 2015-07-15 2018-12-04 Fyusion, Inc. Artificially rendering images using viewpoint interpolation and extrapolation
US11095869B2 (en) 2015-09-22 2021-08-17 Fyusion, Inc. System and method for generating combined embedded multi-view interactive digital media representations
US10242474B2 (en) 2015-07-15 2019-03-26 Fyusion, Inc. Artificially rendering images using viewpoint interpolation and extrapolation
US10852902B2 (en) 2015-07-15 2020-12-01 Fyusion, Inc. Automatic tagging of objects on a multi-view interactive digital media representation of a dynamic entity
TWI547177B (en) * 2015-08-11 2016-08-21 晶睿通訊股份有限公司 Viewing Angle Switching Method and Camera Therefor
US10636206B2 (en) 2015-08-14 2020-04-28 Metail Limited Method and system for generating an image file of a 3D garment model on a 3D body model
GB2546572B (en) * 2015-08-14 2019-12-04 Metail Ltd Method and system for generating an image file of a 3D garment model on a 3D body model
US10291845B2 (en) * 2015-08-17 2019-05-14 Nokia Technologies Oy Method, apparatus, and computer program product for personalized depth of field omnidirectional video
US11783864B2 (en) 2015-09-22 2023-10-10 Fyusion, Inc. Integration of audio into a multi-view interactive digital media representation
US10791285B2 (en) * 2015-10-05 2020-09-29 Woncheol Choi Virtual flying camera system
JP6696149B2 (en) * 2015-10-29 2020-05-20 富士通株式会社 Image generation method, image generation program, information processing device, and display control method
WO2017075614A1 (en) 2015-10-29 2017-05-04 Oy Vulcan Vision Corporation Video imaging an area of interest using networked cameras
US10769849B2 (en) * 2015-11-04 2020-09-08 Intel Corporation Use of temporal motion vectors for 3D reconstruction
US9473758B1 (en) * 2015-12-06 2016-10-18 Sliver VR Technologies, Inc. Methods and systems for game video recording and virtual reality replay
JP6674247B2 (en) * 2015-12-14 2020-04-01 キヤノン株式会社 Information processing apparatus, information processing method, and computer program
US10839717B2 (en) * 2016-01-11 2020-11-17 Illinois Tool Works Inc. Weld training systems to synchronize weld data for presentation
EP3196838A1 (en) * 2016-01-25 2017-07-26 Nokia Technologies Oy An apparatus and associated methods
US10347102B2 (en) 2016-03-22 2019-07-09 Sensormatic Electronics, LLC Method and system for surveillance camera arbitration of uplink consumption
US10764539B2 (en) 2016-03-22 2020-09-01 Sensormatic Electronics, LLC System and method for using mobile device of zone and correlated motion detection
US9965680B2 (en) 2016-03-22 2018-05-08 Sensormatic Electronics, LLC Method and system for conveying data from monitored scene via surveillance cameras
US10475315B2 (en) 2016-03-22 2019-11-12 Sensormatic Electronics, LLC System and method for configuring surveillance cameras using mobile computing devices
US11601583B2 (en) 2016-03-22 2023-03-07 Johnson Controls Tyco IP Holdings LLP System and method for controlling surveillance cameras
US10192414B2 (en) * 2016-03-22 2019-01-29 Sensormatic Electronics, LLC System and method for overlap detection in surveillance camera network
US11216847B2 (en) 2016-03-22 2022-01-04 Sensormatic Electronics, LLC System and method for retail customer tracking in surveillance camera network
US10733231B2 (en) 2016-03-22 2020-08-04 Sensormatic Electronics, LLC Method and system for modeling image of interest to users
US10318836B2 (en) * 2016-03-22 2019-06-11 Sensormatic Electronics, LLC System and method for designating surveillance camera regions of interest
US10665071B2 (en) 2016-03-22 2020-05-26 Sensormatic Electronics, LLC System and method for deadzone detection in surveillance camera network
US10319071B2 (en) * 2016-03-23 2019-06-11 Qualcomm Incorporated Truncated square pyramid geometry and frame packing structure for representing virtual reality video content
US10737546B2 (en) 2016-04-08 2020-08-11 Fox Factory, Inc. Electronic compression and rebound control
JP6735592B2 (en) * 2016-04-08 2020-08-05 キヤノン株式会社 Image processing apparatus, control method thereof, and image processing system
US10523929B2 (en) * 2016-04-27 2019-12-31 Disney Enterprises, Inc. Systems and methods for creating an immersive video content environment
US10848743B2 (en) * 2016-06-10 2020-11-24 Lucid VR, Inc. 3D Camera calibration for adjustable camera settings
JP6400879B2 (en) * 2016-07-15 2018-10-03 ナーブ株式会社 Image display device and image display system
JP6938123B2 (en) * 2016-09-01 2021-09-22 キヤノン株式会社 Display control device, display control method and program
US11202017B2 (en) 2016-10-06 2021-12-14 Fyusion, Inc. Live style transfer on a mobile device
WO2018095366A1 (en) * 2016-11-24 2018-05-31 腾讯科技(深圳)有限公司 Frame-synchronisation-based data processing method for video recommendation determination and information display
GB2556910A (en) * 2016-11-25 2018-06-13 Nokia Technologies Oy Virtual reality display
EP4030767A1 (en) * 2016-11-30 2022-07-20 Panasonic Intellectual Property Corporation of America Three-dimensional model distribution method and three-dimensional model distribution device
CN107223245A (en) * 2016-12-27 2017-09-29 深圳前海达闼云端智能科技有限公司 A kind of data display processing method and device
CN110114803B (en) * 2016-12-28 2023-06-27 松下电器(美国)知识产权公司 Three-dimensional model distribution method, three-dimensional model reception method, three-dimensional model distribution device, and three-dimensional model reception device
US10358180B2 (en) 2017-01-05 2019-07-23 Sram, Llc Adjustable seatpost
WO2018147329A1 (en) * 2017-02-10 2018-08-16 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Free-viewpoint image generation method and free-viewpoint image generation system
US10666923B2 (en) * 2017-02-24 2020-05-26 Immervision, Inc. Wide-angle stereoscopic vision with cameras having different parameters
JP7212611B2 (en) * 2017-02-27 2023-01-25 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Image delivery method, image display method, image delivery device and image display device
JP6730613B2 (en) * 2017-02-28 2020-07-29 株式会社Jvcケンウッド Overhead video generation device, overhead video generation system, overhead video generation method and program
US10306212B2 (en) * 2017-03-31 2019-05-28 Verizon Patent And Licensing Inc. Methods and systems for capturing a plurality of three-dimensional sub-frames for use in forming a volumetric frame of a real-world scene
WO2018199052A1 (en) * 2017-04-25 2018-11-01 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ Image display method and image display device
CN110869980B (en) * 2017-05-18 2024-01-09 交互数字Vc控股公司 Distributing and rendering content as a spherical video and 3D portfolio
US10313651B2 (en) 2017-05-22 2019-06-04 Fyusion, Inc. Snapshots at predefined intervals or angles
CN108933913A (en) * 2017-05-24 2018-12-04 中兴通讯股份有限公司 A kind of video meeting implementing method, device, system and computer storage medium
US11272160B2 (en) * 2017-06-15 2022-03-08 Lenovo (Singapore) Pte. Ltd. Tracking a point of interest in a panoramic video
US11069147B2 (en) 2017-06-26 2021-07-20 Fyusion, Inc. Modification of multi-view interactive digital media representation
US20190012822A1 (en) * 2017-07-05 2019-01-10 Cinova Media Virtual reality system with advanced low-complexity user interactivity and personalization through cloud-based data-mining and machine learning
GB2564642A (en) * 2017-07-10 2019-01-23 Nokia Technologies Oy Methods and apparatuses for panoramic image processing
US20200213647A1 (en) * 2017-08-03 2020-07-02 David G. Perryman Method and system for providing an electronic video camera generated live view or views to a user's screen
US11109066B2 (en) 2017-08-15 2021-08-31 Nokia Technologies Oy Encoding and decoding of volumetric video
EP3669333A4 (en) * 2017-08-15 2021-04-07 Nokia Technologies Oy Sequential encoding and decoding of volymetric video
KR102336997B1 (en) 2017-08-16 2021-12-08 삼성전자 주식회사 Server, display apparatus and control method thereof
JP6921686B2 (en) * 2017-08-30 2021-08-18 キヤノン株式会社 Generator, generation method, and program
US10347045B1 (en) 2017-09-29 2019-07-09 A9.Com, Inc. Creating multi-dimensional object representations
US10380798B2 (en) * 2017-09-29 2019-08-13 Sony Interactive Entertainment America Llc Projectile object rendering for a virtual reality spectator
US10593066B1 (en) 2017-09-29 2020-03-17 A9.Com, Inc. Compression of multi-dimensional object representations
CN109801351B (en) * 2017-11-15 2023-04-14 阿里巴巴集团控股有限公司 Dynamic image generation method and processing device
US10697757B2 (en) * 2017-12-22 2020-06-30 Symbol Technologies, Llc Container auto-dimensioning
US10592747B2 (en) 2018-04-26 2020-03-17 Fyusion, Inc. Method and apparatus for 3-D auto tagging
WO2019213497A1 (en) 2018-05-03 2019-11-07 Scotty Labs Virtual vehicle control system
US20210258503A1 (en) * 2018-06-13 2021-08-19 Pelco, Inc. Systems and methods for tracking a viewing area of a camera device
US10540824B1 (en) 2018-07-09 2020-01-21 Microsoft Technology Licensing, Llc 3-D transitions
US10740986B2 (en) * 2018-08-30 2020-08-11 Qualcomm Incorporated Systems and methods for reconstructing a moving three-dimensional object
US10841509B2 (en) * 2018-10-22 2020-11-17 At&T Intellectual Property I, L.P. Camera array orchestration
US10818076B2 (en) * 2018-10-26 2020-10-27 Aaron Bradley Epstein Immersive environment from video
CN109634413B (en) * 2018-12-05 2021-06-11 腾讯科技(深圳)有限公司 Method, device and storage medium for observing virtual environment
US11074697B2 (en) 2019-04-16 2021-07-27 At&T Intellectual Property I, L.P. Selecting viewpoints for rendering in volumetric video presentations
US11153492B2 (en) 2019-04-16 2021-10-19 At&T Intellectual Property I, L.P. Selecting spectator viewpoints in volumetric video presentations of live events
US10970519B2 (en) 2019-04-16 2021-04-06 At&T Intellectual Property I, L.P. Validating objects in volumetric video presentations
US11012675B2 (en) 2019-04-16 2021-05-18 At&T Intellectual Property I, L.P. Automatic selection of viewpoint characteristics and trajectories in volumetric video presentations
CN110211222B (en) * 2019-05-07 2023-08-01 谷东科技有限公司 AR immersion type tour guide method and device, storage medium and terminal equipment
JP7418101B2 (en) * 2019-07-26 2024-01-19 キヤノン株式会社 Information processing device, information processing method, and program
US11023729B1 (en) * 2019-11-08 2021-06-01 Msg Entertainment Group, Llc Providing visual guidance for presenting visual content in a venue
US11284824B2 (en) * 2019-12-02 2022-03-29 Everseen Limited Method and system for determining a human social behavior classification
EP3861720B1 (en) 2019-12-03 2023-07-26 Discovery Communications, LLC Non-intrusive 360 view without camera at the viewpoint
US20210105451A1 (en) * 2019-12-23 2021-04-08 Intel Corporation Scene construction using object-based immersive media
CN111784655B (en) * 2020-06-24 2023-11-24 江苏科技大学 Underwater robot recycling and positioning method
CN112087578A (en) * 2020-09-14 2020-12-15 深圳移动互联研究院有限公司 Cross-region evidence collection method and device, computer equipment and storage medium
US11622100B2 (en) * 2021-02-17 2023-04-04 flexxCOACH VR 360-degree virtual-reality system for dynamic events
US11657578B2 (en) 2021-03-11 2023-05-23 Quintar, Inc. Registration for augmented reality system for viewing an event
US20220295139A1 (en) * 2021-03-11 2022-09-15 Quintar, Inc. Augmented reality system for viewing an event with multiple coordinate systems and automatically generated model
US11645819B2 (en) 2021-03-11 2023-05-09 Quintar, Inc. Augmented reality system for viewing an event with mode based on crowd sourced images
US20220295040A1 (en) * 2021-03-11 2022-09-15 Quintar, Inc. Augmented reality system with remote presentation including 3d graphics extending beyond frame
US11527047B2 (en) 2021-03-11 2022-12-13 Quintar, Inc. Augmented reality system for viewing an event with distributed computing
US20220406003A1 (en) * 2021-06-17 2022-12-22 Fyusion, Inc. Viewpoint path stabilization
CN113572978A (en) * 2021-07-30 2021-10-29 北京房江湖科技有限公司 Panoramic video generation method and device
US11671575B2 (en) * 2021-09-09 2023-06-06 At&T Intellectual Property I, L.P. Compositing non-immersive media content to generate an adaptable immersive content metaverse
US20230128826A1 (en) * 2021-10-22 2023-04-27 Tencent America LLC Generating holographic or lightfield views using crowdsourcing
CN114302060A (en) * 2021-12-30 2022-04-08 苏州瀚易特信息技术股份有限公司 Real-time generation method for 360-degree VR panoramic image and video
CN114679549B (en) * 2022-05-27 2022-09-02 潍坊幻视软件科技有限公司 Cross-platform video communication method
CN115311409A (en) * 2022-06-26 2022-11-08 杭州美创科技有限公司 WEBGL-based electromechanical system visualization method and device, computer equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5267329A (en) * 1990-08-10 1993-11-30 Kaman Aerospace Corporation Process for automatically detecting and locating a target from a plurality of two dimensional images

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5490239A (en) * 1992-10-01 1996-02-06 University Corporation For Atmospheric Research Virtual reality imaging system
US5495576A (en) * 1993-01-11 1996-02-27 Ritchey; Kurtis J. Panoramic image based virtual reality/telepresence audio-visual system and method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5267329A (en) * 1990-08-10 1993-11-30 Kaman Aerospace Corporation Process for automatically detecting and locating a target from a plurality of two dimensional images

Cited By (113)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0785532A3 (en) * 1996-01-16 1998-07-29 University Corporation For Atmospheric Research Virtual reality imaging system
US5982362A (en) * 1996-05-30 1999-11-09 Control Technology Corporation Video interface architecture for programmable industrial control systems
US7146408B1 (en) 1996-05-30 2006-12-05 Schneider Automation Inc. Method and system for monitoring a controller and displaying data from the controller in a format provided by the controller
US6570587B1 (en) 1996-07-26 2003-05-27 Veon Ltd. System and method and linking information to a video
WO1998046029A1 (en) * 1997-04-04 1998-10-15 Orad Hi-Tec Systems Limited Graphical video systems
US6380933B1 (en) 1997-04-04 2002-04-30 Orad Hi-Tec Systems Limited Graphical video system
WO1998045812A1 (en) * 1997-04-07 1998-10-15 Synapix, Inc. Integrating live/recorded sources into a three-dimensional environment for media productions
WO1998045814A1 (en) * 1997-04-07 1998-10-15 Synapix, Inc. Iterative process for three-dimensional image generation
WO1998045813A1 (en) * 1997-04-07 1998-10-15 Synapix, Inc. Media production with correlation of image stream and abstract objects in a three-dimensional virtual stage
US6084590A (en) * 1997-04-07 2000-07-04 Synapix, Inc. Media production with correlation of image stream and abstract objects in a three-dimensional virtual stage
US6124864A (en) * 1997-04-07 2000-09-26 Synapix, Inc. Adaptive modeling and segmentation of visual image streams
US6160907A (en) * 1997-04-07 2000-12-12 Synapix, Inc. Iterative three-dimensional process for creating finished media content
WO1998050834A1 (en) * 1997-05-06 1998-11-12 Control Technology Corporation Distributed interface architecture for programmable industrial control systems
AU747729B2 (en) * 1997-05-06 2002-05-23 Schneider Automation Inc. Distributed interface architecture for programmable industrial control systems
CN1107292C (en) * 1997-06-20 2003-04-30 日本电信电话株式会社 Scheme for interactive video manipulation and display of moving object on background image
SG83686A1 (en) * 1997-09-12 2001-10-16 Matsushita Electric Ind Co Ltd Virtual www server for enabling a single display screen of a browser to be utilized to concurrently display data of a plurality of files which are obtained from respective servers and to send commands to these servers.
US6421459B1 (en) 1997-09-16 2002-07-16 Canon Kabushiki Kaisha Image processing apparatus
EP0903695A1 (en) * 1997-09-16 1999-03-24 Canon Kabushiki Kaisha Image processing apparatus
US6914599B1 (en) 1998-01-14 2005-07-05 Canon Kabushiki Kaisha Image processing apparatus
EP0930585A1 (en) * 1998-01-14 1999-07-21 Canon Kabushiki Kaisha Image processing apparatus
EP1064817A1 (en) * 1998-08-07 2001-01-03 Be Here Corporation Method and apparatus for electronically distributing motion panoramic images
EP1064817A4 (en) * 1998-08-07 2003-02-12 Be Here Corp Method and apparatus for electronically distributing motion panoramic images
US6853867B1 (en) 1998-12-30 2005-02-08 Schneider Automation Inc. Interface to a programmable logic controller
US7984420B2 (en) * 1999-05-17 2011-07-19 Invensys Systems, Inc. Control systems and methods with composite blocks
US7650058B1 (en) 2001-11-08 2010-01-19 Cernium Corporation Object selective video recording
US7397929B2 (en) 2002-09-05 2008-07-08 Cognex Technology And Investment Corporation Method and apparatus for monitoring a passageway using 3D images
US7400744B2 (en) 2002-09-05 2008-07-15 Cognex Technology And Investment Corporation Stereo door sensor
GB2403364A (en) * 2003-06-24 2004-12-29 Christopher Paul Casson Virtual scene generating system
WO2005048200A2 (en) * 2003-11-05 2005-05-26 Cognex Corporation Method and system for enhanced portal security through stereoscopy
WO2005048200A3 (en) * 2003-11-05 2005-12-15 Cognex Corp Method and system for enhanced portal security through stereoscopy
US8326084B1 (en) 2003-11-05 2012-12-04 Cognex Technology And Investment Corporation System and method of auto-exposure control for image acquisition hardware using three dimensional information
US7623674B2 (en) 2003-11-05 2009-11-24 Cognex Technology And Investment Corporation Method and system for enhanced portal security through stereoscopy
US7614083B2 (en) 2004-03-01 2009-11-03 Invensys Systems, Inc. Process control methods and apparatus for intrusion detection, protection and network hardening
WO2007014216A2 (en) * 2005-07-22 2007-02-01 Cernium Corporation Directed attention digital video recordation
US8026945B2 (en) 2005-07-22 2011-09-27 Cernium Corporation Directed attention digital video recordation
WO2007014216A3 (en) * 2005-07-22 2007-12-06 Cernium Corp Directed attention digital video recordation
EP2034441A1 (en) * 2007-09-05 2009-03-11 Sony Corporation System and method for communicating a representation of a scene
US8355532B2 (en) 2007-09-05 2013-01-15 Sony Corporation System for communicating and method
US11172209B2 (en) 2008-11-17 2021-11-09 Checkvideo Llc Analytics-modulated coding of surveillance video
US9215467B2 (en) 2008-11-17 2015-12-15 Checkvideo Llc Analytics-modulated coding of surveillance video
US8854457B2 (en) 2009-05-07 2014-10-07 Universite Catholique De Louvain Systems and methods for the autonomous production of videos from multi-sensored data
WO2010127418A1 (en) * 2009-05-07 2010-11-11 Universite Catholique De Louvain Systems and methods for the autonomous production of videos from multi-sensored data
US9317133B2 (en) 2010-10-08 2016-04-19 Nokia Technologies Oy Method and apparatus for generating augmented reality content
US20120293606A1 (en) * 2011-05-20 2012-11-22 Microsoft Corporation Techniques and system for automatic video conference camera feed selection based on room events
CN102819413A (en) * 2011-06-08 2012-12-12 索尼公司 Display control device, display control method, program, and recording medium
EP2533533A1 (en) * 2011-06-08 2012-12-12 Sony Corporation Display Control Device, Display Control Method, Program, and Recording Medium
EP2791909A4 (en) * 2011-12-16 2015-06-24 Thomson Licensing Method and apparatus for generating 3d free viewpoint video
EP2634772A1 (en) * 2012-02-28 2013-09-04 BlackBerry Limited Methods and devices for selecting objects in images
US11631227B2 (en) 2012-02-28 2023-04-18 Blackberry Limited Methods and devices for selecting objects in images
US9558575B2 (en) 2012-02-28 2017-01-31 Blackberry Limited Methods and devices for selecting objects in images
US11069154B2 (en) 2012-02-28 2021-07-20 Blackberry Limited Methods and devices for selecting objects in images
US10319152B2 (en) 2012-02-28 2019-06-11 Blackberry Limited Methods and devices for selecting objects in images
US10657730B2 (en) 2012-02-28 2020-05-19 Blackberry Limited Methods and devices for manipulating an identified background portion of an image
US11023736B2 (en) 2014-02-28 2021-06-01 Second Spectrum, Inc. Methods and systems of spatiotemporal pattern recognition for video content development
US10755102B2 (en) 2014-02-28 2020-08-25 Second Spectrum, Inc. Methods and systems of spatiotemporal pattern recognition for video content development
US11861906B2 (en) 2014-02-28 2024-01-02 Genius Sports Ss, Llc Data processing systems and methods for enhanced augmentation of interactive video content
US11861905B2 (en) 2014-02-28 2024-01-02 Genius Sports Ss, Llc Methods and systems of spatiotemporal pattern recognition for video content development
US10997425B2 (en) 2014-02-28 2021-05-04 Second Spectrum, Inc. Methods and systems of spatiotemporal pattern recognition for video content development
US11120271B2 (en) 2014-02-28 2021-09-14 Second Spectrum, Inc. Data processing systems and methods for enhanced augmentation of interactive video content
US10832057B2 (en) 2014-02-28 2020-11-10 Second Spectrum, Inc. Methods, systems, and user interface navigation of video content based spatiotemporal pattern recognition
US10769446B2 (en) 2014-02-28 2020-09-08 Second Spectrum, Inc. Methods and systems of combining video content with one or more augmentations
US10762351B2 (en) 2014-02-28 2020-09-01 Second Spectrum, Inc. Methods and systems of spatiotemporal pattern recognition for video content development
US10755103B2 (en) 2014-02-28 2020-08-25 Second Spectrum, Inc. Methods and systems of spatiotemporal pattern recognition for video content development
US10748008B2 (en) 2014-02-28 2020-08-18 Second Spectrum, Inc. Methods and systems of spatiotemporal pattern recognition for video content development
US11275949B2 (en) 2014-02-28 2022-03-15 Second Spectrum, Inc. Methods, systems, and user interface navigation of video content based spatiotemporal pattern recognition
US11373405B2 (en) 2014-02-28 2022-06-28 Second Spectrum, Inc. Methods and systems of combining video content with one or more augmentations to produce augmented video
US11380101B2 (en) 2014-02-28 2022-07-05 Second Spectrum, Inc. Data processing systems and methods for generating interactive user interfaces and interactive game systems based on spatiotemporal analysis of video content
US10313656B2 (en) 2014-09-22 2019-06-04 Samsung Electronics Company Ltd. Image stitching for three-dimensional video
EP3007038A3 (en) * 2014-09-22 2016-06-22 Samsung Electronics Co., Ltd. Interaction with three-dimensional video
US11205305B2 (en) 2014-09-22 2021-12-21 Samsung Electronics Company, Ltd. Presentation of three-dimensional video
US10547825B2 (en) 2014-09-22 2020-01-28 Samsung Electronics Company, Ltd. Transmission of three-dimensional video
US10750153B2 (en) 2014-09-22 2020-08-18 Samsung Electronics Company, Ltd. Camera system for three-dimensional video
US10257494B2 (en) 2014-09-22 2019-04-09 Samsung Electronics Co., Ltd. Reconstruction of three-dimensional video
EP3338106A4 (en) * 2015-08-17 2019-04-03 C360 Technologies, Inc. Generating objects in real time panoramic video
US10623636B2 (en) 2015-08-17 2020-04-14 C360 Technologies, Inc. Generating objects in real time panoramic video
GB2561946B (en) * 2015-08-17 2021-11-17 C360 Tech Inc Generating objects in real time panoramic video
EP3151554A1 (en) * 2015-09-30 2017-04-05 Calay Venture S.a.r.l. Presence camera
CN108141578A (en) * 2015-09-30 2018-06-08 卡雷风险投资有限责任公司 Camera is presented
US11196972B2 (en) 2015-09-30 2021-12-07 Tmrw Foundation Ip S. À R.L. Presence camera
WO2017054925A1 (en) * 2015-09-30 2017-04-06 Calay Venture S.A.R.L. Presence camera
US11006073B1 (en) 2015-12-22 2021-05-11 Steelcase Inc. Virtual world method and system for affecting mind state
US10404938B1 (en) 2015-12-22 2019-09-03 Steelcase Inc. Virtual world method and system for affecting mind state
US11856326B1 (en) 2015-12-22 2023-12-26 Steelcase Inc. Virtual world method and system for affecting mind state
US11490051B1 (en) 2015-12-22 2022-11-01 Steelcase Inc. Virtual world method and system for affecting mind state
US11222469B1 (en) 2016-02-17 2022-01-11 Steelcase Inc. Virtual affordance sales tool
US10181218B1 (en) 2016-02-17 2019-01-15 Steelcase Inc. Virtual affordance sales tool
US10984597B1 (en) 2016-02-17 2021-04-20 Steelcase Inc. Virtual affordance sales tool
US11521355B1 (en) 2016-02-17 2022-12-06 Steelcase Inc. Virtual affordance sales tool
US10614625B1 (en) 2016-02-17 2020-04-07 Steelcase, Inc. Virtual affordance sales tool
CN109275358A (en) * 2016-05-25 2019-01-25 佳能株式会社 The method and apparatus for generating virtual image from the camera array with chrysanthemum chain link according to the selected viewpoint of user
US10848748B2 (en) 2016-05-25 2020-11-24 Canon Kabushiki Kaisha Method for generating virtual viewpoint image and image processing apparatus
CN109275358B (en) * 2016-05-25 2020-07-10 佳能株式会社 Method and apparatus for generating virtual images from an array of cameras having a daisy chain connection according to a viewpoint selected by a user
US10277890B2 (en) 2016-06-17 2019-04-30 Dustin Kerstein System and method for capturing and viewing panoramic images having motion parallax depth perception without image stitching
US10594999B2 (en) 2016-06-23 2020-03-17 Interdigital Ce Patent Holdings Method and apparatus for creating a pair of stereoscopic images using least one lightfield camera
EP3264761A1 (en) * 2016-06-23 2018-01-03 Thomson Licensing A method and apparatus for creating a pair of stereoscopic images using least one lightfield camera
US10805558B2 (en) 2016-10-14 2020-10-13 Uniqfeed Ag System for producing augmented images
US10832732B2 (en) 2016-10-14 2020-11-10 Uniqfeed Ag Television broadcast system for generating augmented images
US10740905B2 (en) 2016-10-14 2020-08-11 Uniqfeed Ag System for dynamically maximizing the contrast between the foreground and background in images and/or image sequences
DE102016119637A1 (en) * 2016-10-14 2018-04-19 Uniqfeed Ag Television transmission system for generating enriched images
US11178360B1 (en) 2016-12-15 2021-11-16 Steelcase Inc. Systems and methods for implementing augmented reality and/or virtual reality
US10182210B1 (en) 2016-12-15 2019-01-15 Steelcase Inc. Systems and methods for implementing augmented reality and/or virtual reality
US11863907B1 (en) 2016-12-15 2024-01-02 Steelcase Inc. Systems and methods for implementing augmented reality and/or virtual reality
US10659733B1 (en) 2016-12-15 2020-05-19 Steelcase Inc. Systems and methods for implementing augmented reality and/or virtual reality
US11049218B2 (en) 2017-08-11 2021-06-29 Samsung Electronics Company, Ltd. Seamless image stitching
EP3882866A4 (en) * 2018-11-14 2022-08-10 Canon Kabushiki Kaisha Information processing system, information processing method, and program
US20220321856A1 (en) * 2018-11-14 2022-10-06 Canon Kabushiki Kaisha Information processing system, information processing method, and storage medium
US11778244B2 (en) 2019-11-08 2023-10-03 Genius Sports Ss, Llc Determining tactical relevance and similarity of video sequences
US11113535B2 (en) 2019-11-08 2021-09-07 Second Spectrum, Inc. Determining tactical relevance and similarity of video sequences
US11380014B2 (en) 2020-03-17 2022-07-05 Aptiv Technologies Limited Control modules and methods
CN111580670B (en) * 2020-05-12 2023-06-30 黑龙江工程学院 Garden landscape implementation method based on virtual reality
CN111580670A (en) * 2020-05-12 2020-08-25 黑龙江工程学院 Landscape implementing method based on virtual reality
EP4122767A1 (en) * 2021-07-22 2023-01-25 Continental Automotive Systems, Inc. Vehicle mirror image simulation
CN113938653A (en) * 2021-10-12 2022-01-14 钱保军 Multi-video monitoring display method based on AR echelon cascade

Also Published As

Publication number Publication date
US5850352A (en) 1998-12-15
AU5380296A (en) 1996-10-16
WO1996031047A3 (en) 1996-11-07

Similar Documents

Publication Publication Date Title
US5850352A (en) Immersive video, including video hypermosaicing to generate from multiple video views of a scene a three-dimensional video mosaic from which diverse virtual video scene images are synthesized, including panoramic, scene interactive and stereoscopic images
US5745126A (en) Machine synthesis of a virtual video camera/image of a scene from multiple video cameras/images of the scene in accordance with a particular perspective on the scene, an object in the scene, or an event in the scene
US11776199B2 (en) Virtual reality environment based manipulation of multi-layered multi-view interactive digital media representations
Kelly et al. An architecture for multiple perspective interactive video
KR101323966B1 (en) A system and method for 3D space-dimension based image processing
Kanade et al. Virtualized reality: Constructing virtual worlds from real scenes
EP1095501A2 (en) Method and apparatus for generating virtual views of sporting events
CN108693970A (en) Method and apparatus for the video image for adjusting wearable device
Katkere et al. Towards video-based immersive environments
WO2004012141A2 (en) Virtual reality immersion system
Kurillo et al. A framework for collaborative real-time 3D teleimmersion in a geographically distributed environment
Lu et al. Automatic object extraction and reconstruction in active video
KR102343267B1 (en) Apparatus and method for providing 360-degree video application using video sequence filmed in multiple viewer location
Lai Immersive Dynamic Scenes for Virtual Reality from a Single RGB-D Camera
Tanahashi et al. Live events accessing for multi-users with free viewpoints using stereo omni-directional system
Rajkumar Best of Both Worlds: Merging 360˚ Image Capture with 3D Reconstructed Environments for Improved Immersion in Virtual Reality
KR20080097403A (en) Method and system for creating event data and making same available to be served
Qui Interactive mixed reality media with real time 3d human capture

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AL AM AT AU AZ BB BG BR BY CA CH CN CZ DE DK EE ES FI GB GE HU IS JP KE KG KP KR KZ LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TR TT UA UG US UZ VN AM AZ BY KG KZ MD RU TJ TM

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): KE LS MW SD SZ UG AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

NENP Non-entry into the national phase

Ref country code: CA

122 Ep: pct application non-entry in european phase