VIDEO DIARY WITH EVENT SUMMARY
This invention relates to the field of consumer electronics, and in particular to a video processing system that is configured to record a user's life experiences and to organize these recordings for efficient search and retrieval.
Video cameras are becoming increasingly smaller, and "wearable" cameras are becoming increasingly popular. Jewelry- sized cameras are currently available, as well as eyeglass mounted cameras. Similarly, many hand-held devices, such as cell-phones, are commonly equipped with video cameras, and security and surveillance cameras are becoming ubiquitous.
Memory devices are also becoming increasingly smaller. Solid state memory devices are currently able to store over a giga-byte of data, and small hard-disk drives, such as used in portable music players, are capable of storing dozens of giga-bytes of data. Personal computers are available with hundreds of giga-bytes of data storage, and storage nodes with thousands of giga-bytes of data storage are also readily available.
Although the technologies exist for a user to continuously record videos corresponding to months or years of the user's life, the heretofore absence of a viable method and system for organizing these videos for subsequent recall and recollection has substantially diminished the practicality of using these technologies to create such a video diary.
Techniques are available for analyzing, characterizing, and summarizing video information from conventional sources, such as television programs, but these techniques generally rely upon known recording patterns and fairly static characterizations. For example, television programs may be characterized as "comedy", "drama", "news", "weather", and so on, with easy to recognize context shifts and program breaks. Such techniques are not well suited for characterizing and organizing 'free-flowing' video, such as the collection of videos from an always-on wearable camera, and such techniques are not well suited for maintaining long-term archives and summaries of events.
At the CARPE workshop co-located with ACM Multimedia Conference 2004 in New York, October 15, 2004, a number of papers were presented regarding techniques for capturing and organizing information regarding a person's life, and are summarized below.
"Passive Capture and Ensuing Issues for a Personal Lifetime Store," by Gemmell, presents a system for integrating the input from cameras and sensors, such as time and location sensors. The time and location that each photograph is taken is stored in a database, so that a user can retrieve pictures based on time or location. The system also maintains a log of events, such as a record of telephone calls, a record of web-pages visited, a record of appointments in a calendar, and so on. If the user recalls an event associated with the time or location of pictures to be retrieved, the user consults the log for the time of the event, and thereby locates the pictures.
"Efficient Retrieval of Life Log Based on Context and Content" by Aizawa et al. presents a lifelog that contains input from substantially continuous video recordings, location and motion sensors, physiological sensors, documents, emails, and so on. Spatio- temporal sampling of the video recordings, such as sampling every N seconds or after every M meters of movement, is used to create summary frames and key frames that facilitate retrieval. Additionally, these key frames are analyzed to distinguish conversation scenes from non-conversation scenes, based, for example, on whether the scene contains a close-up face. Retrieval of the information can be based on time, location, and behavior, as well as annotations provided by words in documents, e-mails and so on.
"A Layered Interpretation of Human Interactions Captured by Ubiquitous Sensors," Takahashi et al. introduces a system architecture for capturing continuous video, including a layered model of interactions based on semantic levels. At the lowest level is the raw data collected by audio/visual equipment, location sensors, and the like. At a higher level, segments of raw data are created, and at a further level, characterized as elements of human behavior, such as "LOOK_AT", "TALK_TO", and so on. At a higher level, the individual behavior elements are grouped into social interactions, such as "GROUP_DISCUSSION", "TOGETHER_WITH", and so on. This structure facilitates the sharing of information, including the creation of summary videos that combine the collected information from multiple users, as well as identifying each user's interests, based, for example on the "LOOK_AT" and "TALK_TO" elements in the video. In like manner, an automated-guide system can be provided, wherein the guide provides suggestions based on prior interactions, such as exhibits associated with each "LOOK_AT" element.
Although the above articles address techniques for storing and organizing videos, a need exists for improved techniques for such storage and organization, to facilitate long
term as well as short term retrieval and recollection. Of particular note, a need exists for a system that facilitates the retrieval of events in a person's life, recognizing that the identification of material that constitutes an "event" is dynamic, and that the significance of an event changes with time.
It is an object of this invention to provide a method and system for collecting and organizing videos related to a user's everyday experiences, to facilitate recollection of events in the user's life. It is a further object of this invention to provide a method and system for collecting and organizing videos to facilitate maintenance of a personal or business diary.
These objects, and others, are achieved by a video processing system that is configured to process videos related to activities of a user, to identify events that facilitate the organization and storage of the videos for subsequent recall. Preferably, the user wears one or more camera devices that continuously record the activities of the user. Processing elements analyze the recorded videos to recognize events, such as a greeting with another person. The recognized events are used to index or otherwise organize the videos to facilitate recollection, such as recalling "people I met today", or answering queries such as "when did I last speak to Wendy?" The topic of recorded conversations can also be used to characterize events, as well as the recognition of key words or phrases. A hierarchy of archives is used to organize events and corresponding videos on a daily, weekly, monthly, and yearly basis.
The invention is explained in further detail, and by way of example, with reference to the accompanying drawings wherein:
FIG. 1 illustrates a block diagram of an example video archiving system in accordance with this invention.
FIG. 2 illustrates an example flow diagram for creating and maintaining an event library in accordance with this invention.
FIG. 3 illustrates an example flow diagram for recognizing events in accordance with this invention.
Throughout the drawings, the same reference numeral refers to the same element, or an element that performs substantially the same function. The drawings are included for illustrative purposes and are not intended to limit the scope of the invention.
Using commonly available compression techniques, one giga-byte of memory is able to store about one hour's worth of high quality video, or at least four hours of "teleconference-quality" video. This invention is based on the observation that, using existing technology, a continuous recording of an average person's activities for days at a time could be recorded on a portable recorder, and with judicious event selection and repetition reduction, years of key events in the person's life could be stored for retrieval and recollection on a personal computer. As memory and processing technologies advance, a user will be able to maintain a life-long video diary on a portable viewing system using the techniques of this invention.
Techniques are continuing to be developed for summarizing and classifying video segments. For example, Mauro Barbieri, Nevenka Dimitrova, and Lalitha Agnihotri's "MOVIE-IN-A-MINUTE: AUTOMATICALLY GENERATED VIDEO PREVIEWS", published at the Pacific Rim Conference on Multimedia, in Tokyo, 1-3 December 2004, presents a technique for automatically creating a video preview of a movie. USP 6,754,389, "PROGRAM CLASSIFICATION USING OBJECT TRACKING", issued 22 June 2004 to Nevenka Dimitrova, Lalitha Agnihotri, and Gang Wei, presents a technique for classifying television programs based on the path or trajectory of objects within each scene, and is incorporated by reference herein.
Similarly, techniques are continuing to be developed for recognizing significant objects within video images. For example, Jun Fan, Nevenka Dimitrova, and Vasanth Philomin's "ONLINE FACE RECOGNITION SYSTEM FOR VIDEOS BASED ON MODIFIED PROBABILISTIC NEURAL NETWORKS WITH ADAPTIVE THRESHOLD", published at IEEE ICIP 2004, teaches an online learning technique for efficiently recognizing faces in video images, and adding new faces to the database "online". During continuous operation of the system, the system "learns" new faces and is open to memorize new face models.
FIG. 1 illustrates a block diagram of an example video archiving system in accordance with this invention. The system includes a memory 150 that receives videos from one or more cameras 110, and an abstraction/summarization module 120 that is configured to organize the videos in the memory 150 so as to facilitate recall via an access and editing module 160 that is coupled to a playback device 170. To facilitate dynamic event recognition and characterization, an adaptive learning module 130 is also provided.
The memory 150 is illustrated as containing three levels of memory, named for convenience as "immediate" 152, "short-term" 154, and "long-term" 156 memory. Fewer or more levels of memory may be used, and the different levels may be distinguished physically or logically, or a combination of both. Although a single block is used to represent the memory 150, the memory 150 may be distributed among multiple physical blocks. For example, the immediate memory 152 may be located in a portable device that is coupled directly to a user's camera 110; it may be integral to a camera 110; it may include portions of different users' cameras 110; and so on. In like manner, the short-term memory 154 may be partitioned between a portable device and a personal computer, and the long-term memory 156 may be located at a mass-storage facility that is accessed via the Internet. Similarly, a portion of the immediate memory 152 may be temporary memory that is used to download videos from a surveillance tape of a region that the user visited, or from another person's cell-phone, and so on.
The abstraction and summarization module 120 is configured to analyze the videos in the immediate memory 152, to identify videos that should be placed in short-term memory 154, and to semantically index these stored videos for efficient retrieval. The module 120 is also optionally configured to further determine which videos should be placed in long-term memory 156, as detailed further below. The abstraction and summarization module 120 also provides features of a personal assistant device, by processing the videos to identify personal introductions, telephone numbers, addresses, and so on.
In conjunction with the adaptive learning module 130, discussed further below, the abstraction and summarization module 120 is configured to partition the continuous video stream into discrete events. To facilitate this partitioning, "break points" are identified within the video, based on changes in the visual or audio content, or based on other "break" cues. Events are defined to begin and end at break points, and may contain one or more intermediate break points. For example, during a ride in an automobile, the general content or form of a video stream will be fairly consistent. When the ride terminates, the content and form of the video stream will change substantially, and this visual transition will be identified by a break point. In like manner, the audio content of a video stream will often change substantially when a person enters a new environment, when a meeting commences, when the unexpected happens, and so on. Each substantial audio change can
be used to identify break points in the video. If the system is coupled with a location determination device, changes in velocity or direction may also serve to identify potential break points in the video.
Nevenka Dimitrova, Jacquelyn Martino, Lalitha Agnihotri, and Herman Elenbaas's "SUPERHISTOGRAMS FOR VIDEO REPRESENTATION", IEEE ICIP 1999, Kobe, Japan, discloses the use of color histograms to identify related frames, and identifies breaks where the histograms differ substantially. A data structure is presented to identify related but non-contiguous frames, to accommodate temporal gaps in a common scene, such as a temporary departure from a meeting, using "families" of histograms. Additional techniques for defining breaks in video are presented in Nevenka Dimitrova, Lalitha Agnihotri, and Radu Jasinschi's "TEMPORAL VIDEO BOUNDARIES", at pages 61-90 of the VIDEO MINING BOOK, edited by A. Rosenfeld et al., and published by Kluwer, 2003.
In a preferred embodiment of this invention, the abstraction and summarization module 120 is configured to provide multi- level hierarchical summaries based on the recorded events, including, for example, daily, weekly, monthly, and yearly summaries. Additionally, special-purpose summaries may be created, for example to create personal "life videos" to pass along to family and friends.
Generally, the daily summaries will correspond to the videos that are selected for inclusion in the short-term memory 154, and the other summaries will correspond to videos that are selected for inclusion in the long-term memory 156.
Preferably, the daily summary will include a table of contents that identifies the recorded events, the people encountered or referenced, as well as the topics of conversation, based on the audio content of the videos. User cues, such as "Hello, Wendy", or facial recognition techniques, are used to identify the people encountered. In a preferred embodiment of this system, the techniques presented in the aforementioned "ONLINE FACE RECOGNITION SYSTEM FOR VIDEOS BASED ON MODIFIED PROBABILISTIC NEURAL NETWORKS WITH ADAPTIVE THRESHOLD" paper are used to add new faces and new voices to the memory, and to reconcile the identities of the people involved in the conversations. Similarly, the system can be coupled to a dictionary to identify proper names, so that, for example, the text "I was speaking to Charles, yesterday... " can be used to identify people referenced in a conversation. In like manner, other cues, such as "This is Sally", or "I'd like you to meet Sally", can be used to capture
images of newly-met people, to facilitate recollection at a later time. Also, the occurrence of numbers in a conversation can trigger savable events, such as identifying telephone numbers, addresses, appointment times, and so on.
In addition to the text of the conversation, the tenor, duration, and tone of the conversation may be used to classify the conversation as "small talk", "greeting", "significant news", "argument", and so on.
Note that within a given video segment corresponding to a saved event, multiple classifications and indexing may occur. In like manner, to facilitate efficient retrieval, multiple levels of cross-indexing are preferably used, so that, for example, all encounters with a particular person can be efficiently recalled.
The abstraction module 120 may also be configured to recognize commands directed to this archiving system, such as "DIARY, TAKE NOTE ... ", or "DIARY, THIS MEETING IS IMPORTANT" to direct the system to include subsequent videos in the short term memory 154, or "DIARY, OFF" to direct the system not to include the subsequent videos in the short term memory 154. Similarly, "DIARY, DELETE LAST 15 MINUTES" can be used to remove recording information from the short term memory 154. In like manner, commands such as "DIARY, NEW EVENT", or "DIARY, NEW MEETING" can be used to facilitate the identification of segment breaks, as well as the classification of events.
In addition to audio cues, visual cues may be used to identify or classify events. For example, the detection of text within the visual image may facilitate a determination of the location or venue of the video, including for example, the recognition of road signs, building names, skylines, and so on. In like manner, pattern recognition and learning techniques can be applied to recognize visual patterns corresponding to "home", "office", "traveling to work", "airport", and so on. Similarly, the immediate memory 152 may include information from other sources, such as a GPS receiver, that can be used to facilitate identification of events, such as "trip to New York", "at the Empire State Building", and so on.
Optionally, the abstraction module 120 may be configured to always save videos to the short term memory when the user is at a particular location, such as "in the Board room", and never save videos when the user is at another location, such as "in the bedroom".
Other sources of information may be used to facilitate the identification and classification of events. For example, the module 120 may be coupled to a personal calendar, and key dates, such as birthdays, anniversaries, holidays, and so on, can effect a different prioritization for saving events. The calendar will also affect the movement of events from short-term memory 154 to long-term memory 156, as detailed further below. In like manner, if the abstraction and summarization module 120 detects a schedule event, such as "Let's meet next Wednesday at three", this scheduled event can be automatically added to the personal calendar.
As with the memory 150, although the abstraction and summarization module 120 is illustrated as a single block in FIG. 1, it may include a number of components that are logically or physically independent from each other. For example, the aforementioned "DIARY, OFF" or "in the bedroom" controls may be located with the immediate memory 152 components, so that the videos are not recorded to the memory 152, for greater privacy. In like manner, the components used to determine which events are stored in long- term memory 156 may be located wherever the long-term memory 156 components are located.
As the use of this invention proliferates, third party vendors may offer summarization and abstraction components that can be added to the archiving system, including abstraction and storage components that are accessed via the Internet. For example, to replace the "holiday letter" that many families send each year, summarizing the significant events of the year, a third-party vendor may offer a "summarization and e- mailing service" that creates and distributes a video summary that includes videos from the video diaries of multiple family members. These and other configurations and enhancements will be evident to one of ordinary skill in the art in view of this disclosure.
The adaptive learning module 130 is configured to facilitate the recognition and characterization of events, including distinguishing between rare events and repetitive events. As contrast with conventional systems used to characterize video streams, such as television programs or videoconference meetings, the video archiving system of this invention cannot rely on a "supervised" structure of the images in the stream. For example, in a system that summarizes news stories, the system can expect each segment to begin with a scene break, each news story to begin with an image of the anchor person, each weather story to begin with an image of a map, and so on. In a supervised videoconference,
the image of the current speaker is usually placed in the foreground, or a light appears at the current speaker's location, and so on. In a typical embodiment of this invention, on the other hand, a "free flow", or "unsupervised flow" of images will be collected, and the learning module 130 is configured to facilitate the segregation of the continuous flow videos into discrete events by creating a library of labeled events and their characteristic features. To facilitate efficient storage and recall of the videos, the learning module 130 is also configured to further distinguish events as being rare or repetitive. Additional levels of distinction may also be provided, such as 'extremely rare', 'somewhat repetitive', and so on.
FIG. 2 illustrates an example flow diagram for creating and maintaining a dynamic event library in accordance with this invention, as would be included, for example, in the learning module 130 of FIG. 1.
At 210, features are extracted from the video information that is stored in the memory 150. Generally, these features are extracted from the information in the immediate memory 152. The extraction of features include audio, visual, and textual analysis of the video information, as discussed above, optionally enhanced by ancillary information, such as location and time. At 220, the features are processed to identify clusters of events, such as "meeting people", "driving in car", "in kitchen", and so on; typically, this clustering is performed on a daily basis.
As illustrated at 230, each day 232, 234 will likely exhibit different clusters. However, over time, each cluster can generally be classified as being "rare" or "repetitive". A repetitive cluster, for example, might be "driving to work", or "eating breakfast", whereas rare clusters may include "at a beach", or "at a party". In a preferred embodiment, conventional unsupervised learning methods are used to define the clusters. Clusters that contain many sequences would be representatives of clusters of repetitive events. The clusters that have few sequences are rare events, which may be characterized as important, such as meeting the President, or not-important, such as losing one's way through an unknown part of town. The system delineates repetitive vs. rare events, and provides the user the option of further distinguishing the events as important. In a preferred embodiment, known patterns of events are identified using statistical time series methods, such as hierarchical temporal Hidden Markov Models (HMMs). Optionally, adaptive resonance theory is used to further distinguish long-term and short-term patterns of events.
At 260, the clusters of events are identified by labels, such as the aforementioned "driving to work", "eating breakfast", "at a beach", and so on. In a preferred embodiment, the learning module 130 includes default labels, and patterns, for easily-identifiable and predictable events, such as views from an automobile, views of particular rooms (meeting room, bedroom, kitchen, etc.), cues for greeting people, and others, and provides a user- interface to allow the user to label unrecognized events, change labels of specific events, further distinguish clustered events, and so on. Similarly, this user interface is also used to allow the user to identify particular images, such as "Sally", "John", "home-kitchen", "office-kitchen", so that subsequent videos can be suitably identified.
The labeled events and their characteristic features are stored in an event library, at 270, for subsequent recognition and classification of events, as further detailed with regard to FIG. 3.
FIG. 3 illustrates an example flow diagram for recognizing and classifying events in accordance with this invention, as would be used, for example, in the abstraction and summarization module 120 of FIG. 1.
At 310, features are extracted from the memory 150, typically the immediate memory 152 of FIG. 1. The extraction of features include audio, visual, and textual analysis of the video information, as discussed above, optionally enhanced by ancillary information, such as location and time.
At 320, the extracted features are analyzed to determine if they correspond to one or more defined events in the event library 270, discussed above. If, at 330, the features correspond to an event, the occurrence of the event is stored, including the transfer of the video segments into the short-term memory 154 of FIG. 1, and the creation of a summary record that characterizes the event, such as "Meeting Sally in Manhattan", "On the beach with Bill", and so on. The summary record also includes an "importance" rating, to facilitate prioritization; generally, for example, rare events are rated more important than routine and/or repetitive events. In a preferred embodiment of this invention, an interface is provided to allow the user to adjust the importance ratings and/or the resultant prioritization.
If, at 330, the extracted features do not correspond to a defined event, the information from the memory 150 is passed to the adaptive learning module 130, at 340, for the potential creation of a newly defined event in the event library, at 350, using the
techniques detailed above with regard to FIG. 2. In a preferred embodiment of this invention, the user is also provided the option of explicitly defining events, preferably via the learning module 130. If a newly defined event is created, this first occurrence of the event is stored and summarized at 360, discussed above.
If the identified event corresponds to one of a defined set of events that trigger other operations, such as storing a telephone number, or storing an image of a new acquaintance, these operations are subsequently performed, at 370.
In addition to storing and summarizing the daily events as illustrated in FIG. 3, the abstraction and summarization module 120 is also configured to prepare weekly, monthly, and yearly summaries of events, based on a prioritization of the events. In a preferred embodiment of this invention, the module 120 allows both automated and manual identification of events for inclusion in these summaries, as well as manual control of the parameters and procedures used for automated event identification and prioritization.
Video data mining is used to detect weekly, monthly, or yearly patterns, to facilitate the detection of memorable days and/or memorable events. High level semantic pointers are preferably used to index these memorable days and events, to facilitate retrieval. As discussed above, ancillary information, such as calendar information for birthdays, anniversaries, travel schedule, and so on, and GPS information for locations, can be used to identify memorable days and events as well. The aforementioned importance rating is also used to select memorable events. Generally, days and events are identified as being memorable when they include out-of-the-ordinary occurrences, such as rare events, new locations, new vistas, and so on.
As noted above, the user is provided the option of manually overriding the systems' choice of memorable days or events. Optionally, the system includes a list of criteria and/or options for defining memorable events, such as a list that includes "location", "vista", "people", "calendar", etc., and the user selects which of these criteria should or should not be used to define memorable events.
Indexed archives are used to manage the storage of the daily, weekly, monthly, and yearly summaries, using archiving techniques common in the art. Generally, "meta summaries" are created at each level (e.g. daily level) to produce the summary at the next higher level (e.g. weekly level). In a preferred embodiment, to control the amount of storage required to maintain the video diary of this invention, a fixed number of daily
summaries, with their associated stored videos, are maintained, with new daily summaries and associated videos replacing the oldest daily summaries and videos. In like manner, a fixed number of each of weekly and monthly summaries and videos are maintained, with new summaries and videos replacing the oldest summaries and videos. As noted above, any of a variety of techniques can be used for maintaining this hierarchy of saved information. For example, the selected daily videos may be copied to a memory reserved for weekly summaries, and selected weekly videos may be copied to a memory reserved for monthly summaries, and so on. Alternatively, pointers could be maintained to a single copy of each video segment, and the video segment is not permitted to be overwritten if any of the daily, weekly, monthly, etc. summaries contain a pointer to this segment. Note that the terms "delete" and "replace" are used herein in the "logical" sense, and do not necessarily imply a physical deletion or immediate overwriting of the material. A video segment is "deleted" from the system whenever a reference to the segment does not appear in any of the current summaries.
The access and editing module 160 of FIG. 1 is configured to allow a user to browse the hierarchy of summaries, search for key words or phrases, sort by time or location, and so on, using data access techniques common in the art. In a preferred embodiment, each stored event includes a preview image or short clip, to facilitate selection of the stored videos.
Preferably, the accessing and editing module 160 allows a user to search and browse the video diary using a variety of query modes. For example, if the user is browsing the video, the module 160 allows queries such as: "Which building is this?", "Who is this person?", "Have I met this person before?", "Whose voice is this?", and so on. In response to such queries, the accessing module 160 compares the current image, audio, or video sequence to other images, audio and sequences within the memory 150. The query may also be in the form of a summary, such as "How often have I met Mary?", "What songs did we listen to at Mike's party?", "Who attended the staff meeting yesterday?", and so on.
The module 160 also provides a variety of editing functions, ranging from straightforward copying, cutting, and deleting functions to automated tasks such as "Prepare a disk containing all of my meetings with Charles", or "Prepare a summary disk of all of our family reunions", and so on. Optionally, the module 160 allows a user to use a variety of "director styles" in the creation of such composites and summaries.
The foregoing merely illustrates the principles of the invention. It will thus be appreciated that those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the invention and are thus within its spirit and scope. For example, although this invention is presented in the context of a single user, the system can be configured to communicate with similar systems, to provide for multi-user-based visual memories. The recordings of multiple users who attend a meeting, party, or other assembly, for example, can be merged to provide a common multiple- view recording of the event. These and other system configuration and optimization features will be evident to one of ordinary skill in the art in view of this disclosure, and are included within the scope of the following claims.
In interpreting these claims, it should be understood that: a) the word "comprising" does not exclude the presence of other elements or acts than those listed in a given claim; b) the word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements; c) any reference signs in the claims do not limit their scope; d) several "means" may be represented by the same item or hardware or software implemented structure or function; e) each of the disclosed elements may be comprised of hardware portions (e.g., including discrete and integrated electronic circuitry), software portions (e.g., computer programming), and any combination thereof; f) hardware portions may be comprised of one or both of analog and digital portions; g) any of the disclosed devices or portions thereof may be combined together or separated into further portions unless specifically stated otherwise; h) no specific sequence of acts is intended to be required unless specifically indicated; and i) the term "plurality of" an element includes two or more of the claimed element, and does not imply any particular range of number of elements; that is, a plurality of elements can be as few as two elements.