|Publication number||US20080059522 A1|
|Application number||US 11/511,816|
|Publication date||Mar 6, 2008|
|Filing date||Aug 29, 2006|
|Priority date||Aug 29, 2006|
|Publication number||11511816, 511816, US 2008/0059522 A1, US 2008/059522 A1, US 20080059522 A1, US 20080059522A1, US 2008059522 A1, US 2008059522A1, US-A1-20080059522, US-A1-2008059522, US2008/0059522A1, US2008/059522A1, US20080059522 A1, US20080059522A1, US2008059522 A1, US2008059522A1|
|Inventors||Ying Li, Youngja Park|
|Original Assignee||International Business Machines Corporation|
|Export Citation||BiBTeX, EndNote, RefMan|
|Referenced by (9), Classifications (11), Legal Events (1)|
|External Links: USPTO, USPTO Assignment, Espacenet|
1. Field of the Invention
The present invention relates generally to the field of multimedia content analysis and, more particularly, to a system and method for automatically creating personal profiles for video characters.
2. Description of the Prior Art
With the fast development of multimedia technology and the rapid growth of the Internet, a person can now basically find everything he/she wants from the world wide web (“Web”). One type of popular information that people usually search from the web is person-specific information. For instance, typing in “Tom Hanks” to find all information related to Tom Hanks (the actor perhaps). However, considering the overwhelming amount of information that can be obtained from the web, it would be desirable that there be implemented smart tools that can automatically collect the information related to a particular person, identify important pieces, assemble them into a personal profile and finally present the profile to the user for a view. Another example is to create such profiles for people who appear in a video (i.e., video characters). Generation of such profiles can benefit many multimedia applications such as personal activity tracking, information management and retrieval.
There has been some previous work on extracting person-specific information from video streams in the community of video content analysis. Some examples include voice-based person identification as described in the reference to Y. Li, S. Narayanan and C. Kuo, entitled “Adaptive Speaker Identification with Audiovisual Cues for Movie Content Analysis”, Pattern Recognition Letters, vol. 25, no. 7, 2004; face detection and recognition as described in the reference to E. Acosta, L. Torres, A. Albiol and E. Delp, entitled “An Automatic Face Detection and Recognition System for Video Indexing Applications”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2002; as well as the detection of other soft biometrics such as gait as described in the reference to A. Bissacco, A. Chiuso, Y. Ma and S. Soatto, entitled “Recognition of Human Gaits”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2001. However, none of the existing work has ever attempted to extract various types of person-specific information such as name, affiliation, portrait and voice, from a video, and correlate them with each other for a particular video character. Such information extraction and assembly would require fairly sophisticated multimedia content analysis techniques.
There has heretofore never been provided a solution for enabling automatic creation of personal profiles for video characters, which contain various aspects of information that are specific to each individual character.
It would thus be highly desirable to have an implementation of media searching and data aggregation system and methodology for automatically creating personal profiles associated with video characters (i.e., characters appearing in a video media).
In a broad sense, the present invention is directed to a system, method and computer program product for extracting various personal information with respect to an individual character who is appearing in a video media source, e.g., a video character. The extracted personal information data not only includes voice/speech information, but additionally other information such as affiliation, job title, face, gait, etc.
More particularly, a method and apparatus is provided for automatically extracting personal information and creating personal profiles for people that appear in video streams. Personal information that can be automatically extracted from videos may include name, affiliation, job position, face (or portrait), voice, motion (e.g. gait and gesture), and other related features. Specifically, extensive text analysis is first carried out on the video text (which includes video transcript, video scene texts, etc.) to identify various types of text-related personal identity information such as a character's name, affiliation, work location and job position. Different information pieces are then correlated and fused with each other across the entire video. This forms the name identities for video characters. Meanwhile, advanced audiovisual content analysis is carried out which extracts audiovisual-related personal identities from the video such as face, voice and motion. This forms the visual, voice and kinematics identities for video characters, respectively. For this invention, besides the information from the video and its text sources, both of the text and audiovisual content analysis processes can also access additional or external information sources such as the World Wide Web (WWW) and other private information databases (such as employee database and fingerprint or iris databases) for data enrichment purpose. Next, the text-related name identity and audiovisual-related visual, voice and kinematic identities that all refer to the same particular video character are correlated with each other based on advanced semantic context analysis. Finally, a personal profile for this video character is generated by assembling all of his or her identity information together.
Furthermore, in one aspect of the invention, various personal information with respect to an individual video character is extracted. Thus, not only does the personal profile include voice/speech information, but, in a much broader sense, also includes other information such as affiliation, job title, face, gait, gestures, etc. It is understood that the invention contemplates extracting features such as face, voice and gait from the video stream, with or without recognition. For example, the extracted visual information may be obtained without knowing the persons' names who “own” these features. Consequently, the invention relies upon the text mining tools to correlate a “name” with the extracted “features”. Similarly, as the extracted subject matter also includes extracted voice (or speech) information from the audio stream, it thus also relies on text mining tools (e.g. semantic context analysis) to correlate a person's “name” with his/her “voice”.
Thus, in accordance with the invention, there is provided a system, method and computer program product for generating a personal profile for a subject appearing in a video media source. The method includes:
extracting audiovisual-related personal information related to a subject appearing in the video media source;
extracting text-related personal information that is related to the subject in the video source;
correlating the extracted audiovisual-related personal information and the extracted text-related personal information related to the subject; and
assembling a personal profile data structure for the subject, the personal profile data structure comprising the text-related personal information and audiovisual-related personal information related to the subject.
Further to the system for generating personal profiles there is provided a means for extracting video texts from the video media source, as well as from other possible additional information sources, the text-related personal information extracting means receiving the extracted video texts to extract the text-related personal information.
The text-related personal information forms the name identity of the subject, while the audiovisual-related personal information includes audiovisual-related features including information forming one or more of: a visual identity, a kinematic identity, and a voice identity of the subject.
In an alternate embodiment, in an iterative manner, the correlated extracted audiovisual-related personal information and extracted text-related personal information may be fed back to a search engine means and utilized for performing an additional search to obtain additional texts relating to the subject or obtain additional video media sources having the subject.
Advantageously, the system and method for generating personal profiles further enables the updating an assembled personal profile of a subject as a new video media source having said subject becomes available.
The objects, features and advantages of the present invention will become apparent to one skilled in the art, in view of the following detailed description taken in combination with the attached drawings, in which:
Referring now to drawings, and more particularly to
In the high-level overview of the system 10 for generating personal profiles for video characters, shown in
This type of identity information is alternately referred to herein as name identity. One example is given here. For purposes of illustration, this example assumes that a character called Lisa Smith from OpenMind Company is introduced at the beginning of the video (denoted as a time instance A). Later on, when a person called Lisa is giving a speech and mentions that she works as a sales manager (denoted as a time instance B), a text analyzer 33 provided in the text-related personal information extractor module 30 is implemented to determine if this Lisa refers to Lisa Smith who is introduced earlier. If yes, then the analyzer 33 should be able to derive the fact that Lisa Smith is a sales manager at OpenMind Company, that is, to correlate and fuse various types of information that are related to one specific video character together, which may have been collected at different analysis stages.
Next, as shown in
Contemporaneously with the extraction/processing of the text information related to the video, referring to
Finally, referring to
Continuing, when all identity information including the name, visual, voice and kinematics identities that relates to various video characters are extracted and finalized, they are fed into the information correlator module 25 to be correlated with each other. As known in the art, complex and advanced semantic context analysis may be performed in the information correlator module 25. For example, assuming from the text-related personal information extractor module 30 that Lisa Smith's name identity is obtained, and from the audiovisual-related personal information extractor module 40, a set of visual, voice and kinematics identities for multiple video characters are obtained whose names are still unknown (i.e., it is only known from the audiovisual-related personal information extractor module 40 that this group includes faces, voices or motions that corresponding to one specific video character, but it is not known who he or she is). Then, the information correlator module 25 will determine which visual, voice and kinematics identities belong to the character Lisa Smith. One approach to fulfill this task is to perform context analysis. For instance, if it is known that starting from time instance B, Lisa Smith is giving a speech, which could be derived from extracted text cues (e.g., a sentence which says “now, let's welcome Lisa Smith to give us a speech”), then it is possible to correlate Lisa's name identity with the visual identity that contains the face extracted at time instance B. In the same manner, the example Lisa Smith's voice and kinematics identities can be identified. Another example of performing such information correlation is to take advantage of the cues from the video scene texts. As mentioned hereinabove, scene texts are frequently used to inform the audience of the current speaker's name, job position, affiliation, etc. Therefore, if it is detected that there is a person who is present in the current frame with superimposed video texts showing the name, job position and affiliation, the person's visual identity can be easily correlated with his or her name identity. Moreover, if it is further detected that the person is also speaking at that time, then his or her voice identity will also be correlated.
Finally, as shown in
The computer system 200 also includes a display device 299 or like monitor and associated I/O device, e.g., video adapter device 270 that couples the display device 299 to a system bus 101 implemented for connecting various system components together. For instance, the bus 101 connects the CPU or like processor 210 to the RAM or other system memory 230. The bus 101 can be implemented using any kind of bus structure or combination of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures such as ISA bus, an Enhanced ISA (EISA) bus, and a Peripheral Component Interconnects (PCI) bus or like bus device. The computer node 200 implements functionality for providing a user interface for initiating and controlling execution of the respective video text extraction, text-related personal information extraction, audio/video-related personal information extraction, information correlation and personal profile generation aspects of the invention, via the associated display device 299. Although not shown, the computing node 200 includes other user input devices such as a keyboard, and a pointing device (e.g., a “mouse”) for entering commands and information into the computer (e.g., data storage devices), and, particularly, for searching additional information from additional or external information sources, visualizing the extracted text-related personal information and audiovisual-related personal information, and presenting the generated personal profiles to users enabled by the invention via a user interface generated on the display device 299.
As mentioned herein, the computer system 200 is adapted to operate in a networked environment for conducting searches and receiving information from additional information sources 20, e.g., a web-site and a database server. As shown in
It should be understood that other kinds of computer and network architectures are contemplated. For example, although not shown, the computer system 200 can include hand-held or laptop devices. It is further understood that the computing system 200 can employ a distributed processing configuration. In a distributed computing environment, computing resources for implementing the video text extractor module 35, text-related personal information extractor module 30, audiovisual (A/V)-related personal information extractor module 40, information correlator 25 and personal profile generator 50 can be physically dispersed.
The invention has been described herein with reference to particular exemplary embodiments. Certain alterations and modifications may be apparent to those skilled in the art, without departing from the scope of the invention. The exemplary embodiments are meant to be illustrative, not limiting of the scope of the invention.
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US8024341 *||Jul 10, 2008||Sep 20, 2011||AudienceScience Inc.||Query expansion|
|US8060373 *||Mar 21, 2007||Nov 15, 2011||At&T Intellectual Property I, L.P.||System and method of identifying contact information|
|US8521531 *||Feb 6, 2013||Aug 27, 2013||Lg Electronics Inc.||Displaying additional data about outputted media data by a display device for a speech search command|
|US8548999||Aug 15, 2011||Oct 1, 2013||AudienceScience Inc.||Query expansion|
|US20110153417 *||Aug 21, 2009||Jun 23, 2011||Dolby Laboratories Licensing Corporation||Networking With Media Fingerprints|
|US20130279573 *||May 9, 2012||Oct 24, 2013||Vixs Systems, Inc.||Video processing system with human action detection and methods for use therewith|
|CN102750366A *||Jun 18, 2012||Oct 24, 2012||海信集团有限公司||Video search system and method based on natural interactive import and video search server|
|WO2012027607A2 *||Aug 25, 2011||Mar 1, 2012||Intel Corporation||Technique and apparatus for analyzing video and dialog to build viewing context|
|WO2012027607A3 *||Aug 25, 2011||May 31, 2012||Intel Corporation||Technique and apparatus for analyzing video and dialog to build viewing context|
|U.S. Classification||1/1, 707/999.107|
|Cooperative Classification||G06F17/30793, G06F17/30796, G06K9/00335, G06K9/00973|
|European Classification||G06F17/30V1R1, G06F17/30V1T, G06K9/00Y, G06K9/00G|
|Aug 8, 2008||AS||Assignment|
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LI, YING;PARK, YOUNGJA;REEL/FRAME:021360/0137
Effective date: 20060822