|Publication number||US20080162577 A1|
|Application number||US 11/964,971|
|Publication date||Jul 3, 2008|
|Filing date||Dec 27, 2007|
|Priority date||Dec 27, 2006|
|Also published as||CN101212648A, CN101212648B, US8838594|
|Publication number||11964971, 964971, US 2008/0162577 A1, US 2008/162577 A1, US 20080162577 A1, US 20080162577A1, US 2008162577 A1, US 2008162577A1, US-A1-20080162577, US-A1-2008162577, US2008/0162577A1, US2008/162577A1, US20080162577 A1, US20080162577A1, US2008162577 A1, US2008162577A1|
|Inventors||Takashi Fukuda, Daisuke Sato|
|Original Assignee||Takashi Fukuda, Daisuke Sato|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (1), Referenced by (5), Classifications (13), Legal Events (1)|
|External Links: USPTO, USPTO Assignment, Espacenet|
This application claims priority under 35 U.S.C. § 119 to Japanese Patent Application No. 2006-351358 filed Dec. 27, 2006, the entire text of which is specifically incorporated by reference herein.
The present invention relates to identifying a time position of a data stream of multimedia content by using a feature vector of multimedia content data being reproduced.
The prevalence of broadband has lead to a rapid increase in services of distributing multimedia contents such as video images. However, contents with captions or audio descriptions for people with visual and hearing impairments hardly exist. For this reason, in order to improve accessibilities, it is strongly desired that captions or audio descriptions for people with visual and hearing impairments be provided as metadata for video image contents distributed on the Internet. Currently, content providers are not equipped with metadata for captions or audio descriptions in many cases, such that a need has been rapidly increasing for building an infrastructure allowing a third party organization such as volunteers to provide metadata.
However, the current content players do not include a structure to interpret metadata provided by third party organizations. Moreover, since various types of content players are used, it is assumed that a considerable amount of time is needed for all of the content players to implement support for metadata provided by the third party organizations. Normally, since metadata is synchronized with the content by use of time stamps including the starting point of the content as the origin, the playback position of the content player needs to be obtained. However, not all of the playback positions of various kinds of content players can be obtained. For this reason, the problem cannot be sufficiently solved by the attempt to reproduce metadata in synchronization with the content by interpreting the metadata by an external application.
Japanese Patent Application Laid-Open Publication No. 2005-339038 discloses a device which determines a timing to provide a particular service on the basis of a feature vector of media. Here, a certain feature vector, and degree of appropriateness (the degree of appropriateness for providing the specific service) are registered in advance, and the degree of appropriateness is then obtained from the feature vector of media being reproduced. When the degree of appropriateness is greater than a threshold value, the service is provided; in other words, the timing at which an advertisement or the like is to be inserted is determined. Accordingly, the technique disclosed in Japanese Patent Application Laid-Open Publication No. 2005-339038 is to determine whether or not it is appropriate to provide a service, but is not to specify a time stamp of media.
To provide a method and apparatus for specifying a time position of a data stream of multimedia content by using a feature vector of multimedia content data being reproduced.
In order to solve the aforementioned problems, in the present invention, there is proposed an apparatus which synchronizes a content data stream with metadata by use of a feature vector of the content data stream. The apparatus synchronizes content data with metadata, and includes: a storage device having the metadata including a feature vector of the content data recorded therein; a calculation component which calculates a feature vector from the content data, a search component which searches for metadata in the storage device on the basis of the calculated feature vector; and a reproducing component which reproduces the searched out metadata in synchronization with the content data.
According to the apparatus of the invention, it is made possible to provide metadata, and to cause a content data stream to be synchronized with the metadata without processing the content data.
Although the outline of the invention has been described so far as a method, the present invention can be grasped as an apparatus, a program, or a program product. The program product includes, for example, a recording medium having the aforementioned program stored thereon, or a medium which transfers the program.
It should be noted that the outline of the present invention does not list all of the features required for the invention, and other combinations of or sub-combinations of these constitutional elements possibly become the invention.
For a more complete understanding of the present invention and the advantage thereof, reference is now made to the following description taken in conjunction with the accompanying drawings.
A metadata synchronization system 210 includes a feature extractor 213. The feature extractor 213 calculates a feature vector from any one of or both of the audio stream 228, and the video stream 229. From the metadata DB 211, a metadata search component 215 searches for metadata having a feature vector corresponding to the calculated feature vector. When the corresponding feature vector is found in the metadata DB as a result of the search, a metadata reproducing component 217 reproduces metadata associated with the feature vector. Here, the metadata reproducing component 217 includes a caption reproducing component 219, and an audio description reproducing component 221 since metadata normally includes captions, and audio descriptions. The reproduced captions, or audio descriptions are respectively output to a speaker 226, and a display device 227 together with the content data stream. It should be noted that in consideration of the search time or the like, the metadata synchronization system 210 preferably processes data by reading them into a buffer memory in advance.
Here, the whole metadata, and whole content data are once downloaded. However, the metadata, and content data may be synchronized with each other while being downloaded little by little as a data stream.
Furthermore, in the case of the present invention, metadata includes data such as: a time stamp; information (caption data, audio description data and the like) to be provided as metadata; data of a feature vector serving as a pointer by the feature vector; and a type of feature vector (information to specify a calculation method of the feature vector). The following is an example of the metadata.
The character string enclosed with <data type=“base64”>and </data> is data serving as a pointer by the feature vector. This character string is data obtained by replacing the feature vector calculated from the content with a character string in accordance with a constant rule.
Here, data for audio descriptions using an automatic speechreading system is cited as an example. However, captions can also be provided to users by using data for captions.
Incidentally, as a feature vector of a data stream of multimedia content, a feature vector of audio data, and a feature vector of video data are conceivable. As a feature vector of audio data, a mel frequency cepstral coefficient (MFCC), or a linear predictive coding (LPC) mel cepstrum, which is used in a standard automatic speech recognition device, or simply, a log power spectrum, or the like can be used. For example, in a derivation process of an MFCC, first, audio signals of 25 ms time length (normally referred to as frames) are extracted from input audio, and thereafter, a frequency analysis is performed on the signals. Subsequently, an analysis is performed by a 24 channel bandpass filter (BPF) having a center frequency following the mel scale. Then, an output of the resultant BPF group is subjected to discrete cosine transform to obtain an MFCC. Here, the mel scale is an interval scale based on human pitch perception with respect to high and low audio frequencies, and the value thereof substantially corresponds to the logarithm value of the frequency. An MFCC calculated by the frame is a vector having 12 components (a 12 dimensional vector). The following are some examples of a feature vector of a video image: a feature of shape, which indicates an area or a circumference length of an object; temporal variation of a feature of a pixel tone; and a velocity vector image (optical flow) of each point on a screen.
In step 309, metadata is searched for, by use of the feature vector calculated in step 307. In the search for metadata, an Euclidean distance between the feature vector obtained in step 307 and the feature vector in the metadata, or a likelihood ratio found on the basis of a probability model can be utilized. In step 311, as a result of the searching in step 309, it is determined whether or not corresponding metadata is detected. In step 311, in a case where it is determined that the corresponding metadata cannot be detected (No), the process returns to step 305 where content data are newly obtained, to repeat the search for metadata. On the other hand, in a case where it is determined that metadata is detected in step 311 (Yes), the process proceeds to step 313. In step 313, the content data and metadata are synchronized with each other to be reproduced. In such a case as where the content data are read in advance, adjustment is to be made in synchronizing the already read content data.
In step 317, it is determined whether or not all of the content data has been read. In step 317, in a case where it is determined that all of the content data has not been read (No), the process returns to step 305 where content data are newly obtained, to repeat the search for metadata. On the other hand, in a case where it is determined that all of the content data has been read in step 317 (Yes), the process proceeds to step 319, and the process ends. It should be noted that it is possible to configure the process to proceed to step 319 and end the process also in the case where all of the metadata has been detected in step 317, in addition to the case where it is determined that all of the content data has been read.
The process flow 400 begins in step 401. In step 403, the content data are partially read. In step 405, feature vectors are calculated. In step 405, among a plurality of feature vector calculation methods, one or some of the plurality of feature vector calculation methods are previously selected. In step 407, the plurality of feature vectors calculated in step 405 are compared with one another.
In step 409, it is determined whether or not similar feature vectors exist, which determination is based on the result of comparison in step 407. In a case where it is determined that similar feature vectors exist (Yes) in step 409, the process proceeds to step 413. In step 413, alternative calculation methods are employed, where the process returns to step 403 to repeat the calculation of feature vectors. An alternative calculation method may include the use of a different calculation formula, or an alternation in the acquisition time of content data for calculating a feature vector. On the other hand, in a case where it is determined that similar feature vectors do not exit (No), the process proceeds to step 411, where the feature vectors are registered as the pointers (search keys). Thereafter, the process proceeds to step 415, and ends the process.
It should be noted that whether or not the calculated feature vector is one with which a certain scene in the video image can be uniquely specified can be examined by calculating matching with the entire video image. Accordingly, it is easily understandable for those skilled in the art that the examination of the uniqueness of a feature vector is not limited to the process flow 400.
Furthermore, when a feature vector used is extracted in broader intervals than time intervals used in a general automatic speech recognition system, the original data cannot be restored from the feature vector. Thus, problems concerning copyright are reduced. Specifically, by use of the so-called mel log spectrum approximation (MLSA), original audio signals can be restored from a feature vector of audio (time-series data of MFCC), in a quality with which a person can at least understand what is spoken although the signals are somewhat deteriorated. When a feature vector of audio to be added as metadata is calculated from continuous frames, there possibly occurs a problem, namely, an unauthorized copy from the viewpoint of copyright since the audio signals can be restored. However, by use of feature vectors extracted with constant intervals as shown in the example, the audio signals cannot be restored. Thus, problems concerning copyright can be reduced.
The information processing apparatus includes a CPU (central processing unit) 801, and a main memory 804 connected to a bus 802. Removable storage devices (media-exchangeable external storage systems) such as hard disk drives 813 and 830, CD-ROM drives 826 and 829, a flexible disk drive 820, an MO drive 828, and a DVD drive 831 are connected to the bus 802 via a Floppy® disk controller 819, an IDE controller 825, a SCSI controller 827, or the like.
Recording media such as a flexible disk, an MO disc, a CD-ROM disc, and a DVD-ROM disc are inserted in a removable storage device. Computer program codes for implementing the present invention by providing commands to the CPU or like in corporation with an operating system can be stored in these recording media or the hard disk drives 813, or 830, or a ROM 814. The computer program is executed by being loaded on the main memory 804. The computer program may be compressed, or be divided into multiple pieces and then, stored in multiple media.
The information processing apparatus receives an input from an input device such as a keyboard 806, or a mouse 807 via a keyboard/mouse controller 805. The information processing apparatus is connected, via a DAC/LCDC 810 to a display device 811 for displaying visual data to users.
The information processing apparatus is capable of communicating with another computer or the like by being connected to a network via a network adapter 818 (an Ethernet® card, or a token ring card) or the like. The information processing apparatus is also capable of being connected to a printer, or a modem via a parallel port 816, or a serial port 815, respectively.
By the descriptions which have been provided so far, it is to be easily understood that the information processing apparatus preferably used for implementing the system according to the embodiment of the present invention is implemented by a general information processing apparatus such as a personal computer, a workstation, or a main frame, or a combination of these. However, the constitutional elements of these apparatuses are mere examples, and not all of the constitutional elements are necessarily required constitutional elements of the present invention.
As a matter of course, various modifications including combination of each hardware constitutional element of the information processing apparatus used in the embodiment of the present invention, or multiple machines, and allocation of functions thereto can be easily conceived by a person skilled in the art. Needless to say, such modifications are within the concept included in the spirit of this invention.
The system according to the embodiment of the present invention employs an operating system, which supports a graphical user interface (GUI) multi-window environment, such as Windows® operation system provided by Microsoft Corporation, MacOS® provided by Apple Computer Inc., and Unix® system including X Window System (for example, AIX® provided by International Business Machines Corporation).
From the descriptions provided so far, it can be understood that the system used in the embodiment of the present invention is not limited to a specific operating system environment. Specifically, any operating system can be employed as long as the operation system is capable of providing a resource management function which allows an application software program or the like to utilize a resource of a data processing system. It should be noted that the resource management function possibly includes a hardware resource management function, a file handling function, a spool function, a job management function, a storage protection function, a virtual storage management function, or the like. However, the descriptions of these functions are omitted here since they are well known to those skilled in the art.
Moreover, the present invention can be implemented by means of a combination of hardware components, or software components, or a combination of hardware and software components. As a typical example of an implementation by means of a combination of hardware and software components, an implementation by a data processing system including a predetermined program can be cited. In this case, the program controls, and causes the data processing system to execute the processing according to the present invention. This program is constituted of command sets which can be described by an arbitrary language, code, or description. Such command sets allow the system to execute a specific function directly or after any one of, or both of, 1. conversion into another language, code, or description, and 2. copy to another medium are performed.
As a matter of course, the present invention not only includes such a program itself, but also a medium having the program recorded thereon in the scope of the invention. A program causing a system to execute the functions of the present invention can be stored in an arbitrary computer-readable recording medium such as a flexible disk, an MO disc, a CD-ROM disc, a DVD disc, a hard disk drive, a ROM, an MRAM, or a RAM. Such program can be downloaded from another data processing system connected to a communications line, or can be copied from another recording medium for the purpose of storing the program in a recording medium. Furthermore, such program can be compressed, or divided into multiple pieces, and then be stored in a single medium or multiple recording media. In addition, it should be noted that as a matter of course, it is also possible to provide a program product in various form which implements the present invention.
From the descriptions provided, according to the embodiment of the present invention, it is understood that a system which backs up resource data of a web server on a client, and which recovers data from resource data backed up in the client in a case where the resource data of the web server is damaged, can be easily built.
Hereinabove, the present invention has been described by using the embodiment. However, the technical scope of the present invention is not limited to the above-described embodiment. It is obvious to those skilled in the art that various modifications and improvements may be made to the embodiment. Moreover, it is also obvious from the scope of the present invention that thus modified and improved embodiments are included in the technical scope of the present invention.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US20020088009 *||Nov 16, 2001||Jul 4, 2002||Dukiewicz Gil Gavriel||System and method for providing timing data for programming events|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US8179475 *||Mar 9, 2007||May 15, 2012||Legend3D, Inc.||Apparatus and method for synchronizing a secondary audio track to the audio track of a video source|
|US8205148||Jan 7, 2009||Jun 19, 2012||Bruce Sharpe||Methods and apparatus for temporal alignment of media|
|US8730232||Feb 1, 2011||May 20, 2014||Legend3D, Inc.||Director-style based 2D to 3D movie conversion system and method|
|CN101937268A *||Jun 23, 2010||Jan 5, 2011||索尼公司||Apparatus control based on visual lip share recognition|
|WO2012078429A1 *||Nov 30, 2011||Jun 14, 2012||Baker Hughes Incorporated||System and methods for integrating and using information relating to a complex process|
|U.S. Classification||1/1, 707/E17.028, 707/E17.026, 707/E17.009, 707/999.107|
|Cooperative Classification||Y10S707/913, G06F17/30796, G06F17/30265, G11B27/10|
|European Classification||G06F17/30V1T, G06F17/30M2, G11B27/10|
|Mar 5, 2008||AS||Assignment|
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:FUKUDA, TAKASHI;SATO, DAISUKE;REEL/FRAME:020605/0421
Effective date: 20080108