Search Images Maps Play YouTube News Gmail Drive More »
Sign in
Screen reader users: click this link for accessible mode. Accessible mode has the same essential features but works better with your reader.

Patents

  1. Advanced Patent Search
Publication numberUS5815196 A
Publication typeGrant
Application numberUS 08/581,666
Publication dateSep 29, 1998
Filing dateDec 29, 1995
Priority dateDec 29, 1995
Fee statusPaid
Publication number08581666, 581666, US 5815196 A, US 5815196A, US-A-5815196, US5815196 A, US5815196A
InventorsHiyan Alshawi
Original AssigneeLucent Technologies Inc.
Export CitationBiBTeX, EndNote, RefMan
External Links: USPTO, USPTO Assignment, Espacenet
Videophone with continuous speech-to-subtitles translation
US 5815196 A
Abstract
There is disclosed a method and apparatus for providing continuous speech-to-subtitles translation utilizing a video-based communications device but without speech synthesis at the output. Instead, a translation of each user's speech is displayed continuously in text form on the other user's screen. In the preferred embodiment, the sending party speaks into a conventional videophone. Speech recognition and translation of the transmitted signal are performed by a remote device at the receiving party's location. The audio portion of the signal is sent both to a speaker for audio output and to a speech recognizer and text-based translation system, the output of which is text translated into the target language. The video portion of the signal and the translated text are combined in a subtitle generator and sent to a display device for viewing by the receiving party.
Images(2)
Previous page
Next page
Claims(9)
What is claimed is:
1. An apparatus for providing continuous speech-to-subtitle translation of a signal containing a video portion, and an audio portion comprising:
(a) means for converting said audio portion to a corresponding first textual signal, wherein said converting means is located at a sending party's location;
(b) means for translating said corresponding first textual signal to a second textual signal wherein said second textual signal is in a target language and wherein said translating means is located remotely from said sending party's location;
(c) means for combining said video portion with said second textual signal to form a display signal, wherein said display signal displays said second textual signal as subtitles; and
(d) means for simultaneously displaying said display signal and outputting said audio portion.
2. An apparatus according to claim 1, wherein said converting means comprises a speech recognizer.
3. An apparatus according to claim 1, wherein said translating means comprises a text-based machine translation system.
4. An apparatus according to claim 1, wherein said combining means comprises a subtitle generator.
5. A method of providing continuous speech-to-subtitle translation of a signal containing a video portion and an audio portion, comprising the steps of:
(e) converting said audio portion to a corresponding first textual signal at a sending party's location;
(f) translating said corresponding first textual signal to a second textual signal at a location remote from said sending party's location, wherein said second textual signal is in a target language; and
(g) combining said video portion with said second textual signal to form a display signal, wherein said display signal displays said second textual signal as subtitles.
6. A method according to claim 5, further comprising the step of displaying said display signal and outputting said audio portion simultaneously.
7. A method according to claim 5, wherein said converting step is performed by a speech recognizer.
8. A method according to claim 5, wherein said translating step is performed by a text-based machine translation system.
9. A method according to claim 5, wherein said combining step is performed by a subtitle generator.
Description
TECHNICAL FIELD

This invention relates to a method and apparatus for providing continuous speech-to-subtitles translation for communication between people speaking different languages.

BACKGROUND OF THE INVENTION

As the world moves closer and closer to a true global economy, the need for individuals who speak different languages to be able to easily communicate has increased. Efforts have been made to facilitate communication between people speaking different languages using current speech-to-speech translation technology wherein the translated speech is synthesized in the target language.

Current speech-to-speech translation technology operates such that language is translated and synthesized sentence by sentence or phrase by phrase. Typically in such systems, a user speaks an entire sentence or phrase and presses a button or flips a switch when completed. The device then translates the entire sentence or phrase and synthesizes and outputs the translation in the target language. Thus, the other party must wait for the speech synthesizer to stop before responding. Such systems are currently preferred because they include in the synthesized translation the intonations, often related to emotion, contained in the original speech. Such intonations are generally thought to increase the quality of the communication. The result, however, is a delay between one party speaking and the other party hearing the synthesized translation which can make communication awkward and unnatural. A system which instead translates and synthesizes one word at a time would most likely also sound awkward and unnatural and would lack the normal intonations of speech. Thus, although research and development efforts are aimed at eliminating the current limitations of speech-to-speech translation technology, it is unlikely that any resulting systems will be capable of perfect simultaneous translation and speech synthesis for many years.

SUMMARY OF THE INVENTION

The problems and limitations associated with speech-to-speech translation are avoided, in accordance with the principles of the present invention, by using a video-based communication device for speech translation, but without speech synthesis of the output. Instead, a translation of each user's speech is displayed continuously in text form on the other user's screen. At the same time, the original, untranslated speech is played over a speaker.

In the preferred embodiment, the sending party speaks into a conventional prior art videophone. The output of the videophone, a signal consisting of both the audio and video portions of the communication, is transmitted to the receiving party's location. Speech recognition and translation of the transmitted signal are performed by a remote device at the receiving party's location. The audio portion of the signal is sent both to a speaker for audio output and to a speech recognizer and text-based machine translation system, the output of which is translated text. The video portion of the signal and the translated text are then combined in a subtitle generator and sent to a display device for viewing by the receiving party.

Because users hear the actual voice of the other party, the communication is more personal and is likely to be perceived to be of higher quality. Hearing the original speech can also reduce misunderstanding because emotional clues are available to the listener. Also, in the event that an imperfect translation takes place, users can look over the stream of translated words and make use of their knowledge of the other language to try to reconstruct the intended meaning. Moreover, because the original untranslated speech is provided audibly and the translated text is provided visually, users can employ any knowledge they may have of the original language. Finally, according to the preferred embodiment, the sending party need not be aware that the other party has translation subtitles displayed on their screen, a feature that would be appreciated by users embarrassed about their foreign language skills.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an embodiment of the present invention wherein recognition, translation and subtitle generation are performed remotely.

FIG. 2 is a flow diagram illustrating the method for providing continuous speech-to-subtitles translation.

FIG. 3 is a block diagram of an embodiment of the present invention wherein recognition, translation and subtitle generation are performed by a telephone service provider network.

DETAILED DESCRIPTION

Referring to FIG. 1, a diagram of the presently preferred embodiment of the system is shown. Conventional prior art videophone 5, such as the AT&T VT2500, is located at the sending party's location. Remote receiving device 8 is located at the receiving party's location. The sending party speaks into videophone 5 which contains camera 9 and microphone 11. Camera 9 outputs video signal 10, which represents the visual component of the communication, and microphone 11 outputs audio signal 12, which represents the speech component of the communication. Video signal 10 and audio signal 12 are fed into audio/video encoder 13. Audio/video encoder 13 is a conventional digital signal processing device which can be found in the transmitting end of the AT&T VT2500. Audio/Video encoder 13 converts video signal 10 and audio signal 12 into a single encoded digital signal 14. Encoded digital signal 14 is placed on conventional telephone line 15 for transmission to remote receiving device 8.

At remote receiving device 8, encoded digital signal 14 is fed into audio/video decoder 16. Similar to audio/video encoder 13, audio/video decoder 16 is a conventional digital signal processing device which can be found in the receiving end of the AT&T VT2500. Audio/video decoder 16 converts encoded digital signal 14 back into two separate signals, video signal 23 and audio signal 17. Audio signal 17, which is in the original language of the sending party, is simultaneously fed into speaker 18 for audio output to the receiving party and into recognizer 19. Recognizer 19 is a conventional speech recognizer which converts human speech to text. Speech recognizers are well known in the art and are described, for example, in L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Prentice-Hall (1993). Using a prior art statistical pattern recognition technique, recognizer 19 converts audio signal 17 into recognition hypothesis 20, which consists of one or more possible sequences of words in text format corresponding to audio signal 17. Essentially, recognition hypothesis 20 is a signal representing the most likely textual counterpart to the spoken language represented by audio signal 17. Recognition hypothesis 20, however, is still in the original language of the sending party, and thus needs to be translated into the target language.

Recognition hypothesis 20 is sent to translator 21. Translator 21 is a conventional text-based machine translation system which converts text in one natural language to text in another natural language. Text-based machine translation systems are well known in the art and are described, for example, in W. J. Hutchins and H. L. Somers, An Introduction to Machine Translation, Academic Press (1992). Translator 21 takes recognition hypothesis 20 and translates it into the target language. If recognition hypothesis 20 consists of a set of possible sequences of words, translator 21 applies a language model which chooses the most likely grammatical version for translation. The output of translator 21 is text signal 22.

Text signal 22 and video signal 23 are sent to subtitle generator 24 where the two signals are overlaid onto one another to create display signal 25. In display signal 25, text signal 22 appears as subtitles to video signal 23. Subtitle generator 24 is common in the prior art, especially in the film industry.

Subtitle generator 24 outputs display signal 25 which in turn is sent to video display device 26, such as a monitor, for display to the receiving party. Thus, the receiving party can simultaneously hear the original speech of the sending party and view the video of the sending party overlaid with subtitles translating the sending party's speech.

Because the recognition and translation functions are performed at remote receiving device 8, standard existing videophone transmission signals and protocols, such as ITU-H.261 and IT-TH.263, can be used. As such, the only party that needs to have any special equipment is the receiving party. The sending party need only use a standard videophone and need not know that the translation and subtitling are taking place.

Referring now to FIG. 2, a flow diagram illustrating the method for providing continuous speech-to-subtitles translation is shown. Block 30 shows that after a sending party speaks into his or her videophone, the encoded signal output thereby is sent to the receiving party's location. As shown in Block 32, the encoded signal is then decoded into an audio signal and a video signal. Block 34 shows that the audio signal is then converted into a corresponding textual signal in the sending party's language. Block 36 shows that the textual signal is then translated into a textual signal in the receiving party's language, also known as the target language. As shown in block 38, the textual signal in the target language is then overlaid onto the video signal as subtitles. Finally, as shown in block 40, the video signal with subtitles overlaid thereon and the audio signal are simultaneously output to the receiving party.

Referring now to FIG. 3, an alternate embodiment of the system is shown wherein the recognition, translation and subtitle generation functions are performed by telephone service provider network 60 rather than at the receiving party's location. Conventional videophone 62 located at the sending party's location 64 outputs signal 66. Signal 66 is a standard videophone signal. Signal 66 is sent to central processing unit, or CPU, 68 which is attached to telephone network switch 70. CPU 68 and telephone network switch 70 are part of telephone service provider network 60. CPU 68 contains algorithms which perform the recognition, translation and subtitle generation functions on signal 66. CPU 68 outputs signal 72. Signal 72 consists of an audio portion, which contains the sending party's original speech, and a subtitled video portion. Signal 72 is sent to conventional videophone 74 located at the receiving party's location 76 where it can be viewed by the receiving party.

In this embodiment, the continuous speech-to-subtitles translation would be provided as a service by the telephone service provider wherein the user is charged a fee for each use. As such, a person desiring speech-to-subtitle translation could access the service as needed, using a conventional videophone. This embodiment would also allow the use of standard videophone transmission signals and protocols. Finally, like the preferred embodiment, this embodiment would allow the receiving party to use the service without the knowledge of the sending party.

Still further alternate embodiments of the system are possible. One such embodiment would entail performing the recognition and translation functions at the sending party's locations. Another such embodiment would entail performing the recognition function at the sending party's location and the translation function at the receiving party's location. Both such alternate embodiments would require the transmission of data, i.e., the translated text in the case of the former and the recognition hypothesis in the case of the latter, in addition to the transmission of audio and video signals. Thus, these alternate embodiments would require the use of modified videophone equipment. These alternate embodiments, however, would have the advantage of providing more accurate recognition than in the preferred embodiment because the recognition function is performed locally rather than at the receiving party's location. Local recognition is more accurate because the audio signal does not have to be transmitted over telephone lines before recognition takes place. Conversely, in the preferred embodiment, when encoded digital signal 14 is sent to the receiving party's location, the quality of audio signal 17 is somewhat diminished due to the limited bandwidth of conventional telephone line 15.

It is to be understood that the above description comprises only a few of the possible embodiments of the present invention. Numerous other arrangements may be devised by one skilled in the art without departing from the spirit and scope of the invention. The invention is thus limited only as defined in the accompanying claims.

Patent Citations
Cited PatentFiling datePublication dateApplicantTitle
US5369704 *Mar 24, 1993Nov 29, 1994Engate IncorporatedDown-line transcription system for manipulating real-time testimony
US5512938 *Mar 30, 1995Apr 30, 1996Matsushita Electric Industrial Co., Ltd.Teleconference terminal
Referenced by
Citing PatentFiling datePublication dateApplicantTitle
US5978014 *Sep 19, 1997Nov 2, 199988, Inc.Video TTY device and method for videoconferencing
US5991711 *Feb 24, 1997Nov 23, 1999Fuji Xerox Co., Ltd.Language information processing apparatus and method
US5991723 *Jun 24, 1998Nov 23, 1999International Business Machines CorporationMethod and apparatus for translating text and speech transferred over a telephony or similar network
US6195631 *Nov 23, 1998Feb 27, 2001At&T CorporationMethod and apparatus for automatic construction of hierarchical transduction models for language translation
US6323892 *Jul 29, 1999Nov 27, 2001Olympus Optical Co., Ltd.Display and camera device for videophone and videophone apparatus
US6513003Feb 3, 2000Jan 28, 2003Fair Disclosure Financial Network, Inc.System and method for integrated delivery of media and synchronized transcription
US6542200Aug 14, 2001Apr 1, 2003Cheldan Technologies, Inc.Television/radio speech-to-text translating processor
US6599130Feb 2, 2001Jul 29, 2003Illinois Institute Of TechnologyIterative video teaching aid with recordable commentary and indexing
US6754619 *Nov 15, 1999Jun 22, 2004Sony CorporationDigital recording and playback system with voice recognition capability for concurrent text generation
US7013273Mar 29, 2001Mar 14, 2006Matsushita Electric Industrial Co., Ltd.Speech recognition based captioning system
US7047191 *Mar 6, 2001May 16, 2006Rochester Institute Of TechnologyMethod and system for providing automated captioning for AV signals
US7110946 *Nov 12, 2002Sep 19, 2006The United States Of America As Represented By The Secretary Of The NavySpeech to visual aid translator assembly and method
US7191117Jun 11, 2001Mar 13, 2007British Broadcasting CorporationGeneration of subtitles or captions for moving pictures
US7295969 *Mar 8, 2004Nov 13, 2007Sony CorporationDigital recording and playback system with voice recognition capability for concurrent text generation
US7310605Nov 25, 2003Dec 18, 2007International Business Machines CorporationMethod and apparatus to transliterate text using a portable device
US7359849Dec 16, 2004Apr 15, 2008Speechgear, Inc.Translation techniques for acronyms and ambiguities
US7532229Nov 12, 2004May 12, 2009Nec CorporationData processor, data processing method and electronic equipment
US7552053 *Aug 22, 2005Jun 23, 2009International Business Machines CorporationTechniques for aiding speech-to-speech translation
US7734467 *May 22, 2008Jun 8, 2010International Business Machines CorporationTechniques for aiding speech-to-speech translation
US8112270Oct 9, 2007Feb 7, 2012Sony CorporationDigital recording and playback system with voice recognition capability for concurrent text generation
US8319819Mar 26, 2008Nov 27, 2012Cisco Technology, Inc.Virtual round-table videoconference
US8326596 *May 28, 2009Dec 4, 2012Cohen Sanford HMethod and apparatus for translating speech during a call
US8355041Feb 14, 2008Jan 15, 2013Cisco Technology, Inc.Telepresence system for 360 degree video conferencing
US8390667Apr 15, 2008Mar 5, 2013Cisco Technology, Inc.Pop-up PIP for people not in picture
US8416925Apr 14, 2008Apr 9, 2013Ultratec, Inc.Device independent text captioned telephone service
US8472415Mar 6, 2007Jun 25, 2013Cisco Technology, Inc.Performance optimization with integrated mobility and MPLS
US8477175Mar 9, 2009Jul 2, 2013Cisco Technology, Inc.System and method for providing three dimensional imaging in a network environment
US8515024Jan 13, 2010Aug 20, 2013Ultratec, Inc.Captioned telephone service
US8542264Nov 18, 2010Sep 24, 2013Cisco Technology, Inc.System and method for managing optics in a video environment
US8570373Jun 8, 2007Oct 29, 2013Cisco Technology, Inc.Tracking an object utilizing location information associated with a wireless device
US8599865Oct 26, 2010Dec 3, 2013Cisco Technology, Inc.System and method for provisioning flows in a mobile network environment
US8599934Sep 8, 2010Dec 3, 2013Cisco Technology, Inc.System and method for skip coding during video conferencing in a network environment
US8610755 *Feb 18, 2011Dec 17, 2013Sorenson Communications, Inc.Methods and apparatuses for multi-lingual support for hearing impaired communication
US8635070 *Mar 25, 2011Jan 21, 2014Kabushiki Kaisha ToshibaSpeech translation apparatus, method and program that generates insertion sentence explaining recognized emotion types
US8659637Mar 9, 2009Feb 25, 2014Cisco Technology, Inc.System and method for providing three dimensional video conferencing in a network environment
US8659639May 29, 2009Feb 25, 2014Cisco Technology, Inc.System and method for extending communications between participants in a conferencing environment
US8670019Apr 28, 2011Mar 11, 2014Cisco Technology, Inc.System and method for providing enhanced eye gaze in a video conferencing environment
US8682087Dec 19, 2011Mar 25, 2014Cisco Technology, Inc.System and method for depth-guided image filtering in a video conference environment
US8692862Feb 28, 2011Apr 8, 2014Cisco Technology, Inc.System and method for selection of video data in a video conference environment
US8694658Sep 19, 2008Apr 8, 2014Cisco Technology, Inc.System and method for enabling communication sessions in a network environment
US8699457Nov 3, 2010Apr 15, 2014Cisco Technology, Inc.System and method for managing flows in a mobile network environment
US8701020 *Sep 30, 2011Apr 15, 2014Google Inc.Text chat overlay for video chat
US8723914Nov 19, 2010May 13, 2014Cisco Technology, Inc.System and method for providing enhanced video processing in a network environment
US8730297Nov 15, 2010May 20, 2014Cisco Technology, Inc.System and method for providing camera functions in a video environment
US8768699Apr 15, 2010Jul 1, 2014International Business Machines CorporationTechniques for aiding speech-to-speech translation
US8786631Apr 30, 2011Jul 22, 2014Cisco Technology, Inc.System and method for transferring transparency information in a video environment
US20100121629 *May 28, 2009May 13, 2010Cohen Sanford HMethod and apparatus for translating speech during a call
US20120078607 *Mar 25, 2011Mar 29, 2012Kabushiki Kaisha ToshibaSpeech translation apparatus, method and program
US20120212567 *Feb 18, 2011Aug 23, 2012Sorenson Communications, Inc.Methods and apparatuses for multi-lingual support for hearing impaired communication
CN100546324CNov 9, 2004Sep 30, 2009日本电气株式会社Data processor, data processing method and electronic equipment
EP1246166A2 *Nov 13, 2001Oct 2, 2002Matsushita Electric Industrial Co., Ltd.Speech recognition based captioning system
EP1266303A1 *Mar 7, 2001Dec 18, 2002Oipenn, Inc.Method and apparatus for distributing multi-lingual speech over a digital network
EP1542468A1 *Nov 6, 2004Jun 15, 2005NEC CorporationData processor, data processing method and electronic equipment for text transmission
EP2073543A1 *Sep 29, 2007Jun 24, 2009Huawei Technologies Co., Ltd.System and method for realizing multi-language conference
WO2001058165A2 *Feb 2, 2001Aug 9, 2001Streamingtext IncSystem and method for integrated delivery of media and associated characters, such as audio and synchronized text transcription
WO2004015990A1 *Aug 5, 2003Feb 19, 2004Koninkl Philips Electronics NvMethod to process two audio input signals
WO2004080072A1 *Jan 27, 2004Sep 16, 2004France TelecomSystem for the dynamic sub-titling of television and radio broadcasts
WO2008066836A1 *Nov 28, 2007Jun 5, 2008Treyex LlcMethod and apparatus for translating speech during a call
WO2010148890A1 *May 26, 2010Dec 29, 2010Zte CorporationVideophone and communication method thereof
Classifications
U.S. Classification348/14.12, 704/2, 704/235, 379/52, 348/E07.079, 704/277, 348/14.01
International ClassificationH04N7/14
Cooperative ClassificationH04N7/142
European ClassificationH04N7/14A2
Legal Events
DateCodeEventDescription
May 29, 2014ASAssignment
Free format text: MERGER;ASSIGNOR:LUCENT TECHNOLOGIES INC.;REEL/FRAME:033053/0885
Owner name: ALCATEL-LUCENT USA INC., NEW JERSEY
Effective date: 20081101
Mar 24, 2010FPAYFee payment
Year of fee payment: 12
Dec 6, 2006ASAssignment
Owner name: LUCENT TECHNOLOGIES INC., NEW JERSEY
Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS;ASSIGNOR:JPMORGAN CHASE BANK, N.A. (FORMERLY KNOWN AS THE CHASE MANHATTAN BANK), AS ADMINISTRATIVE AGENT;REEL/FRAME:018590/0047
Effective date: 20061130
Mar 6, 2006FPAYFee payment
Year of fee payment: 8
Feb 27, 2002FPAYFee payment
Year of fee payment: 4
Apr 5, 2001ASAssignment
Owner name: THE CHASE MANHATTAN BANK, AS COLLATERAL AGENT, TEX
Free format text: CONDITIONAL ASSIGNMENT OF AND SECURITY INTEREST IN PATENT RIGHTS;ASSIGNOR:LUCENT TECHNOLOGIES INC. (DE CORPORATION);REEL/FRAME:011722/0048
Effective date: 20010222
Owner name: THE CHASE MANHATTAN BANK, AS COLLATERAL AGENT P.O.
Apr 9, 1996ASAssignment
Owner name: LUCENT TECHNOLOGIES INC., NEW JERSEY
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ALSHAWI, HIYAN;REEL/FRAME:007885/0466
Effective date: 19960321