This application claims priority under 35 U.S.C. §119 based on U.S. Provisional Application Nos. 60/394,064 and 60/394,082 filed Jul. 3, 2002 and Provisional Application No. 60/419,214 filed Oct. 17, 2002, the disclosures of which are incorporated herein by reference.
BACKGROUND OF THE INVENTION
 The U.S. Government may have a paid-up license in this invention and the right in limited circumstances to require the patent owner to license others on reasonable terms as provided for by the terms of Contract No. 1999-S018900-0 (Federal Broadcast Information Service (FBIS)).
A. Field of the Invention
The present invention relates generally to speech processing and, more particularly, to the transcription of speech.
B. Description of Related Art
Speech has not traditionally been valued as an archival information source. As effective as the spoken word is for communicating, archiving spoken segments in a useful and easily retrievable manner has long been a difficult proposition. Although the act of recording audio is not difficult, automatically transcribing and indexing speech in an intelligent and useful manner can be difficult.
Automatic transcription systems are generally based on language and acoustic models. The language model is trained on a speech signal and on a corresponding transcription of the speech. The model will “learn” how the speech signal corresponds to the transcription. Typically, the training transcriptions of the speech are derived through a manual transcription process in which a user listens to the training audio, segments the audio, and types in the text corresponding to the audio. While typing in the text, the user may additionally annotate the text so that certain words, such as proper names, are noted as such.
Manually transcribing speech can be a time consuming, and thus expensive, task. Conventionally, generating one hour of transcribed training data requires up to 40 hours of a skilled transcriber's time. Accordingly, in situations in which a lot of training data is required, or in which a number of different languages are to be modeled, the cost of obtaining the training data can be prohibitive.
- SUMMARY OF THE INVENTION
Thus, there is a need in the art to be able to cost-effectively transcribe speech.
Systems and methods consistent with the principles of this invention provide a transcription tool that allows a user to quickly, and with minimal training, transcribe segments of speech.
One aspect of the invention is directed to a speech transcription tool that includes an audio classification component, control logic, and an input device. The audio classification component receives an audio stream containing speech data and segments the audio stream into speech and non-speech audio segments based on locations of the speech data within the audio stream. The control logic plays the speech segments and skips playing of the non-speech segments. The input device receives user transcription text relating to a transcription of the speech segments played by the control logic.
A second aspect of the invention is directed to a method that includes receiving an audio stream containing speech data, determining where the speech data is located in the audio stream, and playing select portions of the audio stream to a user. The select portions of the audio stream are based on the location of the speech data. The method additionally includes receiving text corresponding to the played portions of the audio stream and outputting the text.
A third aspect of the invention is directed to a method that includes analyzing a data stream based on acoustic characteristics of the data stream to generate acoustic classification information for the data stream, and playing portions of the data stream that meet predetermined criteria based on the acoustic classification information. Further, the method includes receiving transcription information relating to the played portions of the data stream.
BRIEF DESCRIPTION OF THE DRAWINGS
Yet another aspect of the invention is directed to a computing device for transcribing an audio file that includes speech. The computing device comprises speakers, a processor, and a computer memory. The computer memory contains program instructions that when executed by the processor cause processor to automatically segment the audio file into speech and non-speech segments based on acoustic characteristics of the audio file. Additionally, the processor plays a current one of the speech segments through the speakers, receives transcription information for the speech segments played through the speakers, and skips the non-speech segments when locating a next current one of the speech segments.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate the invention and, together with the description, explain the invention. In the drawings,
FIG. 1 is a diagram illustrating an exemplary system in which concepts consistent with the invention may be implemented;
FIG. 2 is a block diagram of a transcription tool consistent with the present invention;
FIG. 3 is a diagram illustrating a graphical user interface of the transcription tool consistent with the present invention; and
FIG. 4 is a flow chart illustrating methods of operation of the transcription tool consistent with the present invention.
The following detailed description of the invention refers to the accompanying drawings. The same reference numbers may be used in different drawings to identify the same or similar elements. Also, the following detailed description does not limit the invention. Instead, the scope of the invention is defined by the appended claims and equivalents of the claim limitations.
- System Overview
A speech transcription tool assists a user in transcribing speech. The transcription tool automatically identifies segments in an audio stream appropriate for transcription. Additionally, the transcription tool presents the user with a simple graphical interface for typing in the transcription text.
Speech transcription, as described herein, may be performed on one or more processing devices or networks of processing devices. FIG. 1 is a diagram illustrating an exemplary system 100 in which concepts consistent with the invention may be implemented. System 100 includes a computing device 101 that has a computer-readable medium 109, such as random access memory, coupled to a processor 108. Computing device 101 may also include a number of additional external or internal devices. An external input device 120 and an external output device 121 are shown in FIG. 1. The input devices 120 may include, without limitation, a mouse, a CD-ROM, or a keyboard. The output devices may include, without limitation, a display or an audio output device, such as a speaker. A keyboard, in particular, may be used by the user of system 100 when transcribing a speech segment that is played back from an output device, such as a speaker.
In general, computing device 101 may be any type of computing platform, and may be connected to a network 102. Computing device 101 is exemplary only. Concepts consistent with the present invention can be implemented on any computing device, whether or not connected to a network.
Processor 108 executes program instructions stored in memory 109. Processor 108 can be any of a number of well-known computer processors, such as processors from Intel Corporation, of Santa Clara, Calif.
- Transcription Tool
Memory 109 contains an application program 115. In particular, application program 115 may implement the transcription tool described below. Transcription tool 115 plays audio segments to a user, who types the words spoken in the audio into transcription tool 115. Transcription tool 115 automates many of the traditional transcription responsibilities of the user.
FIG. 2 is a block diagram illustrating software elements of transcription tool 115. Input audio information is received by audio classification component 201. Users of transcription tool 115 (i.e., transcribers) interact with transcription tool 115 through user input component 203 and graphical user interface (GUI) 204. Control logic 202 processes the output of audio classification component 201 and coordinates the operation of graphical user interface 204 and user input component 203 to perform transcription in a manner consistent with the present invention.
Audio classification component 201 receives an input audio stream and performs acoustic classification functions on the audio stream. More particularly, audio classification component 201 may classify segments of the audio as either speech or non-speech audio, or as wideband or narrowband audio. Audio classification component 201 may output a series of classification codes that indicate when a particular segment of the audio changes from one classification state to another. For example, when audio classification component 201 begins to detect speech, it may output an indication that speech is beginning and a time code corresponding to when the speech begins. When the speech segment ends, audio classification component 201 may similarly output an indication that the speech is ending along with a corresponding time code.
In performing speech/non-speech and wideband/narrowband classifications, audio classification component 201 may analyze the frequency spectrum of the input audio information. For example, a wideband audio signal, such as a studio quality audio signal, will have wider frequency range than a narrowband signal, such as an audio signal received over a telephone line. Similarly, in classifying audio signals as speech/non-speech information, audio classification component 201 may examine the frequency characteristics of the signal. Because signals that include human speech tend to exhibit certain characteristics, audio classification component 201 can determine whether or not the audio signal includes speech based on these characteristics. An implementation of audio classification component 201 is described in additional detail in application Ser. No. ______ (Attorney Docket No. 02-4022), titled “Systems and Methods for Providing Acoustic Classification,” the contents of which are incorporated by reference herein.
User input component 203 processes information received from the user. A user may input information through a number of different hardware input devices. A keyboard, for example, is an input device that the user is likely to use in entering the text corresponding to speech. Other devices, such as a foot pedal or a mouse, may be used to control the operation of transcription tool 115.
Graphical user interface 204 displays the graphical interface through which the user interacts with transcription tool 115. FIG. 3 is an exemplary diagram of an interface 300 that may be presented to the user via graphical user interface 204. Interface 300 includes waveform section 301 and transcription section 302. Additionally, interface 300 may include selectable menu options 303 and window control buttons 304. Through menu options 303, a user may initiate functions of transcription tool 115, such as opening an audio file for transcription, saving a transcription, and setting program options.
Waveform section 301 graphically illustrates the time-domain waveform of the audio stream that is being processed. The exemplary waveform shown in FIG. 3, waveform 310, includes a number of quiet segments 311 that deliniate audible segments 312. Audible segments 312 may include, for example, speech, music, other sounds, or combinations thereof. Audio classification component 201 identifies the start location, the end location, and the classification (i.e., speech or non-speech, wideband or narrowband) of each audible segment 312.
Concurrently with the display of audio waveform 310, transcription tool 115 plays the audio signal to the user. Transcription tool 115 may visually mark the portion of waveform 310 that is currently being played. For example, as shown in FIG. 3, waveform segment 315 is the portion of the audio signal that is currently being played. An additional, more precise location marker, such as arrow 316, may point to the current playback position in audio waveform 310.
The user of transcription tool 115 may input (e.g., type) the textual transcription corresponding to the audio stream into transcription section 302. The user may control the playback of the audio signal through actions, such as actuating a footpedal, graphical commands accessed using a mouse, or keyboard commands. For example, if the user misses a particular word, the user may backup five seconds in the audio stream by tapping a footpedal.
- Operation of Transcription Tool
In alternate implementations, waveform section 301 may be omitted. In this situation, the interface may simply include transcription section 302.
FIG. 4 is a flow chart illustrating the operation of transcription tool 115 for an input audio stream that contains speech. The audio stream may include, for example, audio from a radio or television broadcast.
Audio classification component 201 receives the input audio stream, (act 401), and segments the audio stream into segments such as speech/non-speech segments (act 402). Audio classification component 201 may send indications of the segments to control logic 202, which also receives the audio stream. Control logic 202 displays a graphical waveform representing the audio stream (or a portion of the audio stream), such as waveform 310, in graphical user interface 204 (act 403). The waveform may include graphical indications of, for example, segments that correspond to speech signals, the segment that is active, and an indication of the current playback position within the active segment. Concurrently with the graphical display of waveform 301, control logic 202 plays the audio stream back to the user (act 404).
As the audio stream is played to the user, the user may type in text corresponding to the audio signal. Control logic 202 displays the text in transcription section 302 (act 405). Additionally, control logic 202 may receive and process any commands input by the user (act 406). Examples of such commands include commands to move backwards or forwards in time in the audio stream or a command to skip to the next audio segment. The user may additionally enter predefined formatting commands that define additional information for a typed-in word. For example, the user, before typing in a proper name, may instruct transcription tool 115 that the word that the user is about to type is a proper name. The user instruction can be as simple as a key-code, such as a function key. Control logic 202 internally annotates the typed-in word as a proper name.
At the end of an audio segment 310, control logic 202 automatically skips to the next audio segment 312 that corresponds to a speech signal (acts 407 and 408). Alternatively, control logic 202 may skip to the next audio segment based on user commands. Control logic 212, may, thus skip over audio segments that are not useful for transcription purposes, such as music segments. In one implementation, the user may configure transcription component 115 to only playback audio streams that also meet additional criteria, such as audio streams that include wideband speech signals.
When the user is finished transcribing, transcription tool 115 may output the transcription entered by the user to a file. The file may include the text typed by the user as well as meta-information added by transcription tool 115. The meta-information may include, for example, time codes that correlate the transcription with the original audio and codes that indicate which words the user indicated as being proper names.
For some applications, it may be acceptable to generate transcriptions that do not include all of the words in the audio stream. In other words, a partial transcription that skips certain words may be an acceptable transcription. In these situations, in order to speed the transcription rate, the user may simply skip words that are not understood or the user may skip sections when the user falls behind.
As described herein, a transcription tool automates and simplifies the transcription process. By automatically identifying segments in an audio stream that are appropriate for transcription, the transcription tool saves the user from having to listen to and manually identify suitable segments for transcription from the audio stream. Additionally, the transcription tool is relatively simple to use and does not require the user to memorize or actively use a large number of commands. Accordingly, users can successfully transcribe audio with relatively little specialized training. Essentially, any user with competent typing skills and literacy in the target language can, with little training, effectively use the transcription tool. Additionally, due to the ability of the transcription tool to identify speech segments appropriate for transcription, the transcription tool can increase the efficiency, and thus lower the cost, of creating transcriptions.
The foregoing description of preferred embodiments of the invention provides illustration and description, but is not intended to be exhaustive or to limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. For example, while a series of acts has been presented with respect to FIG. 4, the order of the acts may be different in other implementations consistent with the present invention.
Certain portions of the invention have been described as software that performs one or more functions. The software may more generally be implemented as any type of logic. This logic may include hardware, such as an application specific integrated circuit or a field programmable gate array, software, or a combination of hardware and software.
No element, act, or instruction used in the description of the present application should be construed as critical or essential to the invention unless explicitly described as such. Also, as used herein, the article “a” is intended to include one or more items. Where only one item is intended, the term “one” or similar language is used.
The scope of the invention is defined by the claims and their equivalents.