WO1994016443A1

WO1994016443A1 - Display system facilitating computer assisted audio editing

Info

Publication number: WO1994016443A1
Application number: PCT/US1993/012677
Authority: WO
Inventors: Mark J. Norton
Original assignee: Avid Technology, Inc.
Priority date: 1992-12-31
Filing date: 1993-12-30
Publication date: 1994-07-21
Also published as: GB9513392D0; US5634020A; GB2289558B; GB2289558A

Abstract

A display system represents an audio track as a discrete waveform, wherein each sample in the waveform may take on one of a discrete range of possible values. The discrete waveform indicates the presence of sound energy above user-set thresholds. Audio data samples are smoothed and applied to one or more thresholding functions. The resulting discrete output is displayed by using graphics of either different size or different color. An editor using this display system in conjunction with a computerized editing system receives an indication of where sufficient sound levels occur on the audio track. This information may be used to locate breaks in sound, dialog and other sound effects, which simplifies audio editing and synchronization of audio with other media in a multimedia presentation. Similar thresholds may be applied to the results of frequency analysis to provide an indication of which frequencies are present in the audio's signal.

Description

DISPLAY SYSTEM FACILITATING COMPUTER ASSISTED AUDIO EDITING

Field of the Invention

The present invention is related to computerized multimedia editing systems. More particularly, the invention is related to display systems which facilitate audio editing in computerized multimedia editing systems.

Background of the Invention

A common problem in the production of a multimedia program is dialog and other sound editing. Audio tracks often are searched for desired words, sentences, or other sound effects, (often called "clips") and appropriate mark-in and mark-out points are selected. These "clips" must then be synchronized with video or other media with which they are associated in a multimedia program. In conventional, linear editing, relying on analog or digital source tape, an editor linearly searches (i.e., jogs) through the source tape until a word break is detected. This process is slow even for an experienced editor.

In computerized editing systems, such as a digital audio workstation available from Avid Technology, Inc. of Tewksbury, Massachusetts, and a digital video workstation (e.g.. Media Composer or Media Suite Pro, also available from Avid Technology, Inc.) the audio editing process has been made somewhat easier by providing a representation of the audio waveform for the audio track being edited.

Other available digital audio workstations include: the DSE-7000, from AKG Acoustics, Inc. of San Leandro, California; the DDR-10 from Otari Corp. of Tokyo, Japan; the Audio File Plus, from AMC Industries, PLC of Bernley, Great Britain; Dyaxis from Studer Editech of Menlo Park, California; and Waveframe 401 from Waveframe, Inc., of Sherman Oaks, California. Other available digital video workstations which allow for audio editing include: Video F/X Plus from Digital F/X of Mountain View, California; Studio from Matrox of Dorval, Quebec, Canada; Premier 2.0 from Adobe Systems, Inc., of Mountain View, California; EMC2 from Edit Machines Corporation of Washington, D.C.; Lightworks from OLE Partners, LTD., of London, England; and Picture Processor System III from Montage Group, Ltd., of New York, New York.

An audio waveform can be an amplitude or an energy (absolute value of the amplitude) plot. Unfortunately, a fair amount of experience is still needed to interpret these waveforms in order to take full advantage of their utility.

Summary of the Invention

To facilitate editing a display system was developed which represents an audio track as a discrete waveform (i.e. each sample in the waveform may take on one of a discrete range of possible values) indicating the presence of sound energy above user-set thresholds. The audio data is smoothed and applied to one ore more thresholding functions. The resulting discrete output is displayed, giving the editor an indication of where sufficient sound levels occur on the audio track.

Frequency analysis is also provided in one embodiment to allow detection of specific signals, rhythms and the like.

Brlef Description of the Drawing

In the drawing,

FIG. 1 is a block diagram of a computer system suitable for implementing a display system in accordance with the present invention;

FIG. 2 illustrates sample graphics suitable for a display in system in accordance with an embodiment of the present invention; and FIG. 3 is a sample energy plot of the prior art compared to a binary waveform as may be displayed in accordance with the present invention;

FIG. 4 is a graph illustrating how a thresholding function may be applied over a window of audio samples;

FIG. 5 is a graph illustrating a typical result from sorting audio samples with respect to amplitude;

FIG. 6 is a flow chart describing how the display of FIG. 2 can be generated in accordance with the present invention;

FIG. 7 is a sample energy plot of the prior art compared to a discrete waveform as may be displayed in accordance with the present invention;

FIG. 8 shows a sample waveform used to identify the location of music tracks on a compact disk;

FIG. 9 shows a sample waveform used to identify sound effects on a Foley track;

FIG. 10 shows a sample waveform used for frequency analysis; and

FIG. 11 shows a sample waveform used to identify the location of electronic slate beeps on a track.

Detailed Description

The invention will be more completely understood through a reading of the detailed description which follows, when taken in conjunction with the attached drawing.

Fig. 1 illustrates a suitable data processing system 10 with which the present invention may be implemented. The data processing system 10 is a typical programmable digital computer, such as the Macintosh family of computers available from Apple Computer of Cupertino, California (preferably model Quadra 950), a workstation available from Silicon Graphics, Inc., of Mountain View, California (preferably the Indigo model computer). It should be understood that many other data processing systems could be used to implement the present invention and that those specified are intended to be merely exemplary and illustrative. Such a data processing system may be programmed using typical computer programming languages, such as C++ (on the Indigo) or ThinkC 5.0 (on the Macintosh), which may then be compiled into object code, readable by the data processing system 10, using a suitable compiler, as those familiar with this art would understand.

A suitable data processing system 10 includes a main unit 12 which includes a central processing unit (CPU) 14 which controls the operation of the computer and performs arithmetic and logical operations. The data processing system also includes a random access memory 16, (in which the data is volatile) connected to the CPU via a bus 18. The bus 18 also connects the CPU 14 to a display 20, such as a cathode ray tube (CRT) display or a liquid crystal display (LCD) . The data processing system 10 also includes a nonvolatile memory 22, such as a hard disk, or floppy disk drive. This disk drive is also connected to the CPU via bus 18. An input device 24, such as a keyboard, mouse, track ball, graphic tablet or other mechanical user interface, enables a user of the system to input information into the computer. The input device is connected to the CPU and memory via bus 18. The data processing system 10 also has an input port 26 which enables various multimedia data to be input directly into the computer system. Such an input usually includes, or may be connected to, an analog-to-digital converter (not shown) and other hardware subsystems which enable data such as video data and audio data to be directly input to the computer, sampled, and stored in memory 16 or disk 22.

The data processing system 10 may be programmed, for example, by using the computer languages described above, along with other computer languages, to enable audio editing. Many such systems are currently available. The present invention provides a display system which facilitates such audio editing by representing the audio data to be edited as a binary waveform having user-selectable parameters.

Referring now to Fig. 2, suitable graphics for such a display system are shown. Audio data is represented as a strip, or a track, 30. A plurality of tracks can be displayed as desired, along with video or other multimedia data (indicated at 32). Buttons 34 enable a user to select a track 30 for editing. The provision of such buttons is familiar display and user interface technique in the art. A user may select such buttons and perform other editing functions using an input device which controls a cursor 36 on the display. Such cursor control devices include the track ball, mouse, or graphics tablet as described above.

Each strip or track 32 represents a selected duration of time from the audio track. The amount of time is typically user-selectable, and is selected on the basis of the resolution to which accuracy in editing is desired by the editor using the system. In the example shown in Fig. 2, five seconds of samples from an audio track are displayed. The audio data for a track is mapped to the display space provided. That is, if the number of samples for the selected time period is greater than the number of pixels available to display the track, the data samples are averaged so as to provide one data sample per pixel in the strip 32. In previous computerized audio editing systems, these averaged samples were typically displayed to provide an audio waveform display, having amplitudes scaled to fit the vertical limits of the corresponding display strip 32. A sample corresponding waveform is shown in Fig. 3 at 40. In the present invention, such a waveform is converted into a binary waveform, such as shown at 42 in Fig. 3. How the binary values are generated will now be described in connection with the graph of Fig. 4. Fig. 4 is a graph of representing five samples S_„ through S ₂ having a corresponding amplitude or energy of data samples taken in a given period of time. For each sample S_n in the audio data displayed, the absolute values or squared values of surrounding data samples S_„ through S _N are summed. This sum, or average or root-average based on this sum, is then compared to a threshold which is user-selectable. Preferably, the root-mean-square of the audio data is used for increased accuracy. The threshold is taken from a range of values corresponding to the range of possible values subjected to the threshold operation. In this embodiment, the amplitude of the audio signal is represented by a signed 16-bit number. The possible threshold value therefore ranges between 0 and 32,000. It was experimentally determined that a suitable default threshold is 10,500. This default can then be adjusted by the user for a given audio track so that the resulting output corresponds to sounds heard while listening to the audio track.

A threshold may be provided globally, for all tracks, and clips within tracks, displayed. However, the same threshold may not always be applicable to all clips within a track because different audio sources with different levels are often edited together. Alternatively, a threshold can be made an attribute of a track, or of a clip within a track, to provide more flexibility. In a computer system where tracks and clips are represented as objects such a modification may be readily made. A threshold can also be associated with a master clip, i.e., source data, having the advantage of storing the threshold with the sound data to which it is applied, allowing for a more accurate determination of an appropriate threshold. Threshold adjustment therefore becomes a function on the media database and is not an attribute of the display. Background noise levels and signal-to-noise ratios can also be computed. Auto-correlation and similar techniques can be used to separate the desired audio signal from unwanted noise. Once the background noise has been characterized and measured, the signal-to-noise ratio (S/N) can be computed. A threshold can be determined from this S/N ratio.

A threshold for a track or a clip may also be calculated based on the history of data samples for the track or the clip, allowing adaptation to transient shifts in background noise levels. To do this, the sample valves can be sorted by amplitudes. The resulting function of amplitude to number of samples tends to have two local maxima-one indicating a noise level, the other indicating a desired sound level. See Fig. 5. The local minimum between these two levels may be used as an appropriate threshold. A steady state room tone recording could be used as a source of expected background noise levels to improve the accuracy of such calculations. That is, steady state room tone samples may be added to the sorted audio samples, thus increasing the number of samples at the noise level.

The example shown in Fig. 4 is a five sample window (N=2) into the sound track. The number of samples which are summed per displayed sample (i.e., the size of the sample window) should be odd, so that the sample window is centered over the current sample S„ . A sample window is provided so that sporadic samples do not cause the binary waveform to change states too quickly. Rather, several samples in sequence must be either low or high in order to change the state of the binary waveform thereby providing some state momentum. The number of samples considered may also be made user-selectable, allowing the user to control the level of state momentum in the averaging process. There is a slight delay introduced by this method which causes the binary waveform to change states a number of time units after the actua^'i energy waveform changes. In general, this delay is not a problem, since there are tens of thousands of samples per second and the delay is correspondingly negligible. It has been found that a sample window of five samples provides a suitable smoothing filter for this purpose. A three sample window was found to be insufficient, particularly at higher time resolutions. The sample window should not be made too wide as it would tend to distort the waveform of the audio data.

Fig. 6 is a flow chart describing how this display is generated when a user selects a given audio track. The user first selects the audio data to be viewed in step 50. This audio data is usually available on the computer as an array of time-indexed 16-bit floating point words, wherein each word represents an instantaneous measurement of sound energy, sampled at 44.1 KHz. The data may also be 16-bit integers which enable faster computation. Audio data may be received at a number of different sampling rates; the sampling rate of 44.1 KHz is typical for a compact disc audio data. Such sound information is typically received through a microphone which provides an analog signal. The analog signal is converted to a digital signal using an analog-to-digital converter, as it is well known in the art, which provides a word of digital data at a given sampling rate. This data can be stored in a variety of different media, such as a floppy disk, hard disk or digital audio tape as time-indexed information.

The selected audio data is mapped to the display space in step 52, as described above using procedures which are well known in the art. For example, a number of audio data samples can be averaged to provide a corresponding sample to be displayed for each pixel in the display space. As described in connection with Fig. 4, for each sample to be displayed for a pixel in the display space, the sum of samples within a sample window is calculated and a threshold is applied in step 54 to obtain a binary value. Representation of this binary value is then displayed (step 56) .

When the data is filtered, and the resulting binary waveform is displayed, the location of word breaks, or other breaks, in the sound on the audio track can be readily determined simply by viewing the display. For example, as shown in Fig. 3, where the binary waveform is zero, the person speaking the indicated sentence is pausing between words. Using such a display for editing enables an editor to readily mark cut locations (mark-in and mark-out locations) in the audio track. How cuts are marked in such computerized multimedia editing systems involves techniques which are well-known in the art. In video editing systems, although editing granularity is at the video frame level, fairly accurate edits can be made on word break boundaries using this display mechanism.

Because the number of data samples displayed depends on the size of the window and the time resolution selected by the urer, the granularity of the binary waveform also changes, so that it does not always indicate word breaks. At much higher resolutions, it has been found that syllables of words can be detected. At lower resolutions, breaks between sentences are detected, while word breaks are not. Such functionality is useful for editing because it allows high level selections to be made easily, then later, more fine level editing can be performed.

Given a binary waveform which indicates the presence of word or any sound breaks in a sound track, new editing controls may also be provided. Some of these functions include going to a next word, going to the end of a word, selecting certain words, playing a word or a selection of words, or marking the start or end of a word as a cut location in the track. Such a display system may also be used for musical audio data. Given an appropriate threshold level, the binary waveform may be used to isolate volume peaks and crescendos in music. These and similar functions allow an editor to create multimedia programs based on the dialog or musical content of a sound track.

Binary waveforms may also be used to identify and locate the presence of a sound effect on a Foley track (for example, see Fig. 9) especially if the track includes large amounts of silence, greatly improving the ability to synchronize the sound effect with visual material. Sound effects may be quickly found, edited, and synchronized with other material. In stereo sound, synchronization may be repaired if lost when one track slips versus another.

This type of display may also be used to detect long pauses, which may be then used to identify and separate effects or music tracks captured from prerecorded sources such as records, tapes, compact discs and the like. (For example, see Fig. 8).

By using such a display system an editor may readily visualize sound pieces, and the editing process is accelerated and simplified. Sentences, phrases, words, syllables, transient noises, speech patterns, and even silence such as dramatic pauses and sentence and phrase breaks, may be quickly located and isolated in a long audio track and extracted for appropriate use. Thus, less experience is required to generate high quality multimedia productions. Such a display system facilitates the development of marketing and advertising multimedia programs by companies who have no personnel with experience in film editing.

The invention is not limited to generating nearly a binary waveform, nor to amplitude data.

Two or more thresholds may also be provided to provide a discrete waveform, having a smaller range than a continuous waveform, but a broader range than a simple binary waveform. Two thresholds provide a hysteresis type state behavior to -l i¬

the display. For more thresholds, discrete color values may be used to identify levels of sound over different thresholds. For example, black may be used to indicate silence, purple for low volume, blue for mid-volume and green for high volume. A 16-bit continuous range of colors can completely represent a range of amplitudes from 0 to 32,000 represented by the sound data.

A sample of a display using three thresholds, with threshold level colors identified by patterns, is shown in Fig. 7. In this figure, a discrete waveform is shown at 60, with a corresponding energy plot at 62. The threshold levels are indicated at 64, 65 and 66. When the sound energy is below the first threshold 64, the waveform takes on one value, indicated by black in Fig. 7 at, for example, 68. When sound energy is above the first threshold 64 but below the second threshold 68, the waveform takes a second value, indicated by white in Fig. 7 at, for example, 70. Similarly, the waveform takes on a third value when the sound energy is between the second and third thresholds, as indicated by horizontal lines, for example, at 71. Otherwise the waveform takes on a fourth value indicated by diagonal lines, for example, at 72.

Frequency data may also be used for this display system. Such frequency data can be obtained by applying a simple fast Fourier transform (FFT) with a limited frequency band to the audio data. A threshold can be applied to the amplitudes of the different frequency bands to determine if sounds within the certain frequency band are present. Such information may then also be displayed. A sample display is shown in Fig. 10,

Using frequency analysis, certain sounds can be detected, such as electronic slate beeps (signals used to separate one tape or scene from another during video and audio recording sessions) (for example, see Fig. 11). If the frequency of the desired signal is known (in this example the presumption is 1.2 KHz), its occurrence in a track can be identified, s.ich as shown in black in Fig. 11.

It is also possible to show selected frequencies or a range of frequencies, using colors to denote the various frequency bands. Such frequency data allows certain aspects of music to be viewed, including beat detection. In some compositions with strong drum or other rhythm sounds, it is possible to isolate or determine the tempo of the music using such frequency data.

Frequency data may also be used to identify potential problem areas in a sound track. For example, repetitive background noise events, such as fans, light buzzing, etc. may be detected. Using both frequency and amplitude data, it is possible to avoid the loss of dialogue in noisy environments, when dialogue levels fall below the levels of background noise. It is then possible to differentiate further word breaks in this low level dialogue.

Having now described a few embodiments of the invention, it should be apparent to those skilled in the art that the foregoing is merely illustrative and not limiting, having been presented by way of example only. Numerous modifications and other embodiments are within the scope of one of ordinary skill in the art and are contemplated as falling within the scope of the invention as defined by the appended claims.

Claims

1. A display system for facilitating computerized editing of audio data, comprising: means for selecting at least a portion of the audio data; means for generating a discrete waveform representative of the selected portion of audio data; and means for displaying the discrete waveform on a video display.

2. The display system of claim 1 wherein the selected audio data comprises a plurality of samples and the means for generating includes means for applying a smoothing operation to the selected audio data to obtain an averaged value for each sample and means for applying a threshold operation to each of the averaged values to obtain a binary value.

3. A method of audio data for facilitating computerized editing of the audio data, the method comprising the steps of: selecting at least a portion of the audio data; generating a discrete waveform representative of the selected portion of the audio data; and displaying the discrete waveform on a video display.

4. The method of claim 3 wherein the selected audio data comprises a plurality of samples in the step of generating a discrete waveform includes the steps of applying a smoothing operation to the selected audio data to obtain an averaged value for each sample and applying a threshold operation to each of the averaged values to obtain a binary value.