Search Images Maps Play YouTube News Gmail Drive More »
Sign in
Screen reader users: click this link for accessible mode. Accessible mode has the same essential features but works better with your reader.

Patents

  1. Advanced Patent Search
Publication numberUS3909532 A
Publication typeGrant
Publication dateSep 30, 1975
Filing dateMar 29, 1974
Priority dateMar 29, 1974
Also published asCA1036271A1
Publication numberUS 3909532 A, US 3909532A, US-A-3909532, US3909532 A, US3909532A
InventorsLawrence Richard Rabiner, Lewis Hyman Rosenthal, Ronald William Schafer
Original AssigneeBell Telephone Labor Inc
Export CitationBiBTeX, EndNote, RefMan
External Links: USPTO, USPTO Assignment, Espacenet
Apparatus and method for determining the beginning and the end of a speech utterance
US 3909532 A
Abstract
It has been discovered that the energy of the code words at the output of an adaptive speech encoder may be utilized to accurately determine the beginning and end of an encoded speech utterance. The beginning of an utterance is detected when the code word energy exceeds a predetermined threshold for a fixed duration of time. Likewise, the end of an utterance is detected when the code word energy falls below the threshold for another fixed duration of time.
Images(5)
Previous page
Next page
Description  (OCR text may contain errors)

United States Patent Rabiner et al.

[451 Sept. 30, 1975 APPARATUS AND METHOD FOR OTHER PUBLICATIONS DETERMINING THE BEGINNING AND THE 7 END OF A SPEECH UTTERANCE Johnson. C. et al., Adaptive Rate Delta Modulator, [75! Inventors: Lawrence Richard Rabiner, IBM Tech" Dlsclosurc Apnl 1973 Berkeley Heights, N..l.; Lewis Hyman Rosenthal, Cambridge. Primary Ii\'mninerKathlecn H. Claffy M1188; Ronald William schafel", NCW Assistant E.\'aminerE. S. Kemeny Providence, NJ. Attorney, Agent, or Firm-G. E. Murphy [73] Assigneez. Bell Telephone Laboratories,

Incorporated. Murray Hill NJ.

[57] ABSTRACT [22] Filed: Mar. 29, 1974 It has been discovered that the energy of the code [21] Appl 456027 words at the output of an adaptive speech encoder may be utilized to accurately determine the beginning [52] us. Cl. 179/1 SC; 325/36 B and end of an encoded speech utterance. The begin- [51] Int. Cl. l. G10L l/04 ning of an utterance is detected when the code word [58] Field Of Search 179/1 SA l SC; 325/38 B, energy exceeds a predetermined threshold for a fixed 325/62 326 duration of time. Likewise, the end of an utterance is detected when'the code word energy falls below the [56] References Cited threshold for another fixed duration of time.

UNITED STATES PATENTS 27 Claims, 12 Drawing Figures 3 750.()24 7/1973 Dunn et ul. 325/38 B SPEECH |NpUT 5Ol 502 503 END L 0 u) E(n) J j 8 EG l N ADAPTIVE CODE WORD THRESHOLD ENCODER DETECTOR ENERGY DETECTOR U.S. Patent Sept. 30,1975 Sheet 1 of5 3,909,532

FIG

(PRIOR ART) SPEECH ADAPTIVE INPUT 1| [QUANTIZER A l3 I I woans ENCODER 1 cu) LOGIC NETWORK I4 DELAY FIG. 5A PEECH p 50| 502 503 END cu) E(n) J j D BEGlN ADAPTIVE CODE WORD THRESHOLD ENCODER ENERGY DETECTOR DETECTOR FIG. 2

BMW

U.s. Patent Se t. 30,1975 sheetzofs 3,909,532

2C(i) g I? DIGITAL DOUBLER DELAY U.S, Batant Sept. 30,1975 Sheet4 of5 3,909,532

A WWW 'WORD BEG|NS FIG .6 l

WWDMWWWW DH M WWWW (WORD BEGINS WORD BEGINS US. Patent FIG. /0

FIG.

Sept. 30,1975 Sheet 5 of 5 3,909,532

Pwoao BEGINS }-woRo ENDS k-WORD ENDS APPARATUS AND METHOD'FOR DETERMINING THE BEGINNING AND THE END OF A SPEECH of extensive research. Generally, for these applications,

speech must be stored in digital form. Typically, a file of speech is created and stored in a suitable memory,

e.g., a fixed head disk or drum. In order to efficiently store speech, itis necessary that individual words and phrases be stored in memory without intervening periods of silence between entries. Thus, the need to automatically locate the beginning and endof a speech ut terance frequently arises in speech processing for manmachine communication.

DESCRIPTION OF THE PRIOR ART Conventionally, the. task of determining the end- 1 pointsof a speech utterance has been accomplished by manual editing, utilizing a combination of auditory and visual examinations of the speech waveform. However,

imanual editing is both time-consumingand subject to the inaccuracies concomitant with human judgment. Furthermore, repeatable results are not normally obrange of speech renders the combination of ear and eye a poor determinant of word boundaries. This is especially true when an unvoiced segment of speech, e.g.,'

the fricative at the beginning of the word three, appears' at the beginning or end of a word. Consequently, manual editing usually results in shortening the speech,

both at the beginning and at the end of the utterance. Thus, the words are chopped,

and when they are concatenated to form a message, the effects are quite discernible and also'distracting. I

It is thus an object of this invention to efficiently, ac-

curately, andv automatically detect the beginning and end of a speech utterance. i

SUMMARY'OF THE INVENTION This and other objects of this invention are accomplished by utilizing an adaptive speech encoder, e.g., an I adaptive differential pulse code modulator (ADPCM),

an adaptive delta modulator, etc. It has been discovered'by us that because of the step size adaptation used in 'developing adaptive encodeds peech, an adaptive speech encoder effectively exhibits a form of automatic gain control useful in determining the endpoints ofan utterance. Coded output words of such a coder, it has been found, exhibit high energy during both voiced and ning of a speech utterance is detected whenlthe code word energy exceeds a predetermined threshold for-.a

' the article by I. Cummiskey, N. S. Jayant, and .I. L.

. tained. One reason for this is that the wide dynamic I Flanagan, entitled Adaptive Quantization in Differenfixed interval of time. Likewise, the end of an utterance is detected when the codeword energy falls below the threshold for another fixed interval of time.

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 depicts a prior art ADPCM coder which may be used in the practice of this invention;

FIG. 2 displays the code word sequence for the utterance oh;

FIG. 3 displays the decoded speech waveform corresponding to the code word sequence of FIG. 2;

FIG. 4 is a block diagram of apparatus used in the practice of this invention to determine code word en- FIG. 5 is a block diagram of apparatus used in the practice of this invention to determine the beginning and end of a speech utterance;

FIG. 5A is a block diagram depicting the system operation of this invention;

FIG. 6 displays the code word sequence for the beginning of the utterance three;

FIG. 7 displays the code word energy corresponding to the code word sequence of FIG. 6;

FIG. 8 displays the decoded speech waveform corresponding to the code word sequence of FIG. 6;

FIG. 9 displays the energy of the speech waveform of FIG. 8; 1

FIG. 10 displays the code word sequence for the end of the utterance three; and

FIG. 11 displays the speech waveform corresponding to the code word sequence of FIG. 10.

' DETAILED DESCRIPTION OF THE INVENTION FIG. 1 depicts a prionart'adaptive differential pulse code modulation CII'CUIIWI'IICII is described in detail in tial PCM Coding of Speech, Bell System Technical Jonrnkzl,,Vol. 52, p. 1105-1118, September 1973. In the ADPCM coder, of FIG. 1, differential input amplifier or network 11 develops an output signal proportional to the difference between an applied sampled speech signal and a signal which is an estimate of the incoming speech signal. This difference signal is quantized in adaptive quantizer l2 and applied to encoder I 13 and to summing amplifier or network 14. Summing amplifier 14, in-conjunction with first order prediction network 15, having a transfer function, for example, of az", is utilized to develop an estimate of the incoming speech signal. If the estimate of the input speech signal 1 is fairly accurate, then the difference signal emanating from network 1 1 will be small and thus more accurately represented by a fixed number of bits than the input speech samples themselves. The difference signal, al-

though nowhere near as redundant as the original 1 speech signal, still exhibits a wide amplitude range. In

order'to make efficient use of the available quantizationlevels of quantizer 12, the peak excursion of the level voiced sounds Accordingly, the need for adaptive quantization is apparent; logic network 16 utilizes the coded-speechsignals (code words) emanating from encoder13 to determine optimum quantization steps. 'Thatis, logic network 16 monitors the coded output of encoder l3 and provides for adaptation of the step size on the basis of the most recent encoded quantizer output. For example, if the code word corresponds to one of the higher levels, the quantizer is overloaded and the step size is increased. On the other hand, if the code word corresponds to one of the lower levels, the step size is decreased. Step size adaptation effectively compensates for amplitude variations to the extent that the quantizer treats low level unvoiced speech signals, e.g., fricatives, much the same as high level voiced speech signals. The objective, of course, is that each of the quantizer levels be used a significant portion of the time regardless of the absolute amplitude level of the incoming speech samples. However, when the amplitude of the input speech signal is of the order of the minimum step size, the adaptation logic insures that the step size will seek its minimum value and the difference signal will then fall within the lowest quantization levels. We have discovered that when no speech is present at the input, the code word energy will vary only slightly. It is this feature of ADPCM speech encoding, and of adaptive encoding in general, that is turned to account in the practice of this invention. It is to be understood that the principles of this invention are applicable to all forms of adaptive encoders including ADPCM and adaptive delta modulation.

FIG. 2 is an exemplary display of code word activity for the voiced utterance oh. Each line, A, B, C, D of FIG. 2, corresponds to approximately 256 samples (6 kHz sampling rate) of the applied speech utterance, i.e., approximately 40 milliseconds of the signal. Line B is to be considered a continuation of line A, line C a continuation of line B, etc. It is noted that for line A, and for most of line B, the code words show little activity, remaining for the most part within a limited range ofquantization levels. This first part of the code word sequence corresponds to background silence. However, at almost the end of line B, and then for the remainder of lines C and D, the code word sequence fluctuates much more rapidly and with greater amplitude. FIG. 3 illustrates the decoded speech waveform corresponding to the code word sequence of FIG. 2. It is noted in FIG. 3 that voiced speech apparently commences somewhere near the end of line B and continues for line C and D. This property of code words to indicate the presence of speech activity is more accurately reflected in what we define as adaptation activity or code word energy. The code word energy may be defined as the number of code word adaptations per unit time. In one embodiment of this invention, we used as a measurement of energy the sum of the squares of the code words for one hundred and one samples, or code words, corresponding to a 16 millisecond window centered about a selected sample. That is, the code word energy may be defined as +50 (11) Emm i=n-5(l where c(i) corresponds to a code word emanating from encoder 13 of FIG. 1. Of course, other equivalent definitions of energy may be utilized.

In the prior art ADPCM implementation of FIG. 1, the largest negative quantization level is represented by the binary code word 0000 While the largest positive quantization level is represented by the binary code word I l l 1, corresponding to the decimal number 15. Thus, it is necessary, if one is using such a symmetrical coding system, to subtract from the code words a number corresponding to the dc level or average value of the code words to make the average level of the code words equal to zero. Of course, a different coding implementation may be utilized which inherently has a zero average value. Since the number 7.5, corresponding to the average value, may not be conveniently represented in digital form, the following definition of energy may be utilized:

where a(i) [2 c(i) 151 By using this definition, the dc level is removed from consideration and the energy content of the code words differs from the definition of Eq. l by only a multiplicative constant. It may readily be shown that the energy term defined by Eq. (2) is equivalent to The code word energy, in accordance with this invention, is computed at each sample of the speech signal and compared with a threshold which is established at a level intermediate to the measured energy of silence and the average measured energy of the speech utterance. When the code word energy exceeds this threshold for approximately 320 consecutive samples, corresponding to about 50 milliseconds of speech, the word c(n) at which the energy first exceeded the threshold is defined as the beginning of an utterance. The code word energy-threshold comparison is continued, and when the code word energy falls below the threshold for approximately 1,024 consecutive samples, corresponding to about I60 milliseconds of speech, the point at which the energy first fell below the threshold is defined as the end of the utterance. The millisecond criterion insures that a stop consonant within a word or phrase will not be mistaken for the end of the utterance.

Apparatus for determining the energy of the code words in accordance with Eq. (3) is illustrated in FIG. 4. A code word, c(i), emanating from encoder 13 of FIG. 1, is applied to digital doubler 17, wherein it is doubled in value to develop a signal 2 c(i), which is twice the digital value of the applied code word. Digital doubler 17 may be of any well-known configuration, e.g., a shift left by one bit register will double the value of an applied binary signal. Digital subtractor 18 subtracts from signal 2 c(i), a signal supplied by digital reference register 19. The signal stored in register 19 is proportional to the dc level or average of the code words. In a particular embodiment, the digital signal stored in register 19 is equal to fifteen as required by Eq. (2). Digital multiplier 21 multiplies the output signal of subtractor 18 by itself to achieve a squared signal which corresponds to the function a(i) of Eq. (2). Both subtractor l8 and multiplier 21 may be conventional digital arithmetic circuits. The signal output, a(i), of multiplier 21 is applied to shift register 22. Register 22, which preferably has a digital capacity of one hundred and two words, sequentially shifts digital signal a(i) through the register at the system clock rate. It is to be understood that in the circuitry of FIG. 4, and also in that of FIG. 5, that all operations are performed in synchronism with the master sampling clock of the coder of FIG. 1, which has not been depicted in order not to obfuscate the operation of the instant invention. At any point of time, the last digital word stored in register 22, i.e., the oldest word in storage, corresponds to a(n5 l and the first word stored in register 22, i.e., the most recently stored word, corresponds to a(n+50). The first and last words of register 22 are combined in conventional digital subtractor 23 to form a difference signal, a(n+50) a(n5 l This difference signal is applied to conventional digital adder 24 which, in conjunction with delay network 25, develops a signal representative of the code word energy as defined in Eq. (3), Delay network 25 may be of conventional design and is utilized to delay the output of adder 24 by one clock period.

The output signal E(n), of adder 24, is applied to digital comparator 26 of FIG. 5. Comparator 26 compares the energy of each code word E(n) with a signal stored in register 27 to determine whether or not the energy of the code word is above or below a predetermined threshold. The threshold is generally empirically determined and may be approximately equal to a point midway between the measured energy of background silence and the average measured energy of the speech signal, which is readily obtained by averaging the output of the apparatus of FIG. 4. As discussed above, when the code word energy exceeds this threshold for approximately 50 milliseconds or 300 consecutive samples, the point at which the energy function first exceeded the threshold is defined as the beginning of an utterance. The apparatus of FIG. 5 is utilized to deter mine when this has occurred. Also, when an utterance has been determined to have begun, the apparatus of FIG. 5 continues to make a comparison of the energy of subsequent code words with the threshold signal stored in register 27. When the code word energy falls below this threshold for approximately 160 milliseconds or 1,000 consecutive samples, the point at which the energy function first passed below the threshold is recorded as the end of the utterance.

To understand the operation of the circuit of FIG. 5, it is convenient to assume that speech is not present at the input to the ADPCM coder and, in fact, has not been present long enough so that the last indication encountered was an end of a speech utterance. This is indicated by certain states or levels for particular circuit components. Thus, it may be assumed that output lead 39 of flip-flop 34 is at a logical 0 state and that output lead 41 of flip-flop 34 is at a logical I state. It may also be assumed that output lead 43 of digital comparator 26 is at a 0 state and that output lead 45 of digital comparator 26 is at a I state. Accordingly, input lead 42 to NAND gate 28 is at a logical I state andinput lead 44 to NAND gate 29 is at a logical 0 state. In accordance with the well-known logical rules for NAND circuits, input lead 46 to NAND gate 31 is at a logical 1 state and input lead 47 to NAND gate 31 is at a logical I state. Thus, lead 48, connecting the output of NAND gate 31 and one of the inputs to NAND gate 32, is at a 0 state and lead 51, one of the inputs to NAND gate 38, is also at a 0 state. Clock input 49 to NAND gate 32 is presumed to enable NAND gate 32 upon the presence of a logical l on lead 49. Accordingly, output lead 54 of NAND gate 32 isat a logical I state; counter 33 is presumed to be incremented upon the presence of a 0 level input on line 54. Thus, output leads 55, 56 and 57 of counter 33, which correspond to the 10th, 8th and 6th powers, respectively, of the binary base two, are at a logical 0 state. Output lead 58 of NAND gate 35 is thus at a logical 1 state as is output lead 59 of NAND gate 36. Input leads 53 and 52 to NAND gate 38 are also at a logical I state, thus establishing output lead 61 of NAND gate 38 at a logical 1 state and output lead 62 of inverter circuit 37 at a logical 0 state. Since this is the clear input to counter 33, a logical 0 state is presumed to clear the counter.

If it is now presumed that the energy signal applied to digital comparator 26 exceeds the output of digital threshold register 27, output lead 43 of comparator 26 assumes a logical I state and output lead 45 of comparator 26 assumes a 0 state. Output lead 46 of NAND gate 28 is then at a logical 0 state and output lead 47 of NAND gate 29 is at a logical 1 state. Output lead 48 of NAND gate 31 assumes a logic 1 state as does lead 51, which is one of the inputs to NAND gate 38. Since input leads 52 and 53 are already at a logical 1 state, the output lead 61 of NAND gate 38 assumes a logical 0 state and therefore output lead 62 of inverter 37 assumes a logical 1 state, thereby allowing counter 33 to be incremented. Upon the presence ofa logical 1 signal at clock input 49 to NAND gate 32, output lead 54 of NAND gate 32 assumes a logical 0 state and counter 33 is incremented. Assuming that the inputenergy signal to comparator 26 remains above, the predetermined threshold, then with each energy word, counter 33 will I be incremented. When counter 33 reaches a level of 320, which corresponds to a 1 output on leads 56 and 57, output lead 59 of NAND gate 36 assumes a logical 0 state indicating the beginning of a speech utterance. The presence of a 0 level signal on output lead 59 resets flip-flop 34 so that a logical 1 signal appears on output lead 39 and a logical 0 signal appears on output lead 41. Output lead 58 of NAND gate 35 remains at a logical I state. The resetting of flip-flop 34 causes output lead 59 to return to a logical I state and in turn causes input lead 44 to NAND gate 29 to assume a logical I state and input lead 42 to NAND gate 28 to assume a logical 0 state. Assuming that the energy signal remains above the threshold, output lead 43 is still at a logical I state, but since input lead 42 to NAND gate 28 is now at a logical 0 state, output lead 46 of NAND gate 28 assumes a logical I state. Output lead 45 of comparator 26 is still at a 0 state, but input lead 44 to NAND gate 29 is now at a logical 1 state. Thus, output lead 47 of NAND gate 29 is at a logical 1 state. Accordingly, output lead 48 of NAND gate 31 assumes a logical 0 state as does input lead 51 to NAND gate 38. Input lead 54 to counter 33 assumes a logical 1 state and counter 33 is not incremented. Since input lead 51 is at a 0 state and input leads 52 and 53 of NAND gate 38 are at a logical 1 state, output lead 61 of NAND gate 38 is at a logical I state and the clear input to counter 33, lead 62, is at a logical 0 state. Thus, the counter is cleared and output leads 58, 59 remain at a logical 1 state. When the energy of the applied code words to digital comparator 26 decreases to a level below the threshold level established by register 27, output lead 45 of comparator 26 assumes a logical I state and output lead 43 assumes a logical 0 state. Since input lead 42 to NAND gate 28 is at a 0 level, output lead 46 of NAND gate 28 assumes a logical I state. Similarly, since input lead 44 to NAND gate 29 is at a logical I state, output lead 47 of NAND gate 29 assumes a logic state. Thus, output lead 48 of NAND gate 31 is at a logical I state as is input lead 51 to NAND gate 38. Upon the occurrence of a I level on clock input 49 to NAND gate 32, output lead 54 assumes a logical 0 state and increments counter 33. Assuming the input energy level of the code words remains below the predetermined threshold, counter 33 will be successively incremented but no change in the logic states of the circuit will occur until leads 55, 56, and 57 of counter 33 all assume a logical 1 state. This state corresponds to a count of 1024. Upon the occurrence of this condition, output lead 58 assumes a logical 0 state indicating the end of the speech utterance while output lead 59 remains at a logical I state. The occurrence of a O logic state on output lead 58 sets flip-flop 34 back to its original state, i.e., output lead 39 assumes a 0 state and output lead 41 assumes a I state. Output lead 58 accordingly returns to a logical I state and the apparatus of FIG. 5 has returned to the conditions initially assumed prior to the beginning of the speech utterance. The waveforms appearing at output leads 59 and 58 of the apparatus of FIG. 5 indicate the logic state transition, respectively, at the beginning and end of a speech utterance. The output signals of the apparatus of FIG. 5 may be used in a variety of ways. For example, they may be used to gate a register which temporarily stores the code words of the apparatus of FIG. 1 so that the code words of the speech utterance, determined by the apparatus of FIG. 5, may be conveyed to a permanent store. Or, if so desired, the signals appearing on leads 58 and 59 may be utilized to activate an alarm circuit to indicate to an operator that the beginning and end of a speech utterance has occurred. Many other applications, of course, will be apparent to those skilled in the art.

FIG. 5A is a block diagram depicting the overall operation of this invention, as discussed above. Adaptive encoder 501 corresponds to the encoder shown in FIG. 1, code word energy detector 502 corresponds to the apparatus depicted in FIG. 4, and threshold detector 503 corresponds to the apparatus shown in FIG. 5.

The significant advantages of the instant invention, in determining the beginning and end of a speech utterance, are illustrated by FIGS. 6 through 11. FIG. 6 displays the sequence of code words corresponding to the beginning of the word three. The left-half of line A shows very little code word variation and corresponds to low level noise. The right-half of line A, and the next two lines, B and C, correspond to the initial fricative th of the word three. The code words show markedly greater variation as does the last line, D, which corresponds to the beginning of voicing, i.e., ree. The marker in the middle of line A denotes the beginning point of the speech utterance, as determined by this invention. FIG. 7 displays the energy of the code words of FIG. 6, as determined by this invention. The marker on line A denotes the point at which the energy of the code words exceeded the threshold and remained above the threshold for approximately 50 milliseconds, as discussed above. It is noted that the code word energy is roughly the same for both the voiced and unvoiced segments of the utterance while the energy is significantly lower when no speech is present. FIG. 8 displays the actual speech waveform represented by the code word sequence of FIG. 6. The beginning of the word three is not nearly as evident as in the code word sequence; indeed, it is hardly discernible. FIG. 9, which displays the energy of the speech waveform of FIG. 8, emphasizes the fact that the beginning of a speech utterance is not readily discernible from an examination of the energy of the speech waveform itself. FIG. 10 displays the code word sequence at the end of the word three. The marker on line B indicates the end point of the utterance as determined by the instant invention. FIG. 11 displays the speech waveform corresponding to the code word sequence of FIG. 10. The end point of the utterance is clearly not apparent from an examination of the speech waveform itself.

The instant invention has been tested extensively in determining the beginning and end speech entries for a voice response system vocabulary, and has proved to be very reliable. Two other aspects of the coded speech signal, i.e., the energy of the difference signal of the coder of FIG. 1, and the energy of the quantizer output were also studied as possible considerations for use in the instant invention. However the results based on the coded word samples themselves were found to be far more accurate.

What is claimed is:

1. Apparatus for determining a boundary of an applied speech utterance comprising:

means for adaptive encoding said applied speech utterance to develop coded output signals;

means for developing a signal representative of the energy of said coded output signals; and

means for comparing said representative signal with a predetermined threshold signal.

2. The apparatus defined in claim 1 wherein said signal representative of the energy of said coded output signals is representative of the adaptation activity of said means for adaptive encoding.

3. The apparatus defined in claim 1 wherein said threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.

4. Apparatus for determining a boundary of an applied speech utterance comprising:

means for adaptive differential pulse code modulating said applied speech utterance to develop digitally coded output signals;

means for developing a signal representative of the energy of said coded output signals; and

means for comparing said representative signal with a predetermined digital threshold signalv 5. The apparatus defined in claim 4 wherein said sig nal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.

6. The apparatus defined in claim 4 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.

7. Apparatus for detecting the beginning of a speech utterance, including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising:

means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals; and

means responsive to said representative signal for developing an output signal when said representative signal is greater than, for a predetermined interval of time, an applied digital threshold'signal. said output signal indicative of the beginning of said speech utterance. i

8. The apparatus defined in claim' 7 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.

9. The apparatus defined in claim 7 wherein said digital' threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.

10. The apparatus defined in claim 7 wherein said means for developing a signal representative of the energy of said digitally coded output signals comprises:

first means for doubling each digitally coded output signal of said modulation circuit;

second means for subtracting from each of said doubled coded output signals a predetermined digital reference signal;

third means for squaring each output signal of said second means;

fourth means for sequentially storing a predetermined number of said squared signals;

fifth means for subtracting from the most recently stored squared signal in said fourth means the oldest stored squared signal in said fourth means;

sixth means for adding the output signal of said fifth means and an applied signal to develop a signal representative of the energy of said coded output signals; and

seventh means for applying said representative signal to said sixth means after a predetermined interval of time has elapsed.

11. Apparatus for determining the beginning of a speech signal, including an adaptive differential pulse code modulation circuit responsive to said speech signal, comprising:

' means responsive to the output signals of said modulation circuit for developing a signal representative of the energy of said output signals; and means responsive to said representative signal for developing an indicator signal when said representative signal is greater than, for a predetermined interval of time, an applied threshold signal, said indicator signal indicative of the beginning of said speech signal.

12. Apparatus for detecting the end ofa speech utterance. including an adaptive differential pulse code modulation circuit responsive to said speech utterance,

comprising:

means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals; and

means responsive to said representative signal for developing an output signal when said representative signal is less than. for a predeterminedinterval of time. an applied digital threshold signal said output signal indicative of the end of said speech utterance.

13. The apparatus defined in claim 12 wherein said signal representative ofthe energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.

[4. The apparatus defined in claim 12 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.

15. The apparatus defined in claim 12 wherein said means for developing a signal representative of the en- 5 ergy of said digitally coded output signals comprises:

I first means for doubling each digitally coded output signal of said modulation circuit; second means for subtracting from each of said doubled coded output signals a predetermined digital reference signal; third meansfor squaring each output signal of said second means; fourth means for sequentially storing a predetermined number of said squared signals; fifth means for subtracting from the most recently stored squared signal in said fourth means the oldest stored squared signal in said fourth means; sixth means for adding the output signal of said fifth means and an applied signal to develop a signal representative of the energy of said coded output signals; and seventh means for applying said representative signal to said sixth means after a predetermined interval 2 of time has elapsed.

16. Apparatus for determining the end of a speech signal, including an adaptive differential pulse code modulation circuit responsive to said speech signal, comprising:

means responsive to the output signals of said modulation circuit for developing a signal representative of the energy of said output signals; and

means responsive to said representative signal for developing an indicator signal when said representative signal is less than, for a predetermined interval of time, an applied threshold signal, said indicator signal indicative of the end of said speech signal.

17. Apparatus for detecting the boundaries of a speech utterance, including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising:

code word energy means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals;

comparator means for comparing said digital representative signal with an applied digital threshold signal; and

means responsive to said comparator means for developing a signal indicative of the beginning of said speech utterance when said representative signal is greater than, for a first predetermined interval of time, said threshold signal, and for developing a signal indicative of the end of said speech utterance when said representative signal is less than, for a second predetermined interval of time, said threshold signal.

7 18. Apparatus for determining the boundaries of a speech signal, including an adaptive differential pulse code modulation circuit responsive to said speech signal, comprising:

code word energy means responsive to the output signals of said modulation circuit for developing a signal representative of the energy of said output signals;

comparator means for comparing said representative signal with an applied threshold signal; and

means responsive to said comparator means for developing a signal indicative of the beginning of said speech signal when said representative signal is greater than, for a first predetermined interval of time, said threshold signal, and for developing a signal indicative of the end of said speech signal when said representative signal is less than, for a second predetermined interval of time, said threshold signal.

19. Apparatus for detecting the boundaries of a speech utterance, including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising:

code word energy means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals; and

means for developing a signal indicative of the beginning of said speech utterance when said representative signal is greater than an applied digital threshold signal for a first predetermined interval of time, and for developing a signal indicative of the end of said speech utterance when said representative signal is less than said applied threshold signal for a second predetermined interval of time.

20. The apparatus defined in claim 19 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a prede termined number of said digitally coded output signals.

21. The apparatus as defined in claim 19 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.

22. The apparatus defined in claim 19 wherein said means for developing a signal representative of the energy of said coded output signals comprises:

. first means for doubling each digitally coded output signal of said modulation circuit;

second means for subtracting from each of said doubled coded output signals a predetermined digital reference signal;

third means for squaring each output signal of said second means;

fourth means for sequentially storing a predetermined number of said squared signals; fifth means for subtracting from the most recently stored squared signal in said fourth means the oldest stored squared signal in said fourth means;

sixth means for adding the output signal of said fifth means and an applied signal to develop a signal representative of the energy of said coded output signals; and

seventh means for applying said representative signal to said sixth means after a predetermined interval of time has elapsed.

23. The apparatus defined in claim 19 wherein said means for developing said indicative signals comprises:

digital comparator means responsive to said signal representative of the energy of said coded output signals and to said applied digital threshold signal for developing a signal at a first output terminal when said representative energy signal is greater than said threshold signal and for developing a signal at a second output terminal when said representative energy signal is less than said threshold signal;

a bistable circuit having first and second output terminals, and set and reset terminals;

a first logic circuit responsive to said comparator first output terminal signal and to the signal at the first output terminal of said bistable circuit;

a second logic circuit responsive to said comparator second output terminal signal and to the signal at the second output terminal of said bistable circuit;

a third logic circuit responsive to the output signals of said first and second logic circuits;

a fourth logic circuit responsive to the output signal of said third logic circuit and to an applied clock signal;

a counter circuit, having a plurality of output terminals, responsive to the output signal of said fourth logic circuit;

a fifth logic circuit, for developing said signal indicative of the end of said speech utterance, responsive to the signal at the second output terminal of said bistable circuit and to the signal at a preselected one of said plurality of counter circuit output terminals;

a sixth logic circuit, for developing said signal indicative of the beginning of said speech utterance, responsive to the signal at the first output terminal of said bistable circuit first and to the signals at the other of said plurality of counter circuit output terminals;

a seventh logic circuit responsive to the output signals of said third, fifth, and sixth logic circuits for developing a control signal for said counter circuit, said control signal returning said counter to a predetermined initial state; and

means for connecting the output terminals of said fifth and sixth logic circuits, respectively, to said set and reset terminals of said bistable circuit.

24. The method of determining a boundary of an applied speech utterance comprising the steps of:

adaptive differential pulse code modulating said applied speech utterance to develop digitally coded output signals;

developing a signal representative of the energy of said coded output signals; and

comparing said representative signal with a predetermined digital threshold signal.

25. The method defined in claim 24 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.

26. The method defined in claim 24 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.

27. The method of determining a boundary of an applied speech utterance comprising the steps of:

adaptive encoding said applied speech utterance to develop coded output signals;

developing a signal representative of the energy of said coded output signals; and

comparing said representative signal with a predetermined threshold signal.

Patent Citations
Cited PatentFiling datePublication dateApplicantTitle
US3750024 *Jun 16, 1971Jul 31, 1973Itt Corp NutleyNarrow band digital speech communication system
Referenced by
Citing PatentFiling datePublication dateApplicantTitle
US4275270 *Nov 29, 1979Jun 23, 1981The Regents Of The University Of CaliforniaSpeech detector for use in an adaptive hybrid circuit
US4351983 *Oct 20, 1980Sep 28, 1982International Business Machines Corp.Speech detector with variable threshold
US4370521 *Dec 19, 1980Jan 25, 1983Bell Telephone Laboratories, IncorporatedEndpoint detector
US4454586 *Nov 19, 1981Jun 12, 1984At&T Bell LaboratoriesMethod and apparatus for generating speech pattern templates
US4587670 *Oct 15, 1982May 6, 1986At&T Bell LaboratoriesHidden Markov model speech recognition arrangement
US4704696 *Jan 26, 1984Nov 3, 1987Texas Instruments IncorporatedMethod and apparatus for voice control of a computer
US4802224 *Sep 22, 1986Jan 31, 1989Nippon Telegraph And Telephone CorporationReference speech pattern generating method
US4821325 *Nov 8, 1984Apr 11, 1989American Telephone And Telegraph Company, At&T Bell LaboratoriesEndpoint detector
US4829572 *Nov 5, 1987May 9, 1989Andrew Ho ChungSpeech recognition system
US4989246 *Mar 22, 1989Jan 29, 1991Industrial Technology Research Institute, R.O.C.Adaptive differential, pulse code modulation sound generator
US5706393 *Apr 3, 1995Jan 6, 1998Matsushita Electric Industrial Co., Ltd.Audio signal transmission apparatus that removes input delayed using time time axis compression
US6003004 *Jan 8, 1998Dec 14, 1999Advanced Recognition Technologies, Inc.Speech recognition method and system using compressed speech data
US6377923Oct 5, 1999Apr 23, 2002Advanced Recognition Technologies Inc.Speech recognition method and system using compression speech data
US7072828 *May 13, 2002Jul 4, 2006Avaya Technology Corp.Apparatus and method for improved voice activity detection
US7672839 *Jun 24, 2005Mar 2, 2010Cambridge Silicon Radio LimitedDetecting audio signal activity in a communications system
US7848358 *Nov 15, 2002Dec 7, 2010Symstream Technology HoldingsOctave pulse data method and apparatus
USRE32172 *Jan 25, 1985Jun 3, 1986At&T Bell LaboratoriesEndpoint detector
USRE33597 *May 5, 1988May 28, 1991 Hidden Markov model speech recognition arrangement
USRE44466Jun 13, 2002Aug 27, 2013Koninklijke Philips Electronics N.V.Method and device for packaging audio samples of a non-PCM encoded audio bitstream into a sequence of frames
USRE44955Jul 12, 2013Jun 17, 2014Koninklijke Philips N.V.Method and device for packaging audio samples of a non-PCM encoded audio bitstream into a sequence of frames
DE2659083A1 *Dec 27, 1976Jul 14, 1977Western Electric CoVerfahren und vorrichtung zur sprechererkennung
DE3149134A1 *Dec 11, 1981Jul 29, 1982Western Electric CoVerfahren und vorrichtung zur bstimmung von sprachendpunkten
DE3337353A1 *Oct 14, 1983Apr 19, 1984Western Electric CoSprachanalysator auf der grundlage eines verborgenen markov-modells
DE3630518A1 *Sep 8, 1986Mar 19, 1987Ricoh KkSpeech or sound recognition device
DE3645118A1 *Sep 8, 1986Aug 17, 1989 Title not available
EP0945854A2 *Mar 11, 1999Sep 29, 1999Matsushita Electric Industrial Co., Ltd.Speech detection system for noisy conditions
EP1019904A1 *Aug 17, 1998Jul 19, 2000Ameritech CorporationSpeech reference enrollment method
WO1999035639A1 *Jul 22, 1998Jul 15, 1999Art Advanced Recognition TechnA vocoder-based voice recognizer
WO2001056015A1Jan 10, 2001Aug 2, 2001Koninkl Philips Electronics NvSpeech detection device having two switch-off criterions
WO2009127014A1Apr 17, 2009Oct 22, 2009Cochlear LimitedSound processor for a medical implant
Classifications
U.S. Classification704/215, 52/DIG.130, 704/E11.5, 375/247
International ClassificationG10L11/02
Cooperative ClassificationY10S52/13, G10L25/87
European ClassificationG10L25/87