US 3553372 A
Description (OCR text may contain errors)
XR 3553372 SR "States Patent  Inventors Esmond Philip Goodwin Wright Bishop 5 Stortford; Wincenty B81116, Harlow, Essex, England ] Appl. No. 587,539  Filed Oct. 18, 1966  Patented Jan. 5, 1971  Assignee International Standard Electric Corporation New York, N.Y. a corporation of Delaware  Priority Nov. 5, 1965 [3 3 1 Great Britain I 31 l 46 984/65  SPEECH RECOGNITION APPARATUS 2 Claims, 12 Drawing Figs.
[52} U.S. Cl 179/1, 324/77  lnt.Cl G10l1/00  Field of Search 179/ 1 AS;
340/l46.3(lnquired); 324/77; 328/151  References Cited UNITED STATES PATENTS 3,102,928 9/1963 Schroeder 179/1(AS) 3,278,685 10/1966 Harper 179/l(AS) 3,335,225 8/1967 Campanella.. l79/l(AS) 3,416,080 12/1968 Wright et a1. 179/1 (AS)X Primary Examiner-Kathleen H. Claffy Assistant Examiner-Charles W. Jirauch Atrorneys-Percy P. Lantzy, C. Cornell Remsen, .lr., Rayson P. Morris, Philip M. Bolton and lsidore Togut at sta e 204520 9 awia elmsm e Ale $21M- mat/on 297 l l l Phaneme g l Aec fg Inf/an Output PATENTED JAN SIQYI SHEET u. or 6 Inventors ESMOMD E G. WR/GHT W/NCENTY 85205. I
Atlorney PATENTED JAN 5 I971 SHEET 8 OF 6 Inventors ESMOND G. MIG/6H7 W/NCENTY BEZOEL A ttorn e y SPEECH RECOGNITION APPARATUS This invention relates to speech recognition equipment in which automatic adjustment takes place to enable the equipment to suit itself to the speech characteristics of different talkers.
In our copending application Ser. No. 437,349 filed Mar. 2, I965 for Apparatus for the Analysis of Waveforms, now issued as U.S. Pat. No. 3,416,080, there is described apparatus for speech recognition in which speech recognition is accomplished by analysis of the zero crossing intervals in the speech wave. Every word has, within fairly wide limits, a recognizable pattern of zero crossings which can be divided into groups representing different sounds; the crossings making up a group being in turn identified by their number and timing relative to each other. Such'a method of speech recognition can be distinguished from frequency spectrum analysis in as much as the information bearing parameters can be converted into a time or digital domain in the case of zero crossing analysis. The zero crossing intervals making up each group are counted under the control of a suitable nonlinear time scale.
According to the present invention there is provided speech recognition apparatus including means for detecting reversals of polarity in the speech waveform, means for generating a measuring time scale waveform when a reversal is detected, means for counting the number of time scale units generated between the detected reversal and the next detected reversal and means for altering the scale of the time scale waveform according to a characteristic of the speech waveform.
In a preferred embodiment of the present invention there is provided means for producing a voltage proportional to the fundamental frequency of the speech waveform and means for generating a nonlinear pulse train time scale, the initial time constant of the pulse generator being controlled by and proportional to the voltage derived from the fundamental frequency.
The above and other features of the invention will become more readily apparent and be better understood from the following description of an embodiment thereof, taken in conjunction with the accompanying drawings in which:
FIG. I illustrates a typical speech waveform and the timing of the zero crossings contained therein,
FIG. 2 illustrates an alternative method of locating the zero crossings in the waveform,
FIG. 3 is a nonlinear timescale,
FIG. 4 is a block diagram of a circuit arranged to time the intervals between successive zero crossings in a waveform,
FIG. 5 illustrates a method of extracting zero crossings from the waveform,
FIG. 6 is a circuit by which the square wave shown in FIG. 5 may be obtained,
FIG. 7 is a block diagram of a circuit by which a limited number of parts of speech may be recognized,
FIG. 8 is a block diagram of an arrangement by which a larger vocabulary may be recognized, and
FIGS. 9 and 10 illustrate sections of FIG. 8,
FIG. 11 illustrates the nonlinear pulse train timescale generating circuit, and
FIG. 12 illustrates diagrammatically two nonlinear pulse time scales derived for different fundamental frequencies.
A fundamental aspect of speech recognition is the ability to extract from a speech waveform features such as frequencies, amplitudes, phase relationships etc., which can be recognized as conforming to certain known patterns for each type of speech sound. These features can be extracted and, with the aid of modern computers, measured, classified, stored and compared with various standards or reference patterns.
One method of analyzing speech waveforms for the purpose of extracting recognizable features therefrom is to count and measure the intervals between zero crossings of the waveform. A refinement of this technique is to count the number of com binations of zero crossing intervals that conform to a particular pattern. For example the speech waveform may be analyzed to ascertain the number of adjacent pairs of zero crossing intervals where the first interval falls within the range between I and 1.5 msec and is followed by an interval that falls within the range between 0.5 and 0.7 msec.
FIG. 1 illustrates a speech waveform 11 having zero crossings 12 to 20. The intervals between these zero crossings are represented as periods of time 21 to 28. The timing of these intervals is achieved by counting the number of timescale units generated by a timescale which is started when a zero crossing is detected. Thus interval 21 is timed as being I timescale unit in duration, while interval 24 is 3 timescale units in duration.
Whilst it has been assumed that the intervals between the actual zero crossings can be timed and counted, in practice it may be found that unwanted noise in the waveform will produce spurious zero crossings. To overcome this it can be arranged that instead of detecting the actual zero crossings, the analysis is based on the detection of those points where the waveform alternately exceeds positive and negative threshold amplitudes. This is illustrated in FIG. 2, in which the waveform 31 is depicted as crossing the positive threshold at points 32, 34, 36, 38 and 40, and crossing the negative threshold at points 33, 35, 37 and 39. This arrangement can be adopted because most of the noise in the waveform is of small amplitude compared with the speech waveform. Therefore the threshold values can be chosen so that the noise content of the waveform lies between them; and detection of the points 32 to 40 will not include spurious zero crossings. It will be noted that the threshold crossings do not depart significantly from the zero crossings, and in practice the intervals between the threshold crossings will be substantially the same as the intervals between the zero crossings.
Therefore, for the remainder of this specification the term zero crossings will be used to denote both actual zero crossings and threshold crossings.
It has been stated above that the intervals between zero crossings are timed by counting timescale units, the timescale being started afresh in each case when a zero crossing is detected.
The relation between the measured interval Z,, the counting period t,.-, and the count number n is:
It should be noted that Z cy of the zero crossing wave.*-
Considering the lower and upper end frequencies of this wave, namely, f, and f then where f is the counting rate, or pulse repetition frequency in the case of a pulse timescale.
Thus f A C (2n+1 )n- (n+1 where is the center frequency, and B (f f /2f,n (Bandwidth).
In the previous discussion, it was assumed that the counting rate was constant during the measured interval or channel. The principal disadvantage of this technique is that the accuracy of measurement depends directly upon the frequency of the signal to be measured. It can be seen that a low frequency or long interval will be measured very accurately compared with the measurement of a high frequency or short interval.
In terms of frequency bands, each count number at the lower end of the measured spectrum will produce a bandwidth which is too narrow, and each counter number at the higher end will produce a bandwidth which is too wide. For example, consider that the counting rate is 10 kc./s. The interval between two successive counts is equivalent of 5 kc./s. However, substitution of n in the preceding formulas shows that where n is equal to l, the band is equivalent to 2,500 to 5,000 c./s. Similarly it is possible to show that for n 15 the frequency band is 300 to 330 c./s.
In any practical application of this counting technique, it is most desirable to increase the number of counts for a high frequency, i.e. reduce the width of the band, and to decrease the number of'counts for a lower frequency, i.e. increase the where f is the frequenwidth of the band. A possible method of achieving this object is to use a nonlinear measuring scale so that the counting rate is effectively different in adjacent channels.
The formulas which were derived previously for counting frequency, count number, etc., still apply. However, instead of using f one has to substitute a function relating f to either time, or to count number.
This function has the form f (n) =f, 1+ logf(n)) wheref is the frequency of the first pulse.
FIG. 3 depicts a nonlinear timescale such as is used in FIGS. 1 and 2.
FIG. 4 illustrates by block diagrams a circuit for timing the intervals between successive zero crossings in a waveform such as that shown in either FIG. 1 or FIG. 2.
The equipments denoted by the various blocks in the drawings are known electronic circuits and do not in themselves constitute novel features of the invention.
The incoming speech waveform 50 is fed to a wave-shaping circuit 51 used to identify the zero crossings. The identification may be performed according to the procedures outlined with reference to FIG. 2. The output from the wave-shaping circuit may take the form of a square wave, as shown in FIG. 5. It will be seen that the waveform 61 in FIG. can be used to produce a square wave 62 having the same zero crossing characteristics as the waveform 61. Since zero crossing analysis is independent of amplitude or other factors, a square wave of fixed amplitude having the necessary zero crossing intervals makes a suitable trigger waveform for operating counters and other circuits.
One method of producing the desired square wave is by utilizing the circuit shown in FIG. 6. In this FIG., transistor 70 operates as an amplifier for the speech input, which is limited by amplitude limiter diodes 68 and 69 so as to avoid overloading of the amplifier. Transistor 71 operates as a phase-splitter and converts the amplified and limited signal from transistor 70 into two outputs in opposite phase. These outputs are passed to two transistors 72 and 73 operating as emitter followers and arranged to reproduce negative going signals only. The waveform 63 of FIG. 5 represents the outputs of transistors 72 and 73 added together. These two outputs are taken to the inputs of a pair of trigger transistors 74 and 75. The trigger can be set to a threshold value which is adjustable by means of a potentiometer 76 in the common emitter connection of the two transistors. The outputs from the circuit are derived from two inverter transistors 77 and 78, and are represented by the square wave 62 in FIG. 5.
The circuit of FIG. 6 is biased where shown by voltages V+ or V-, all ofequal amplitude with respect to ground.
Returning to FIG. 4, the output of the wave-shaping circuit is applied to a measuring circuit 55 which includes separate timescale counting circuits 52 and 53, and a timescale generating circuit 54.
As has been previously stated the timescale generated is nonlinear, and recommences when each zero crossing is detected. The counter 52 is arranged to count the timescale units following all zero crossings going positive, and the counter 53 is arranged to count the timescale units following all negative going zero crossings.
Switches 56 and 57 can be set to select the counts of either counter 52 or 53, and the selected count is passed through a gate 58 which is under the control of a threshold and control circuit 59. This threshold and control circuit is used to control the time during which an examination of zero crossings is made. The results of each examination are displayed in a display counter 60, which registers the total number of zero crossings which occur during examination time.
The equipment depicted in FIG. 4 can be arranged to make various types of examination of the speech waveform 50, for example:
I. It can count the number of zero crossing intervals that fall into the time range between I msec and 1.5 msec.
II. It can count the number of combinations of intervals, such as those combinations where an interval of between I msec and 1.5 msec is followed by an interval of between 0.5 msec and 0.7 msec.
The recognition of simple parts of speech (not in the grammatical sense), such as digits zero to nine, as opposed to simple waveform analysis, can be achieved by an arrangement such as that shown in FIG. 7. It consists of a squaring circuit 80 whichidentifies the zero crossing intervals, a measuring circuit 81 which measures the zero crossing intervals, and a gating circuit 82 which sorts the zero crossing intervals into seven interval ranges, referred to as channels CH, as follows:
CHI-00 to 1.31 msec CH2-1.3l to 0.93 msec CH3-0.93 to 0.73 msec CH4-0.73 to 0.42 msec CH5-0.42 to 0.3] msec CH6 0.3] to 0.18 msec CH7 0.18 to O msec.
A threshold circuit 83 provides on or off signals during the presence or absence of speech signals, and controls a timing circuit 84 which provides the following outputs:
(1) Output when speech signals persist more than 100 msec. (beginning of the word) (ii) Output when speech signal is absent for more than 200 msec. (end of word) (iii)dOutput (D1) for the first 100 msec. of the wor (iv) Output (D2) for the 350 msec. following first 100 msec. of speech signal (v) Output (D3) for the first 100 msec. after a gap shorter than 200 msec A group of threshold counters 85 are set to count the number of zero crossing intervals in a given channel. Each threshold counter produces an output when a threshold to which the counter is preset is reached. The following threshold counters (TC) are provided.
TCl for CHI TCZ for CH1 +CH2 TC3 for CH3 CH4 TC4 for CH5 TC5 forCH6+CH7 Finally a gating circuit 86 is used to identify spoken digits according to the following patterns GATE CONDITION example, the unit marked 88 classifies the voiced or unvoiced characteristics. Units 89 and 90 isolate the first and second frequency ranges corresponding to formants of vowel sounds respectively and pass the vowel information in the form of zero crossings. Unit 91 extracts the fundamental frequency of a talker. Units marked 92 and 93 extract two groups of frequencies with respect to unvoiced sounds, and unit 94 detects consonant groups. The unit 95 is a threshold detector enstsr t vedin d ec The complexity of the first stage in the classification of speech characteristics depends mainly on the size of vocabulary and the range of talkers. For example, for the recognition of vowels it may be sutficient to analyze only one frequency reuse-s. i V. i
In the second stage of the recognition process analysis is performed on the portions of speech which were separated in the first stage. This analysis leads to the recognition of specific voiced and unvoiced sounds by the recognition circuits 97 and 98. The analysis is performed during the time controlled by a sample A which covers a segment of sound. The same analysis 7 is repeated for any subsequent segment of the speech wave. The length of each segment, e.g. sample A, is determined by the fundamental frequency of the talker. This is the function of tl i e measuring and segmentation unit 99.
FIG. 9 shows in more detaila part of a vowel recognition arrangement. Information-is derived from the zero crossings of the first formant and the analysis is done by measuring zero crossing distances and extracting only the significant ones. The zero crossing intervals are measured in the unit 102, and the timing control 103. controlled by sample pulse A,selects the period during which the zero crossing distances are meastated. The significant zero crossing distances extracted by the unit 102 are stored in the storage units marked D1, D2 Dn. As has been stated above, the length of each sample of speech is determined by the fundamental frequency of the talker. The fundamental frequency also controls measurement of zero crossing distances. One sample constitutes the shortest recognizable portion of a sound. In the case of vowels these portions may be referred to as little vowels." For example, during an uttering of the sound a recognition of a segment of the sound can consist of the following series of samples This series is stored as three as and two 0's. The recognition of each sample is-performed by the recognition circuit 104 under the control of the sample pulse A and when a sufficient number of samples have been recognized a complete group of samples, i.e. a segment, is recognized by the recognition circuit 105 under the control of a segment pulse B. The recogniiion of the group of samples given above, under the control of the segment pulse 8, indicates that the unknown letter sound was a. The segment 13 covers a number of samples A which is sufficient to make a decision on the unknown sound.
Recognition of a group of parameters, such as zero crossing distances or little vowels, and so on, can be accomplished by straightforward threshold circuit followed by logical gating m by a statistical decision circuit.
An example of the latter is shown schematically in FIG. 10. The output from each parameter (a parameter can be represented as either I or O voltage levels, or as an analogue or quantized voltage level) is taken via resistor Ri to a point recognizing, for example, a, 0 etc. The value of the resistor 'Ri represents a weighted contribution of a given parameter to the recognition of a, o'etc., and is such that ROIRI l where R0 is a constant of the adding circuit. Contributions of Ri should satisfy the expression for all i s associated with a given point, say, a, 0 etc.
Similarly the unvoiced sounds are recognized by the recognition circuit 98.
As in the first stage, complexity of the remaining stages in the recognition process is mainly related to the size of vocabulary and the range of talkers.- For example, voiced, unvoiced and phoneme recognition can be reduced to one unit. The phoneme recognition circuit 100 and-the word recognition circuit 101 are arranged on the same lines as previously described with reference to FIGS. 9 and 10. The main difference is that in each succeeding recognition sequence another set of parameters is brought into use from the preceding stage. The number of stages in the recognition process is also related to the size of vocabulary and the range of talkers. ln the recognition of a short selected vocabulary it may be quite feasible to recognize words directly, without dividing them into phonemes, voiced sounds, etc.
In the arrangement shown in FIG. 11 two complementary transistors 201 and 202 have their emitters connected together- The base of transistor 202 is connected to the collector of transistor 201 by a positive feedback connection 203.
-' The base of transistor 201 is connected to a bias voltage source at b via two resistors 210, 211 and is also connected to two grounded capacitors 212 and 213. Transistors 201 and 91 if? rrrsrsq vsl fl fi a "Wan positive and negative DC, bias supplies are connected as indicated to the collector and base of transistor 202 and the collector of transistor 201.
When the base of transistor 201 is driven negative suffciently for it to begin to conduct then the action of the feedback circuit 203 will start to drive the base of transistor 202 positive. Transistor 202 then begins to conduct and its emitter-collector current reinforces the emitter-collector cur-- rent of transistor 201 and the rise in emitter voltage of transistor 201 makes it conduct even more. This process continues until saturation is reached and the feedback voltage applied to the base of transistor 202 cannot rise any further.
The capacitors 212, 213 and resistors 210 and 211 control the voltage applied to the base of transistor 201 in response to a pulse at the input 204.
initially a bias voltage b at point 208 is arranged to be at least equal to or more positive than the voltage a at point 209. The timing scale is initiated at time t by a negative going pulse at the input 204, applied to capacitor 212 by transistor 206. The amplitude of this pulse determines the duration T, (Note FIG. 12), of a succession of pulses in a timescale. This negative going pulse at 204 negatively charges capacitor 212 according to its amplitude. Capacitor 212 immediately starts to discharge according to the time constants of 210 and 212. At the same time 213, via 211, is charged negatively at a rate determined by the time constants of 213 and 211. When the voltage on 213 drops to a point where it is equal to the voltage a at point 209 the base voltage of transistor 20] is sufficiently negative to cause the transistor to conduct. The positive feedback circuit 203 ensures that the rise in conduption of transistors 20! and 202 is very rapid an causes the first timing pulse to be delivered to the output 205. When transistor'20l is saturated the drain on capacitor 213 via the base of transistor Meanwhile capacitor 212 has lost some of its negative charge due to the potential [2 at point 208 and therefore the rate of negative charge of capacitor 213 is reduced. Thus the second pulse interval is longer than the first, and each succeeding interval is longer than the last. FIG. 12 illustrates a timescale P generated by the circuit of FIG. 11.
The negative-going pulses at point 204 are derived from the trigger output of the circuit of FIG. 6. This circuit will produce two square wave output waveforms which have positive-going trigger pulses, each trigger pulse in the one square wave output being representative ofa positive-going zero crossing con tained in the input speech wave and each trigger pulse in the other square wave output being representative of a negativegoing zero crossing contained in the input speechlwave. Each trigger output is conventionally inverted, the leading edge of which coincides with the positive-going edge of the relevant trigger output. These two sets of negative-going pulses have a constant width and amplitude todefine the period T referred to above. H
If the circuit is left untouchedafier the initial pulse at point 204 there will come a time whenthe output pulse interval becomes infinite. However, in practice the period T over which the timescale is required to function covers only a small number of pulses, and at the end of this period the timescale will be restarted by receipt of a new negative going pulse at point 204. To ensure that the timescale starts from zero, so to speak, at the start time t capacitor 213 isfully discharged positively by a positive going pulse applied via the diode 207.
The value of the potential b at point 208 in relation to the potential at point 209 controls the number and distribution of output pulses during a given period T. To alter the scale, i.e. to increase or reduce T for the same number ofpulses with the same pulse interval ratios it is only necessary to alter the initial negative charge on the capacitor 212. The timescale q in FIG.
12 illustrates the effect of reducing the amplitude of the input pulse at point 204.
As noted previously, reference is made'to the use of a nonlinear timescale for counting'zero crossing intervals. In the present invention the circuit of'FIG. 11 is used to generate a nonlinear timescale the scale of which is automatically expanded or contracted according to the fundamental frequency or other characteristics of the talker. The derivation ofa signal representing the fundamental frequency of a talker is Well known and forms no part of the present invention, see for example Automatic Ei'g tractiori of the Excitation function of Speech with Particular Reference to the U se of Correlation Methods by J. S. Gill','Proceedi ngs ofth le Tit International Congress on Acoustics, Stuttgart l99, ge 21 7. The pitch analogue output o f't he" sv anddeseribed therein can be converted by means'g'ri' tfs h H )Lt'ofpr bvide a controlling voltage waveformfor j t 2(14 iii 't 'h'e nonlinear timescale generator of' FIG thefariiplitude of this voltage being related to the fundari'l "talffie quency or 'object characteristic ofthe talk'er.
It is to be understood that thelfdiegoin'g description of specific examples of thisiriventiori is niad by way of example only and is not to be considered as a limitation on its scope.
1. Speech recognition apparatus comprising:
a. a speech waveform source;
b. means coupled to said speech waveform source for detecting waveform reversals of plurality and generating therefrom a corresponding outputwaveforrn,
c. means coupled to said speech waveform source for detecting the waveform fundamental frequency and generating therefrom a voltage representative of said fundamental frequency; a
(1. means, responsive to said reversal detector and to said fundamental frequency detector, for generating a nonlinear measuring timescale waveform, said nonlinear timescale being initiated whenever a reversal is detected and altered according to variations in the fundamental frequency; and
e. means, coupled to said reversal detector and to said timescale generator for counting the number of timescale units generated between detected reversals.
2. Apparatus according to claim 1 in which the means for generating the nonlinear time scale includes first and second transistors having complementary symmetry with their emitters connected together, 'a positive feedback connection between the base of the first transistor and the collector of the second transistor, first'and second capacitors connected to the base of the second transistor and means for charging the first and second capacitors at differentialv rates by the voltage related to the fundamental frequency.