US 3649765 A
An important step in speech signal analysis is the identification of formant frequencies of voiced speech. Formant data is necessary in the synthesizer used, for example, in a resonance vocoder. To derive these data, i.e., to obtain an estimate of the pitch period of the signal and its spectral envelope, a cepstrum of a speech signal is used. The lowest three formants of a voiced speech signal are then estimated from a smoothed spectral envelope using constraints on formant frequency ranges and relative levels of spectral peaks at the formant frequencies. These constraints allow detection in cases where formants are too close together to be resolved from the initial spectral envelope.
Claims available in
Description (OCR text may contain errors)
finite ttes Patent Rabiner et al.
 Mar. 14, 11972  SPEECH ANALYZER-SYNTHESEZER SYSTEM EMPLOYING HMPROVED FORMANT EXTRACTOR  Inventors: Lawrence R. Rabiner, Chatham; Ronald W. Schaier, New Providence, both of NJ.
[211 App1.No.: 872,050
3,493,684 2/1970 Kelly 179/15A 3,190,963 6/1965 David..... 179/1 5A 3,268,660 8/1966 Flanagan 179/1 5A Primary Examinerl(athleen H. Claffy Assistant Examiner-Jon Bradford Leaheey Attorney-R. J. Guenther and William L. Keefauver  STRACT An important step in speech signal analysis is the identification of formant frequencies of voiced speech. Formant data is necessary in the synthesizer used, for example, in a resonance vocoder. To derive these data, i.e., to obtain an estimate of the (g1. pitch period of the signal and its spectral envelope, a cepstmm  Field 5 A 15 55 of a speech signal is used. The lowest three formants of a v0- iced speech signal are then estimated from a smoothed spectral envelope using constraints on formant frequency ranges  References cued and relative levels of spectral peaks at the formant frequen- UNITED STATES P S cies. These constraints allow detection in cases where formants are too close together to be resolved from the initial 2,938,079 5/1960 Flanagan ..179/ 15.55 Spectral enve]ope 3,328,525 6/1967 Kelly .l79/15.55 3,448,216 6/1969 Kelly 179 /1 55 p 9 Claims, 11 Drawing Figures 13 I4 26 :25 ZERO PITCH P: to r P F CROSSING MOD NO'SE COUNTER FDETECTOR GENERATOR P, 1 AN 23 F1 24 WINDOW UNVOICED UNVOICED FUNCTION SPECTRUM RESONANT GENERATOR CODER CIRCUITS l5 16 m m 1:: F1 30 32 4 Q 2 3| SPECTRAL FEM 3 FIXED WENVELOPE GATE 2- 2 ADD SPECTRAL ESTIMATOR *3 SHAPING T; F F 3 i 5 27 .C (nT) A VOICED l VOICED SPECTRUM RESONANT c '9 ODER F CIRCUITS 20 3 29 BUZZ/HISS m v LEVEL op PULSE CONTROL 1 GENERATOR T T Av PATENTEDMAR 14 1972 SHEET 2 [1F 5 FIG. 2 sEEcTRAL EHVELOPE *ESETIMATOR l q C(HT) HT SPECTRAL I ENVELOPE 52 DISCRETE 5|GNAL MOD. ADD FOURIER CEPSTRUM f TRANSFORMER I nT e nT T T T c(n )Mn )+e(n) lzERo CROSSING 35 (0 I) PITCH 1 DETECTOR /COUNT mcoMPARE y 36 FZKLOGIC OR 39 34 3a My VO|CED CEPSTRUM [WW COMPARE o= UNVOICED PEAK M PICKER 'GATE CEPSTRUM PEAKS OF SPECTRAL ENVELOPE SIGNALS (FREQUENCIES AND LEVELS) FIG. 4
Pp FREQUENCY OF HIGHEST PEAK ABOVE I000 HZ Ap l3db E =500 Hz PAIENIEDIIIIR 14 m2 FROM CE PSTRUM ANALYZER SHEET 3 [IF 5 0 T0 900 HZ IN Fl REGION ENHANCE REGION FI=HIGHEST PEAK FIAMP=FOAMP-8.7 db
' ARE NOT RESOLVED Fl HAS BEEN PICKED Fl AND PEAK DUE TO SOURCE FI=LOCATION OF HIGHEST PEAK IN FI REGION FIAMP=LEVEL OF PEAK I FOAMP=LEVEL OF THE HIGHEST PEAK IN THE RANGE 0 T0 900 HZ PEAKS IN SPECTRAL ENVELOPE (FREQUENCIES & LEVELS) FROM CEPSTRUM ANALYZE R SEARCH REGION FL T0 F2MX F2=LOCATION OF HIGHEST I PEAK FOR WHICH FIAMP-FZAMP EXCEEDS THE THRESHOLD OF FIG. 9
ENHANCE REGION FI-450 T0 1 FI+450 HZ FI=HIGHEST LEVEL PEAK IN Fl REGION FZ SECOND HIGHEST LEVEL PEAK F2=FI +200 4 NO F2 FOUND? YES ' THRESHOLD FOR F3 PEAK= H138 CII? THRESHOLD FOR F3 PEAK= I000 L Fl AND F2 ARE NOT RESOLVED FIG. 5
PATENTEBHARM I972 3,649,765
SHEET H [1F 5 FIG. 7
Fl AND F2 HAVE BEEN PICKED FL=F2MN FL=F3MN FROM CEPSTRUM SEARCH REGION FL TO F3MX ANALYZER F3 -LOCATION OF HIGHEST PEAK FOR WHICH F2AMP-F3AMP EXCEEDS THRESHOLD SET DURING F2 SEARCH ENHANCE REGION N0 F2 -450 T0 F3 FOUND? 1 F2 +450 Hz T YES FP HIGHEST PEAK F3 SECOND HIGHEST PEAK ALL FORMANTS ESTIMATED SPEECH ANALYZER-SYNTHESIZER SYSTEM EMFLOYING IMPROVED FORMANT EXTRACTOR This invention relates to the analysis and synthesis of speech in bandwidth compression systems. Subordinately, it relates to the identification and extraction of formants from continuous human speech.
BACKGROUND OF THE INVENTION In order to make more economical use of the frequency bandwidth of speech transmission channels, a number of bandwidth compression arrangements have been devised for transmitting the information content of a speech wave over a channel whose bandwidth is substantially narrower than that required for analog transmission of the speech wave itself. Bandwidth compression systems typically include, at a transmitter terminal, an analyzer for deriving from an incoming speech wave a group of narrow bandwidth control signals representative of selected information-bearing characteristics of the speech wave and, at a receiver terminal, a synthesizer for reconstructing from the control signals a replica of the original speech wave.
1. Field of the Invention It has been demonstrated that a speech waveform can be constructed by means of an arrangement that corresponds generally to the structure of the human vocal tract. Speech is produced in such an arrangement by exciting a series or parallel connection of resonators either by random noise, to produce unvoiced sounds, by a quasi-periodic pulse train, to produce voiced sounds, or in some cases by a mixture of these sources, to produce voiced fricatives. To produce natural sounding speech, the mode of operation of the human vocal tract is simulated by continuously tuning the natural frequencies of the resonators. As tuned, resonances are established at selected frequencies to produce peaks or maxima in the amplitude spectrum of the reconstructed signal which correspond to the principal resonances, or formants, of the human vocal tract. Since the first three formants, in order of frequency, contribute most to the intelligibility of speech, it is common practice to transmit at least three formant control signals to shape an artificial spectrum at the synthesizer.
2. Discussion of the Prior Art Since formants are effective parameters for the production of artificial human speech, they are used as control signals, for example, in such devices as the wellknown resonance vocoder. A typical resonance vocoder is described in J. C. Steinberg, U.S. Pat. No. 2,635,146, issued Apr. 14, 1953. Further, since the quality of speech reconstructed by a resonance vocoder or the like is largely dependent on the proper identification of formant frequencies and locations, a number of techniques have been proposed for extracting formant information from a speech wave. One such proposal is described in J. L. Flanagan, U.S. Pat. No. 2,938,079, issued May 24, 1960. Further, electrical methods for speech synthesis, using formant data, are discussed in detail in Speech Analysis, Synthesis and Perception by J. L. Flanagan, Academic Press, lnc., 1965.
SUMMARY OF THE INVENTION It is an object of this invention to improve the accuracy and efficiency with which formants are derived from a speech signal. It is another object to use these forrnants and other selected parameters to transmit, over a narrow band communication circuit, sufficient information with which to produce an accurate replica of an input speech signal.
These and other objects are achieved, in accordance with this invention, by determining, at a transmitter station, as a function of time, the pitch period, the amplitude of voiced and unvoiced excitation, the location of the lowest three formants for voiced speech, and the locations of a single pole and zero necessary for the synthesis of unvoiced speech. These data are suitable for transmission to a receiver station for use in the synthesis of speech. Since the system is not pitch-synchronous,
an exact determination of pitch period is not required. Instead, several periods of speech may be examined at a time. Averaging of this sort has the advantage of eliminating the difficult problem of accurately determining pitch periods in the acoustic waveform.
The analysis of applied voiced speech thus involves two basic parts, viz, initially, an estimation of pitch period and a computation of the spectral envelope of the applied signal, and, secondly, an estimation of formants from the spectral envelope. Estimation of the pitch period and the spectral envelope is accomplished through a computation of the cepstrum of a segment of the applied speech waveform. The cepstrum of a segment of sampled speech is defined as the inverse transform of the logarithm of the Fourier transform of that segment. Cepstral techniques for pitch period estimation have been described in Cepstrum Pitch Determinations by A. M. Noll, Journal of the Acoustical Society of America, February, 1967, at page 293. Previous investigations have shown that it is reasonable to assume that the logarithm of the Fourier transform (actually the logarithm of the z-transform in the case of sampled date) of a segment of voiced speech consists ofa slowly varying component attributable to the convolution of the glottal pulse with the vocal tract impulse response, plus a rapidly varying periodic component due to the repetitive nature of an acoustic waveform. These two additive components can be separated by linear filtering of the logarithm of the transform. The assumption that the log magnitude is composed of two separate components is supported by investigation of models of the production of speech waveforms.
Accordingly, the pitch period is determined by searching the cepstrum for a strong peak in a region encompassing the minimum expected pitch period. The spectral envelope is obtained by low pass filtering of the log magnitude of the discrete Fourier transform. Formants are derived from the smoothed spectral envelope by locating all of the peaks (maxima) and identifying the location and amplitude level of each peak. This collection of peak locations and peak levels contains the spectral information necessary for a satisfactory estimation of formant values. The frequency region expected to contain the first three formants of a speech signal is then segmented into three regions. The lowest formantis searched for first, looking primarily in the lowest region, then the second formant is sought, primarily in the next highest region, and finally the third formant is searched in the highest of the three regions. Based on the amplitudes and frequencies of the peaks and their locations in the various regions or in regions of overlap, logical operations are performed by which spurious candidates are eliminated and the selected highest peaks are ordered and identified as speech formants. If the speech is unvoiced, only a single variable resonance peak and a single variable antiresonance are used to characterize the sound. They, too, are extracted from a cepstrally smoothed spectrum. A voiced-unvoiced decision additionally is obtained based on the presence or absence of a strong peak in the cepstrum together with a measure of a zero crossing count.
In order to convert the control parameters of the analyzer to speech, a digital, serial, terminal analog speech synthesizer is employed. It models the transmission characteristic of the V vocal tract from glottis to mouth. Synthesizers based on such models have been described'previously in the art, for example, in Gerstman-Kelly, U.S. Pat. No. 3,l58,685, issued Nov. 24, 1964, as well as elsewhere. The variable resonance circuits employed in the synthesis network and the manner of controlling them may be substantially identical to those described in the Gerstman-Kelly patent.
Certain other refinements to the generation of parameter signals are employed to improve the synthesis of speech, particularly in those cases in which formants in the applied speech are too close together in frequency to be resolved.
This invention will be more fully understood from the following detailed description taken together with the drawings.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block schematic diagram of a speech analyzersynthesizer which illustrates the principles of this invention;
FIG. 2 illustrates the structure of a spectral envelope estimator suitable for use in the system of FIG. 1;
FIG. 3 depicts a pitch detector which may be used in the practice of the invention;
FIG. 4 illustrates the functional operation of unvoiced spectrum coder 18 used in the apparatus of FIG. 1;
FIG. 5 illustrates the manner in which FIGS. 6 and 7 are interconnected;
FIGS. 6 and 7 illustrate by way ofa functional flow chart the operation of voiced spectrum coder 19 used in the analyzer of FIG. 1;
FIG. 8 depicts typical regions in the spectrum of a speech signal likely to contain form ants;
FIG. 9 illustrates the threshold level of signal F relative to signal F,, useful in explaining the operation of a voiced spec trum coder;
FIG. 10 illustrates a characteristic cepstrally smoothed log spectrum of a speech signal. and
FIG. 11 illustrates the manner in which formants in the log spectrum of the signal of FIG. 10 are emphasized by virtue of the operation of the apparatus of this invention.
DETAILED DESCRIPTION OF THE INVENTION FIG. 1 illustrates a band compression system including an analyzer at a transmitter station, and a synthesizer at a receiver station, which illustrates the principles of the invention. At the analyzer, an incoming speech wave from source 10, which may be a conventional transducer for converting speech sounds into a corresponding electrical wave, is applied both by way of modulator 11 to cepstrum analyzer 12, and to zero crossing counter 13. The purpose of the analyzer station is to develop control signals representative of the pitch period and formant locations for voiced speech, the resonance and antiresonance locations for unvoiced speech, and an indication of the magnitude of the buzz or hiss components during voiced and unvoiced speech intervals, respectively. A cepstrum analysis is particularly suitable for this purpose since it permits ali of these parameter signals to be developed with a minimum of equipment complexity. Thus. estimation of the pitch period and the spectral envelope of the applied signal is accomplished from the computation of the cepstrum of a segment of the speech waveform. As discussed by Noll, the cepstrum of a signal is the spectrum of the logarithm of the power spectrum of a signai and exhibits a number of distinct peaks at pitch period intervals. Previous investigations have shown that the logarithm of the Fourier transform of a segment of voiced speech consists of a slowly varying component attributable to the convolution of the glottal pulse with the vocal tract impulse response, plus a rapidly varying periodic component due to the repetitive nature of the acoustic waveform. These two additive components. available in the cepstrum signal, may be separated by linear filtering.
Preparatory to developing the cepstrum of the applied signal, a segment of input speech, .r(T+nT), is weighted, through the action of modulator 11, by a symmetric Hamming window function, u(nT), such that where denotes a discrete convolution, where T is the starting sample of the segment of the speech waveform, and where T is the sampling period in seconds. In equation l p(T+n T) represents a quasi-periodic impulse train appropriate for the particular segment being analyzed and h(nT) represents the triple convolution of the vocal tract impulse response with the glottal pulse and the radiation load impulse response. The window function w(nT) tapers to zero at each end to minimize the effects of a nonintegral number of pitch periods within the window. Since the window function varies slowly with respect to variations in the pitch of the applied signal, it is convenient to develop it, in function generator 23, from the indication of pitch period developed by pitch detector 14. Thus, the purpose of modulating the applied speech wave from transducer 10 by the window function in modulator 11 is to improve the approximation that a segment of voiced speech can be represented as a convolution ofa periodic impulse train with a time invariant, vocal tract impulse response sequence. Preferably, the window function is specified by the equation:
0.54 O.46 cos. 21mT/31-0 s nT s 31' W t2) 0 elsewhere. The duration, 31-, of the window is three times the previous estimate of pitch period. It is made dependent on the pitch period estimate, from detector 14, for two conflicting reasonsv In order to obtain a strong peak in the cepstrum at the pitch period, it is necessary to have several periods of the waveform within the window. In contrast, in order to obtain strong peaks in the smooth spectrum, only about two periods should be within the window, i.e., formants should not have changed appreciably within the time interval spanned by the window. Thus, an adaptive width window assures better estimates of pitch and formants since it presents a wider window for finding a strong peak at the pitch period, and a narrower window for finding strong, unambiguous indications of formants. The choice for window duration of three times the previous pitch period represents a compromise which has proven to be satisfac tory.
As noted earlier, the cepstrum developed at the output of analyzer 12 consists of two components. The component due primarily to the glottal wave and the vocal tract is concentrated in the region lnTl r, while the component due to the pitch occurs in the region .lnTlz-r, where r is the pitch period during the segment being analyzedv The component due to excitation consists mainly of sharp peaks at multiples of the pitch period. Thus, pitch period can be determined by searching the cepstrum for a strong peak in the region nT 1-,,,,,,, where 1', is the minimum expected pitch period. Signals from analyzer 12 are accordingly supplied as one input to pitch detector 14. Zero crossing count information developed by counter 13 is supplied as the other. This information is employed to provide an indication of the voiced or unvoiced character of the applied speech signal. Detector 14 produces a signal P, which may either be equal to 1- for voiced signals, in which case 1' denotes the pitch period of the input signal, or zero for unvoiced signai. Details of a suitable pitch detector are described hereinafter with reference to FIG. 3.
Similarly, a suitable examination of the cepstrum from analyzer 12 is performed to develop an estimate of the spectral envelope of the applied signal. Although a variety of techniques for deriving such an envelope signal are known in the art, one suitable arrangement is described hereinafter in the discussion of the arrangement of PK]. 2.
Peaks in the spectral envelope are identified in peak picker network 16. Suitable peak picking networks have been described variously in the art. Peaks of the spectral envelope are delivered by way of gate 17 either to unvoiced spectrum coder 18 or to voiced spectrum coder 19. The choice is dependent upon whether the input speech signal is voiced or unvoiced. Accordingly, gate 17 is actuated by the voiced-unvoiced signal character of the pitch period signal developed by detector 14. If the input signal is voiced, values of 1' which appear as a l signal at the input of gate 17, open the gate so that peaks of the spectrum envelope are supplied to coder 19. If the input signal is unvoiced, a 0" pitch signal (absence of r) is applied use in synthesizing the applied wave. Two control signals, F and F are developed by coder 18, indicating for unvoiced speech the location ofa single resonance and antiresonance in the speech signal, and three control signals, F F and F are produced by coder 19, representative of the location of the first three formants of the applied signal. Coder 19, in addition to operating on the peaks of the spectrum envelope, also is supplied with cepstrum signals from analyzer 12.
Control signals A, and A representative of the level of buzz and hiss signals to be used in synthesis, are developed in control network from the first spectrum signal produced by cepstrum analyzer 12. Apparatus for developing such level control signals are well known in the vocoder art; any form of buzz-hiss level analyzer may be employed.
Signals P, F F F F F and A and A constitute all of the controls necessary for characterizing applied speech, both when voiced and unvoiced. These signals together require considerably less transmission bandwidth than would analog transmission of the applied speech signal Accordingly, they may be delivered to multiplex unit 21, of any desired construction, wherein the group of control signals is prepared for transmission to a receiver station. At the receiver station distributor unit 22, again of any desired construction, recovers the transmitted signals and makes them available for synthesis.
Received parameter signals may be used to control the production of artificial speech, using any well-known synthesis apparatus. For example, a formant vocoder synthesizer of the form described in the above-mentioned Gerstman and Kelly US. Pat. No. 3,158,685, is satisfactory. Typically, a formant vocoder synthesizer includes two systems of resonant circuits, one energized by a noise signal to produce unvoiced sounds, and the other energized by a periodic pulse signal to develop voiced sounds. In the illustrated apparatus, unvoiced resonant circuits 24 receive noise signals from generator 25 by way of modulator 26. The modulator is controlled by the hiss level control signal A, and serves to control the amplitude of noise signals supplied to the input of the resonant circuits. Spectrum signals F and F tune the resonant circuits 24 to shape the noise signals.
Voiced resonant circuits 27 are supplied, by way of modulator 28, with signals from pulse generator 29. Pulse generator 29 responsive to control signal P, develops a train of unit samples with the spacing between samples equal to "r, where r is the value of P during voiced intervals. Such pulses are similar to vocal pulses of air passing through the vocal chords at the fundamental frequency of vibration, l/r, of the vocal chords. The amplitude of the resulting pulse train is controlled in modulator 28 by buzz level control signal A Signal A represents the intensity of voicing. Resonant circuits 27 thus energized are controlled by formant control signals F,, F and F to shape the train of pulse signals in a fashion not unlike the shaping of voiced excitation that takes place in the human vocal tract, and to produce voiced signals which correspond to those contained in the input signal. In the conventional manner, resonant system 27 includes additional fixed resonant circuits to provide high frequency shaping of the spectrum.
Voiced and unvoiced replica signals from circuits 24 and 27 are combined in adder 30 and delivered for use, for example, to energize loud speaker 31. Additional spectral balance for the synthetic speech signals preferably is obtained by passing the signals from adder 30 through fixed spectral shaping network 32 before delivering them for use. This refinement aids in restoring realism to the reconstructed speech.
A form of spectral envelope estimator 15, suitable for use in the practice of the invention, is shown in FIG. 2. Low pass filtering of the cepstrum signal c(nT) is accomplished by first multiplying the supplied cepstrum by a function 1(nT) of the form where r, AT is less than the minimum pitch period that will be encountered. The sequence e(nT) is next added to the sequence c(nT)l(nT). The purpose of adding this component to the cepstrum is to equalize formant amplitudes. The
sequence e(nT) consists of four nonzero values, as follows:
4 Functions [(nT) and e(nT) may be produced, respectively, by function generators 51 and 53, constructed to evaluate the above equations. Function generators suitable for making such evaluations are well known in the art. The signal from function generator 51 is applied to modulator 50 and the signal from function generator 53 is added to the resultant signal in adder 52. The sequence. c(nT)I(nT) e(nT) then transformed, in discrete Fourier transformer 54, of any wellknown construction, to produce an equalized spectral envelope.
Since the component of the cepstrum due to voiced excitation consists mainly of sharp peaks at multiples of the pitch period, the pitch period of the applied speech wave can be determined by searching the cepstrum for strong peaks in the region of the minimum expected pitch period. A suitable manner of doing this is shown in the detailed illustration of pitch detector 14 by way of FIG. 3. A zero crossing count from counter 13 (FIG. 1) is supplied to compare network 34, where the total count is matched to a threshold signal, typically with a value of 1500 crossings per second. If the count is above the threshold, a signal, Y=O is delivered to logic OR gate 36. If the count is below the threshold, a signal, Y=I is delivered to gate 36. Cepstrum signals from analyzer 12 are delivered to peak picker network 37, which may be of the type described by Noll in US. Pat. No. 3,420,955, issued Jan. 7, 1969, or of any other desired form of construction. Cepstrum peaks are then compared in network 38 against a threshold established symbolically by potentiometer 39. If the amplitude of the detected peak is greater than the threshold, the comparator issues a signal X=1 to indicate that a voiced signal is present (because of the presence of a pitch period signal), but if the peak amplitude is below threshold, 2 signal X=O is delivered to logic OR gate 36. Peak signals from peak picker 37 are also delivered to gate 40. Gate 40 is COlllIPllfid by the output of OR gate 36 such that a cepstrum peak signal above threshold, or a zero crossing count signal below threshold, indicates a voiced signal. Gate 40 thereupon permits the peak location signal from picker 37 to be delivered as an output signal. It is designated P='r. If neither of the threshold criteria are met, logic OR gate issues a zero, gate 40 is not actuated, and no signal appears at the output of the gate. This constitutes the signal P=O and indicates that the applied signal is unvoiced.
From the derived peaks in the spectral envelope of the applied signal, it is in accordance with the invention to develop both signals for control of unvoiced resonant circuits at a synthesizer, and signals representative of the formant frequencies and locations for use in the control of voiced resonant circuits at the synthesizer. If the speech is unvoiced, as indicated by the P=O signal from pitch detector 14 applied to gate 17, then only a single variable resonance peak is used to characterize the sound. It has not been found necessary to estimate a second unvoiced resonance in order to synthesize unvoiced sounds. The resonance peak for unvoiced sounds is extracted from peaks in the spectral envelope in coder 18. Since there is no pitch period for these sounds, a fixed number ofdata points is analyzed. The resonance peak used is the strongest spectral peak about 1,000 I-Iz. Although coder 18 may be implemented in any desired fashion to select and process the desired spectral peak, it has been found convenient to employ a special purpose computer programmed, for example, in accordance with the flow chart of steps shown in FIG. 4.
As indicated in FIG. 4, peaks of the spectral envelope signal delivered to coder 18 are processed by defining the frequency of the highest peak above 1,000 I-Iz. as F The difference between F and the incoming signal is set equal, in Z-transform notation (discussed hereinafter), to
Ap= l t l I l ")l- (5) If A is found to be greater than 13 db., F is assumed to be 500 cycles and is determined. If A is not greater than 13 db. above the reference, but is less than db. below the reference, F is assumed to be equal to F,. and F is deter mined. If F meets neither criteria, it is set equal to F =(0.0065 F +4.5 Ap)(0.014 F +28), (6)
and zero in the unvoiced spectrum. are available for use at I the synthesizer in adjusting unvoiced resonant circuits 24. A suitable program listing for carrying out these operations ,on a computer is set forth in Appendix I, attached to this specification.
Before proceeding to the details of the process for estimating the formant frequencies from peaks in the spectral envelope, in coder 19, it is believed helpful to present data relating to the properties of the speech spectrum. FIG. 8 shows the frequency ranges of the first three formants as determined from experimental data. Individual speakers may have formant ranges somewhat different from those shown in the figure and, if known, these ranges may be used for that speaker. It is apparent that there is a high degree of overlap lbetween ranges in which formants may be located. The first formant range is from 200 to 900 Hz. However, for approximately one-half of this range (500-900 Hz.) the second formant can overlap the first. Simultaneously, the second and ithird formant regions overlap from l,l0O-2,700 Hz. Thus, the lestimation of the formants is not simply a matter of locating ipeaks of the spectrum in non-overlapping frequency bands. Another property of speech pertinent to formant estimation is the relationship between formant frequencies and relative amplitudes of formant peaks in the smooth spectrum. Considerable importance, therefore, is placed on a measurement of the level of the second formant peak (F relative to the !level of the first formant peak (F,). The level measurement A is defined, again in Z-transform notation, as: I A log mo e log wo en I. 7) where F, and F are the frequencies of the first and second for- :mants, lH(e I is the magnitude of the smoothed spectrum at F Hz. A careful analysis shows that A depends primarily upon F,, and F and is fairly insensitive to the bandwidths of all the formants and to the higher formant frequencies. FIG. 9 shows a curve of the minimum difference in formant level (in 'db.) between F, and F, as the function of the frequency F ;This curve takes into account equalization of the spectrum and serves as a threshold against which the difference betweenthe level of a possible F peak and the level of an F, peak is; ;compared. The dependence of A on F, is eliminated by as !suming that F, is fixed at its lower limit FIMN. If the F, depen- ;dence were to be accounted for, a family of curves similar in shape but displaced vertically from the one shown in FIG. 9 is required. For a value of F, greater than FIMN, the cor-, responding curve is above the curve shown in FIG. 9. In FIG. :9, the curve is fiat until 500 Hz. because F is assumed to be above this minimum value. The curve then decreases until about 1,500 Hz., reflecting the drop in F level as it gets further away from F,. However, above 1,500 Hz. the curve rises again due to the increasing proximity of F and F The curve continues to rise until F gets to its maximum value F2MX 2,700 H2., at which point F and F are maximally close (according to the simple model offixed F In order to estimate formants from the spectrum envelope, all peaks are located and the frequency and amplitude of each peak is recorded. The frequency region of the applied signal is segmented into three regions not unlike those depicted in FIG. 8. The lowest formant is first searched for, then F and finally F Based on the amplitudes and frequencies of the peaks, spuirious candidates are eliminated and ambiguities resulting, for 'example, from closely spaced formants are eliminated by a logical examination of the detected peaks.
In cases where F,, F and F are separated by more than about 300 l-lz., there is no difficulty in resolving the corresponding peaks in the smoothed spectrum. However, when F, and F or when F and F get closer than about 300 Hz. the cepstral smoothing results in the peaks not being resolved. In these cases, a spectral analysis algorithm called the Chirp Transform (CZT) can be used to advantage. The CZT permits the computation of samples of the z-transform at equally spaced intervals along a circular or spiral contour in the 2- plane. In particular, if F, and F are close together, it is possible to compute the z-transform on a contour which passes closer to the pole locations than the unit circle contour, thereby enhancing the peaks in the spectrum and improving the resolution. For example, FIG. 10 shows a smoothed spectral envelope in which F, and F are unresolved. In this case the parameters of the cep'stral window function 1(nT), were 1, 2 msec. and Ar 2 msec. FIG. 11 shows the results ofa CZT analysis along a circular contour of radius e' over the frequency range 0 to 900 Hz. with a resolution of about 10 Hz. The effect of the use of the contour which passes closer to the poles is evident in contrast to FIG. 10. A discussion of the CZT algorithm is given in The Chirp z-Transform Algorithm and Its Application," by Rabiner, Schafer and Rader, Bell System Technical Journal, May-June 1969, at p. 1249.
Voiced spectrum coder 19, supplied with peaks of the spectral envelope during voiced speech intervals from gate 17 and with cepstrum signals C(nT) from analyzer 12, is accordingly programmed to take these characteristics of voiced speech into account. It serves to derive control signals F,, F F 3 which specify formant frequencies and which are sufficient for controlling voiced resonant circuits 27 at a synthesizer. Again, the logical operations performed on the cepstrum and peak signals may be carried out using any desired form of apparatus. In practice, however, it has been found most convenient to employ a computer programmed in accordance with the steps set forth in the flow chart of FIGS. 6 and 7. Program listings for the steps of the flow chart appears in Appendix ll of this specification.
Referring to FIGS. 6 and 7, the formants are picked in sequence beginning with F,. To start the process, the highest level peak of the spectrum from the peak picker I6 in the frequency range 0 to FIMX is recorded as FOAMP. FIMX is the upper limit of the F, region. Generally the value FOAMP will occur at a peak in the F, region which will ultimately be chosen as the F, peak. However, sometimes there is an especially strong peak below FIMN, the lower limit of the F, re- 'glT, which is due to the spectrum of the glottal sou rce waveform. In such cases there may or may not be a clearly resolved F, peak above FIMN. In order to avoid choosing a low level spurious peak or possibly the F peak for the F, peak, Iwhen in fact the F, peak and peak due to the source are not resolved, a peak in the F, region is required to be less than 8.7 db. (1.0 on a natural log scale) below FOAMP to be considered as a possible F, peak. The frequency of the highest level peak in the F, region which exceeds this threshold is selected as the first formant, F,. The level of this peak is recorded as FIAMP. If no F, can be selected this way, the spectral envelope in the region 0 to 900 Hz. is reevaluated. The spectral peaks are sharpened by weighting the cepstrum,
.c(nT), supplied to coder 19 directly from analyzer 12, with a window w ln T), where WANT) l001'l'nT w i (8) and performing a spectral analysis on the resultant. This has the effect of evaluating the spectrum on a contour which passes closer to the poles. As previously discussed, the CZT algorithm is an efficient way of performing this evaluation. The enhanced section of the spectrum is then searched for the highest level peak in the F, region. The location of this peak is accepted as F,. If the enhancement has failed to bring about a resolution of the source peak and the F, peak, F, is arbitrarily :set equal to F IMN, the lower limit of the F, region.
The quantity FIAMP is used in the estimation of F,. If the F, peak is very low in frequency and is not clearly resolved from the lower frequency peak due to the glottal waveform, FIAMP is set equal to (FOAMP 8.7 db.). This is done effec- -ztively to lower (because F, is very low) the threshold which is used in searching for F The first step in estimating F is to fix the frequency range to be searched. If F has been estimated to be less than FZMN, the lower limit of the F region, then only the region from F2MN to FZMX is searched. However, if F, has been estimated to be greater than FZMN, it is possible that the F peak has in fact been chosen as the F, peak. Therefore the combined F,-F region from FlMN to F2MX is searched to ensure that if this is the case, the F, peak will be found as the F peak. After F has been estimated, F, and F are compared and their values are interchanged if F is less than F,
In deciding whether a particular spectral peak under investigation is a possible candidate for an F peak, the threshold curve of FIG. 9 is used. The spectral peak is first checked to see ifit is located in the proper frequency range. If so. the difference between the level of the peak under consideration and FIAMP is computed. If this difference exceeds the threshold of FIG. 9, that peak is a possible F peak; if not, that peak is not considered as a possible F peak. The value of F is chosen to be the frequency of the highest level peak to exceed the threshold. The level of this peak is recorded as FZAMP.
If no peaks are found which exceeded the threshold, further analysis is called for. The fact that no peaks are located has been found to be a reliable indication that F, and F are close together. Therefore the cepstrum is multiplied by the weighting function w,(nT) and a high resolution, narrow band spectrum is computed over the frequency range (F -450) Hz.
to (F,I450) Hz. (If F, 450 Hz. the range is to 900 Hz). This spectrum is evaluated along a circular arc of radius e' in the z-plane. This analysis generally produces a spectrum such as shown in FIG. 11 in which the two formants F, and F are readily apparent.
The value of F, is reassigned as the frequency of the highest level peak in the F, region and F is the frequency of the next highest peak. If only one peak is found. F, is arbitrarily set equal to the frequency of that peak and F: (F,+200) Hz.
In searching for F;,, a threshold on the difference in level between a possible F peak and the F peak is employed. In this case a fixed, frequency-independent, threshold has been found satisfactory. lf F is located without weighting the cepstrum with the w,(n T) function, (i.e., F is not extremely low),
the threshold o the difference is set at l7 .3 db. (2. O o a natural log scale). Otherwise, the threshold is effectively removed b y setting it at l ,000 db. l 7
The estimation of F from the smoothed spectrum is then carried out. Because of equalization, there is a possibility of finding the F peak as F Thus, F is checked to see if it is greater than F3MN, the lower limit of the F region. If so, the search for F is extended to cover the combined F -F;, region from FZMN to F3MX. Otherwise the frequency region F3MN to F3MX is searched. As before, a spectral peak is first checked to see if it is in the correct frequency range. Then the difference between the level of the peak being considered for an F peak and F2AMP is computed. The highest level peak which exceeds the threshold is chosen as the F peak. If no peak is found for F further analysis is again called for. It has been found that this situation is generally due to F and F being very close together. As before, an enhanced spectrum is computed by multiplying the cepstrum by window function w,(nTand performing a spectrum analysis on the resultant, in this case over the frequency range (F 450) Hz. to (F +45O) Hz. The result is normally a spectrum similar to that shown in FIG. 11, where F and F are clearly resolved. F is chosen to be the frequency of the highest peak and F to be the frequency of the next highest peak. If only one peak is found, that peak is arbitrarily called the F peak and F is set to (F d-200) Hz. (This may sometimes result in estimates of both F and F which are slightly high.). The final step in the process is to compare F and F and interchange their values if F is greater than F The arrangement for estimating the three lowest formant frequencies of voiced speech, i.e., F,, F F has been found to perform well on vowels, glides, and semivowels. Although no attempt is made to deal with voiced stop consonants or nasal consonants, experience has shown that extremely natural sounding synthetic speech nevertheless may be produced with the limited class of control signals employed in this invention. Advantageously, the control signals may be stored or transmitted with greatly limited channel capacity, thus to achieve substantial economies.
Variations and modifications of the system described herein will occur to those skilled in the art.
n n (f? 3,649,765 7 M19 20 FORTRAN SUBROUTINE FOR ENHANCING FORMAMTS AND PICKING PEAKS SUBROUTINE ENHANQQXQNLCPQWRQWIQFOFAQFBOYQSOQFMN) DIMENSION 9(1) 0X1) QWRI) sWI(1) QYUJ DIMENSI N PLOCX(20) PAMPX(20) INTEGERENV OMEGOZOQ CALL ZERCHYolvlZB) CALL. CDPYQNLCPQQQXY CALL C T(X9YJNLCPQNOPTSQDSIGQDOMGQWROw!QSOOOMEGOQO) CZT IS A SUBROUTINE FOR SPECTRAL ANALYSIS WHICH IS BASED UN THE PRINCIPLES SET FORTH IN RABINER' SCHAFERe AND RADERQ BSTJv MAY-JUNEo 1969 DO 5 Y= q2 PLQCX(I )2000 PAMPX(I 2000 CALL PKFINDNDONDIQNOPTSQXVPLOCX'PAMPXODOMG) PRI T 1 a LOCX( I) QPAMPXI) o 1:1.Q) FORMAT(2F12@5) CALL PICK4TFAQFMNOQFMXOQPLQCXQPAMPXQTHR'AMPQO) CALL PICK(TFBQFMNOQFMXOQPLOCXv AMPXvTHR0AMP! 0) IF(TFBeEQe0oD) Go To 500 IF(TFAOLT TFB) GO TO 2000 TzTFA TFA- -TFB GO TO 2000 CONTINUE TFB TFAQQUOQ CONTINUE FA=TFA+OMEGO FBZTFBHDMEGO WNW CONTINUE CONTINUE FAMPzPKAMPULDC) F=TLOC RETURN END FORTRAN SUBPOUTINE FOR GROSS PEAK SEARCH SUBROUTINE GRGSPMNLeNUoNDeNDlvTABvNloNZ) DIMENSION TAFNZ) ND2=ND/2 DO 10 I=NLvNUoNU2 I1=I-ND1 SL1=TABU TAFHIl) SL2=TAB(I3)='TAB( 12) IHSLIQGEOQDBANDQSLZGLEQOQM GO TO 20 CONTINUE GO TO 30 CONTINUE IF (SLlwEQeOeO) IHSLMEQQOM) N2=I+2*ND CONTINUE RETURN END FOR?RAN SUBROUTINE FOR FINDING THE BIGGEST PEAK BETWEEN N1&N2
SUBROUTINE FINEPKN].9N2QPKLOCIPKAMPOTAB DIMENSION PKLOCKI) QPKAMPKI @TABU.)
pmmpviooooo PKLOCzNi D0 10 I=N1 eN2 TMP=TAB I) IF(TMPQL.EQPKAMP GO TO 10 PKAMP=T P CONTINUE RETUR END Imam/Lease; so To 3000 CALL sPc'rENmoptstoomsevtx xmonso.0i
3000 CONTINUE RF'TURN END $ FORTRAN SUBROUTINEZERO(TABONL'WNU) 3 FORTRA C SUBROUTINE FOR COPYING TABLES SUBROUTINE DO 10 Z 1 3N T1182 T)=TL\Bl I) CONTINUE RETURN END means responsive to said peak representative signals for selecting as formants of said speech signal the highest amplitude peaks according to location within said ranges.
2. Speech analysis apparatus for locating formants of voiced speech signals, which comprises:
means for developing a signal representative of the cepstrum of an applied speech signal,
means for developing from said cepstrum signal a signal representative of the spectral envelope of said speech signal,
means for evaluating said spectral envelope signal along a contour close to the pole locations in the complex frequency plane thereby to produce a signal in which spectrum peaks are sharpened,
means responsive both to said spectral envelope signal and selectively to said cepstrum signal for developing signals representative of the location and amplitude of all peaks in said spectral envelope signal,
means responsive to said peak location signal for selecting and ordering in frequency the highest of said amplitude peaks, and
means for identifying said selected and ordered peak location signals as formants of said applied signal.
3. Speech analysis apparatus for locating formants of a voiced speech signal, which comprises:
means for developing a signal representative of the smoothed spectral envelope of an applied speech signal,
COPYiNtTABl @TABZ) DIMENSIQN TAR]. l l H'ABZi 1) means for locating all peaks in said spectral envelope signal,
means for developing signals representative of the location and amplitude of each of said located peaks within assigned frequency ranges, said ranges being selected to encompass a selected frequency range of said applied signal with prescribed segments of overlap,
means responsive to said peak location signals for selecting the highest amplitude peak in each of said ranges,
means for identifying as formants of said applied signal said selected peaks which occur in nonoverlapping segments of said ranges, and
means for identifying as formants of said applies signal the highest amplitude peaks according to their location, which occur in overlapping segments of said ranges.
4. Apparatus as defined in claim 3, in combination with,
spectral analysis means for enhancing said peaks in said spectral envelope signal.
5. Speech analysis apparatus for locating formants of voiced speech signals, which comprises:
means for developing a signal representative of the pitch period of an applied speech signal,
means for selectively weighting said applied speech signals with a symmetric window function of said pitch period signal,
means supplied with said weighted speech signal for developing a signal representative of the smoothed spectral envelope of said applies speech signal,
means for locating all peaks in said spectral envelope signal,
means for developing signals representative of the location and amplitude of each of said located peaks within assigned frequency ranges,
means responsive to said peak location signals for selecting the highest amplitude peak in each ofsaid ranges,
means for identifying as formants of said applied signal said selected peaks which occur in nonoverlapping segments of said ranges, and
means for identifying as formants of said applied signal the highest amplitude peaks according to their location, which occur in overlapping segments of said ranges.
6. Apparatus as defined in claim wherein said applied speech signals are weighted with a window function with a duration of approximately three times the pitch period of said applied speech signals.
8. A speech signal analyzer system for producing coded signals from applied speech signals, which comprises:
means for developing a signal representative of the smoothed spectrum of an applied speech signal,
means for locating all peaks in said spectrum,
means responsive to said located peaks and selectively to said spectrum for developing, during voiced intervals of said speech signal, control signals representative of the location of the highest of said spectrum peaks in a prescribed order as formants of said applied signal,
means responsive to said spectrum for developing control signals representative of the level of said applied signal during voiced and unvoiced intervals, respectively,
means for developing a signal representative of the cepstrum of said applied speech signal,
means responsive to a count of zero axis crossings in said applied signal and to said cepstrum for developing a signal representative of the voicing character of said applied signal and the pitch period of voiced intervals thereof,
means responsive to said peak signals for developing a signal representative of the pole and zero locations for unvoiced intervals of said applied signal, and
means for utilizing all of said developed signals as a coded representation of said applied speech signal.
9, A speech signal analyzer-synthesizer system with reduced channel bandwidth requirements, which comprises:
at an analyzer station,
means for developing a signal representative of the smoothed spectrum of an applied speech signal,
means for locating all peaks in said spectrum,
means responsive to an indication of said located peaks and selectively to said spectrum signal for developing, during voiced intervals of said speech signal, control signals representative of the location of the highest of said amplitude peaks in a prescribed order as formants of said applied signal,
means responsive to said spectrum signal for developing signals representative of the level of said applied signal during voiced and unvoiced intervals, respectively,
means for developing a signal representative of the cepstrum of said applied signal,
means responsive to a count of zero axis crossings in said applied signal and to said cepstrum signal for developing signals representative of the voicing character of said applied signal and the pitch period of voiced intervals thereof,
means responsive to said peak signals for developing signals representative of the pole and zero locations for unvoiced intervals of said applied signal, and
means responsive to all of said developed signals for delivering them to a synthesizer station, and
at said synthesizer station,
means responsive to received unvoiced level control signals for adjusting the level of a source of noise signals,
a system of unvoiced resonant circuits energized by said adjusted noise signals,
means for adjusting said resonant system with said pole and zero location signals to produce an unvoiced signal,
generator means responsive to said pitch period control signal for developing pulses at pitch frequency,
means for adjusting the amplitude of said pulses according to said level control signal during voiced signals of said applied signal,
a system of resonant circuits energized by said control pulse signals and by said formant signals to produce a voiced signal,
means for combining said voiced and unvoiced signals,
means for shaping the spectrum of said combined signal,
means for utilizing said shaped spectrum signal as a replica of said applied speech signal.