Search Images Maps Play YouTube News Gmail Drive More »
Sign in
Screen reader users: click this link for accessible mode. Accessible mode has the same essential features but works better with your reader.

Patents

  1. Advanced Patent Search
Publication numberUSH2172 H1
Publication typeGrant
Application numberUS 10/186,605
Publication dateSep 5, 2006
Filing dateJul 2, 2002
Priority dateJul 2, 2002
Publication number10186605, 186605, US H2172 H1, US H2172H1, US-H1-H2172, USH2172 H1, USH2172H1
InventorsDavid H. Staelin, Carlos R. Cabrera-Mercader
Original AssigneeThe United States Of America As Represented By The Secretary Of The Air Force
Export CitationBiBTeX, EndNote, RefMan
External Links: USPTO, USPTO Assignment, Espacenet
Pitch-synchronous speech processing
US H2172 H1
Abstract
Pitch-synchronous speech processing invention involves two main steps: 1) divide the speech into pitch periods, or into pseudo pitch periods for unvoiced speech, where the breaks occur, for example, at the first zero-crossing preceding each glottal pulse for voiced speech and at any arbitrary point for unvoiced speech, and 2) compute the log-magnitude of the Discrete Fourier Transform (DFT) of each pitch-period waveform, and interpolate each log-magnitude spectrum to a common regular grid which can accommodate the spectrum of a waveform having the longest pitch period anticipated.
Images(4)
Previous page
Next page
Claims(7)
1. A pitch-synchronous speech processing method for converting an acoustic data stream that contains periods of speech and periods of silence into a series of vectors that constitute a vector representation of the speech the proves comprising the steps of:
dividing the speech into pitch periods, or into pseudo pitch periods for unvoiced speed, where breaks occur, for example, at a first zero-crossing preceding each glottal pulse for voiced speech and at any arbitrary point for unvoiced speech, and
computing log-magnitude of the Discrete Fourier Transform (DFT) of each pitch-period waveform, and interpolate each log-magnitude spectrum to a common regular grid which can accommodate a spectrum of a waveform have a pitch period.
2. A method as defined in claim 1, wherein said dividing step further comprises:
a silence detection subset in which periods of speech in the acoustic data stream are flagged with a speech identifier flag, and wherein the periods of silence in the acoustic data stream are flagged with a silence identifier flag.
3. A method as defined in claim 2, wherein said dividing step further comprises a pitch estimation substep in which samples of the acoustic data stream are taken and used to estimate pitch in the periods of speech identified with a speech identifier flag, and not in the periods of silence identified by a silence identifier flag, the pitch estimation substep outputting thereby a set of pitch estimates.
4. A method as defined in claim 3, wherein said dividing step further comprises a pitch period segmentor substep in which the acoustic data stream, pitch estimate, speech identifier flags and silence identifier flagger are used to compute measurements of pitch period lengths and pitch period waveforms in the acoustic data stream.
5. A method as identified in claim 4, wherein said computing step further comprises:
a Fourier transform substep which produces output signals by performing Fourier transforms on the pitch period waveforms and outputting said Fourier transforms and pitch period lengths.
6. A method as defined in claim 5 wherein said computing step further comprises:
a log-magnitude computing step which operates on the output signals of the Fourier transform substep to output thereby a log-magnitude spectra of the acoustic data stream.
7. A method as defined in claim 6 wherein said computing step further comprises and interpolator substep which produces an output by interpolating the log-magnitude spectra of the acoustic data stream with the pitch period lengths of the acoustic data stream, the output signals of the interpolator step being the series of vectors of the acoustic data thereon defined as a set of interpolated log-magnitude spectra values.
Description
STATEMENT OF GOVERNMENT INTEREST

The invention described herein may be manufactured and used by or for the Government for governmental purposes without the payment of any royalty thereon.

BACKGROUND OF THE INVENTION

The present invention relates generally to synthetic speech systems and more specifically to a pitch synchronous method of transforming speech into vectors for speech processing.

Signal processing for speech, speaker, or language recognition, or for other speech applications, generally consists of a pre-processing step that reduces the speech to a series of vectors, on per time interval, where that interval is typically chosen to lie between five and twenty msec, and successive intervals may overlap. The most commonly used vector representation is the mel cepstrum, which is the Discrete Fourier Transform (DFT) of the logarithm of the non-uniformly low-pass filtered sampled magnitude of the spectrum of that speech segment. The non-uniform filtering and sampling provide roughly constant Q for each channel. A typical output vector might have twenty-eight scalar elements.

The task of processing speech into preprocessing vectors is alleviated, to some extent, by the systems disclosed in the following U.S. Patent, the disclosures of which are incorporated herein by reference:

    • U.S. Pat. No. 5,008,941 issued to Sejnoha
    • U.S. Pat. No. 5,148,489 issued to Erell et al
    • U.S. Pat. No. 5,337,301 issued to Rosenberg et al
    • U.S. Pat. No. 5,469,529 issued to Bimbot et al
    • U.S. Pat. No. 5,598,505 issued to Austin et al
    • U.S. Pat. No. 5,727,124 issued to Lee et al
    • U.S. Pat. No. 5,745,872 issued to Sonmez et al
    • U.S. Pat. No. 5,768,474 issued to Neti
    • U.S. Pat. No. 5,924,065 issued to Eberman
    • U.S. Pat. No. 6,059,602 issued to Stadin

The Stadin is interesting as it is for a powered roller skating system using speech recognition sensors and synthesized speech data processing.

The best reference is the Eberman patent which shows a computerized speech processing system with speech signals stored in a vector codebook and processed to produce corrected vectors.

Generally, speech processing includes the following steps. In a first step, digitized speech signals are partitioned into time-aligned portions (frames) where acoustic features can generally be represented by linear predictive coefficient (LPC) “feature” vectors. In a second step, the vectors can be cleaned up using environmental acoustic data. That is, processes are applied to the vectors representing dirty speech signals so that a substantial amount of the noise and distortion is removed. The cleaned-up vectors, using statistical comparison methods, more closely resemble similar speech produced in a clean environment. Then in a third step, the cleaned feature vectors can be presented to a speech processing engine which determines how the speech is going to be used. Typically, the processing relies on the use of statistical models or neural networks to analyze and identify speech signal patterns.

In an alternative approach, the feature vectors remain dirty. Instead, the pre-stored statistical models or networks which will be used to process the speech are modified to resemble the characteristics of the feature vectors of dirty speech. This way a mismatch between clean and dirty speech, or their representative feature vectors can be reduced.

By applying the compensation on the processes (or speech processing engines) themselves, instead on the data, i.e., the feature vectors, the speech analysis can be configured to solve a generalized maximum likelihood problem where the maximization is over both the speech signals and the environmental parameters.

The present invention is an alternate method and means for performing this first step of transforming speech into a standard series of vectors where each vector represents the sampled magnitude of the spectrum of one pitch period for voiced speech or one pseudo pitch period for unvoiced speech. The subsequent speech processing steps can then be performed with these new vectors as inputs.

SUMMARY OF THE INVENTION

The present invention is an alternate method and means for performing the first step of transforming speech into a standard series of vectors where each vector represents the sampled magnitude of the spectrum of one pitch period for voiced speech or one pseudo pitch period for unvoiced speech. The subsequent speech processing steps can then be performed with these new vectors as inputs, provided these subsequent steps are adapted to the new vectors with suitable training protocols and data.

The invention involves two main steps:

    • 1. divide the speech into pitch periods, or into pseudo pitch periods for unvoiced speech, where the breaks occur, for example, at the first zero-crossing preceding each glottal pulse for voiced speech and at any arbitrary point for unvoiced speech, and
    • 2. compute the log-magnitude of the Discrete Fourier Transform (DFT) of each pitch-period waveform, and interpolate each log-magnitude spectrum to a common regular grid which can accommodate the spectrum of a waveform having the longest pitch period anticipated.
BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an illustration of a complete speech preprocessing system of the present invention;

FIG. 2 is a diagram of the pitch estimation component; and

FIG. 3 is a diagram of the output of the pitch period segmentor of the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENT

The present invention is a speech processing system and process for transforming speech into a standard series of vectors where each vector represents the sampled magnitude of the spectrum of one pitch period for voiced speech or one pseudo pitch period for unvoiced speech. The subsequent speech processing steps can then be performed with these new vectors as inputs, provided these subsequent steps are adapted to the new vectors with suitable training protocols and data. A block diagram of the proposed process is illustrated in FIG. 1.

The process of FIG. 1 has two main steps:

    • 1. divide the speech into pitch periods, or into pseudo pitch periods for unvoiced speech, where the breaks occur, for example, at the first zero-crossing preceding each glottal pulse for voiced speech and at any arbitrary point for unvoiced speech, and
    • 2. compute the log-magnitude of the Discrete Fourier Transform (DFT) of each pitch-period waveform, and interpolate each log-magnitude spectrum to a common regular grid which can accommodate the spectrum of a waveform having the longest pitch period anticipated.

The process of FIG. 1 begins as acoustic data is processed for silence detection 100 to determine which part of the data stream has speech or silence. The speech sequence is converted into a stream of windows of LW speech samples each. The length LW should be comparable to the duration of a syllable. A given window is said to contain speech if its average power exceeds a suitably chosen threshold POW_TH and is otherwise classified as silence, e.g. POW_TH may equal the noise variance per sample.

Once the portion of the data stream containing speech is flagged the pitch estimator 200 can process the flagged data stream.

The pitch estimation component is illustrated in FIG. 2. The input data used to estimate the pitch is the stream of classified speech/silence windows, and the minimum and maximum anticipated pitch period, P_MIN and P_MAX respectively. A register of length K=┌2P_MAX/LW┐ LW1 is sequentially filled with samples from a contiguous sequence of windows containing speech until the capacity of the buffer is reached or a silence window is found on the input stream. Then the following operations are performed on the retrieved speech segment:

    • 1. The N-point DFT of the speech segment is computed with N = 2 log K 2 + 1
    •  and the square-magnitude of each transform coefficient is computed to yield a power spectrum.
    • 2. The frequencies at which the power spectrum has local maxima are determined.
    • 3. A locally normalized spectral envelope is computed by dividing the value of the power spectrum at each peak by the geometric mean of the two adjacent peaks. For the first and last peaks the power spectrum is normalized by the value of the single adjacent peak.
    • 4. If there are no frequencies at which the normalized spectral envelope is greater than ten, the speech segment is declared to be unvoiced; otherwise it is declared to be voiced.
    • 5. For unvoiced speech segments the pitch is set to the default pitch P_DEF.
    • 6. For voiced speech segments a primary pitch estimate is extracted from the normalized spectral envelope using the following heuristic. If there are fewer than five normalized spectral peaks which exceed a threshold of ten, then the lowest frequency in that set of spectral values yields the primary pitch estimate. Alternatively, if there are five or more normalized spectral peaks greater than ten, one first finds the maximum normalized spectral peak from the set of frequencies which are lower than the lowest frequency satisfying the threshold condition. If such a maximum exists and is greater than five and occurs at a frequency which is within twenty percent of half the lower frequency at which the normalized spectrum is greater than ten, then the lower of the two frequencies gives the primary pitch estimate, otherwise the higher of the two frequencies is used as the primary pitch estimate.
    • 7. If the current and previous speech segments are not separated by silence and they were both declared as voiced, a secondary pitch estimate for the current segment is computed. First the means and standard deviation of the ensemble of pitch period lengths of the previous speech segment are computed. If the standard deviation is less than ten percent of the mean and the mean is less than P_MAX, then the mean pitch period length for the previous segment is used as the secondary pitch estimate for the current speech segment.
    • 8. The final pitch estimate p_est for voiced speech segments is obtained as follows. If only the primary pitch estimate is available, it is used as the final estimate. When the secondary pitch estimate is also available the ratio of the primary estimate to the secondary estimate determines which of the two estimates is used as the final estimate. If the ratio is less than 1.3 and greater than 0.7, the primary estimate is used; otherwise the secondary estimate is used.

The speech segments are segmented further into pitch periods as follows.

    • 1. If the current speech segment is starting and the current and previous speech segments are separated by silence, find the maximum peak of the speech waveform in the time interval of duration P_MAX starting at the beginning of the current speech segment. Otherwise, find the maximum peak within the time interval starting 0.7*p_est time units ahead of the last located peak and ending 1.3*p_est time units ahead of the last located peak. Let s_max and t_max be the value and the time index of the located maximum, respectively.
    • 2. Find the minimum value of the speech waveform in the time interval of duration p_est/2 ending at t_max. Let s_min be the value of the located minimum.
    • 3. Position the time cursor at t_max.
    • 4. Move back along the time axis until a peak is found which lies above a line of slope 0.5*(s_max−s_min)/p_est passing through the current peak and is contained in the time interval of length 0.3*p_est ending at t_max.
    • 5. Repeat step 4 until another peak satisfying the specified conditions is not found. Let t_p be the time index of the last located peak.
    • 6. If the current speech segment is declared as unvoiced, the start of the current pseudo pitch period is the minimum of t_p and the start of the previous pitch period (pseudo pitch period) plus P_MAX if there is a preceding pitch period (pseudo pitch period), or the maximum of t_p and the start of the current speech segment if the current pseudo pitch period is the first one in the current speech segment if the current pseudo pitch period is the first one in the current speech segment and the current and previous speech segments are separated by silence.

TABLE 1
parameter values used to generate the example discussed below.
The symbol [*] denotes rounding to the nearest integer. The
sampling rate was F_s = 48000 samples sec.
Parameter Value
POW_TH 1000
LW [16 * F_s/1000]
P_MIN [1.4 * F_s/1000]
P_MAX [25 * F_s/1000]
P_DEF [6 * F_s/1000]

    • 7. If the current speech segment is declared as voiced the following rules are used to determine the start of the current pitch period.
      • (a) If the current and previous speech segments are separated by silence and the current pitch period is the first one in the current speech segment, the start of the current pitch period is the maximum of the zero-crossing preceding t_p and the start of the current speech segment. If there is no zero-crossing, the start of the current pitch period is the start of the current speech segment.
      • (b) If the current and previous speech segments are adjacent in time and there is a zero-crossing between t_p and the start of the previous pitch period, the start of the current pitch period is the minimum of trhe zero-crossing immediately preceding t_p and the start of the previous pitch period plus P_MAX. If there is no zero-crossing between t_p and the start of the previous pitch period, the start of the current pitch period is the start of the previous pitch period plus p_est.

This procedure is repeated until the end of the current speech segment is reached. FIG. 3 shows the segmentation into pitch periods and pseudo pitch periods of a speech segment 100 msec long, where the breaks are indicated by asterisks.

For each pitch period or pseudo pitch period the N-point DFT is computed with N equal to the length of the period in question and the log-magnitude of each transform coefficient is computed. Finally, each log-magnitude spectrum is linearly interpolated to a common regular grid with frequency resolution 1/P_MAX.

One example of the invention illustrated the pitch-synchronous spectral representation of the sentence “The little blankets lay around on the floor.” as delivered by a female speaker. The speech was sampled at a rate of F_s=48000 samples/sec with 16-bit resolution. The values of the parameters used to generate this example are listed in Table 1.

While the invention has been described in its presently preferred embodiment it is understood that the words which have been used are words of description rather than words of limitation and that changes within the purview of the appended claims may be made without departing from the scope and spirit of the invention in its broader aspects.

Patent Citations
Cited PatentFiling datePublication dateApplicantTitle
US4885790 *Apr 18, 1989Dec 5, 1989Massachusetts Institute Of TechnologyProcessing of acoustic waveforms
US5008941Mar 31, 1989Apr 16, 1991Kurzweil Applied Intelligence, Inc.Method and apparatus for automatically updating estimates of undesirable components of the speech signal in a speech recognition system
US5023910 *Apr 8, 1988Jun 11, 1991At&T Bell LaboratoriesVector quantization in a harmonic speech coding arrangement
US5148489Mar 9, 1992Sep 15, 1992Sri InternationalMethod for spectral estimation to improve noise robustness for speech recognition
US5377301Jan 21, 1994Dec 27, 1994At&T Corp.Technique for modifying reference vector quantized speech feature signals
US5469529Sep 21, 1993Nov 21, 1995France Telecom Establissement Autonome De Droit PublicProcess for measuring the resemblance between sound samples and apparatus for performing this process
US5548680 *May 17, 1994Aug 20, 1996Sip-Societa Italiana Per L'esercizio Delle Telecomunicazioni P.A.Method and device for speech signal pitch period estimation and classification in digital speech coders
US5598505 *Sep 30, 1994Jan 28, 1997Apple Computer, Inc.Cepstral correction vector quantizer for speech recognition
US5727124Jun 21, 1994Mar 10, 1998Lucent Technologies, Inc.Method of and apparatus for signal recognition that compensates for mismatching
US5745872May 7, 1996Apr 28, 1998Texas Instruments IncorporatedMethod and system for compensating speech signals using vector quantization codebook adaptation
US5768474Dec 29, 1995Jun 16, 1998International Business Machines CorporationMethod and system for noise-robust speech processing with cochlea filters in an auditory model
US5832437 *Aug 16, 1995Nov 3, 1998Sony CorporationContinuous and discontinuous sine wave synthesis of speech signals from harmonic data of different pitch periods
US5924065 *Jun 16, 1997Jul 13, 1999Digital Equipment CorporationComputerized method
US5933808 *Nov 7, 1995Aug 3, 1999The United States Of America As Represented By The Secretary Of The NavyMethod and apparatus for generating modified speech from pitch-synchronous segmented speech waveforms
US6029133 *Sep 15, 1997Feb 22, 2000Tritech Microelectronics, Ltd.Pitch synchronized sinusoidal synthesizer
US6059062May 31, 1995May 9, 2000Empower CorporationPowered roller skates
US6418408 *Apr 4, 2000Jul 9, 2002Hughes Electronics CorporationFrequency domain interpolative speech codec system
US6463406 *May 20, 1996Oct 8, 2002Texas Instruments IncorporatedFractional pitch method
US6678655 *Nov 12, 2002Jan 13, 2004International Business Machines CorporationMethod and system for low bit rate speech coding with speech recognition features and pitch providing reconstruction of the spectral envelope
US6871176 *Jul 26, 2001Mar 22, 2005Freescale Semiconductor, Inc.Phase excited linear prediction encoder
US6885986 *May 7, 1999Apr 26, 2005Koninklijke Philips Electronics N.V.Refinement of pitch detection
Referenced by
Citing PatentFiling datePublication dateApplicantTitle
US20140200889 *Mar 17, 2014Jul 17, 2014Chengjun Julian ChenSystem and Method for Speech Recognition Using Pitch-Synchronous Spectral Parameters
Classifications
U.S. Classification704/207
International ClassificationG10L15/00
Cooperative ClassificationG10L25/93, G10L19/097
European ClassificationG10L19/097
Legal Events
DateCodeEventDescription
Sep 18, 2002ASAssignment
Owner name: GOVERNMENT OF THE UNITED STATES OF AMERICA AS REPR
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:STAELIN, DAVID H.;CABRERA-MERCADER, CARLOS R.;REEL/FRAME:013297/0546;SIGNING DATES FROM 20020521 TO 20020613