|Publication number||USH2172 H1|
|Application number||US 10/186,605|
|Publication date||Sep 5, 2006|
|Filing date||Jul 2, 2002|
|Priority date||Jul 2, 2002|
|Publication number||10186605, 186605, US H2172 H1, US H2172H1, US-H1-H2172, USH2172 H1, USH2172H1|
|Inventors||David H. Staelin, Carlos R. Cabrera-Mercader|
|Original Assignee||The United States Of America As Represented By The Secretary Of The Air Force|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (21), Referenced by (5), Classifications (5), Legal Events (1)|
|External Links: USPTO, USPTO Assignment, Espacenet|
The invention described herein may be manufactured and used by or for the Government for governmental purposes without the payment of any royalty thereon.
The present invention relates generally to synthetic speech systems and more specifically to a pitch synchronous method of transforming speech into vectors for speech processing.
Signal processing for speech, speaker, or language recognition, or for other speech applications, generally consists of a pre-processing step that reduces the speech to a series of vectors, on per time interval, where that interval is typically chosen to lie between five and twenty msec, and successive intervals may overlap. The most commonly used vector representation is the mel cepstrum, which is the Discrete Fourier Transform (DFT) of the logarithm of the non-uniformly low-pass filtered sampled magnitude of the spectrum of that speech segment. The non-uniform filtering and sampling provide roughly constant Q for each channel. A typical output vector might have twenty-eight scalar elements.
The task of processing speech into preprocessing vectors is alleviated, to some extent, by the systems disclosed in the following U.S. Patent, the disclosures of which are incorporated herein by reference:
The Stadin is interesting as it is for a powered roller skating system using speech recognition sensors and synthesized speech data processing.
The best reference is the Eberman patent which shows a computerized speech processing system with speech signals stored in a vector codebook and processed to produce corrected vectors.
Generally, speech processing includes the following steps. In a first step, digitized speech signals are partitioned into time-aligned portions (frames) where acoustic features can generally be represented by linear predictive coefficient (LPC) “feature” vectors. In a second step, the vectors can be cleaned up using environmental acoustic data. That is, processes are applied to the vectors representing dirty speech signals so that a substantial amount of the noise and distortion is removed. The cleaned-up vectors, using statistical comparison methods, more closely resemble similar speech produced in a clean environment. Then in a third step, the cleaned feature vectors can be presented to a speech processing engine which determines how the speech is going to be used. Typically, the processing relies on the use of statistical models or neural networks to analyze and identify speech signal patterns.
In an alternative approach, the feature vectors remain dirty. Instead, the pre-stored statistical models or networks which will be used to process the speech are modified to resemble the characteristics of the feature vectors of dirty speech. This way a mismatch between clean and dirty speech, or their representative feature vectors can be reduced.
By applying the compensation on the processes (or speech processing engines) themselves, instead on the data, i.e., the feature vectors, the speech analysis can be configured to solve a generalized maximum likelihood problem where the maximization is over both the speech signals and the environmental parameters.
The present invention is an alternate method and means for performing this first step of transforming speech into a standard series of vectors where each vector represents the sampled magnitude of the spectrum of one pitch period for voiced speech or one pseudo pitch period for unvoiced speech. The subsequent speech processing steps can then be performed with these new vectors as inputs.
The present invention is an alternate method and means for performing the first step of transforming speech into a standard series of vectors where each vector represents the sampled magnitude of the spectrum of one pitch period for voiced speech or one pseudo pitch period for unvoiced speech. The subsequent speech processing steps can then be performed with these new vectors as inputs, provided these subsequent steps are adapted to the new vectors with suitable training protocols and data.
The invention involves two main steps:
The present invention is a speech processing system and process for transforming speech into a standard series of vectors where each vector represents the sampled magnitude of the spectrum of one pitch period for voiced speech or one pseudo pitch period for unvoiced speech. The subsequent speech processing steps can then be performed with these new vectors as inputs, provided these subsequent steps are adapted to the new vectors with suitable training protocols and data. A block diagram of the proposed process is illustrated in FIG. 1.
The process of
The process of
Once the portion of the data stream containing speech is flagged the pitch estimator 200 can process the flagged data stream.
The pitch estimation component is illustrated in FIG. 2. The input data used to estimate the pitch is the stream of classified speech/silence windows, and the minimum and maximum anticipated pitch period, P_MIN and P_MAX respectively. A register of length K=┌2P_MAX/LW┐ LW1 is sequentially filled with samples from a contiguous sequence of windows containing speech until the capacity of the buffer is reached or a silence window is found on the input stream. Then the following operations are performed on the retrieved speech segment:
The speech segments are segmented further into pitch periods as follows.
parameter values used to generate the example discussed below.
The symbol [*] denotes rounding to the nearest integer. The
sampling rate was F_s = 48000 samples sec.
[16 * F_s/1000]
[1.4 * F_s/1000]
[25 * F_s/1000]
[6 * F_s/1000]
This procedure is repeated until the end of the current speech segment is reached.
For each pitch period or pseudo pitch period the N-point DFT is computed with N equal to the length of the period in question and the log-magnitude of each transform coefficient is computed. Finally, each log-magnitude spectrum is linearly interpolated to a common regular grid with frequency resolution 1/P_MAX.
One example of the invention illustrated the pitch-synchronous spectral representation of the sentence “The little blankets lay around on the floor.” as delivered by a female speaker. The speech was sampled at a rate of F_s=48000 samples/sec with 16-bit resolution. The values of the parameters used to generate this example are listed in Table 1.
While the invention has been described in its presently preferred embodiment it is understood that the words which have been used are words of description rather than words of limitation and that changes within the purview of the appended claims may be made without departing from the scope and spirit of the invention in its broader aspects.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US4885790 *||Apr 18, 1989||Dec 5, 1989||Massachusetts Institute Of Technology||Processing of acoustic waveforms|
|US5008941||Mar 31, 1989||Apr 16, 1991||Kurzweil Applied Intelligence, Inc.||Method and apparatus for automatically updating estimates of undesirable components of the speech signal in a speech recognition system|
|US5023910 *||Apr 8, 1988||Jun 11, 1991||At&T Bell Laboratories||Vector quantization in a harmonic speech coding arrangement|
|US5148489||Mar 9, 1992||Sep 15, 1992||Sri International||Method for spectral estimation to improve noise robustness for speech recognition|
|US5377301||Jan 21, 1994||Dec 27, 1994||At&T Corp.||Technique for modifying reference vector quantized speech feature signals|
|US5469529||Sep 21, 1993||Nov 21, 1995||France Telecom Establissement Autonome De Droit Public||Process for measuring the resemblance between sound samples and apparatus for performing this process|
|US5548680 *||May 17, 1994||Aug 20, 1996||Sip-Societa Italiana Per L'esercizio Delle Telecomunicazioni P.A.||Method and device for speech signal pitch period estimation and classification in digital speech coders|
|US5598505 *||Sep 30, 1994||Jan 28, 1997||Apple Computer, Inc.||Cepstral correction vector quantizer for speech recognition|
|US5727124||Jun 21, 1994||Mar 10, 1998||Lucent Technologies, Inc.||Method of and apparatus for signal recognition that compensates for mismatching|
|US5745872||May 7, 1996||Apr 28, 1998||Texas Instruments Incorporated||Method and system for compensating speech signals using vector quantization codebook adaptation|
|US5768474||Dec 29, 1995||Jun 16, 1998||International Business Machines Corporation||Method and system for noise-robust speech processing with cochlea filters in an auditory model|
|US5832437 *||Aug 16, 1995||Nov 3, 1998||Sony Corporation||Continuous and discontinuous sine wave synthesis of speech signals from harmonic data of different pitch periods|
|US5924065 *||Jun 16, 1997||Jul 13, 1999||Digital Equipment Corporation||Environmently compensated speech processing|
|US5933808 *||Nov 7, 1995||Aug 3, 1999||The United States Of America As Represented By The Secretary Of The Navy||Method and apparatus for generating modified speech from pitch-synchronous segmented speech waveforms|
|US6029133 *||Sep 15, 1997||Feb 22, 2000||Tritech Microelectronics, Ltd.||Pitch synchronized sinusoidal synthesizer|
|US6059062||May 31, 1995||May 9, 2000||Empower Corporation||Powered roller skates|
|US6418408 *||Apr 4, 2000||Jul 9, 2002||Hughes Electronics Corporation||Frequency domain interpolative speech codec system|
|US6463406 *||May 20, 1996||Oct 8, 2002||Texas Instruments Incorporated||Fractional pitch method|
|US6678655 *||Nov 12, 2002||Jan 13, 2004||International Business Machines Corporation||Method and system for low bit rate speech coding with speech recognition features and pitch providing reconstruction of the spectral envelope|
|US6871176 *||Jul 26, 2001||Mar 22, 2005||Freescale Semiconductor, Inc.||Phase excited linear prediction encoder|
|US6885986 *||May 7, 1999||Apr 26, 2005||Koninklijke Philips Electronics N.V.||Refinement of pitch detection|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US8942977 *||Mar 17, 2014||Jan 27, 2015||Chengjun Julian Chen||System and method for speech recognition using pitch-synchronous spectral parameters|
|US9135923 *||Jan 26, 2015||Sep 15, 2015||Chengjun Julian Chen||Pitch synchronous speech coding based on timbre vectors|
|US9196263 *||Dec 29, 2010||Nov 24, 2015||Synvo Gmbh||Pitch period segmentation of speech signals|
|US20130144612 *||Dec 29, 2010||Jun 6, 2013||Synvo Gmbh||Pitch Period Segmentation of Speech Signals|
|US20140200889 *||Mar 17, 2014||Jul 17, 2014||Chengjun Julian Chen||System and Method for Speech Recognition Using Pitch-Synchronous Spectral Parameters|
|Cooperative Classification||G10L25/93, G10L19/097|
|Sep 18, 2002||AS||Assignment|
Owner name: GOVERNMENT OF THE UNITED STATES OF AMERICA AS REPR
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:STAELIN, DAVID H.;CABRERA-MERCADER, CARLOS R.;REEL/FRAME:013297/0546;SIGNING DATES FROM 20020521 TO 20020613