US 6226606 B1 Abstract In a method for tracking pitch in a speech signal, first and second window vectors are created from samples taken across first and second windows of the speech signal. The first window is separated from the second window by a test pitch period. The energy of the speech signal in the first window is combined with the correlation between the first window vector and the second window vector to produce a predictable energy factor. The predictable energy factor is then used to determine a pitch score for the test pitch period. Based in part on the pitch score, a portion of the pitch track is identified.
Claims(36) 1. A method for tracking pitch in a speech signal, the method comprising:
sampling the speech signal across a first time window that is centered at a first time mark to produce a first window vector;
sampling the speech signal across a second time window that is centered at a second time mark to produce a second window vector, the second time mark separated from the first time mark by a test pitch period;
calculating an energy value indicative of the energy of the portion of the speech signal represented by the first window vector;
calculating a cross-correlation value based on the first window vector and the second window vector;
combining the energy value and the cross-correlation value to produce a predictable energy factor;
determining a pitch score for the test pitch period based in part on the predictable energy factor; and
identifying at least a portion of a pitch track based in part on the pitch score.
2. The method of claim
1 wherein sampling the speech signal across a first time window comprises sampling the speech signal across a first time window that is the same length as the test pitch period.3. The method of claim
2 wherein sampling the speech signal across the second time window comprises sampling the speech signal across a second time window that is the same length as the test pitch period.4. The method of claim
1 wherein calculating the cross-correlation value comprises dividing the scalar product of the first window vector and a second window vector by magnitudes of the first window vector and second window vector to produce an initial cross-correlation value.5. The method of claim
4 wherein calculating the cross-correlation value further comprises setting the cross-correlation value equal to the initial cross-correlation value.6. The method of claim
4 wherein calculating the cross-correlation value further comprises setting the cross-correlation value to zero if the initial cross-correlation value is less than zero.7. The method of claim
4 further comprising sampling the speech signal across a third time window that is centered at a third time mark to produce a third window vector, the third time mark separated from the first time mark by the test pitch period.8. The method of claim
7 wherein calculating the cross-correlation value further comprises:calculating a second cross-correlation value based on the first window vector and the third window vector;
comparing the initial cross-correlation value to the second cross-correlation value; and
setting the cross-correlation value equal to the second cross-correlation value if the second cross-correlation value indicates more correlation than the initial cross-correlation value and otherwise setting the cross-correlation value equal to the initial cross-correlation value.
9. The method of claim
4 wherein calculating the cross-correlation value further comprises:sampling the speech signal across a first harmonic time window that is centered at the first time mark to produce a first harmonic window vector;
sampling the speech signal across a second harmonic time window that is centered at a second harmonic time mark to produce a second harmonic window vector, the second harmonic time mark separated from the first time mark by one-half the test pitch period;
calculating a harmonic cross-correlation value based on the first harmonic window vector and the second harmonic window vector;
multiplying the harmonic cross-correlation value by a reduction factor to produce a harmonic reduction value; and
subtracting the harmonic reduction value from the initial cross-correlation value and setting the cross-correlation value equal to the difference.
10. The method of claim
1 wherein determining a pitch score comprises determining the probability that the test pitch period is an actual pitch period for a portion of the speech signal centered at the first time mark.11. The method of claim
10 wherein determining the probability that the test pitch period is the actual pitch period comprises adding the predictable energy factor to a transition probability that indicates the probability of transitioning from a preceding pitch period to the test pitch period.12. The method of claim
11 further comprising determining a plurality of pitch scores with one pitch score for each possible transition from a plurality of preceding pitch periods to the test pitch period.13. The method of claim
12 further comprising combining the plurality of pitch scores with past pitch scores to produce pitch track scores, each pitch track score indicative of the probability that a test pitch track is equal to an actual pitch track of the speech signal.14. The method of claim
13 wherein identifying the pitch track comprises identifying the pitch track associated with the highest pitch track score.15. The method of claim
1 further comprising determining if the first time marker is in a voiced region of the speech signal.16. The method of claim
15 wherein determining if the first time marker is in a voiced region of the speech signal comprises determining a probability that the first time marker is in a voiced region based on the energy value and the cross-correlation value.17. In a computer speech system designed to perform speech functions, a pitch tracker comprising:
a window sampling unit for constructing a current window vector and a previous window vector from a respective current window and previous window of the speech signal, the center of the current window separated from the center of the previous window by a test pitch period;
an energy calculator for calculating the total energy of the current window;
a cross-correlation calculator for calculating a cross-correlation value based on the current window vector and the previous window vector;
a multiplier for multiplying the total energy by the cross-correlation value to produce a predictable energy factor;
a pitch score generator for generating a pitch score based on the predictable energy; and
a pitch track identifier for identifying at least a portion of a pitch track for the speech signal based at least in part on the pitch score.
18. The pitch tracker of claim
17 wherein the computer speech system is a speech synthesis system.19. The pitch tracker of claim
17 wherein the computer speech system is a speech coder.20. A method for tracking pitch in a speech signal, the method comprising:
sampling a first waveform in the speech signal;
sampling a second waveform in the speech signal, the center of the first waveform separated from the center of the second waveform by a test pitch period;
creating a correlation value indicative of the degree of similarity between the first waveform and the second waveform through steps comprising:
determining the cross-correlation between the first waveform and the second waveform;
determining the energy of the first waveform; and
multiplying the cross-correlation by the energy to produce the correlation value;
creating a pitch-contouring factor indicative of the similarity between the test pitch period and a previous pitch period;
combining the correlation value and the pitch-contouring factor to produce a pitch score for transitioning from the previous pitch period to the test pitch period; and
identifying a portion of a pitch track based on at least one pitch score.
21. The method of claim
20 wherein determining the cross-correlation comprises creating a first window vector based on samples of the first waveform and creating a second window vector based on samples of the second waveform.22. The method of claim
21 wherein determining the cross-correlation further comprises dividing a scalar product of the first window vector and the second window vector by magnitudes of the first window vector and second window vector to produce an initial cross-correlation value.23. The method of claim
22 wherein determining the cross-correlation further comprises setting the cross-correlation equal to the initial cross-correlation value.24. The method of claim
22 wherein determining the cross-correlation further comprises setting the cross-correlation to zero if the initial cross-correlation value is less than zero.25. The method of claim
22 further comprising:sampling a third waveform in the speech signal, the center of the third waveform separated from the center of the first waveform by the test pitch period; and
creating a third window vector based on samples of the third waveform.
26. The method of claim
25 wherein determining the cross-correlation further comprises:calculating a second cross-correlation value based on the first window vector and the third window vector;
comparing the initial cross-correlation value to the second cross-correlation value; and
setting the cross-correlation equal to the second cross-correlation value if the second cross-correlation value is higher than the initial cross-correlation value and otherwise setting the cross-correlation equal to the initial cross-correlation value.
27. The method of claim
22 wherein determining the cross-correlation further comprises:sampling a first harmonic waveform and creating a first harmonic window vector based on samples of the first harmonic waveform;
sampling a second harmonic waveform and creating a second harmonic window vector based on samples of the second harmonic waveform, the center of the second harmonic waveform separated from the center of the first harmonic waveform by one-half the test pitch period;
calculating a harmonic cross-correlation value based on the first harmonic window vector and the second harmonic window vector;
multiplying the harmonic cross-correlation value by a reduction factor to produce a harmonic reduction value; and
subtracting the harmonic reduction value from the initial cross-correlation value and setting the cross-correlation equal to the difference.
28. The method of claim
20 wherein the length of the first waveform is equal to the test pitch period.29. The method of claim
20 wherein creating the pitch-contouring factor comprises subtracting the test pitch period from the previous pitch period.30. The method of claim
29 wherein combining the correlation value and the pitch-contouring factor comprises subtracting the pitch-contouring factor from the correlation value.31. The method of claim
20 wherein identifying a portion of a pitch track comprises determining a plurality of pitch scores for at least two test pitch tracks, with one pitch score for each pitch transition in each test pitch track.32. The method of claim
31 wherein identifying a portion of a pitch track further comprises summing together the pitch scores of each test pitch track and selecting the test pitch track with the highest sum as the pitch track for the speech signal.33. For use in a computer system, a pitch tracker capable of determining if a region of a speech signal is a voiced region, the pitch tracker comprising:
a sampler for sampling a first waveform and a second waveform;
a correlation calculator for calculating a correlation between the first waveform and the second waveform;
an energy calculator for calculating the total energy of the first waveform; and
a region identifier for identifying a region of the speech signal as a voiced region if the correlation between the first waveform and the second waveform is high and the total energy of the first waveform is high.
34. A pitch tracking system for tracking pitch in a speech signal, the system comprising:
a window sampler for creating samples of a first waveform and a second waveform in the speech signal;
a correlation calculator for creating a correlation value indicative of the degree of similarity between the first waveform and the second waveform through steps comprising:
determining the cross-correlation between the first waveform and the second waveform;
determining the energy of the first waveform; and
multiplying the cross-correlation by the energy to produce the correlation value;
a pitch-contour calculator for calculating a pitch-contouring factor indicative of the similarity between a test pitch period and a previous pitch period;
a pitch score calculator for calculating a pitch score based on the correlation value and the pitch-contouring factor; and
a pitch track identifier for identifying a pitch track based on the pitch score.
35. A method of determining if a region of a speech signal is a voiced region, the method comprising:
sampling a first waveform and a second waveform of the speech signal;
determining the correlation between the first waveform and the second waveform;
determining the total energy of the first waveform; and
determining that the region is a voiced region if the total energy of the first waveform and the correlation between the first waveform and the second waveform are both high.
36. The method of claim
35 further comprising determining that a region of the speech signal is an unvoiced region if the total energy of the first waveform and the correlation between the first waveform and the second waveform are both low.Description The present invention relates to computer speech systems. In particular, the present invention relates to pitch tracking in computer speech systems. Computers are currently being used to perform a number of speech related functions including transmitting human speech over computer networks, recognizing human speech, and synthesizing speech from input text. To perform these functions, computers must be able to recognize the various components of human speech. One of these components is the pitch or melody of speech, which is created by the vocal cords of the speaker during voiced portions of speech. Examples of pitch can be heard in vowel sounds such as the “ih” sound in “six”. The pitch in human speech appears in the speech signal as a nearly repeating waveform that is a combination of multiple sine waves at different frequencies. The period between these nearly repeating waveforms determines the pitch. To identify pitch in a speech signal, the prior art uses pitch trackers. A comprehensive study of pitch tracking is presented in “A Robust Algorithm for Pitch Tracking (RAPT)” D. Talkin, Speech Coding and Synthesis, pp.495-518, Elsevier, 1995. One such pitch tracker identifies two portions of the speech signal that are separated by a candidate pitch period and compares the two portions to each other. If the candidate pitch period is equal to the actual pitch of the speech signal, the two portions will be nearly identical to each other. This comparison is generally performed using a cross-correlation technique that compares multiple samples of each portion to each other. Unfortunately, such pitch trackers are not always accurate. This results in pitch tracking errors that can impair the performance of computer speech systems. In particular, pitch-tracking errors can cause computer systems to misidentify voiced portions of speech as unvoiced portions and vice versa, and can cause speech systems to segment the speech signal poorly. In a method for tracking pitch in a speech signal, first and second window vectors are created from samples taken across first and second windows of the speech signal. The first window is separated from the second window by a test pitch period. The energy of the speech signal in the first window is combined with the correlation between the first window vector and the second window vector to produce a predictable energy factor. The predictable energy factor is then used to determine a pitch score for the test pitch period. Based in part on the pitch score, a portion of the pitch track is identified. In other embodiments of the invention, a method of pitch tracking takes samples of a first and second waveform in the speech signal. The centers of the first and second waveform are separated by a test pitch period. A correlation value is determined that describes the similarity between the first and second waveforms and a pitch-contouring factor is determined that describes the similarity between the test pitch period and a previous pitch period. The correlation value and the pitch-contouring factor are then combined to produce a pitch score for transitioning from the previous pitch period to the test pitch period. This pitch score is used to identify a portion of the pitch track. Other embodiments of the invention provide a method of determining whether a region of a speech signal is a voiced region. The method involves sampling a first and second waveform and determining the correlation between the two waveforms. The energy of the first waveform is then determined. If the correlation and the energy are both high, the method identifies the region as a voiced region. FIG. 1 is a plan view of an exemplary environment for the present invention. FIG. 2 is a graph of a speech signal. FIG. 3 is a graph of pitch as a function of time for a declarative sentence. FIG. 4 is a block diagram of a speech synthesis system. FIG. 5-1 is a graph of a speech signal. FIG. 5-2 is a graph of the speech signal of FIG. 5-1 with its pitch properly lowered. FIG. 5-3 is a graph of the speech signal of FIG. 5-1 with its pitch improperly lowered. FIG. 6 is a block diagram of a speech coder. FIG. 7 is a two-dimensional representation of window vectors for a speech signal. FIG. 8 is a block diagram of a pitch tracker of the present invention. FIG. 9 is a flow diagram of a pitch tracking method of the present invention. FIG. 10 is a graph of a speech signal showing samples that form window vectors. FIG. 11 is a graph of a Hidden Markov Model for identifying voiced and unvoiced regions of a speech signal. FIG. 12 is a graph of the groupings of voiced and unvoiced samples as a function of energy and cross-correlation. FIG. 13 is a flow diagram of a method for identifying voiced and unvoiced regions under the present invention. FIG. With reference to FIG. 1, an exemplary system for implementing the invention includes a general purpose computing device in the form of a conventional personal computer Although the exemplary environment described herein employs the hard disk, the removable magnetic disk A number of program modules may be stored on the hard disk, magnetic disk The personal computer When used in a LAN networking environment, the personal computer FIGS. 2 and 3 are graphs that describe the nature of pitch in human speech. FIG. 2 is a graph of a human speech signal FIG. 3 provides a graph Changes in pitch are tracked in a number of speech systems including speech synthesis systems such as speech synthesis system The analog signal from microphone Feature extraction component The digitized signal is also provided to pitch tracker Analysis engine Analysis engine Because storage is limited, analysis engine For each phonetic speech unit stored in unit storage Synthesis section The output of LTS Speech synthesizer The step of converting the pitch of the stored units into the pitch set by prosody engine If the pitch marks are not properly determined for the speech units, this segmentation technique will not result in a lower pitch. An example of this can be seen in FIG. 5-3, where the stored pitch marks used to segment the speech signal have incorrectly identified the pitch period. In particular, the pitch marks indicated a pitch period that was too long for the speech signal. This resulted in multiple peaks Pitch tracking is also used in speech coding to reduce the amount of speech data that is sent across a channel. Essentially, speech coding compresses speech data by recognizing that in voiced portions of the speech signal the speech signal consists of nearly repeating waveforms. Instead of sending the exact values of each portion of each waveform, speech coders send the values of one template waveform. Each subsequent waveform is then described by making reference to the waveform that immediately proceeds it. An example of such a speech coder is shown in the block diagram of FIG. In FIG. 6, a speech coder The speech signal is also provided to a subtraction unit The delayed waveform is multiplied by a gain factor “g(n)” in a multiplication unit Once the gain factor is minimized, the difference from subtraction unit In the speech coder of FIG. 6, the performance of the coder is improved if the difference from subtraction unit In the prior art, pitch tracking has been performed using cross-correlation, which provides an indication of the degree of similarity between the current sampling window and the previous sampling window. The cross-correlation can have values between −1 and +1. If the waveforms in the two windows are substantially different, the cross-correlation will be close to zero. However, if the two waveforms are similar, the cross-correlation will be close to +1. In such systems, the cross-correlation is calculated for a number of different pitch periods. Generally, the test pitch period that is closest to the actual pitch period will generate the highest cross-correlation because the waveforms in the windows will be very similar. For test pitch periods that are different from the actual pitch period, the cross-correlation will be low because the waveforms in the two sample windows will not be aligned with each other. Unfortunately, prior art pitch trackers do not always identify pitch correctly. For example, under cross-correlation systems of the prior art, an unvoiced portion of the speech signal that happens to have a semi-repeating waveform can be misinterpreted as a voiced portion providing pitch. This is a significant error since unvoiced regions do not provide pitch to the speech signal. By associating a pitch with an unvoiced region, prior art pitch trackers incorrectly calculate the pitch for the speech signal and misidentify an unvoiced region as a voiced region. In an improvement upon the cross-correlation method of the prior art, the present inventors have constructed a probabilistic model for pitch tracking. The probabilistic model determines the probability that a test pitch track P is the actual pitch track for a speech signal. This determination is made in part by examining a sequence of window vectors X, where P and X are defined as:
where P Each window vector x
where N is the size of the window, t is a time mark at the center of the window, and x[t] is the sample of the input signal at time t. In the discussion below, the window vector defined in Equation 3 is referred to as the current window vector x
where N is the size of the window, P is the pitch period describing the time period between the center of the current window and the center of the previous window, and t−P is the center of the previous window. The probability of a test pitch track P being the actual pitch track given the sequence of window vectors X can be represented as ƒ(P/X). If this probability is calculated for a number of test pitch tracks, the probabilities can be compared to each other to identify the pitch track that is most likely to be equal to the actual pitch track. Thus, the maximum a posteriori (MAP) estimate of the pitch track is: Using Bayes rule, the probability of EQ. 5 can be expanded to: where ƒ(P) is the probability of the pitch track P appearing in any speech signal, ƒ(X) is the probability of the sequence of window vectors X, and ƒ(X|P) is the probability of the sequence of window vectors X given the pitch track P. Since Equation 6 seeks a pitch track that maximizes the total probability represented by the factors of the right-hand side of the equation, only factors that are functions of the test pitch track need to be considered. Factors that are not a function of pitch track can be ignored. Since f (X) is not a function of P, Equation 6 simplifies to: Thus, to determine the most probable pitch track, the present invention determines two probabilities for each test pitch track. First, given a test pitch track P, the present invention determines the probability that a sequence of window vectors X will appear in a speech signal. Second, the present invention determines the probability of the test pitch track P occurring in any speech signal. The probability of a sequence of window vectors X given a test pitch track P is approximated by the present invention as the product of a group of individual probabilities, with each probability in the group representing the probability that a particular window vector x where M is the number of window vectors in the sequence of window vectors X and the number of pitches in the pitch track P. The probability ƒ(x
where x From FIG. 7 it can be seen that the minimum prediction error |e In Equation 11, <x where x[t+n] is the sample of the input signal at time t+n, x[t+n−P] is the sample of the input signal at time t+n−P, and N is the size of the window. |x Combining equations 11, 12, 13 and 14 produces: The right-hand side of Equation 15 is equal o the cross-correlation α
Under an embodiment of the invention, the present inventors model the probability of an occurrence of a minimum prediction error |e The log likelihood of |e which can be simplified by representing the constants as a single constant V to produce: Substituting for |e The factors that are not a function of the pitch can be collected and represented by one constant K because these factors do not affect the optimization of the pitch. This simplification produces: The probability of having a specific prediction error given a pitch period P as described in Equation 21 is the same as the probability of the current window vector given the previous window vector and a pitch period P. Thus, Equation 21 can be rewritten as: where ƒ(x As mentioned above, there are two probabilities that are combined under the present invention to identify the most likely pitch track. The first is the probability of a sequence of window vectors given a pitch track. That probability can be calculated by combining equation 22 with equation 8 above. The second probability is the probability of the pitch track occurring in the speech signal. The present invention approximates the probability of the pitch track occurring in the speech signal by assuming that the a priori probability of a pitch period at a frame depends only on the pitch period for the previous frame. Thus, the probability of the pitch track becomes the product of the probabilities of each individual pitch occurring in the speech signal given the previous pitch in the pitch track. In terms of an equation:
One possible choice for the probability ƒ(P where γ is the standard deviation of the Gaussian distribution and k′ is a constant. Combining equations 7, 8 and 23, and rearranging the terms produces: Since the logarithm is monotonic, the value of P that maximizes EQ 25 also maximizes the logarithm of the right hand side of EQ 25: Combining equation 26 with equations 22 and 24 and ignoring the constants k and k′ produces: where λ=σ Thus, the probability of a test pitch track being the actual pitch track consists of three terms. The first is an initial energy term α The second term is a predictable energy term α The third term in the probability of a test pitch track is pitch transition term λ(P The summation portion of Equation 27 can be viewed as the sum of a sequence of individual probability scores, with each score indicating the probability of a particular pitch transition at a particular time. These individual probability scores are represented as:
where S Combining Equation 28 with Equation 27 produces: Equation 29 provides the most likely pitch track ending at pitch P Comparing Equation 30 to Equation 29, it can be seen that in order to calculate a most likely pitch path ending at a new pitch P Under an embodiment of the invention, pitch track scores are determined at a set of time marks t=iT such that the pitch track scores ending at pitch P Based on Equation 30, a pitch tracker Pitch tracker At a step The test pitch P Examples of the samples that are found in current window vector x Window sampler Window sampler In some embodiments of the invention, window sampler After calculating the backward cross-correlation at step If the backward cross-correlation is higher than the forward cross-correlation, the backward cross correlation is compared to zero at step If the forward cross-correlation is larger than the backward cross-correlation at step In further embodiments of the present invention, the once modified cross-correlation α
where β is the reduction factor such that 0<β<1. Under some embodiments, β is (0.2). After steps The pitch transition terms λ(P The separate pitch transition terms produced by pitch transition calculator At step If all of the current path scores have been calculated at step As part of this process, some embodiments of dynamic programming This most probable pitch track is then output at step The scores for surviving pitch tracks determined at time t=MT are stored at step In addition to identifying a pitch track, the present invention also provides a means for identifying voiced and unvoiced portions of a speech signal. To do this, the present invention defines a two-state Hidden Markov Model (HMM) shown as model The probability of being in either the voiced state or the unvoiced state at any time period is the combination of two probabilities. The first probability is a transition probability that represents the likelihood that a speech signal will transition from a voiced region to an unvoiced region and vice versa or that a speech signal will remain in a voiced region or an unvoiced region. Thus, the first probability indicates the likelihood that one of the transition paths The second probability used in determining whether the speech signal is in a voiced region or an unvoiced region is based on characteristics of the speech signal at the current time period. In particular, the second probability is based on a combination of the total energy of the current sampling window |x A method under the present invention for identifying the voiced and unvoiced regions of a speech signal is shown in the flow diagram of FIG. After the cross-correlation has been calculated at step In step At step At step At step Although the present invention has been described with reference to particular embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention. In addition, although block diagrams have been used to describe the invention, those skilled in the art will recognize that the components of the invention can be implemented as computer instructions. Patent Citations
Non-Patent Citations
Referenced by
Classifications
Legal Events
Rotate |