« PreviousContinue »
<12) United States Patent (10) Patent N0.I US 6,879,957 B1 Pechter et al. (45) Date of Patent: Apr. 12, 2005 (54) METHOD FOR PRODUCINGA SPEECH 5,930,754 A * 7/1999 Karaali et al. ............ .. 704/259 RENDITIQN ()1? TEXT FRQM DIPHQNE 6,088,666 A * 7/2000 Chang et al. 704/258 6,148,285 A * 11/2000 Busardo 704/260 SOUNDS 6,175,821 B1 * 1/2001 Page et al. .. 704/258 Inventors, H. Pechter, * DOHOVGH CL al. . . . . . . . . . .. Doubloon Dr., Vero Beach, FL (US) * Cited by examiner 32963; Joseph E. Pechter, 1295 Olde Doubloon Dr» Vero B9aCh> FL (US) Primary Examiner—Richemond Dorvil 32963 Assistant Examiner—Donald L. Storm * _ _ _ _ _ (74) Attorney, Agent, or Firm—Kevin P. Crosby; Daniel C. ( ) Notice: SLIDJCCL to any disclaimer, the term of this Crflly; Brinkley, MCNemey et a1_ patent is extended or adjusted under 35 U.S.C. 154(b) by 817 days. (57) ABSTRACT _ A text-to-speech system utilizes a method for producing a (21) App1' NO" 09/653582 speech rendition of text based on dividing some or all words (22) Filed; $ep_ 1, 2000 of a sentence into component diphones. A phonetic dictionary is aligned so that each letter within each word has a Related U.S. Application Data single corresponding phoneme. The aligned dictionary is (60) Provisional application No. 60/157,808, filed on Oct. 4, analyzed to determine the most common phoneme repre1999- sentation of the letter in the context of a string of letters (51) Int. Cl.7 .............................................. .. G10L 13/08 before and after 1t'_The results for eaeh letter are Stored m (52) U.S. Cl. ...................................... .. 704/267; 704/260 Ph°“°m.° rule mamX' A d1Ph9“e détabase ls Created uslpg a , way editor to cut 2,000 distinct diphones out of specially (58) Field of Search 704/267 260 """""""""""""""" " ’ selected words. Acomputer algorithm selects a phoneme for (56) References Cited each letter. Then, two phonemes are used to create a
diphone. Words are then read aloud by concatenating sounds from the diphone database. In one embodiment, diphones are used only when a word is not one of a list of pre-recorded words.
17 Claims, 1 Drawing Sheet
METHOD FOR PRODUCING A SPEECH RENDITION OF TEXT FROM DIPHONE SOUNDS
CROSS-REFERENCE TO A RELATED APPLICATION
This application claims priority from U.S. Provisional Application Ser. No. 60/157,808, filed Oct. 4, 1999, the disclosure of which is incorporated herein by reference.
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to speech synthesis systems and more particularly to algorithms and methods used to produce a viable speech rendition of text.
2. Description of the Prior Art
Phonology involves the study of speech sounds and the rule system for combining speech sounds into meaningful words. One must perceive and produce speech sounds and acquire the rules of the language used in one’s environment. In American English a blend of two consonants such as “s” and “t” is permissible at the beginning of a word but blending the two consonants “k” and “b” is not; “ng” is not produced at the beginning of words; and “w” is not produced at the end of words (words may end in the letter “w” but not the sound “w”). Marketing experts demonstrate their knowledge of phonology when they coin words for new products; product names, if, chosen correctly using phonological rules, are recognizable to the public as rightful words. Slang also follows these rules. For example, the word “nerd” is recognizable as an acceptably formed noun.
Articulation usually refers to the actual movements of the speech organs that occur during the production of various speech sounds. Successful articulation requires (1) neurological integrity, (2) normal respiration, (3) normal action of the larynx (voice box or Adam’s apple), (4) normal movement of the articulators, which include the tongue, teeth, hard palate, soft palate, lips, and mandible (lower jaw), and (5) adequate hearing.
Phonics involves interdependence between the three cuing systems: semantics, syntax, and grapho-phonics. In order to program words and use phonics as the tool for doing that, one has to be familiar with these relationships. Semantic cues (context: what makes sense) and syntactic cures (structure and grammar: what sounds right grammatically) are strategies the reader needs to be using already in order for phonics (letter-sound relationships: what looks right visually and sounds right phonetically) to make sense. Phonics proficiency by itself cannot elicit comprehension of text. While phonics is integral to the reading process, it is subordinate to semantics and syntax.
There are many types of letter combinations that need to be understood in order to fully understand how programming a phonics dictionary would work. In simple terms, the following letter-sound relationships need to be developed: beginning consonants, ending consonants, consonant digraphs (“sh,” “th,” “ch,” “wh”), medial consonants, consonant blends, long vowels and short vowels.
Speech and language pathologists generally call a speech sound a “phoneme”. Technically, it is the smallest sound segment in a word that we can hear and that, when changed, modifies the meaning of a word. For example the word “bit” and “bid” have different meanings yet they differ in their respective sounds by only the last sound in each word (i.e., “t” and “d”). These two sounds would be considered pho
nemes because they are capable of changing meaning. Speech sounds or phonemes are classified as vowels and consonants. The number of letters in a word and the number of sounds in a word do not always have a one-to-one correspondence. For example, in the word “squirrel”, there are eight letters, but there are only five sounds: “s”-“k”“w”-“r”-“I.”
A “diphthong” is the sound that results when the articulators move from one vowel to another within the same syllable. Each one of these vowels and diphthongs is called a speech sound or phoneme. The vowel sounds are a, e, i, o, u, and sometimes y, but when we are breaking up words into sounds they may be five or six vowel letters, but approximately 17 distinct vowel sounds. One should note that there are some variations in vowel usage due to regional or dialectical differences.
Speech-language pathologists often describe consonants by their place of articulation and manner of articulation as well as the presence or absence of voicing. Many consonant sounds are produced alike, except for the voicing factor. For instance, “p” and “b” are both bilabial stops. That is, the sounds are made with both lips and the flow of air in the vocal tract is completely stopped and then released at the place of articulation. It is important to note, however, that one type of consonant sound is produced with voicing (the vocal folds are vibrating) and the other type of consonant sound is produced without voicing (the vocal folds are not vibrating).
The concepts described above must be taken into account in order to enable a computer to generate speech which is understandable to humans. While computer generated speech is known to the art, it often lacks the accuracy needed to render speech that is reliably understandable or consists of cumbersome implementations of the rules of English (or any language’s) pronunciation. Other implementations require human annotation of the input test message to facilitate accurate pronunciation. The present invention has neither of these limitations.
It is a principle object of this invention to provide a text to speech program with a very high level of versatility, user friendliness and understandability.
In accordance with the present invention, there is a method for producing viable speech rendition of text comprising the steps of parsing a sentence into a plurality of words and punctuation, comparing each word to a list of pre-recorded words, dividing a word not found on the list of pre-recorded words into a plurality of diphones and combining sound files corresponding to the plurality of diphones, and playing a sound file corresponding to the word.
The method may also include the step of adding inflection to the word in accordance with the punctuation of the sentence.
The method may further include using a database of diphones to divide the word into a plurality of diphones.
These and other objects and features of the invention will be more readily understood from a consideration of the following detailed description, taken with the accompanying drawings.
In Phase 1 of our project, we developed: a parser program in Qbasic; a file of over 10,000 individually recorded common words; and a macro program to link a scanning and optical character recognition program to these and a wav
player so as to either say or spell each word in text. We tested many different articles by placing them into the scanner and running the program. We found that of the 20 articles we placed into the scanner, 86% of the words were recognized by our program from the 10,000 word list. Our major focus for Phase 2 of our project has been on the goal of increasing accuracy. Our 86% accuracy with phase one was reasonable, but this still meant that, on average, one to two words per line had to be spelled out, which could interrupt the flow of the reading and make understanding the full meaning of a sentence more difficult. We have found some dictionaries of words of the English language with up to 250,000 words. To record all of them would take over 1,000 hours and still would not cover names, places, nonsense words or expressions like “sheesh”, slang like “jumpin”, or new words that are constantly creeping into our language. If we recorded a more feasible 20,000 new words, ti would probably only have increased the accuracy by 1 to 2%. Anew approach was needed. We felt the most likely approach to make a more dramatic increase would involve phonetics. Any American English word can be reasonably reproduced as some combination of 39 phonetic sounds (phonemes). We researched phonetics and experimented linking together different phonemes, trying to create understandable words with them. Unfortunately, the sounds did not sound close enough to the appropriate word, rendering the process infeasible. Most spoken words have a slurring transition from one phoneme to the next. When this transition is missing, the sounds are disjointed and the word is not easily recognized. Overlapping phoneme wav files by 20% helped, but not enough. Other possibilities then considered were the use of syllables or groupings of 2 or 3 phonemes (diphones and triphones). Concatenations of all of these produced a reasonable approximation of the desired word. The decision to use diphones was based on practicality. Only diphones are needed as opposed to 100,000 triphones. Due to the numbers, we elected to proceed with using diphones. The number could be further reduced by avoiding combinations that never occur in real words. We elected to include all combinations because names, places and nonsense words can have strange combinations of sounds and would otherwise need to be spelled out. By experimentation, we found that simply saying the sound did not work well. This produced too many accentuated sounds that did not blend well. What worked best was cutting the diphone from the middle of a word, using a good ear and a wav editor to cut the sound out of the word. We initially tried to cut the diphones from 12 or more letter words, since long words would potentially have more diphones in them, but there was so much duplication that we shortly switched to a more methodical method of searching a phonetic dictionary for words with a specific diphone, cutting out that single diphone, and then going on to the next one on the list. If no word could be found, we would create a nonsense word with the desired diphone in the middle of the word, and then extract it with the editor. A considerable amount of time was spent perfecting the process of extracting the diphones. We needed to get the tempo, pitch, and loudness of each recording as similar as possible to the others in order to allow good blending.
We decided to use a hybrid approach in our project. Our program uses both whole words (from out list of 10,000 words) and also concatenated words (from the linking of diphones). Any word found on our main list would produce the wav recording of that entire word. All other words would be produced by concatenation of diphones unless it included a combination of letters and numbers (like B42) in which case it would be spelled out.
We next needed an algorithm to determine which phonemes and diphones to give for a given word. We first explored English texts and chapters dealing with pronunciation rules. Though many rules were found, they were not all inclusive and had many exceptions. We next searched the Internet for pronunciation rules and found an article by the Naval Research Laboratory “Document AD/A021 929 published by National Technical Information Services”. It would have required hundreds of nested if-then statements and reportedly it still had only mediocre performance. We decided to try to create our own set of pronunciation rules by working backwards from a phonetic dictionary. We were able to find such a dictionary (CMU dictionary (v0.6)) at the website identified by the uniform resource locator (URL) “ftp://ftp.cs.cmu.edu/project/speech/dict.” It had over 100, 000 words followed by their phonetic representation. The site made it clear this was being made freely available for anyone’s use.
Out strategy was to have every letter of each word represented by a single phoneme, and then to find the most common phoneme representation of a letter when one knew certain letters that preceded and followed it. Not all words have the same number of letters as phonemes, so we first had to go through the list and insert blank phonemes when there were too many letters for the original number of phonemes (like for ph, th, gh or double consonant . . . the first letter carried the phoneme of the sound made and the second letter would be the blank phoneme). In other cases we combined two phonemes into one in the less common situation when there were too many phonemes for the number of letters in the word. These manipulations left us with a dictionary of words and matching phonemes; each letter of each word now had a matching phoneme. We used this aligned dictionary as input for a Visual Basic program which determined which was the most common phoneme representation for a given letter, taking into account the one letter before and two letters after it. This was stored in 26><26><26><26 matrix form and output to a file so it could be input and used in the next program. Our next program tested the effectiveness of this matrix in predicting the pronunciation of each word on the original phoneme dictionary list. This program utilized the letter to phoneme rules of the matrix for each word and then directly compared this with the original phoneme assigned to that letter by the dictionary. It found 52% of the words were given the entirely correct pronunciation, 65% were either totally correct or had just one letter pronounced incorrectly, and over all 90% of all letters were assigned the correct pronunciation.
In an attempt to obtain better accuracy we attempted to look at the 3 letters before and 3 letters after the given letter, but in order to put the results in a simple standard matrix by the same technique, we would have needed a 26><26><26>< 26><26><26><26 matrix, which required more space than out computer allowed. Instead, we created different types of matrices within separate file names for each letter of the alphabet. In our “a” file we included a list of 7 letters strings with the 3 letters before and 3 letters after every “a” found in our phonetic dictionary. We made additional files for b thru z. Again we found the most common phoneme representation of “a” for each distinct 7 letter string that had “a” as the middle letter. By reading these into 26 different 1 dimensional matrix files, the additional run-search time of the program was minimized. We kept the 1 before-2 after matrix as a backup to be used if letters in the input word did not have a 7 letter match to any word in the phonetic dictionary. Using this technique, accuracy improved dramatically. 98% of all letters (804961/823343) were assigned