CA2119397C

CA2119397C - Improved automated voice synthesis employing enhanced prosodic treatment of text, spelling of text and rate of annunciation

Info

Publication number: CA2119397C
Application number: CA002119397A
Authority: CA
Inventors: Kim E.A. Silverman
Original assignee: Nynex Science and Technology Inc
Current assignee: Google LLC
Priority date: 1993-03-19
Filing date: 1994-03-18
Publication date: 2007-10-02
Anticipated expiration: 2014-03-18
Also published as: US5652828A; US5832435A; CA2119397A1; US5751906A; US5749071A; US5890117A; US5732395A

Abstract

Improved automated synthesis of human audible speech from text is disclosed. Performance enhancement of the underlying text comprehensibility is obtained through prosodic treatment of the synthesized material, improved speaking rate treatment, and improved methods of spelling words or terms for the system user. Prosodic shaping of text sequences appropriate for the discourse in large groupings of text segments, with prosodic boundaries developed to indicate conceptual units within the text groupings, is implemented in a preferred embodiment.

Claims

1. An automated system for synthesizing human audible speech from machine-readable representation of text wherein the system employs a synthesis device which has been designed for use with unrestricted text, said system including a prosody indicia generating means for automatically providing indicia of the text prosody to the synthesis device, said indicia being interpretable and executable by that device, and assigned on the basis of predetermined characteristics of restricted text, and wherein the prosody indicia are generated by identifying major prosodic groupings by utilizing major demarcation features to define the beginning and end of the major prosodic groupings.

2. The synthesis system of Claim 1 wherein the indicia are generated by prosody rules associated with predetermined discourse constraints particular to the context of the synthesis of the text.

3. The system of Claim 2 wherein the restricted text consists of name and address information.

4. The system of Claim 3 wherein name and address information is arranged into fields containing respectively names and addresses associated with a telephone number or numbers.

5. The system of Claim 1, 2, 3 or 4 wherein the prosody indicia are further generated by:

a) identifying prosodic subgroupings within the major prosodic groupings according to prosodic rules for analyzing the text for predetermined textual markers indicative of prosodically isolatible subgroupings not delineated by the major demarcations dividing the prosodic major groupings, b) within the prosodic subgroupings, identifying prosodically separable subgroup components, and c) generating prosody indicia which include salience signifiers utilizable by the synthesis device to vary the salience of segments of the synthesized speech such that (i) the salience signifiers within the prosodic subgroupings are first generated in accordance with predetermined salience placement rules solely relating to the components themselves, (ii) thereafter the first generated salience signifiers are modified to increase the salience at the start of the prosodic subgroup and further signify the salience at the end of the prosodic subgroup, and (iii) the salience signifiers are subsequently further modified to further increase the salience of the beginning of the major prosodic grouping and further signify the salience of the end of the major prosodic grouping.

6. The system of Claim 5 wherein the subgroup components are identified by:

a) identifying textual indicators which mark divisions of text groupings around them, b) utilizing the textual indicators to separate the text within the prosodic subgrouping into units of nominal text which do not include the said predetermined textual markers, and c) within the units of nominal text, identifying other indicators of textual groupings that are not predetermined textual markers of divisions, identifying nouns, and identifying qualifiers of nouns.

7. In the speech synthesis system of Claims 1, 2, 3, 4, or 6, said system having means for allowing users to obtain repeats of text segments and having means for adjusting a rate of annunciation of the synthesized segments of text by:

a) changing the rate of annunciation of a text segment after a first number of successive repeats of that segment for a first user, and b) decreasing the rate of annunciation of a further text segment for a subsequent number of successive repeats of that further text segment for the first user, and increasing the rate of annunciation if no repeats are requested by that user, and c) adjusting the initial annunciation rate for subsequent users in response to the number of consecutive prior users for whom the rate of annunciation of text had been altered.

8. An automated synthesis system wherein human audible speech is synthesized from text by a synthesis device in accordance with indicia of text prosody derived from rules relating to the underlying discourse context of the synthesis, said prosody indicia including features generated by:

a) identifying major prosodic groupings by utilizing major demarcation features to define the beginning and end of the major prosodic groupings;

b) identifying prosodic subgroupings within the major prosodic groupings according to prosodic rules for analyzing the text for predetermined textual markers indicative of prosodically isolatible subgroupings not delineated by the major demarcations dividing the prosodic major groupings, c) within the prosodic subgroupings, identifying prosodically separable subgroup components, and d) generating prosody indicia which include salience signifiers utilizable by the synthesis device to vary the salience of segments of the synthesized speech such that i) the salience signifiers within the prosodic subgroupings are first generated in accordance with predetermined salience placement rules solely relating to the components themselves, (ii) modifying the first generated salience signifiers to increase the salience at the start of the prosodic subgroup and further signify the salience at the end of the prosodic subgroup, and (iii) further modifying the salience signifiers to further increase the salience of the beginning of the major prosodic grouping and further signify the salience of the end of the major prosodic grouping.

9. The system of Claim 8 wherein the subgroup components are isolated by:

a) identifying textual indicators which mark relations of text groupings around them, b) ~utilizing the textual indicators to separate the text within the prosodic subgrouping into units of nominal text which do not include the said predetermined textual markers, and c) ~within the units of nominal text, identifying relational words that are not predetermined textual markers, nouns, or qualifiers of nouns.

10. The system of Claims 8 or 9 wherein the salience signifiers are indicia of pitch.

11. An automated system for synthesizing human audible speech from machine readable representation of restricted text having predetermined characteristics wherein the system employs a synthesis device which has been designed for use with unrestricted text, having a prosody indicia generator means for providing indicia of the text prosody to the synthesis device, said indicia being interpretable and executable by that device, and assigned on the basis of predetermined discourse constraints particular to the context of the synthesis of the text, and wherein the prosody indicia are generated by identifying major prosodic groupings by utilizing major demarcation features to define the beginning and end of the major prosodic groupings.

12. The system of Claim 11 wherein the restricted text consists of name and address information.

13. The system of Claim 12 wherein the name and address information is arranged into fields containing respectively names and addresses associated with a telephone number or numbers.

14. The system of Claim 11, 12, or 13 wherein the prosody indicia are further generated according to the following method:

a) ~identifying prosodic subgroupings within the major prosodic groupings according to prosodic rules for analyzing the text for predetermined textual markers indicative of prosodically isolatible subgroupings not delineated by the major demarcations dividing the prosodic major groupings, b) ~within the prosodic subgroupings, identifying prosodically separable subgroup components, and c) ~generating prosody indicia which include salience signifiers utilizable by the synthesis device to vary the salience of segments of the synthesized speech such that i) ~the salience signifiers within the prosodic subgroupings are first generated in accordance with predetermined salience placement rules solely relating to the components themselves, (ii) ~the first generated salience signifiers are modified to increase the salience at the start of the prosodic subgroup and further signify the salience at the end of the prosodic subgroup, and (iii) ~the salience signifiers are further modified to further increase the salience of the beginning of the major prosodic grouping and further signify the salience of the end of the major prosodic grouping.

15. The system of Claim 14 wherein the subgroup components are identified by:

a) ~identifying textual indicators which mark divisions of text groupings around them, b) ~utilizing the textual indicators to separate the text within the prosodic subgrouping into units of nominal text which do not include the said predetermined textual markers, and c) ~within the units of nominal text, identifying other indicators of textual groupings that are not predetermined textual markers of divisions, identifying nouns, and identifying qualifiers of nouns.

16. The system of Claim 14 wherein the salience signifiers are indicia of pitch.

17. The system of Claim 15 wherein the salience signifiers are indicia of pitch.

18. The system of Claims 1 or 11 wherein prosodic subgroupings are identified within the major prosodic groupings according to prosodic flues for analyzing the text for predetermined textual markers indicative of prosodically isolatible subgroupings not delineated by the major demarcations dividing the prosodic major groupings.

19. The system of Claim 18 wherein within the prosodic subgroupings, prosodically separable subgroup components are identified by:

a) ~identifying textual indicators which mark divisions of text groupings around them, b) ~utilizing the textual indicators to separate the text within the prosodic subgrouping into units of nominal text which do not include the said predetermined textual markers, and c) ~within the units of nominal text, identifying other indicators of textual groupings that are not predetermined textual markers of divisions, identifying nouns, and identifying qualifiers of nouns.

20. The system of Claim 19 wherein salience signifiers utilizable by the synthesis device to vary the salience of segments of the synthesized speech are generated such that (i) ~the salience signifiers within the prosodic subgrouping are first generated in accordance with predetermined salience placement rules solely relating to the components themselves, (ii)~thereafter the first generated salience signifiers are modified to increase the salience at the start of the prosodic subgroup and further signify the salience at the end of the prosodic subgroup, and (iii) ~the salience signifiers are further modified to further increase the salience of the major prosodic grouping and further signify the salience at the end of the major prosodic grouping.

21. The system of Claim 20 wherein the salience signifiers are indicia of pitch.

22. An automated system for synthesizing human audible speech from machine-readable representation of text wherein the system employs a synthesis device which has been designed for use with unrestricted text, said system including a prosody indicia generating means for automatically providing indicia of the text prosody to the synthesis device, said indicia being interpretable and executable by that device, and assigned on the basis of predetermined characteristics of restricted text, and wherein the indicia are generated by prosody rules associated with predetermined discourse constraints particular to the context of the synthesis of the text.