« PreviousContinue »
United States Patent 
 Patent Number:  Date of Patent:
 METHODS FOR PART-OF-SPEECH DETERMINATION AND USAGE
 Inventor: Kenneth W. Church, Chatham, N.J.
 Assignee: AT&T Bell Laboratories, Murray Hill, N.J.
 Appl. No.: 152,740
 Filed: Feb. 5, 1988
 Int. C1.5 G06F 15/38
 U.S. CI 364/419; 364/400
 Field of Search 381/51-53,
381/41, 43^5; 364/513.5, 900 MS File, 200 MS File, 419; 434/167, 169
 References Cited
U.S. PATENT DOCUMENTS
3,704,345 11/1972 Coker et al 179/1 SA
4,456,973 6/1984 Carlgren et al 364/900
4,580,218 4/1986 Raye 364/300
4,586,160 4/1986 Amano et al 364/900
4,661,924 4/1987 Okamoto et al 364/900
4,674,065 6/1987 Lange et al 364/900
4,688,195 8/1987 Thompson et al 364/300
4,692,941 9/1987 Jacks et al 381/52
DeRose, "Grammatical Category Disambiguation by Statistical Optimization", Computational Linguistics, vol. 14, No. 1, Jul. 1988, pp. 31-39. Cherry et al., "Writing Tools—The Style and Diction Programs", AT&T Bell Laboratories, pp. 1-14. Jelinek, "Markov Modeling of Text Generation", Proceedings of the NATO Advanced Study Institute, 1985, Martinus Nijhoff Publishers, pp. 569-598. European Conf. on Speech Technology, vol. 1, Sep. 1987, Edingburg, GB, pp. 389-392; & E. Vivalda: "Contextual Syntactic Analysis for text-to-speech conversion", the whole document.
ICASSP 85 Proc, vol. 4, Mar. 1985, Florida, U.S. pp. 1577-1580; & B. Merialdo: "Probabilistic Grammar for Phonetic to french Transciption", the whole document. Proc. of the Spring Joint Computer Conf., Atlantic City, NJ Apr. 30, 1968, Washington, pp. 339-344; J.
Allan: "Machine-to-Man Communication by speech Part II: Synthesis of Prosodic features of speech by rule", the whole document.
Wallraff, Barbara, "The Literate Computer", The Atlantic Monthly, Jan. 1988 pp. 64 at 68. Leech, G. et al, "The Automatic Grammatical Tagging of the LOB Corpus", JCAME News, 7, 13-33 (1983). Marcus, M., A Theory of Syntactic Recognition for Natural Language, MIT Press, Cambridge, Mass. 1980, pp. 37, 38, 175, 199-201.
Fudge, E., English Word Stress, George Allen & Unwin
(Publishers) Ltd., London, 1984.
Francis, W. N., et al., Frequency Analysis of English
Usage, Houghton Mifflin Co., 1982, pp. 6-8 ("List of
Allen, J. From Text to Speech: The MIT Talk System, Cambridge University Press, Cambridge, Mass. (1987) Chapter 10 "The Fundamental Frequency Frequency Generator").
Cherry, L. L., "A System for Assigning Word Classes to English Text", Computer Science Technical Report No. 81 Jun. 1978.
Primary Examiner—Emanuel S. Kemeny
Attorney, Agent, or Firm—G. E. Nelson; W. L. Wisner
Methods for determination of parts of speech of words in a text or other non-verbal record are extended to include so-called Viterbi optimization based on stored statistical data relating to actual usage and to include noun-phrase parsing. The part-of-speech tagging method optimizes the product of individual word lexical probabilities and normalized three-word contextual probabilities. Normalization involves dividing by the contained two-word contextual probabilities. The method for noun phrase parsing involves optimizing the choices of, typically non-recursive, noun phrases by considering all possible beginnings and endings thereof, preferably based on the output of the part-of-speech tagging method.
6 Claims, 2 Drawing Sheets
U.S. Patent Sep. 8,1992 sheet 2 of 2 5,146,405
Background of the Invention
It has been long recognized that the ability to determine the parts of speech, especially for words that can be used as different parts of speech, is relevant to many different problems in the use of the English language. For example, it is known that speech "stress", including pitch, duration and energy, is dependent on the particular parts of speech of words and their sentence order. Accordingly, speech synthesis needs parts-of-speech analysis of the input written or non-verbal text to produce a result that sounds like human speech.
Moreover, automatic part-of-speech determination can play an important role in automatic speech recognition, in the education and training of writers by computer-assisted methods, in editing and proofreading of documents generated at a word-processing work station, in the indexing of a document, and in various forms of 30 retrieval of word-dependent data from a data base.
For example, some of these uses can be found in various versions of AT&T's Writer's Workbench®. See the article by Barbara Wallraff, "The Literate Computer," in The Atlantic Monthly, January 1988, pp. 64ff, especially page 68, the last two paragraphs. The relationship of parts of speech to indexing can be found in U.S. Pat. No. 4,580,218 issued Apr. 1, 1986, to C. L. Raye.
Heretofore, two principal methods for automatic part-of-speech determination have been discussed in the literature and, to some extent, employed. The first depends on a variety of "ad hoc" rules designed to detect particular situations of interest. These rules may relate, for example, to using word endings to predict part-ofspeech, or to some adaptation thereof. Some ad hoc rules for part-of-speech determination have been used in the Writer's Workbench® application program running under the Unixtm Operating System. These rules tend to be very limited in the situations they can successfully resolve and to lack underlying unity. That technique is described in Computer Science Technical Report, No. 81, "PARTS—A System for Assigning Word Classes to English Text", by L.L. Cherry, June 1978, Bell Telephone Laboratories, Incorporated.
The second principal method, which potentially has greater underlying unity is the "n-gram" technique described in the article "The Automatic Tagging of the LOB Corpus", in ICAME News, Vol. 7, pp. 13-33, by G. Leech et al., 1983, University of Lancaster, England. Part of the technique there described makes the assigned part of speech depend on the current best choices of parts of speech of certain preceding or following words, based on certain rules as to likely combinations of successive parts of speech. With this analysis, various ad hoc rules are also used, so that, overall, this method is still less accurate than desirable. In addition,
this method fails to model lexical probabilities in a systematic fashion.
The foregoing techniques have not generated substantial interest among researchers in the art because of the foregoing considerations and because the results have been disappointing.
Indeed, it has been speculated that any "n-gram" technique will yield poor results because it cannot take a sufficiently wide, or overall, view of the likely structure of the sentence. On the other hand, it has not been possible to program robustly into a computer the kind of overall view a human mind takes in analyzing the parts of speech in a sentence. See the book A Theory of Syntactic Recognition for Natural Language, by M. Marcus, MIT Press, Cambridge, Mass., 1980. Consequently, the "n-gram" type part-of-speech determination, as contrasted to "n-gram" word frequency-of-occurrence analysis, have been largely limited to tasks such as helping to generate larger bodies of fully "tagged" text to be used in further research. For that purpose, the results must be corrected by the intervention of a very capable human.
Nevertheless, it would be desirable to be able to identify parts-of-speech with a high degree of likelihood with relatively simple techniques, like the "n-gram" technique, so that it may be readily applied in all the applications mentioned at the outset, above.
SUMMARY OF THE INVENTION
According to one feature of the my invention, parts of speech are assigned to words in a message by optimizing the product of individual word lexical probabilities and normalized three-word contextual probabilities. Normalization employs the contained two-word contextual probabilites. Endpoints of sentences (including multiple spaces between them), punctuation and words occurring with low frequency are assigned lexical probabilities and are otherwise treated as if they were words, so that discontinuities encountered in prior n-gram partof-speech assignment and the prior use of "ad hoc" rules tend to be avoided. The generality of the technique is thereby established.
According to another feature of my invention, a message in which the words have had parts-of-speech previously assigned has its noun phrases identified in a way that facilitates their use for speech synthesis. This noun phrase parsing also may have other applications. Specifically, the noun phrase parsing method is a highly probabilistic method that initially assigns beginnings and ends of noun phrases at every start or end of a word and progressively eliminates such assignments by eliminating the lowest probability assignments, until only very high probability non-recursive assignments remain. By non-recursive assignments, I mean that no noun phrase assignment is retained that is partly or wholly within another noun phrase.
Alternatively, the method of this feature of my invention can also retain some high-probability noun phrases that occur wholly within other noun phrases, since such assignments are useful in practice, for example, in speech synthesis.
Some noun phrase assignments which are always eliminated are endings without corresponding beginnings (e.g., at the start of a sentence), or beginnings without endings (e.g., at the end of a sentence), but my method further eliminates low-probability assignments of the beginnings and ends of noun phrases; or, to put it