CA2331815A1 - System for creating a dictionary - Google Patents

System for creating a dictionary Download PDF

Info

Publication number
CA2331815A1
CA2331815A1 CA002331815A CA2331815A CA2331815A1 CA 2331815 A1 CA2331815 A1 CA 2331815A1 CA 002331815 A CA002331815 A CA 002331815A CA 2331815 A CA2331815 A CA 2331815A CA 2331815 A1 CA2331815 A1 CA 2331815A1
Authority
CA
Canada
Prior art keywords
dictionary
word
entries
corpus
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CA002331815A
Other languages
French (fr)
Other versions
CA2331815C (en
Inventor
Joseph E. Pentheroudakis
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of CA2331815A1 publication Critical patent/CA2331815A1/en
Application granted granted Critical
Publication of CA2331815C publication Critical patent/CA2331815C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis

Abstract

A computer readable medium has computer executable components that include a morphological analyzer (104) capable of using a corpus of words (102) to automatically form a dictionary containing words associated with respective lemmas (154) and respective parts of speech (156). The computer executable components also include a dictionary analyzer (106) capable of automatically improving such a dictionary.

Claims (10)

1. A method for creating a dictionary of words for a language, each entry in the dictionary indicating a part of speech for the word and a lemma for the word, the method comprising:
selecting a corpus of words; and analyzing the corpus of words with a morphological analyzer to assign a part of speech and a lemma to the words of the corpus.
2. The method of claim 1 further comprising removing all but one lemma for each word in the dictionary.
3. The method of claim 1 further comprising generating multiple default entries for each word in the corpus by using the word itself as a lemma with multiple parts of speech, one part of speech per default entry.
4. The method of claim 3 further comprising after generating the multiple default entries deleting those entries having lemmas that only appear once in the dictionary as lemmas and that match their respective word in their respective entry.
5. The method of claim 4 further comprising deleting those entries having lemmas that do not appear in the corpus.
6. The method of claim 5 further comprising selecting one entry between multiple possible entries for a word on the basis of which entry contains a more probable part of speech for the word.
7. The method of claim 6 further comprising comparing the corpus to the dictionary and using the morphological analyzer to generate second pass entries for words that appear in the corpus but not in the dictionary.
8. The method of claim 7 further comprising eliminating all but one entry from multiple second pass entries that have the same word and part of speech by choosing the entry having a lemma that appears as a lemma in the most entries in the dictionary.
9. A computer readable medium having computer executable components comprising:
a morphological analyzer capable of using a corpus of words to form a dictionary containing words associated with a lemma and a part of speech; and a dictionary analyzer capable of automatically improving the dictionary.
10. The computer readable medium of claim 9 wherein the dictionary analyzer is capable of improving the dictionary by creating multiple default dictionary entries for each word in the corpus, each of the multiple dictionary entries using the respective word as its own lemma, each default dictionary entry having a unique part of speech among the default entries for a particular word.
CA2331815A 1998-05-12 1999-05-12 System for creating a dictionary Expired - Fee Related CA2331815C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US09/076,163 US6192333B1 (en) 1998-05-12 1998-05-12 System for creating a dictionary
US09/076,163 1998-05-12
PCT/US1999/010402 WO1999059082A1 (en) 1998-05-12 1999-05-12 System for creating a dictionary

Publications (2)

Publication Number Publication Date
CA2331815A1 true CA2331815A1 (en) 1999-11-18
CA2331815C CA2331815C (en) 2010-11-16

Family

ID=22130333

Family Applications (1)

Application Number Title Priority Date Filing Date
CA2331815A Expired - Fee Related CA2331815C (en) 1998-05-12 1999-05-12 System for creating a dictionary

Country Status (6)

Country Link
US (1) US6192333B1 (en)
EP (1) EP1078322B1 (en)
AT (1) ATE450007T1 (en)
CA (1) CA2331815C (en)
DE (1) DE69941694D1 (en)
WO (1) WO1999059082A1 (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7680649B2 (en) * 2002-06-17 2010-03-16 International Business Machines Corporation System, method, program product, and networking use for recognizing words and their parts of speech in one or more natural languages
BR0215994A (en) * 2002-12-27 2005-11-01 Nokia Corp Mobile terminal, and predictive text input and data compression method on a mobile terminal
US7181396B2 (en) * 2003-03-24 2007-02-20 Sony Corporation System and method for speech recognition utilizing a merged dictionary
US7424467B2 (en) 2004-01-26 2008-09-09 International Business Machines Corporation Architecture for an indexer with fixed width sort and variable width sort
US7293005B2 (en) * 2004-01-26 2007-11-06 International Business Machines Corporation Pipelined architecture for global analysis and index building
US7499913B2 (en) * 2004-01-26 2009-03-03 International Business Machines Corporation Method for handling anchor text
US8296304B2 (en) * 2004-01-26 2012-10-23 International Business Machines Corporation Method, system, and program for handling redirects in a search engine
US7430716B2 (en) * 2004-07-28 2008-09-30 International Business Machines Corporation Enhanced efficiency in handling novel words in spellchecking module
US7785197B2 (en) * 2004-07-29 2010-08-31 Nintendo Co., Ltd. Voice-to-text chat conversion for remote video game play
US7491123B2 (en) * 2004-07-29 2009-02-17 Nintendo Co., Ltd. Video game voice chat with amplitude-based virtual ranging
US7461064B2 (en) 2004-09-24 2008-12-02 International Buiness Machines Corporation Method for searching documents for ranges of numeric values
US8417693B2 (en) * 2005-07-14 2013-04-09 International Business Machines Corporation Enforcing native access control to indexed documents
WO2007029348A1 (en) * 2005-09-06 2007-03-15 Community Engine Inc. Data extracting system, terminal apparatus, program of terminal apparatus, server apparatus, and program of server apparatus
RU2639280C2 (en) * 2014-09-18 2017-12-20 Общество с ограниченной ответственностью "Аби Продакшн" Method and system for generation of articles in natural language dictionary
US8812296B2 (en) 2007-06-27 2014-08-19 Abbyy Infopoisk Llc Method and system for natural language dictionary generation
DE102008010753A1 (en) * 2008-02-23 2009-08-27 Bayer Materialscience Ag Elastomeric polyurethane molded parts, obtained by reacting polyol formulation consisting of e.g. polyol component and optionally organic tin catalyst, and an isocyanate component consisting of e.g. prepolymer, useful e.g. as shoe sole
US8521516B2 (en) * 2008-03-26 2013-08-27 Google Inc. Linguistic key normalization
WO2009136440A1 (en) * 2008-05-09 2009-11-12 富士通株式会社 Speech recognition dictionary creating support device, processing program, and processing method
US20150347570A1 (en) * 2014-05-28 2015-12-03 General Electric Company Consolidating vocabulary for automated text processing

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4887212A (en) * 1986-10-29 1989-12-12 International Business Machines Corporation Parser for natural language text
US4862408A (en) 1987-03-20 1989-08-29 International Business Machines Corporation Paradigm-based morphological text analysis for natural languages
US5099426A (en) * 1989-01-19 1992-03-24 International Business Machines Corporation Method for use of morphological information to cross reference keywords used for information retrieval
US5229936A (en) * 1991-01-04 1993-07-20 Franklin Electronic Publishers, Incorporated Device and method for the storage and retrieval of inflection information for electronic reference products
US5940624A (en) * 1991-02-01 1999-08-17 Wang Laboratories, Inc. Text management system
US5251316A (en) * 1991-06-28 1993-10-05 Digital Equipment Corporation Method and apparatus for integrating a dynamic lexicon into a full-text information retrieval system
US5412567A (en) * 1992-12-31 1995-05-02 Xerox Corporation Augmenting a lexical transducer by analogy
US5724594A (en) * 1994-02-10 1998-03-03 Microsoft Corporation Method and system for automatically identifying morphological information from a machine-readable dictionary
JPH0844719A (en) * 1994-06-01 1996-02-16 Mitsubishi Electric Corp Dictionary access system
US5873660A (en) * 1995-06-19 1999-02-23 Microsoft Corporation Morphological search and replace
US5794177A (en) * 1995-07-19 1998-08-11 Inso Corporation Method and apparatus for morphological analysis and generation of natural language text
US5995922A (en) * 1996-05-02 1999-11-30 Microsoft Corporation Identifying information related to an input word in an electronic dictionary

Also Published As

Publication number Publication date
CA2331815C (en) 2010-11-16
EP1078322A1 (en) 2001-02-28
US6192333B1 (en) 2001-02-20
ATE450007T1 (en) 2009-12-15
EP1078322B1 (en) 2009-11-25
WO1999059082A1 (en) 1999-11-18
DE69941694D1 (en) 2010-01-07

Similar Documents

Publication Publication Date Title
CA2331815A1 (en) System for creating a dictionary
EP0805403A3 (en) Translating apparatus and translating method
US6823301B1 (en) Language analysis using a reading point
WO1997038376A3 (en) A system, software and method for locating information in a collection of text-based information sources
JPS5762460A (en) Inputting method for sentence to be translated by electronic translating machine
CA2433512A1 (en) File translation
EP0933713A3 (en) Method of and apparatus for processing an input text, method of and apparatus for performing an approximate translation and storage medium
Sarkar Practical experiments in parsing using tree adjoining grammars
JPS58192173A (en) System for selecting word used in translation in machine translation
JPS54139355A (en) Word separator
JP2828692B2 (en) Information retrieval device
JPS56147269A (en) Electronic translator
KR100886688B1 (en) Method and apparatus for creating quantifier of korean language
Kchaou et al. Arabic stemming with two dictionaries
Garvin et al. The conversion of phonetic into orthographic English: A machine-translation approach to the problem
CN109165392A (en) Interaction language translating method and device
JPS6441066A (en) Back-up device for consultation of dictionary
EP0562334A2 (en) Computer system and method for the automated analysis of texts
Hurskainen Information retrieval and two-directional word formation
JPH0728823A (en) Coocurrence word extracting method and device therefor
Gachot The systran renaissance
JPS63192130A (en) Automatic key word extracting device
JP2742059B2 (en) Dictionary editor for translation
JPH07319879A (en) Translation processor
Olney et al. Summary of some computational aids for obtaining a formal semantic description of english

Legal Events

Date Code Title Description
EEER Examination request
MKLA Lapsed

Effective date: 20190513