CA2331815A1 - System for creating a dictionary - Google Patents
System for creating a dictionary Download PDFInfo
- Publication number
- CA2331815A1 CA2331815A1 CA002331815A CA2331815A CA2331815A1 CA 2331815 A1 CA2331815 A1 CA 2331815A1 CA 002331815 A CA002331815 A CA 002331815A CA 2331815 A CA2331815 A CA 2331815A CA 2331815 A1 CA2331815 A1 CA 2331815A1
- Authority
- CA
- Canada
- Prior art keywords
- dictionary
- word
- entries
- corpus
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/268—Morphological analysis
Abstract
A computer readable medium has computer executable components that include a morphological analyzer (104) capable of using a corpus of words (102) to automatically form a dictionary containing words associated with respective lemmas (154) and respective parts of speech (156). The computer executable components also include a dictionary analyzer (106) capable of automatically improving such a dictionary.
Claims (10)
1. A method for creating a dictionary of words for a language, each entry in the dictionary indicating a part of speech for the word and a lemma for the word, the method comprising:
selecting a corpus of words; and analyzing the corpus of words with a morphological analyzer to assign a part of speech and a lemma to the words of the corpus.
selecting a corpus of words; and analyzing the corpus of words with a morphological analyzer to assign a part of speech and a lemma to the words of the corpus.
2. The method of claim 1 further comprising removing all but one lemma for each word in the dictionary.
3. The method of claim 1 further comprising generating multiple default entries for each word in the corpus by using the word itself as a lemma with multiple parts of speech, one part of speech per default entry.
4. The method of claim 3 further comprising after generating the multiple default entries deleting those entries having lemmas that only appear once in the dictionary as lemmas and that match their respective word in their respective entry.
5. The method of claim 4 further comprising deleting those entries having lemmas that do not appear in the corpus.
6. The method of claim 5 further comprising selecting one entry between multiple possible entries for a word on the basis of which entry contains a more probable part of speech for the word.
7. The method of claim 6 further comprising comparing the corpus to the dictionary and using the morphological analyzer to generate second pass entries for words that appear in the corpus but not in the dictionary.
8. The method of claim 7 further comprising eliminating all but one entry from multiple second pass entries that have the same word and part of speech by choosing the entry having a lemma that appears as a lemma in the most entries in the dictionary.
9. A computer readable medium having computer executable components comprising:
a morphological analyzer capable of using a corpus of words to form a dictionary containing words associated with a lemma and a part of speech; and a dictionary analyzer capable of automatically improving the dictionary.
a morphological analyzer capable of using a corpus of words to form a dictionary containing words associated with a lemma and a part of speech; and a dictionary analyzer capable of automatically improving the dictionary.
10. The computer readable medium of claim 9 wherein the dictionary analyzer is capable of improving the dictionary by creating multiple default dictionary entries for each word in the corpus, each of the multiple dictionary entries using the respective word as its own lemma, each default dictionary entry having a unique part of speech among the default entries for a particular word.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/076,163 US6192333B1 (en) | 1998-05-12 | 1998-05-12 | System for creating a dictionary |
US09/076,163 | 1998-05-12 | ||
PCT/US1999/010402 WO1999059082A1 (en) | 1998-05-12 | 1999-05-12 | System for creating a dictionary |
Publications (2)
Publication Number | Publication Date |
---|---|
CA2331815A1 true CA2331815A1 (en) | 1999-11-18 |
CA2331815C CA2331815C (en) | 2010-11-16 |
Family
ID=22130333
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA2331815A Expired - Fee Related CA2331815C (en) | 1998-05-12 | 1999-05-12 | System for creating a dictionary |
Country Status (6)
Country | Link |
---|---|
US (1) | US6192333B1 (en) |
EP (1) | EP1078322B1 (en) |
AT (1) | ATE450007T1 (en) |
CA (1) | CA2331815C (en) |
DE (1) | DE69941694D1 (en) |
WO (1) | WO1999059082A1 (en) |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7680649B2 (en) * | 2002-06-17 | 2010-03-16 | International Business Machines Corporation | System, method, program product, and networking use for recognizing words and their parts of speech in one or more natural languages |
BR0215994A (en) * | 2002-12-27 | 2005-11-01 | Nokia Corp | Mobile terminal, and predictive text input and data compression method on a mobile terminal |
US7181396B2 (en) * | 2003-03-24 | 2007-02-20 | Sony Corporation | System and method for speech recognition utilizing a merged dictionary |
US7424467B2 (en) | 2004-01-26 | 2008-09-09 | International Business Machines Corporation | Architecture for an indexer with fixed width sort and variable width sort |
US7293005B2 (en) * | 2004-01-26 | 2007-11-06 | International Business Machines Corporation | Pipelined architecture for global analysis and index building |
US7499913B2 (en) * | 2004-01-26 | 2009-03-03 | International Business Machines Corporation | Method for handling anchor text |
US8296304B2 (en) * | 2004-01-26 | 2012-10-23 | International Business Machines Corporation | Method, system, and program for handling redirects in a search engine |
US7430716B2 (en) * | 2004-07-28 | 2008-09-30 | International Business Machines Corporation | Enhanced efficiency in handling novel words in spellchecking module |
US7785197B2 (en) * | 2004-07-29 | 2010-08-31 | Nintendo Co., Ltd. | Voice-to-text chat conversion for remote video game play |
US7491123B2 (en) * | 2004-07-29 | 2009-02-17 | Nintendo Co., Ltd. | Video game voice chat with amplitude-based virtual ranging |
US7461064B2 (en) | 2004-09-24 | 2008-12-02 | International Buiness Machines Corporation | Method for searching documents for ranges of numeric values |
US8417693B2 (en) * | 2005-07-14 | 2013-04-09 | International Business Machines Corporation | Enforcing native access control to indexed documents |
WO2007029348A1 (en) * | 2005-09-06 | 2007-03-15 | Community Engine Inc. | Data extracting system, terminal apparatus, program of terminal apparatus, server apparatus, and program of server apparatus |
RU2639280C2 (en) * | 2014-09-18 | 2017-12-20 | Общество с ограниченной ответственностью "Аби Продакшн" | Method and system for generation of articles in natural language dictionary |
US8812296B2 (en) | 2007-06-27 | 2014-08-19 | Abbyy Infopoisk Llc | Method and system for natural language dictionary generation |
DE102008010753A1 (en) * | 2008-02-23 | 2009-08-27 | Bayer Materialscience Ag | Elastomeric polyurethane molded parts, obtained by reacting polyol formulation consisting of e.g. polyol component and optionally organic tin catalyst, and an isocyanate component consisting of e.g. prepolymer, useful e.g. as shoe sole |
US8521516B2 (en) * | 2008-03-26 | 2013-08-27 | Google Inc. | Linguistic key normalization |
WO2009136440A1 (en) * | 2008-05-09 | 2009-11-12 | 富士通株式会社 | Speech recognition dictionary creating support device, processing program, and processing method |
US20150347570A1 (en) * | 2014-05-28 | 2015-12-03 | General Electric Company | Consolidating vocabulary for automated text processing |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4887212A (en) * | 1986-10-29 | 1989-12-12 | International Business Machines Corporation | Parser for natural language text |
US4862408A (en) | 1987-03-20 | 1989-08-29 | International Business Machines Corporation | Paradigm-based morphological text analysis for natural languages |
US5099426A (en) * | 1989-01-19 | 1992-03-24 | International Business Machines Corporation | Method for use of morphological information to cross reference keywords used for information retrieval |
US5229936A (en) * | 1991-01-04 | 1993-07-20 | Franklin Electronic Publishers, Incorporated | Device and method for the storage and retrieval of inflection information for electronic reference products |
US5940624A (en) * | 1991-02-01 | 1999-08-17 | Wang Laboratories, Inc. | Text management system |
US5251316A (en) * | 1991-06-28 | 1993-10-05 | Digital Equipment Corporation | Method and apparatus for integrating a dynamic lexicon into a full-text information retrieval system |
US5412567A (en) * | 1992-12-31 | 1995-05-02 | Xerox Corporation | Augmenting a lexical transducer by analogy |
US5724594A (en) * | 1994-02-10 | 1998-03-03 | Microsoft Corporation | Method and system for automatically identifying morphological information from a machine-readable dictionary |
JPH0844719A (en) * | 1994-06-01 | 1996-02-16 | Mitsubishi Electric Corp | Dictionary access system |
US5873660A (en) * | 1995-06-19 | 1999-02-23 | Microsoft Corporation | Morphological search and replace |
US5794177A (en) * | 1995-07-19 | 1998-08-11 | Inso Corporation | Method and apparatus for morphological analysis and generation of natural language text |
US5995922A (en) * | 1996-05-02 | 1999-11-30 | Microsoft Corporation | Identifying information related to an input word in an electronic dictionary |
-
1998
- 1998-05-12 US US09/076,163 patent/US6192333B1/en not_active Expired - Lifetime
-
1999
- 1999-05-12 EP EP99922966A patent/EP1078322B1/en not_active Expired - Lifetime
- 1999-05-12 CA CA2331815A patent/CA2331815C/en not_active Expired - Fee Related
- 1999-05-12 WO PCT/US1999/010402 patent/WO1999059082A1/en active Application Filing
- 1999-05-12 DE DE69941694T patent/DE69941694D1/en not_active Expired - Lifetime
- 1999-05-12 AT AT99922966T patent/ATE450007T1/en not_active IP Right Cessation
Also Published As
Publication number | Publication date |
---|---|
CA2331815C (en) | 2010-11-16 |
EP1078322A1 (en) | 2001-02-28 |
US6192333B1 (en) | 2001-02-20 |
ATE450007T1 (en) | 2009-12-15 |
EP1078322B1 (en) | 2009-11-25 |
WO1999059082A1 (en) | 1999-11-18 |
DE69941694D1 (en) | 2010-01-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA2331815A1 (en) | System for creating a dictionary | |
EP0805403A3 (en) | Translating apparatus and translating method | |
US6823301B1 (en) | Language analysis using a reading point | |
WO1997038376A3 (en) | A system, software and method for locating information in a collection of text-based information sources | |
JPS5762460A (en) | Inputting method for sentence to be translated by electronic translating machine | |
CA2433512A1 (en) | File translation | |
EP0933713A3 (en) | Method of and apparatus for processing an input text, method of and apparatus for performing an approximate translation and storage medium | |
Sarkar | Practical experiments in parsing using tree adjoining grammars | |
JPS58192173A (en) | System for selecting word used in translation in machine translation | |
JPS54139355A (en) | Word separator | |
JP2828692B2 (en) | Information retrieval device | |
JPS56147269A (en) | Electronic translator | |
KR100886688B1 (en) | Method and apparatus for creating quantifier of korean language | |
Kchaou et al. | Arabic stemming with two dictionaries | |
Garvin et al. | The conversion of phonetic into orthographic English: A machine-translation approach to the problem | |
CN109165392A (en) | Interaction language translating method and device | |
JPS6441066A (en) | Back-up device for consultation of dictionary | |
EP0562334A2 (en) | Computer system and method for the automated analysis of texts | |
Hurskainen | Information retrieval and two-directional word formation | |
JPH0728823A (en) | Coocurrence word extracting method and device therefor | |
Gachot | The systran renaissance | |
JPS63192130A (en) | Automatic key word extracting device | |
JP2742059B2 (en) | Dictionary editor for translation | |
JPH07319879A (en) | Translation processor | |
Olney et al. | Summary of some computational aids for obtaining a formal semantic description of english |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
EEER | Examination request | ||
MKLA | Lapsed |
Effective date: 20190513 |