WO2002027524B1 - A method and system for describing and identifying concepts in natural language text for information retrieval and processing - Google Patents

A method and system for describing and identifying concepts in natural language text for information retrieval and processing

Info

Publication number
WO2002027524B1
WO2002027524B1 PCT/CA2001/001398 CA0101398W WO0227524B1 WO 2002027524 B1 WO2002027524 B1 WO 2002027524B1 CA 0101398 W CA0101398 W CA 0101398W WO 0227524 B1 WO0227524 B1 WO 0227524B1
Authority
WO
WIPO (PCT)
Prior art keywords
text
concepts
concept
documents
csl
Prior art date
Application number
PCT/CA2001/001398
Other languages
French (fr)
Other versions
WO2002027524A2 (en
WO2002027524A3 (en
Inventor
Daniel C Fass
Davide Turcato
Gordon W Tisher
James Devlan Nicholson
Milan Mosny
Frederick P Popowich
Janine T Toole
Paul G Mcfetridge
Frederick W Kroon
Original Assignee
Gavagai Technology Inc
Daniel C Fass
Davide Turcato
Gordon W Tisher
James Devlan Nicholson
Milan Mosny
Frederick P Popowich
Janine T Toole
Paul G Mcfetridge
Frederick W Kroon
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gavagai Technology Inc, Daniel C Fass, Davide Turcato, Gordon W Tisher, James Devlan Nicholson, Milan Mosny, Frederick P Popowich, Janine T Toole, Paul G Mcfetridge, Frederick W Kroon filed Critical Gavagai Technology Inc
Priority to CA002423964A priority Critical patent/CA2423964A1/en
Priority to AU2001293595A priority patent/AU2001293595A1/en
Priority to EP01973933A priority patent/EP1393200A2/en
Priority to US10/398,129 priority patent/US7346490B2/en
Publication of WO2002027524A2 publication Critical patent/WO2002027524A2/en
Publication of WO2002027524A3 publication Critical patent/WO2002027524A3/en
Publication of WO2002027524B1 publication Critical patent/WO2002027524B1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99932Access augmentation or optimizing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99934Query formulation, input preparation, or translation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access

Abstract

A method for information retrieval that matches occurrences of concepts in natural language text documents against descriptions of concepts in user queries. Said method, implemented in a computer system, includes a preferred version of the method that comprises (1) annotating natural language text in documents and other text-forms with linguistic information and Concepts and Concept Rules expressed in a Concept Specification Language (CSL) for a particular domain, (2) pruning and optimizing synonyms for a particular domain, (3) defining and learning said CSL Concepts and Concept Rules, (4) checking user-defined descriptions of Concepts represented in CSL (including user queries), and (5) retrieval by matching said user-defined descriptions (and queries) against said annotated text. CSL is a language for expressing linguistically-based patterns. Said patterns can represent the linguistic manifestations of concepts in text. Said concepts may derive from the sublanguages used by experts to analyze specialized domains including, but not limited to, insurance claims, police incident reports, medical reports, and aviation incident reports.

Claims

82AMENDED CLAIMSReceived by the International Bureau on 24 October 2003 (24.10.2003) original claims 1-93 are replaced by amended claims 1-95
1. A method of information retrieval, performed on a computer system that matches text in documents and other text-forms against user-defined descriptions of concepts, comprising: a) identification of linguistic entities in the text of documents and other text-forms; b) annotation of said identified linguistic entities in a text markup language to produce linguistically annotated documents and other text-forms; c) storage of said linguistically annotated documents and other text- forms; d) identification of concepts using linguistic information, where said concepts are represented in a concept specification language and said concepts occur in one of:
1) said text of documents and other text-forms in which linguistic entities have been identified in step a); or
2) said linguistically annotated documents and other text-forms of step b); or
3) stored linguistically annotated documents and other text-forms of step c); e) annotation of said identified concepts in said text markup language to produce conceptually annotated documents and other text-forms; f) storage of said conceptually annotated documents and other text- forms; g) defining and learning concept representations of said concept specification language; h) checking user-defined descriptions of concepts represented in said concept specification language; and i) retrieval by matching said user-defined descriptions of concepts against said conceptually annotated documents and other text-forms. 83
2. The method according to claim 1 wherein said identification of linguistic entities in the text of documents and other text-forms comprises identification of morphological, syntactic, and semantic entities.
3. The method according to claim 2 wherein said identification of linguistic entities in the text of documents and other text-forms comprises identifying words and phrases, and establishing dependencies between words and phrases.
4. The method according to claim 3 wherein said identification of linguistic entities in the text of documents and other text-forms is accomplished by a method selected from one or more of: a) preprocessing of text of documents and other text-forms; b) tagging of text of documents and other text-forms; c) parsing of text of documents and other text-forms.
5. The method according to claim 4 wherein annotation of said identified linguistic entities in the text of documents and other text-forms is linguistic annotation and produces a representation of linguistically annotated documents and other text-forms in a text markup language.
6. The method according to claim 5 wherein said linguistically annotated documents and other text-forms are stored.
7. The method according to claim 1 wherein in said identification of concepts using linguistic information said concepts are represented in a concept specification language and said concepts occur in one of: a) said text of documents and other text-forms in which linguistic entities have been identified to produce said linguistically annotated documents and other text-forms by means of a method comprising 1) identification of morphological, syntactic, and semantic entities; 84
2) identification or words and phrases, and establishment of dependencies between words and phrases; and
3) at least one of: i) preprocessing of text of documents and other text- forms; ii) tagging of text of documents and other text-forms; iii) parsing of text of documents and other text-forms; or b) said linguistically annotated documents and other text-forms wherein said annotation of said identified linguistic entities in the text of documents and other text-forms comprises linguistic annotation and produces a representation of linguistically annotated documents and other text-forms in a text markup language; or c) said stored linguistically annotated documents and other text-forms in a text markup language.
8. The method according to claim 7 wherein said concept specification language allows representations to be defined for concepts in terms of a linguistics-based pattern or set of patterns, where each pattern consists of words, phrases, other concepts, and relationships between words, phrases, and concepts.
9. The method according to claim 8 wherein said identification of concepts using linguistic information, when used with said concept specification language, consists of applying representations of concepts for the purpose of identifying concepts.
10. The method according to claim 7 wherein annotation of said identified concepts in linguistically annotated documents and other text-forms is conceptual annotation and produces a representation of conceptually annotated documents and other text-forms in a text markup language. 85
11. The method according to claim 7 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of methods for identifying comprising: a) compiling an expression from said concept specification language into finite state automata (FSAs); b) matching said FSAs against linguistic entities in said linguistically annotated text.
12. The method according to claim 11 wherein concepts from said concept specification language are compiled into finite state automata (FSAs) and said compilation into FSAs comprises one or both of the following: a) the grammar from the parser used within the method to parse linguistically annotated text; and b) sets of synonyms.
13. The method according to claim 7 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of methods for identifying concepts comprising recursive descent matching which consists of traversing an expression in said concept specification language and recursively matching constituents of said expression against linguistic entities in linguistically annotated text.
14. The method according to claim 13 wherein said identification of concepts uses recursive descent matching and wherein said recursive descent matching comprises sets of synonyms.
15. The method according to claim 7 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of methods for identifying concepts which comprise bottom-up matching comprising: 86
a) generating in a bottom-up fashion multiple spans, where each span is
1) a word or constituent and, optionally, structural information about the word or constituent, or
2) a set of words and constituents that follow each other and, optionally, structural information about the words or word and constituents or constituent; b) generating in a bottom-up fashion spans consumed by single-term patterns in an expression in said concept specification language; c) generating in a bottom-up fashion spans consumed by operators in an expression in said concept specification language; and d) matching in a bottom-up fashion said spans against linguistic entities in linguistically annotated text.
16. The method according to claim 15 wherein identification of concepts using bottom-up matching, where said bottom-up matching comprises sets of synonyms.
17. The method according to claim 7 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of methods for identifying concepts that are index-based comprising use of an inverted index, where a) said inverted index contains words, constituents, and tags for linguistic information, comprising syntactic information, from linguistically annotated text; b) said inverted index contains spans for said words, constituents, and tags from linguistically annotated text; c) where each span is
1) a word or constituent and, optionally, structural information about the word or constituent, or 87
2) a set of words and constituents that follow each other and, optionally, structural information about the words or word and constituents or constituent.
18. The method according to claim 17 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of index-based methods for identifying concepts comprising index-based matching, where said index-based matching comprises: a) using backtracking to resolve the constraints of operators in an expression in said concept specification language; b) attaching iterators to all items in the expression in said concept specification language; c) using the iterators to produce matches of all items in the expression in said concept specification language against text in the inverted index; d) maintaining a state for the iterator for each item in the expression in said concept specification language where that state is used to determine whether or not it has been processed before in the match of said expression against said inverted index, and also relevant information about the progress of the match; e) maintaining a state for the iterator for each item that is a word in the expression in said concept specification language where that state comprises: a list of applicable synonyms of the word in question, and the current synonym being used for matching; an iterator into the inverted index that can enumerate all instances of the word in said index, and which records the current word; f) during the course of a match, each item in the expression in said concept specification language is tested, and if successful, returns a set of spans covering the match of its corresponding sub-expression (i.e., components of said expression). 88
19. The method according to claim 18 wherein said identification of concepts uses index-based matching, where said index-based matching comprises sets of synonyms.
20. The method according to claim 17 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of index-based methods for identifying concepts comprising candidate checking index-based matching where said candidate checking index-based matching comprises identifying sets of candidate spans, where a) a candidate span is a span that may contain a concept to be identified; b) any span that is not covered by a candidate span from the sets of candidate spans is one that cannot contain a concept to be identified; c) each sub-expression of an expression in the concept specification language is associated with a procedure; d) each such procedure is used to generate candidate spans or to check whether a given span is a candidate span.
21. The method according to claim 20 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of index-based methods for identifying concepts comprising candidate checking index-based matching where said candidate checking index-based matching produces candidate spans that serve as input to concept identification methods comprising compiling and matching finite state automata, recursive descent matching, bottom-up matching, and index based matching.
22. The method according to claim 7 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of methods for identifying concepts comprising using an inverted index with compiling and matching finite state automata, recursive descent matching, bottom-up matching, and index based matching.
23. The method according to claim 10 wherein said conceptually annotated documents and other text-forms are stored.
24. The method according to claim 1 wherein said concept representations to be defined and learned comprise hierarchies, rules, operators, patterns, and macros.
25. The method according to claim 1 , further comprising the step of defining and learning said concept representations of said concept specification language comprising: a) marking up instances of concepts in the text of documents and other text-forms; b) creating new concept representations in the concept specification language from said highlighted instances of concepts; c) adding and, if necessary, integrating said new concept representations in the concept specification language with preexisting concept representations in said language.
26. The method according to claim 25 wherein creating new concept representations of said concept specification language comprises: a) using concept identification methods to match together concept specification language vocabulary specifications and highlighted linguistically annotated documents and other text-forms; b) defining linguistic variants; c) adding synonyms from a set of synonyms; d) adding parts of speech.
27. The method according to claim 1 further comprising the step of defining and learning said concept representations of said concept specification language comprising: a) highlighting instances of concepts in the text of linguistically annotated documents and other text-forms to produce highlighted linguistically annotated documents and other text-forms; where b) said linguistically annotated documents and other text-forms are stored or produced on demand; and c) said highlighted linguistically annotated documents and other text- forms are stored or produced on demand; d) producing new concept representations in the concept specification language from said highlighted instances of concepts in said highlighted linguistically annotated documents and other text-forms; and e) adding and, if necessary, integrating said new concept representations in the concept specification language with preexisting concept representations in said language.
28. The method according to claim 1 further comprising the step of defining and learning said concept representations of said concept specification language comprising: a) marking up instances of concepts in the text of documents and other text-forms to produce highlighted documents and other text-forms; b) identification of linguistic entities in said highlighted documents and other text-forms and annotation of said documents and other text- forms to produce highlighted linguistically annotated documents and other text-forms; c) said highlighted text documents and other text-forms are stored or produced on demand; 91
d) said highlighted linguistically annotated documents and other text- forms are stored or produced on demand; e) producing new concept representations in the concept specification language from said highlighted instances of concepts in said highlighted linguistically annotated documents and other text-forms; and f) adding and, if necessary, integrating said new concept representations in the concept specification language with preexisting concept representations in said language.
29. The method according to claim 1 wherein said user-defined descriptions of concepts represented in said concept specification language comprise user queries to an information retrieval system, said user queries being represented in said concept specification language.
30. The method according to claim 29 wherein, if all known queries are represented in said concept specification language, then a proposed query represented in said concept specification language is subsequently used by said retrieval method.
31. The method according to claim 29 wherein, if all queries are not known in advance to be represented in said concept specification language, then a proposed query represented in said concept specification language is matched against a pre-stored repository of queries represented in said concept specification language and, if a match is found, then the query is subsequently used by said method of retrieval.
32. The method according to claim 29 wherein, if all queries are not known in advance to be represented in said concept specification language, then a proposed query represented in said concept specification language is matched against a pre-stored repository of queries represented in said concept 92
specification language and, if a match is not found, then the query is subsequently used by said method of conceptual annotation.
33. The method according to claim 29 wherein retrieval matches said user- defined descriptions against said annotated text and retrieves matching documents and other text-forms.
34. The method according to claim 1 comprising: b) said annotation of said identified linguistic entities in a text markup language to produce linguistically annotated documents and other text-forms comprises annotation of said identified linguistic entities in a Text Markup Language (TML) to produce linguistically annotated documents and other text-forms; d) said identification of concepts using linguistic information comprises identification of Concepts and Concept Rules using linguistic information, where said Concepts and Concept Rules are represented in a Concept Specification Language (CSL) and said Concepts-to-be-identified and Concept Rules-to-be-identified occur in one of:
1) said text of documents and other text-forms in which linguistic entities have been identified, or
2) said linguistically annotated documents and other text-forms; or
3) said stored linguistically annotated documents and other text- forms; e) said annotation of said identified concepts in said text markup language to produce conceptually annotated documents and other text-forms comprises annotation of said identified Concepts and Concept Rules in said TML to produce conceptually annotated documents and other text-forms; g) defining and learning CSL Concepts and Concept Rules; 93
h) said checking user-defined descriptions of concepts represented in said concept specification language comprises checking user-defined descriptions of Concepts and Concept Rules represented in CSL; and i) said retrieval by matching said user-defined descriptions of concepts against said conceptually annotated documents and other text-forms comprises retrieval by matching said user-defined descriptions of CSL Concepts and Concept Rules against said conceptually annotated documents and other text-forms.
35. A system for implementing said method according to claim 1 comprising one of: a) a server, comprising a communications interface to one or more clients over a network or other communication connection, one or more central processing units (CPUs), one or more input devices, one or more program and data storage areas comprising a module or submodules for an information retriever, and one or more output devices; and b) one or more clients, comprising a communications interface to a server over a network or other communication connection, one or more central processing units (CPUs), one or more input devices, one or more program and data storage areas comprising one or more submodules for an information retriever, and one or more output devices.
36. A system for implementing said method according to claim 34 comprising one of: a) a server, comprising a communications interface to one or more clients over a network or other communication connection, one or more central processing units (CPUs), one or more input devices, one or more program and data storage areas comprising a module or 94
submodules for an information retriever, and one or more output devices; and b) one or more clients, comprising a communications interface to a server over a network or other communication connection, one or more central processing units (CPUs), one or more input devices, one or more program and data storage areas comprising one or more submodules for an information retriever, and one or more output devices.
37. The system of claim 35 wherein the information retriever takes as input text in documents and other text-forms in the form of a signal from one or more input devices to a user interface, and carries out predetermined information retrieval processes to produce a collection of text in documents and other text-forms, which are output from the user interface in the form of a signal to one or more output devices.
38. The system of claim 36 wherein the information retriever takes as input text in documents and other text-forms in the form of a signal from one or more input devices to a user interface, and carries out predetermined information retrieval processes to produce a collection of text in documents and other text-forms, which are output from the user interface in the form of a signal to one or more output devices.
39. The system according to claim 37 wherein predetermined information retrieval processes, accessed by said user interface, comprises: a) identification of linguistic entities in the text of documents and other text-forms; b) annotation of said identified linguistic entities in a text markup language to produce linguistically annotated documents and other text-forms; 95
c) storage of said linguistically annotated documents and other text- forms; d) identification of concepts using linguistic information, where said concepts are represented in a concept specification language and said concepts to be identified occur in one of:
1 ) said text of documents and other text-forms in which linguistic entities have been identified in step a), or
2) said linguistically annotated documents and other text-forms of step b); or
3) stored linguistically annotated documents and other text-forms of step c); e) annotation of said identified concepts in said text markup language to produce conceptually annotated documents and other text-forms; f) storage of said conceptually annotated documents and other text- forms; g) defining and learning concept representations of said concept specification language; h) checking user-defined descriptions of concepts represented in said concept specification language; and i) retrieval by matching said user-defined descriptions of concepts against said conceptually annotated documents and other text-forms.
40. The system according to claim 38 wherein predetermined information retrieval processes, accessed by said user interface, comprise a text document annotator, CSL processor, CSL parser, and text document retriever.
41. The system according to claim 40 wherein said text document annotator, accessed by said user interface, comprises a document loader from a document database, which passes text documents to the annotator, and outputs one or more annotated documents. 96
42. The system according to claim 41 wherein said annotator takes as input one or more text documents, outputs one or more annotated documents, and is comprised of a linguistic annotator which passes linguistically annotated documents to a conceptual annotator.
43. The system according to claim 42 wherein said linguistically annotated documents, are annotated with a representation in a Text Markup Language.
44. The system according to claim 42 wherein said Text Markup Language (TML) has the syntax of XML, and conversion to and from TML is accomplished with an XML converter.
45. The system according to claim 42 wherein said linguistic annotator, taking as input one or more text documents, and outputting one or more linguistically annotated documents, comprises one or more of the following: a) a preprocessor; b) a tagger; and c) a parser.
46. The system according to claim 45 wherein said preprocessor, taking as input one or more text documents or the documents output by any other appropriate linguistic identification process, and producing as output one or more preprocessed documents, comprises means for one or more of the following: a) breaking text into words; b) marking phrase boundaries; c) identifying numbers, symbols, and other punctuation; d) expanding abbreviations; and e) splitting apart contractions.
47. The system according to claim 45 wherein said tagger takes as input a set of tags, one or more preprocessed documents or the documents output by any 97
other appropriate linguistic identification process and produces as output one or more documents tagged with the appropriate part of speech from a given tagset.
48. The system according to claim 45 wherein said parser takes as input one or more tagged documents or the documents output by any other appropriate linguistic identification process and produces as output one or more parsed documents.
49. The system according to claim 42 wherein said conceptual annotator takes as input one or more linguistically annotated documents, a list of CSL Concepts and Concept Rules for annotation, and optionally data from a synonym resource, and outputs one or more conceptually annotated documents.
50. The system according to claim 42 wherein said conceptually annotated documents are annotated with a representation in TML.
51. The system according to claim 42 wherein said input of one or more linguistically annotated documents to said conceptual annotator comprises at least one of the following sources: a) the linguistic annotator directly; b) storage in some linguistically annotated form such as the representation produced by the final linguistic identification process of the linguistic annotator; and c) storage in TML followed by conversion from TML to the representation produced by the final linguistic identification process of the linguistic annotator.
52. The system according to claim 42 wherein said conceptual annotator comprises a Concept identifier. 98
53. The system according to claim 52 wherein said Concept identifier produces conceptually annotated documents as a result of: a) compiling CSL into finite state automata (FSAs); b) matching said FSAs against linguistically annotated documents.
54. The system according to claim 53 wherein said compilation into FSAs also includes as part of compilation one or both of the following: a) the grammar from the parser used by the system to parse linguistically annotated documents; and b) sets of synonyms.
55. The system according to claim 52 wherein said Concept identifier produces conceptually annotated documents as a result of recursive descent matching which consists of traversing an expression in CSL and recursively matching constituents of said expression against linguistic entities in linguistically annotated text.
56. The system according to claim 53 wherein said recursive descent matching comprises sets of synonyms.
57. The system according to claim 52 wherein said Concept identifier produces conceptually annotated documents as a result of bottom-up matching which comprises: a) generating in a bottom-up fashion multiple spans, where each span is
1) a word or constituent and, optionally, structural information about the word or constituent, or
2) a set of words and constituents that follow each other and, optionally, structural information about the words or word and constituents or constituent; b) generating in a bottom-up fashion spans consumed by single-term patterns in an expression in CSL; 99
c) generating in a bottom-up fashion spans consumed by operators in an expression in CSL; and d) matching in a bottom-up fashion said spans against linguistic entities in linguistically annotated documents.
58. The system according to claim 57 wherein said bottom-up matching, where bottom-up matching comprises sets of synonyms.
59. The system according to claim 52 wherein said Concept identifier produces conceptually annotated documents as a result of methods for identifying Concepts that are index-based comprising use of an inverted index, where a) said inverted index contains words, constituents, and tags for linguistic information from linguistically annotated text; b) said inverted index contains spans for said words, constituents, and tags from linguistically annotated text; c) where a span is
1) a word or constituent and, optionally, structural information about the word or constituent, or
2) a set of words and constituents that follow each other and, optionally, structural information about the words or word and constituents or constituent.
60. The system according to claim 57 wherein said Concept identifier using index-based methods produces conceptually annotated documents as a result of index-based matching, where said index-based matching comprises: a) using backtracking to resolve the constraints of CSL operators in an expression in CSL; b) attaching iterators to all items in the CSL expression; c) using the iterators to produce matches of all items in the CSL expression against text in the inverted index; 100
d) maintaining a state for the iterator for each item in the CSL expression where that state is used to determine whether or not it has been processed before in the match of said expression against said inverted index, and also relevant information about the progress of the match; e) maintaining a state for the iterator for each item that is a word in the expression in CSL where that state comprises the following information: a list of applicable synonyms of the word in question, and the current synonym being used for matching; an iterator into the inverted index that can enumerate all instances of the word in said index, and which records the current word; f) during the course of a match, each item in the CSL expression is tested, and if successful, returns a set of spans covering the match of its corresponding sub-expression (i.e., components of said CSL expression).
61. The system according to claim 60 wherein said index-based matching, where index-based matching comprises sets of synonyms.
62. The method according to claim 57 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of index-based methods for identifying concepts comprising candidate checking index-based matching where said candidate checking index-based matching comprises identifying sets of candidate spans, where a) a candidate span is a span that may contain a Concept to be identified (matched); b) any span that is not covered by a candidate span from the sets of candidate spans is one that cannot contain a Concept to be identified (matched); 101
c) each sub-expression of a CSL expression is associated with a procedure; d) each such procedure is used to generate candidate spans or to check whether a given span is a candidate span.
63. The system according to claim 62 wherein said candidate spans produced by said candidate checking index-based matching serve as input to Concept identification methods comprising compiling and matching finite state automata, recursive descent matching, bottom-up matching, and index based matching.
64. The system according to claim 57 wherein said Concept identifier produces conceptually annotated documents as a result of methods for identifying Concepts comprising using an inverted index with compiling and matching finite state automata, recursive descent matching, bottom-up matching, and index based matching.
65. The system according to claim 49 wherein said conceptually annotated documents are stored.
66. The system according to claim 20 wherein said CSL processor, accessed by said user interface, comprises a CSL Concept and Concept Rule learner, and a CSL query checker.
67. The system according to claim 66 wherein said CSL Concept and Concept Rule learner comprises: a) highlighting instances of Concepts in the text of documents; b) creating new CSL Rules from said highlighted instances of Concepts; c) creating new CSL Concepts from said CSL Rules; d) adding and, if necessary, integrating said new CSL Concepts and Concept Rules with pre-existing CSL Concepts and Concept Rules. 102
68. The system according to claim 67 wherein creating new CSL Rules comprises: a) using the Concept identifier to match together CSL vocabulary specifications and highlighted linguistically annotated documents; b) defining linguistic variants; c) adding synonyms from a set of synonyms; d) adding parts of speech.
69. The system according to claim 66 wherein said CSL Concept and Concept Rule learner comprises means for: a) highlighting instances of Concepts in the text of linguistically annotated documents to produce highlighted linguistically annotated documents; where b) said linguistically annotated documents can be either produced on demand or stored in TML or other formats; and c) said highlighted linguistically annotated documents can be either produced on demand or stored in TML or other formats; d) producing new CSL Concept Rules from said highlighted instances of Concepts in said highlighted linguistically annotated document; and e) adding and, if necessary, integrating said new CSL Concepts and Concept Rules with pre-existing CSL Concepts and Concept Rules.
70. The system according to claim 66 wherein said CSL Concept and Concept Rule learner comprises means for: a) highlighting instances of Concepts in the text of documents to produce highlighted documents; b) linguistic annotation of said documents to produce highlighted linguistically annotated documents; c) said highlighted text documents can be either produced on demand or stored in TML or other formats; 103
d) said highlighted linguistically annotated documents can be either produced on demand or stored in TML or other formats; e) producing new and CSL Concept Rules from said highlighted instances of Concepts in said highlighted linguistically annotated documents; and f) adding and, if necessary, integrating said new CSL Concepts and Concept Rules with pre-existing CSL Concepts and Concept Rules.
71. The system according to claim 34 wherein said user-defined descriptions of CSL Concepts and Concept Rules comprise user queries to an information retrieval system, said user queries being represented in CSL.
72. The system according to claim 66 wherein said CSL query checker, accessed by said user interface, takes as input a proposed CSL query and, if all queries are known in advance, passes said query to the retriever.
73. The system according to claim 66 wherein said CSL query checker accessed by said user interface, takes as input a proposed CSL query and, if all queries are not known in advance, matches said query against known CSL Concepts and Concept Rules and, if a match is found, then the query is parsed with a CSL parser and passed to the retriever.
74. The system according to claim 66 wherein said CSL query checker, accessed by said user interface, takes as input a proposed CSL query and, if all queries are not known in advance, matches said query against known CSL Concepts and Concept Rules and, if a match is not found, then the query is parsed with a CSL parser and added to the list of CSL Concepts and Concept Rules to be annotated, which are then passed to the annotator. 104
75. The system according to claim 40 wherein said CSL parser takes as input a synonym database, CSL query, and CSL Concepts and Rules, and outputs CSL Concepts and Rules for annotation as a result of the following: a) word compilation; b) Concept compilation; c) downward synonym propagation; and d) upward synonym propagation.
76. The system according to claim 40 wherein said text document retriever, accessed by said user interface, comprises a retriever which takes one or more annotated documents as input, passes retrieved and categorized documents to a TML converter, which passes them to a document viewer.
77. The method according to claim 34 wherein a tag hierarchy in the CSL is a set of declarations, each declaration relating a tag to a set of tags, declaring that each of the latter tags is to be considered an instance of the former tag.
78. The method according to claim 34 wherein a Concept in the CSL is used to represent concepts.
79. The method according to claim 78 wherein a Concept in the CSL can either be global or internal to other Concepts.
80. The method according to claim 78 wherein a Concept in the CSL uses words and other Concepts in the definition of Concept Rules.
81. The method according to claim 80 wherein a Concept Rule in the CSL comprises an optional name internal to the Concept followed by a Pattern.
82. The method according to claim 81 wherein a Pattern in the CSL may match: 105
a) single terms in an annotated text (a "single-term Pattern"); or b) some configuration in an annotated text (a "configurational Pattern").
83. The method according to claim 82 wherein a single-term Pattern in the CSL comprises a reference to: a) the name of a word; b) optionally, its part of speech tag; and c) optionally, synonyms of the word.
84. The method according to claim 82 wherein a configurational Pattern in the CSL consists of the form A Operator B, where the Operator is Boolean:
85. The method according to claim 82 wherein a configurational Pettern in the CSL is any expression in the notation used to represent syntactic descriptions.
86. The method according to claim 85 wherein a configurational Pattern in the CSL consists of the form A Operator B, where the Operator is of two types: a) Dominance, and b) Precedence.
87. The method according to claim 86 wherein a configurational Pattern in the CSL consists of the form A Dominates B, where a) A is a syntactic constituent (which can be identified by a phrasal tag, though not necessarily); b) B is any Pattern; and c) the entire Pattern matches any configuration where what B refers to is a subconstituent of A.
88. The method according to claim 87 wherein a configurational Pattern in the CSL of the form A Dominates B is wide-matched, where said wide-matching returns the interval of the dominant expression A in a text is returned instead of 106
the interval of the dominated expression B, and where said interval is a consecutive sequence of words in a text that is commonly though not necessarily represented as two integers separated by a dash.
89. The method according to claim 86 wherein a configurational Pattern in the CSL consists of the form A Precedes B, where a) A is any Pattern; b) B is any Pattern; and c) the entire Pattern matches any configuration where what B refers to is a subconstituent of A.
90. The method according to claim 84 wherein a Boolean operator in the CSL can be applied to any Patterns to obtain further Patterns.
91. The method according to claim 82 wherein any of the Patterns defined in the CSL is a CSL Expression.
92. The method according to claim 82 wherein a Pattern defined in the CSL is fully recursive.
93. The method according to claim 82 wherein a Macro in the CSL represents a Pattern in a compact, parameterized form and can be used wherever a Pattern is used.
94. The method according to claim 1 wherein said concepts, represented in said concept specification language, derive from the sublanguages used to analyze event-based specialized domains comprising insurance claims, business and financial reports, police incident reports, medical reports, and aviation incident reports. 107
95. The method according to claim 34 wherein said Concepts, represented in said CSL, derive from the sublanguages used to analyze event-based specialized domains comprising insurance claims, business and financial reports, police incident reports, medical reports, and aviation incident reports.
PCT/CA2001/001398 2000-09-29 2001-09-28 A method and system for describing and identifying concepts in natural language text for information retrieval and processing WO2002027524A2 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CA002423964A CA2423964A1 (en) 2000-09-29 2001-09-28 A method and system for describing and identifying concepts in natural language text for information retrieval and processing
AU2001293595A AU2001293595A1 (en) 2000-09-29 2001-09-28 A method and system for describing and identifying concepts in natural language text for information retrieval and processing
EP01973933A EP1393200A2 (en) 2000-09-29 2001-09-28 A method and system for describing and identifying concepts in natural language text for information retrieval and processing
US10/398,129 US7346490B2 (en) 2000-09-29 2001-09-28 Method and system for describing and identifying concepts in natural language text for information retrieval and processing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US23634200P 2000-09-29 2000-09-29
US60/236,342 2000-09-29

Publications (3)

Publication Number Publication Date
WO2002027524A2 WO2002027524A2 (en) 2002-04-04
WO2002027524A3 WO2002027524A3 (en) 2003-11-20
WO2002027524B1 true WO2002027524B1 (en) 2004-04-29

Family

ID=22889103

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/CA2001/001398 WO2002027524A2 (en) 2000-09-29 2001-09-28 A method and system for describing and identifying concepts in natural language text for information retrieval and processing
PCT/CA2001/001399 WO2002027538A2 (en) 2000-09-29 2001-09-28 A method and system for adapting synonym resources to specific domains

Family Applications After (1)

Application Number Title Priority Date Filing Date
PCT/CA2001/001399 WO2002027538A2 (en) 2000-09-29 2001-09-28 A method and system for adapting synonym resources to specific domains

Country Status (5)

Country Link
US (2) US7346490B2 (en)
EP (2) EP1393200A2 (en)
AU (2) AU2001293596A1 (en)
CA (2) CA2423964A1 (en)
WO (2) WO2002027524A2 (en)

Families Citing this family (215)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6825844B2 (en) * 2001-01-16 2004-11-30 Microsoft Corp System and method for optimizing a graphics intensive software program for the user's graphics hardware
US7120868B2 (en) * 2002-05-30 2006-10-10 Microsoft Corp. System and method for adaptive document layout via manifold content
US20070265834A1 (en) * 2001-09-06 2007-11-15 Einat Melnick In-context analysis
US7139695B2 (en) * 2002-06-20 2006-11-21 Hewlett-Packard Development Company, L.P. Method for categorizing documents by multilevel feature selection and hierarchical clustering based on parts of speech tagging
US7266553B1 (en) * 2002-07-01 2007-09-04 Microsoft Corporation Content data indexing
US20040030540A1 (en) * 2002-08-07 2004-02-12 Joel Ovil Method and apparatus for language processing
US8335683B2 (en) * 2003-01-23 2012-12-18 Microsoft Corporation System for using statistical classifiers for spoken language understanding
US7174507B2 (en) * 2003-02-10 2007-02-06 Kaidara S.A. System method and computer program product for obtaining structured data from text
JP2004310691A (en) 2003-04-10 2004-11-04 Mitsubishi Electric Corp Text information processor
US20040243531A1 (en) * 2003-04-28 2004-12-02 Dean Michael Anthony Methods and systems for representing, using and displaying time-varying information on the Semantic Web
CA2523586A1 (en) * 2003-05-01 2004-11-11 Axonwave Software Inc. A method and system for concept generation and management
US7246311B2 (en) * 2003-07-17 2007-07-17 Microsoft Corporation System and methods for facilitating adaptive grid-based document layout
US20050108630A1 (en) * 2003-11-19 2005-05-19 Wasson Mark D. Extraction of facts from text
US20050283357A1 (en) * 2004-06-22 2005-12-22 Microsoft Corporation Text mining method
US8380484B2 (en) * 2004-08-10 2013-02-19 International Business Machines Corporation Method and system of dynamically changing a sentence structure of a message
US20060047636A1 (en) * 2004-08-26 2006-03-02 Mohania Mukesh K Method and system for context-oriented association of unstructured content with the result of a structured database query
US7716056B2 (en) * 2004-09-27 2010-05-11 Robert Bosch Corporation Method and system for interactive conversational dialogue for cognitively overloaded device users
US7610191B2 (en) * 2004-10-06 2009-10-27 Nuance Communications, Inc. Method for fast semi-automatic semantic annotation
US7779049B1 (en) 2004-12-20 2010-08-17 Tw Vericept Corporation Source level optimization of regular expressions
WO2006076398A2 (en) * 2005-01-12 2006-07-20 Metier Ltd Predictive analytic method and apparatus
US7937396B1 (en) 2005-03-23 2011-05-03 Google Inc. Methods and systems for identifying paraphrases from an index of information items and associated sentence fragments
US7996356B2 (en) * 2005-03-24 2011-08-09 Xerox Corporation Text searching and categorization tools
WO2006110684A2 (en) * 2005-04-11 2006-10-19 Textdigger, Inc. System and method for searching for a query
US8280719B2 (en) * 2005-05-05 2012-10-02 Ramp, Inc. Methods and systems relating to information extraction
EP1889181A4 (en) 2005-05-16 2009-12-02 Ebay Inc Method and system to process a data search request
WO2006128183A2 (en) 2005-05-27 2006-11-30 Schwegman, Lundberg, Woessner & Kluth, P.A. Method and apparatus for cross-referencing important ip relationships
US8055608B1 (en) 2005-06-10 2011-11-08 NetBase Solutions, Inc. Method and apparatus for concept-based classification of natural language discourse
US8086605B2 (en) * 2005-06-28 2011-12-27 Yahoo! Inc. Search engine with augmented relevance ranking by community participation
US7809551B2 (en) * 2005-07-01 2010-10-05 Xerox Corporation Concept matching system
US7689411B2 (en) * 2005-07-01 2010-03-30 Xerox Corporation Concept matching
US7937265B1 (en) 2005-09-27 2011-05-03 Google Inc. Paraphrase acquisition
US8498999B1 (en) * 2005-10-14 2013-07-30 Wal-Mart Stores, Inc. Topic relevant abbreviations
WO2007081681A2 (en) 2006-01-03 2007-07-19 Textdigger, Inc. Search system with query refinement and search method
US7788088B2 (en) * 2006-01-11 2010-08-31 International Business Machines Corporation Natural language interaction with large databases
US20070185860A1 (en) * 2006-01-24 2007-08-09 Michael Lissack System for searching
US8977953B1 (en) * 2006-01-27 2015-03-10 Linguastat, Inc. Customizing information by combining pair of annotations from at least two different documents
US8195683B2 (en) * 2006-02-28 2012-06-05 Ebay Inc. Expansion of database search queries
US8423348B2 (en) * 2006-03-08 2013-04-16 Trigent Software Ltd. Pattern generation
WO2007114932A2 (en) 2006-04-04 2007-10-11 Textdigger, Inc. Search system and method with text function tagging
US8442965B2 (en) 2006-04-19 2013-05-14 Google Inc. Query language identification
US8380488B1 (en) 2006-04-19 2013-02-19 Google Inc. Identifying a property of a document
US8255376B2 (en) * 2006-04-19 2012-08-28 Google Inc. Augmenting queries with synonyms from synonyms map
US8762358B2 (en) * 2006-04-19 2014-06-24 Google Inc. Query language determination using query terms and interface language
US7835903B2 (en) * 2006-04-19 2010-11-16 Google Inc. Simplifying query terms with transliteration
US7853446B2 (en) * 2006-05-02 2010-12-14 International Business Machines Corporation Generation of codified electronic medical records by processing clinician commentary
US20070260478A1 (en) * 2006-05-02 2007-11-08 International Business Machines Corporation Delivery of Health Insurance Plan Options
US8041730B1 (en) * 2006-10-24 2011-10-18 Google Inc. Using geographic data to identify correlated geographic synonyms
JP4865526B2 (en) * 2006-12-18 2012-02-01 株式会社日立製作所 Data mining system, data mining method, and data search system
US7925498B1 (en) * 2006-12-29 2011-04-12 Google Inc. Identifying a synonym with N-gram agreement for a query phrase
US8356245B2 (en) * 2007-01-05 2013-01-15 International Business Machines Corporation System and method of automatically mapping a given annotator to an aggregate of given annotators
US8131536B2 (en) * 2007-01-12 2012-03-06 Raytheon Bbn Technologies Corp. Extraction-empowered machine translation
US9093073B1 (en) * 2007-02-12 2015-07-28 West Corporation Automatic speech recognition tagging
US7945438B2 (en) * 2007-04-02 2011-05-17 International Business Machines Corporation Automated glossary creation
US20080250008A1 (en) * 2007-04-04 2008-10-09 Microsoft Corporation Query Specialization
US8886521B2 (en) * 2007-05-17 2014-11-11 Redstart Systems, Inc. System and method of dictation for a speech recognition command system
JP2010532897A (en) 2007-07-10 2010-10-14 インターナショナル・ビジネス・マシーンズ・コーポレーション Intelligent text annotation method, system and computer program
US8037086B1 (en) 2007-07-10 2011-10-11 Google Inc. Identifying common co-occurring elements in lists
US8209321B2 (en) * 2007-08-31 2012-06-26 Microsoft Corporation Emphasizing search results according to conceptual meaning
US8280721B2 (en) 2007-08-31 2012-10-02 Microsoft Corporation Efficiently representing word sense probabilities
US8463593B2 (en) * 2007-08-31 2013-06-11 Microsoft Corporation Natural language hypernym weighting for word sense disambiguation
US8712758B2 (en) 2007-08-31 2014-04-29 Microsoft Corporation Coreference resolution in an ambiguity-sensitive natural language processing system
US8346756B2 (en) * 2007-08-31 2013-01-01 Microsoft Corporation Calculating valence of expressions within documents for searching a document index
US8316036B2 (en) * 2007-08-31 2012-11-20 Microsoft Corporation Checkpointing iterators during search
US20090070322A1 (en) * 2007-08-31 2009-03-12 Powerset, Inc. Browsing knowledge on the basis of semantic relations
US8229970B2 (en) * 2007-08-31 2012-07-24 Microsoft Corporation Efficient storage and retrieval of posting lists
US8229730B2 (en) * 2007-08-31 2012-07-24 Microsoft Corporation Indexing role hierarchies for words in a search index
US8868562B2 (en) * 2007-08-31 2014-10-21 Microsoft Corporation Identification of semantic relationships within reported speech
US20090063521A1 (en) * 2007-09-04 2009-03-05 Apple Inc. Auto-tagging of aliases
US8655868B2 (en) 2007-09-12 2014-02-18 Ebay Inc. Inference of query relationships based on retrieved attributes
US7890539B2 (en) * 2007-10-10 2011-02-15 Raytheon Bbn Technologies Corp. Semantic matching using predicate-argument structure
US7761471B1 (en) * 2007-10-16 2010-07-20 Jpmorgan Chase Bank, N.A. Document management techniques to account for user-specific patterns in document metadata
US8181165B2 (en) * 2007-10-30 2012-05-15 International Business Machines Corporation Using annotations to reuse variable declarations to generate different service functions
WO2009059297A1 (en) * 2007-11-01 2009-05-07 Textdigger, Inc. Method and apparatus for automated tag generation for digital content
BG66255B1 (en) * 2007-11-14 2012-09-28 Ivaylo Popov Natural language formalization
US7962486B2 (en) * 2008-01-10 2011-06-14 International Business Machines Corporation Method and system for discovery and modification of data cluster and synonyms
US20090210229A1 (en) * 2008-02-18 2009-08-20 At&T Knowledge Ventures, L.P. Processing Received Voice Messages
US8061142B2 (en) * 2008-04-11 2011-11-22 General Electric Company Mixer for a combustor
US9646078B2 (en) * 2008-05-12 2017-05-09 Groupon, Inc. Sentiment extraction from consumer reviews for providing product recommendations
US8037069B2 (en) * 2008-06-03 2011-10-11 Microsoft Corporation Membership checking of digital text
US20090313243A1 (en) * 2008-06-13 2009-12-17 Siemens Aktiengesellschaft Method and apparatus for processing semantic data resources
US20090326925A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Projecting syntactic information using a bottom-up pattern matching algorithm
US20090326924A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Projecting Semantic Information from a Language Independent Syntactic Model
US8180629B2 (en) * 2008-07-10 2012-05-15 Trigent Softward Ltd. Automatic pattern generation in natural language processing
US8935152B1 (en) 2008-07-21 2015-01-13 NetBase Solutions, Inc. Method and apparatus for frame-based analysis of search results
US9047285B1 (en) 2008-07-21 2015-06-02 NetBase Solutions, Inc. Method and apparatus for frame-based search
US9424339B2 (en) 2008-08-15 2016-08-23 Athena A. Smyros Systems and methods utilizing a search engine
US20100042589A1 (en) * 2008-08-15 2010-02-18 Smyros Athena A Systems and methods for topical searching
US9092517B2 (en) * 2008-09-23 2015-07-28 Microsoft Technology Licensing, Llc Generating synonyms based on query log data
US8584085B2 (en) * 2008-09-24 2013-11-12 Accenture Global Services Limited Identification of concepts in software
US8370128B2 (en) * 2008-09-30 2013-02-05 Xerox Corporation Semantically-driven extraction of relations between named entities
KR101045762B1 (en) * 2008-11-03 2011-07-01 한국과학기술원 Real-time semantic annotation device and method for generating natural language string input by user as semantic readable knowledge structure document in real time
JP4735726B2 (en) * 2009-02-18 2011-07-27 ソニー株式会社 Information processing apparatus and method, and program
US8090770B2 (en) * 2009-04-14 2012-01-03 Fusz Digital Ltd. Systems and methods for identifying non-terrorists using social networking
US8190601B2 (en) * 2009-05-22 2012-05-29 Microsoft Corporation Identifying task groups for organizing search results
US9189475B2 (en) * 2009-06-22 2015-11-17 Ca, Inc. Indexing mechanism (nth phrasal index) for advanced leveraging for translation
JP2011033680A (en) * 2009-07-30 2011-02-17 Sony Corp Voice processing device and method, and program
US20150006563A1 (en) * 2009-08-14 2015-01-01 Kendra J. Carattini Transitive Synonym Creation
US8392441B1 (en) 2009-08-15 2013-03-05 Google Inc. Synonym generation using online decompounding and transitivity
US8812297B2 (en) 2010-04-09 2014-08-19 International Business Machines Corporation Method and system for interactively finding synonyms using positive and negative feedback
US9026529B1 (en) 2010-04-22 2015-05-05 NetBase Solutions, Inc. Method and apparatus for determining search result demographics
US8161073B2 (en) 2010-05-05 2012-04-17 Holovisions, LLC Context-driven search
US9600566B2 (en) 2010-05-14 2017-03-21 Microsoft Technology Licensing, Llc Identifying entity synonyms
US8903798B2 (en) 2010-05-28 2014-12-02 Microsoft Corporation Real-time annotation and enrichment of captured video
US9703782B2 (en) 2010-05-28 2017-07-11 Microsoft Technology Licensing, Llc Associating media with metadata of near-duplicates
JP2012027723A (en) * 2010-07-23 2012-02-09 Sony Corp Information processor, information processing method and information processing program
US8548989B2 (en) 2010-07-30 2013-10-01 International Business Machines Corporation Querying documents using search terms
US9792640B2 (en) 2010-08-18 2017-10-17 Jinni Media Ltd. Generating and providing content recommendations to a group of users
US8838453B2 (en) * 2010-08-31 2014-09-16 Red Hat, Inc. Interactive input method
US11423029B1 (en) 2010-11-09 2022-08-23 Google Llc Index-side stem-based variant generation
US8375042B1 (en) 2010-11-09 2013-02-12 Google Inc. Index-side synonym generation
US8498972B2 (en) * 2010-12-16 2013-07-30 Sap Ag String and sub-string searching using inverted indexes
FR2970795A1 (en) * 2011-01-25 2012-07-27 Synomia Method for filtering of synonyms in electronic document database in information system for searching information in e.g. Internet, involves performing reduction of number of synonyms of keyword based on score value of semantic proximity
US8688453B1 (en) * 2011-02-28 2014-04-01 Nuance Communications, Inc. Intent mining via analysis of utterances
US9904726B2 (en) * 2011-05-04 2018-02-27 Black Hills IP Holdings, LLC. Apparatus and method for automated and assisted patent claim mapping and expense planning
US9678992B2 (en) * 2011-05-18 2017-06-13 Microsoft Technology Licensing, Llc Text to image translation
US20130144863A1 (en) * 2011-05-25 2013-06-06 Forensic Logic, Inc. System and Method for Gathering, Restructuring, and Searching Text Data from Several Different Data Sources
US10643355B1 (en) 2011-07-05 2020-05-05 NetBase Solutions, Inc. Graphical representation of frame instances and co-occurrences
US9390525B1 (en) 2011-07-05 2016-07-12 NetBase Solutions, Inc. Graphical representation of frame instances
US20130013616A1 (en) * 2011-07-08 2013-01-10 Jochen Lothar Leidner Systems and Methods for Natural Language Searching of Structured Data
US20130197938A1 (en) * 2011-08-26 2013-08-01 Wellpoint, Inc. System and method for creating and using health data record
US9614901B2 (en) * 2011-08-26 2017-04-04 Nimblestack Inc. Data infrastructure for providing interconnectivity between platforms, devices, and operating systems
US9128581B1 (en) 2011-09-23 2015-09-08 Amazon Technologies, Inc. Providing supplemental information for a digital work in a user interface
US9639518B1 (en) 2011-09-23 2017-05-02 Amazon Technologies, Inc. Identifying entities in a digital work
US9613003B1 (en) * 2011-09-23 2017-04-04 Amazon Technologies, Inc. Identifying topics in a digital work
US9449526B1 (en) 2011-09-23 2016-09-20 Amazon Technologies, Inc. Generating a game related to a digital work
US9075799B1 (en) 2011-10-24 2015-07-07 NetBase Solutions, Inc. Methods and apparatus for query formulation
US10872082B1 (en) 2011-10-24 2020-12-22 NetBase Solutions, Inc. Methods and apparatuses for clustered storage of information
US8745019B2 (en) 2012-03-05 2014-06-03 Microsoft Corporation Robust discovery of entity synonyms using query logs
US9275044B2 (en) 2012-03-07 2016-03-01 Searchleaf, Llc Method, apparatus and system for finding synonyms
US8874435B2 (en) * 2012-04-17 2014-10-28 International Business Machines Corporation Automated glossary creation
US9037591B1 (en) 2012-04-30 2015-05-19 Google Inc. Storing term substitution information in an index
US20140040302A1 (en) * 2012-05-08 2014-02-06 Patrick Sander Walsh Method and system for developing a list of words related to a search concept
US8949263B1 (en) 2012-05-14 2015-02-03 NetBase Solutions, Inc. Methods and apparatus for sentiment analysis
US10032131B2 (en) 2012-06-20 2018-07-24 Microsoft Technology Licensing, Llc Data services for enterprises leveraging search system data assets
US9594831B2 (en) 2012-06-22 2017-03-14 Microsoft Technology Licensing, Llc Targeted disambiguation of named entities
US20140006373A1 (en) * 2012-06-29 2014-01-02 International Business Machines Corporation Automated subject annotator creation using subject expansion, ontological mining, and natural language processing techniques
US9229924B2 (en) 2012-08-24 2016-01-05 Microsoft Technology Licensing, Llc Word detection and domain dictionary recommendation
US9460069B2 (en) * 2012-10-19 2016-10-04 International Business Machines Corporation Generation of test data using text analytics
RU2530268C2 (en) * 2012-11-28 2014-10-10 Общество с ограниченной ответственностью "Спиктуит" Method for user training of information dialogue system
US10430506B2 (en) 2012-12-10 2019-10-01 International Business Machines Corporation Utilizing classification and text analytics for annotating documents to allow quick scanning
US9244909B2 (en) * 2012-12-10 2016-01-26 General Electric Company System and method for extracting ontological information from a body of text
US9286280B2 (en) 2012-12-10 2016-03-15 International Business Machines Corporation Utilizing classification and text analytics for optimizing processes in documents
US9154629B2 (en) * 2012-12-14 2015-10-06 Avaya Inc. System and method for generating personalized tag recommendations for tagging audio content
US9043926B2 (en) * 2012-12-20 2015-05-26 Symantec Corporation Identifying primarily monosemous keywords to include in keyword lists for detection of domain-specific language
US10339452B2 (en) 2013-02-06 2019-07-02 Verint Systems Ltd. Automated ontology development
DE102013003055A1 (en) * 2013-02-18 2014-08-21 Nadine Sina Kurz Method and apparatus for performing natural language searches
US9123335B2 (en) * 2013-02-20 2015-09-01 Jinni Media Limited System apparatus circuit method and associated computer executable code for natural language understanding and semantic content discovery
US9311297B2 (en) * 2013-03-14 2016-04-12 Prateek Bhatnagar Method and system for outputting information
US9135243B1 (en) 2013-03-15 2015-09-15 NetBase Solutions, Inc. Methods and apparatus for identification and analysis of temporally differing corpora
US10191893B2 (en) 2013-07-22 2019-01-29 Open Text Holdings, Inc. Information extraction and annotation systems and methods for documents
US8856642B1 (en) 2013-07-22 2014-10-07 Recommind, Inc. Information extraction and annotation systems and methods for documents
US9633009B2 (en) * 2013-08-01 2017-04-25 International Business Machines Corporation Knowledge-rich automatic term disambiguation
US20150066506A1 (en) 2013-08-30 2015-03-05 Verint Systems Ltd. System and Method of Text Zoning
US9311300B2 (en) 2013-09-13 2016-04-12 International Business Machines Corporation Using natural language processing (NLP) to create subject matter synonyms from definitions
US8949283B1 (en) 2013-12-23 2015-02-03 Google Inc. Systems and methods for clustering electronic messages
US9542668B2 (en) 2013-12-30 2017-01-10 Google Inc. Systems and methods for clustering electronic messages
US20150186455A1 (en) * 2013-12-30 2015-07-02 Google Inc. Systems and methods for automatic electronic message annotation
US9767189B2 (en) 2013-12-30 2017-09-19 Google Inc. Custom electronic message presentation based on electronic message category
US10033679B2 (en) 2013-12-31 2018-07-24 Google Llc Systems and methods for displaying unseen labels in a clustering in-box environment
US9306893B2 (en) 2013-12-31 2016-04-05 Google Inc. Systems and methods for progressive message flow
US9124546B2 (en) 2013-12-31 2015-09-01 Google Inc. Systems and methods for throttling display of electronic messages
US9152307B2 (en) 2013-12-31 2015-10-06 Google Inc. Systems and methods for simultaneously displaying clustered, in-line electronic messages in one display
US10255346B2 (en) 2014-01-31 2019-04-09 Verint Systems Ltd. Tagging relations with N-best
US9977830B2 (en) 2014-01-31 2018-05-22 Verint Systems Ltd. Call summary
US20150254211A1 (en) * 2014-03-08 2015-09-10 Microsoft Technology Licensing, Llc Interactive data manipulation using examples and natural language
US10380203B1 (en) 2014-05-10 2019-08-13 NetBase Solutions, Inc. Methods and apparatus for author identification of search results
US9378204B2 (en) 2014-05-22 2016-06-28 International Business Machines Corporation Context based synonym filtering for natural language processing systems
US11250450B1 (en) 2014-06-27 2022-02-15 Groupon, Inc. Method and system for programmatic generation of survey queries
US9317566B1 (en) 2014-06-27 2016-04-19 Groupon, Inc. Method and system for programmatic analysis of consumer reviews
US10878017B1 (en) 2014-07-29 2020-12-29 Groupon, Inc. System and method for programmatic generation of attribute descriptors
US9720978B1 (en) * 2014-09-30 2017-08-01 Amazon Technologies, Inc. Fingerprint-based literary works recommendation system
US10977667B1 (en) 2014-10-22 2021-04-13 Groupon, Inc. Method and system for programmatic analysis of consumer sentiment with regard to attribute descriptors
US10915543B2 (en) 2014-11-03 2021-02-09 SavantX, Inc. Systems and methods for enterprise data search and analysis
US10360229B2 (en) 2014-11-03 2019-07-23 SavantX, Inc. Systems and methods for enterprise data search and analysis
US20160217127A1 (en) 2015-01-27 2016-07-28 Verint Systems Ltd. Identification of significant phrases using multiple language models
US20160343086A1 (en) * 2015-05-19 2016-11-24 Xerox Corporation System and method for facilitating interpretation of financial statements in 10k reports by linking numbers to their context
JP6583686B2 (en) 2015-06-17 2019-10-02 パナソニックIpマネジメント株式会社 Semantic information generation method, semantic information generation device, and program
US10628521B2 (en) * 2015-08-03 2020-04-21 International Business Machines Corporation Scoring automatically generated language patterns for questions using synthetic events
US10628413B2 (en) * 2015-08-03 2020-04-21 International Business Machines Corporation Mapping questions to complex database lookups using synthetic events
US10545920B2 (en) 2015-08-04 2020-01-28 International Business Machines Corporation Deduplication by phrase substitution within chunks of substantially similar content
US9760630B2 (en) 2015-08-14 2017-09-12 International Business Machines Corporation Generation of synonym list from existing thesaurus
US10832146B2 (en) 2016-01-19 2020-11-10 International Business Machines Corporation System and method of inferring synonyms using ensemble learning techniques
US9836451B2 (en) * 2016-02-18 2017-12-05 Sap Se Dynamic tokens for an expression parser
US10843080B2 (en) * 2016-02-24 2020-11-24 Virginia Tech Intellectual Properties, Inc. Automated program synthesis from natural language for domain specific computing applications
JP6589704B2 (en) * 2016-03-17 2019-10-16 日本電気株式会社 Sentence boundary estimation apparatus, method and program
US10191899B2 (en) 2016-06-06 2019-01-29 Comigo Ltd. System and method for understanding text using a translation of the text
US10037360B2 (en) * 2016-06-20 2018-07-31 Rovi Guides, Inc. Approximate template matching for natural language queries
US11200510B2 (en) 2016-07-12 2021-12-14 International Business Machines Corporation Text classifier training
US9940323B2 (en) * 2016-07-12 2018-04-10 International Business Machines Corporation Text classifier operation
JP6737117B2 (en) * 2016-10-07 2020-08-05 富士通株式会社 Encoded data search program, encoded data search method, and encoded data search device
US11188824B2 (en) * 2017-02-17 2021-11-30 Google Llc Cooperatively training and/or using separate input and subsequent content neural networks for information retrieval
US11373086B2 (en) 2017-02-17 2022-06-28 Google Llc Cooperatively training and/or using separate input and response neural network models for determining response(s) for electronic communications
US11328128B2 (en) 2017-02-28 2022-05-10 SavantX, Inc. System and method for analysis and navigation of data
EP3590053A4 (en) 2017-02-28 2020-11-25 SavantX, Inc. System and method for analysis and navigation of data
US10713519B2 (en) * 2017-06-22 2020-07-14 Adobe Inc. Automated workflows for identification of reading order from text segments using probabilistic language models
US10878194B2 (en) 2017-12-11 2020-12-29 Walmart Apollo, Llc System and method for the detection and reporting of occupational safety incidents
US10762142B2 (en) 2018-03-16 2020-09-01 Open Text Holdings, Inc. User-defined automated document feature extraction and optimization
US11048762B2 (en) 2018-03-16 2021-06-29 Open Text Holdings, Inc. User-defined automated document feature modeling, extraction and optimization
US11157538B2 (en) * 2018-04-30 2021-10-26 Innoplexus Ag System and method for generating summary of research document
US10783328B2 (en) * 2018-06-04 2020-09-22 International Business Machines Corporation Semi-automatic process for creating a natural language processing resource
US11763821B1 (en) * 2018-06-27 2023-09-19 Cerner Innovation, Inc. Tool for assisting people with speech disorder
US11288451B2 (en) * 2018-07-17 2022-03-29 Verint Americas Inc. Machine based expansion of contractions in text in digital media
AU2019366366A1 (en) 2018-10-22 2021-05-20 William D. Carlson Therapeutic combinations of TDFRPs and additional agents and methods of use
US11294913B2 (en) * 2018-11-16 2022-04-05 International Business Machines Corporation Cognitive classification-based technical support system
US11610277B2 (en) 2019-01-25 2023-03-21 Open Text Holdings, Inc. Seamless electronic discovery system with an enterprise data portal
US11769012B2 (en) 2019-03-27 2023-09-26 Verint Americas Inc. Automated system and method to prioritize language model and ontology expansion and pruning
CN110085290A (en) * 2019-04-01 2019-08-02 东华大学 The breast molybdenum target of heterogeneous information integration is supported to report semantic tree method for establishing model
US11397854B2 (en) 2019-04-26 2022-07-26 International Business Machines Corporation Generation of domain thesaurus
US11693855B2 (en) * 2019-12-20 2023-07-04 International Business Machines Corporation Automatic creation of schema annotation files for converting natural language queries to structured query language
US11200033B2 (en) * 2020-01-13 2021-12-14 Fujitsu Limited Application programming interface (API) based object oriented software development and textual analysis
US11074402B1 (en) * 2020-04-07 2021-07-27 International Business Machines Corporation Linguistically consistent document annotation
CN111581329A (en) * 2020-04-23 2020-08-25 上海兑观信息科技技术有限公司 Short text matching method and device based on inverted index
US11915167B2 (en) 2020-08-12 2024-02-27 State Farm Mutual Automobile Insurance Company Claim analysis based on candidate functions
JP2023055152A (en) * 2021-10-05 2023-04-17 株式会社デンソーウェーブ robot system
CN116542136A (en) * 2023-04-13 2023-08-04 南京大学 Universal method and device for searching and multiplexing learning objects

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5265242A (en) 1985-08-23 1993-11-23 Hiromichi Fujisawa Document retrieval system for displaying document image data with inputted bibliographic items and character string selected from multiple character candidates
US4843389A (en) * 1986-12-04 1989-06-27 International Business Machines Corp. Text compression and expansion method and apparatus
US5742834A (en) * 1992-06-24 1998-04-21 Canon Kabushiki Kaisha Document processing apparatus using a synonym dictionary
US5625554A (en) 1992-07-20 1997-04-29 Xerox Corporation Finite-state transduction of related word forms for text indexing and retrieval
US5331556A (en) * 1993-06-28 1994-07-19 General Electric Company Method for natural language data processing using morphological and part-of-speech information
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US5799268A (en) * 1994-09-28 1998-08-25 Apple Computer, Inc. Method for extracting knowledge from online documentation and creating a glossary, index, help database or the like
US6026388A (en) * 1995-08-16 2000-02-15 Textwise, Llc User interface and other enhancements for natural language information retrieval system and method
US5963940A (en) 1995-08-16 1999-10-05 Syracuse University Natural language information retrieval system and method
US6076088A (en) * 1996-02-09 2000-06-13 Paik; Woojin Information extraction system and method using concept relation concept (CRC) triples
US5987414A (en) 1996-10-31 1999-11-16 Nortel Networks Corporation Method and apparatus for selecting a vocabulary sub-set from a speech recognition dictionary for use in real time automated directory assistance
US6081774A (en) * 1997-08-22 2000-06-27 Novell, Inc. Natural language information retrieval system and method
US6175829B1 (en) 1998-04-22 2001-01-16 Nec Usa, Inc. Method and apparatus for facilitating query reformulation
US6480843B2 (en) 1998-11-03 2002-11-12 Nec Usa, Inc. Supporting web-query expansion efficiently using multi-granularity indexing and query processing

Also Published As

Publication number Publication date
US7346490B2 (en) 2008-03-18
EP1393200A2 (en) 2004-03-03
AU2001293596A1 (en) 2002-04-08
CA2423965A1 (en) 2002-04-04
CA2423964A1 (en) 2002-04-04
US20040078190A1 (en) 2004-04-22
AU2001293595A1 (en) 2002-04-08
EP1325430A2 (en) 2003-07-09
WO2002027538A3 (en) 2003-04-03
US20040133418A1 (en) 2004-07-08
WO2002027524A2 (en) 2002-04-04
WO2002027538A2 (en) 2002-04-04
WO2002027524A3 (en) 2003-11-20
US7330811B2 (en) 2008-02-12

Similar Documents

Publication Publication Date Title
WO2002027524B1 (en) A method and system for describing and identifying concepts in natural language text for information retrieval and processing
US6745161B1 (en) System and method for incorporating concept-based retrieval within boolean search engines
CN107291687B (en) Chinese unsupervised open type entity relation extraction method based on dependency semantics
US7174507B2 (en) System method and computer program product for obtaining structured data from text
Aït-Mokhtar et al. Robustness beyond shallowness: incremental deep parsing
US20070174041A1 (en) Method and system for concept generation and management
Shah et al. NLKBIDB-Natural language and keyword based interface to database
CN113609838A (en) Document information extraction and mapping method and system
Sonbol et al. A Machine Translation Like Approach to Generate Business Process Model from Textual Description
Potter A survey of knowledge acquisition from natural language
Choi TPEMatcher: A tool for searching in parsed text corpora
Haj et al. Automated generation of terminological dictionary from textual business rules
González et al. Semantic representations for knowledge modelling of a Natural Language Interface to Databases using ontologies
Bédaride et al. Semantic normalisation: a framework and an experiment
Du On the use of natural language processing for automated conceptual data modeling
Korobkin et al. Methods of Russian Patent Analysis
Van Halteren Excursions into syntactic databases
Wilks et al. LaSIE jumps the GATE
Vileiniškis et al. An approach for Semantic search over Lithuanian news website corpus
Han et al. Ontology extraction and conceptual modeling for web information
Le Duyen Sandra Vu CISQA: Corporate Smart Insights Question Answering System
Cedermark et al. Swedish noun and adjective morphology in a natural language interface to databases
Seco et al. Using CBR for semantic analysis of software specifications
Chen Learning information extraction patterns
Bolioli et al. From ir to ie through gl

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PH PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2423964

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 2001973933

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWE Wipo information: entry into national phase

Ref document number: 10398129

Country of ref document: US

WWP Wipo information: published in national office

Ref document number: 2001973933

Country of ref document: EP

B Later publication of amended claims

Effective date: 20031024

WWW Wipo information: withdrawn in national office

Ref document number: 2001973933

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: JP