WO2002027524B1

WO2002027524B1 - A method and system for describing and identifying concepts in natural language text for information retrieval and processing

Info

Publication number: WO2002027524B1
Application number: PCT/CA2001/001398
Authority: WO
Inventors: Daniel C Fass; Davide Turcato; Gordon W Tisher; James Devlan Nicholson; Milan Mosny; Frederick P Popowich; Janine T Toole; Paul G Mcfetridge; Frederick W Kroon
Original assignee: Gavagai Technology Inc; Daniel C Fass; Davide Turcato; Gordon W Tisher; James Devlan Nicholson; Milan Mosny; Frederick P Popowich; Janine T Toole; Paul G Mcfetridge; Frederick W Kroon
Priority date: 2000-09-29
Filing date: 2001-09-28
Publication date: 2004-04-29
Also published as: US7346490B2; EP1393200A2; AU2001293596A1; CA2423965A1; CA2423964A1; US20040078190A1; AU2001293595A1; EP1325430A2; WO2002027538A3; US20040133418A1; WO2002027524A2; WO2002027538A2; WO2002027524A3; US7330811B2

Abstract

A method for information retrieval that matches occurrences of concepts in natural language text documents against descriptions of concepts in user queries. Said method, implemented in a computer system, includes a preferred version of the method that comprises (1) annotating natural language text in documents and other text-forms with linguistic information and Concepts and Concept Rules expressed in a Concept Specification Language (CSL) for a particular domain, (2) pruning and optimizing synonyms for a particular domain, (3) defining and learning said CSL Concepts and Concept Rules, (4) checking user-defined descriptions of Concepts represented in CSL (including user queries), and (5) retrieval by matching said user-defined descriptions (and queries) against said annotated text. CSL is a language for expressing linguistically-based patterns. Said patterns can represent the linguistic manifestations of concepts in text. Said concepts may derive from the sublanguages used by experts to analyze specialized domains including, but not limited to, insurance claims, police incident reports, medical reports, and aviation incident reports.

Claims

82AMENDED CLAIMSReceived by the International Bureau on 24 October 2003 (24.10.2003) original claims 1-93 are replaced by amended claims 1-95

1. A method of information retrieval, performed on a computer system that matches text in documents and other text-forms against user-defined descriptions of concepts, comprising: a) identification of linguistic entities in the text of documents and other text-forms; b) annotation of said identified linguistic entities in a text markup language to produce linguistically annotated documents and other text-forms; c) storage of said linguistically annotated documents and other text- forms; d) identification of concepts using linguistic information, where said concepts are represented in a concept specification language and said concepts occur in one of:

1) said text of documents and other text-forms in which linguistic entities have been identified in step a); or

2) said linguistically annotated documents and other text-forms of step b); or

3) stored linguistically annotated documents and other text-forms of step c); e) annotation of said identified concepts in said text markup language to produce conceptually annotated documents and other text-forms; f) storage of said conceptually annotated documents and other text- forms; g) defining and learning concept representations of said concept specification language; h) checking user-defined descriptions of concepts represented in said concept specification language; and i) retrieval by matching said user-defined descriptions of concepts against said conceptually annotated documents and other text-forms. 83

2. The method according to claim 1 wherein said identification of linguistic entities in the text of documents and other text-forms comprises identification of morphological, syntactic, and semantic entities.

3. The method according to claim 2 wherein said identification of linguistic entities in the text of documents and other text-forms comprises identifying words and phrases, and establishing dependencies between words and phrases.

4. The method according to claim 3 wherein said identification of linguistic entities in the text of documents and other text-forms is accomplished by a method selected from one or more of: a) preprocessing of text of documents and other text-forms; b) tagging of text of documents and other text-forms; c) parsing of text of documents and other text-forms.

5. The method according to claim 4 wherein annotation of said identified linguistic entities in the text of documents and other text-forms is linguistic annotation and produces a representation of linguistically annotated documents and other text-forms in a text markup language.

6. The method according to claim 5 wherein said linguistically annotated documents and other text-forms are stored.

7. The method according to claim 1 wherein in said identification of concepts using linguistic information said concepts are represented in a concept specification language and said concepts occur in one of: a) said text of documents and other text-forms in which linguistic entities have been identified to produce said linguistically annotated documents and other text-forms by means of a method comprising 1) identification of morphological, syntactic, and semantic entities; 84

2) identification or words and phrases, and establishment of dependencies between words and phrases; and

3) at least one of: i) preprocessing of text of documents and other text- forms; ii) tagging of text of documents and other text-forms; iii) parsing of text of documents and other text-forms; or b) said linguistically annotated documents and other text-forms wherein said annotation of said identified linguistic entities in the text of documents and other text-forms comprises linguistic annotation and produces a representation of linguistically annotated documents and other text-forms in a text markup language; or c) said stored linguistically annotated documents and other text-forms in a text markup language.

8. The method according to claim 7 wherein said concept specification language allows representations to be defined for concepts in terms of a linguistics-based pattern or set of patterns, where each pattern consists of words, phrases, other concepts, and relationships between words, phrases, and concepts.

9. The method according to claim 8 wherein said identification of concepts using linguistic information, when used with said concept specification language, consists of applying representations of concepts for the purpose of identifying concepts.

10. The method according to claim 7 wherein annotation of said identified concepts in linguistically annotated documents and other text-forms is conceptual annotation and produces a representation of conceptually annotated documents and other text-forms in a text markup language. 85

11. The method according to claim 7 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of methods for identifying comprising: a) compiling an expression from said concept specification language into finite state automata (FSAs); b) matching said FSAs against linguistic entities in said linguistically annotated text.

12. The method according to claim 11 wherein concepts from said concept specification language are compiled into finite state automata (FSAs) and said compilation into FSAs comprises one or both of the following: a) the grammar from the parser used within the method to parse linguistically annotated text; and b) sets of synonyms.

13. The method according to claim 7 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of methods for identifying concepts comprising recursive descent matching which consists of traversing an expression in said concept specification language and recursively matching constituents of said expression against linguistic entities in linguistically annotated text.

14. The method according to claim 13 wherein said identification of concepts uses recursive descent matching and wherein said recursive descent matching comprises sets of synonyms.

15. The method according to claim 7 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of methods for identifying concepts which comprise bottom-up matching comprising: 86

a) generating in a bottom-up fashion multiple spans, where each span is

1) a word or constituent and, optionally, structural information about the word or constituent, or

2) a set of words and constituents that follow each other and, optionally, structural information about the words or word and constituents or constituent; b) generating in a bottom-up fashion spans consumed by single-term patterns in an expression in said concept specification language; c) generating in a bottom-up fashion spans consumed by operators in an expression in said concept specification language; and d) matching in a bottom-up fashion said spans against linguistic entities in linguistically annotated text.

16. The method according to claim 15 wherein identification of concepts using bottom-up matching, where said bottom-up matching comprises sets of synonyms.

17. The method according to claim 7 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of methods for identifying concepts that are index-based comprising use of an inverted index, where a) said inverted index contains words, constituents, and tags for linguistic information, comprising syntactic information, from linguistically annotated text; b) said inverted index contains spans for said words, constituents, and tags from linguistically annotated text; c) where each span is

1) a word or constituent and, optionally, structural information about the word or constituent, or 87

2) a set of words and constituents that follow each other and, optionally, structural information about the words or word and constituents or constituent.

18. The method according to claim 17 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of index-based methods for identifying concepts comprising index-based matching, where said index-based matching comprises: a) using backtracking to resolve the constraints of operators in an expression in said concept specification language; b) attaching iterators to all items in the expression in said concept specification language; c) using the iterators to produce matches of all items in the expression in said concept specification language against text in the inverted index; d) maintaining a state for the iterator for each item in the expression in said concept specification language where that state is used to determine whether or not it has been processed before in the match of said expression against said inverted index, and also relevant information about the progress of the match; e) maintaining a state for the iterator for each item that is a word in the expression in said concept specification language where that state comprises: a list of applicable synonyms of the word in question, and the current synonym being used for matching; an iterator into the inverted index that can enumerate all instances of the word in said index, and which records the current word; f) during the course of a match, each item in the expression in said concept specification language is tested, and if successful, returns a set of spans covering the match of its corresponding sub-expression (i.e., components of said expression). 88

19. The method according to claim 18 wherein said identification of concepts uses index-based matching, where said index-based matching comprises sets of synonyms.

20. The method according to claim 17 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of index-based methods for identifying concepts comprising candidate checking index-based matching where said candidate checking index-based matching comprises identifying sets of candidate spans, where a) a candidate span is a span that may contain a concept to be identified; b) any span that is not covered by a candidate span from the sets of candidate spans is one that cannot contain a concept to be identified; c) each sub-expression of an expression in the concept specification language is associated with a procedure; d) each such procedure is used to generate candidate spans or to check whether a given span is a candidate span.

21. The method according to claim 20 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of index-based methods for identifying concepts comprising candidate checking index-based matching where said candidate checking index-based matching produces candidate spans that serve as input to concept identification methods comprising compiling and matching finite state automata, recursive descent matching, bottom-up matching, and index based matching.

22. The method according to claim 7 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of methods for identifying concepts comprising using an inverted index with compiling and matching finite state automata, recursive descent matching, bottom-up matching, and index based matching.

23. The method according to claim 10 wherein said conceptually annotated documents and other text-forms are stored.

24. The method according to claim 1 wherein said concept representations to be defined and learned comprise hierarchies, rules, operators, patterns, and macros.

25. The method according to claim 1 , further comprising the step of defining and learning said concept representations of said concept specification language comprising: a) marking up instances of concepts in the text of documents and other text-forms; b) creating new concept representations in the concept specification language from said highlighted instances of concepts; c) adding and, if necessary, integrating said new concept representations in the concept specification language with preexisting concept representations in said language.

26. The method according to claim 25 wherein creating new concept representations of said concept specification language comprises: a) using concept identification methods to match together concept specification language vocabulary specifications and highlighted linguistically annotated documents and other text-forms; b) defining linguistic variants; c) adding synonyms from a set of synonyms; d) adding parts of speech.

27. The method according to claim 1 further comprising the step of defining and learning said concept representations of said concept specification language comprising: a) highlighting instances of concepts in the text of linguistically annotated documents and other text-forms to produce highlighted linguistically annotated documents and other text-forms; where b) said linguistically annotated documents and other text-forms are stored or produced on demand; and c) said highlighted linguistically annotated documents and other text- forms are stored or produced on demand; d) producing new concept representations in the concept specification language from said highlighted instances of concepts in said highlighted linguistically annotated documents and other text-forms; and e) adding and, if necessary, integrating said new concept representations in the concept specification language with preexisting concept representations in said language.

28. The method according to claim 1 further comprising the step of defining and learning said concept representations of said concept specification language comprising: a) marking up instances of concepts in the text of documents and other text-forms to produce highlighted documents and other text-forms; b) identification of linguistic entities in said highlighted documents and other text-forms and annotation of said documents and other text- forms to produce highlighted linguistically annotated documents and other text-forms; c) said highlighted text documents and other text-forms are stored or produced on demand; 91

d) said highlighted linguistically annotated documents and other text- forms are stored or produced on demand; e) producing new concept representations in the concept specification language from said highlighted instances of concepts in said highlighted linguistically annotated documents and other text-forms; and f) adding and, if necessary, integrating said new concept representations in the concept specification language with preexisting concept representations in said language.

29. The method according to claim 1 wherein said user-defined descriptions of concepts represented in said concept specification language comprise user queries to an information retrieval system, said user queries being represented in said concept specification language.

30. The method according to claim 29 wherein, if all known queries are represented in said concept specification language, then a proposed query represented in said concept specification language is subsequently used by said retrieval method.

31. The method according to claim 29 wherein, if all queries are not known in advance to be represented in said concept specification language, then a proposed query represented in said concept specification language is matched against a pre-stored repository of queries represented in said concept specification language and, if a match is found, then the query is subsequently used by said method of retrieval.

32. The method according to claim 29 wherein, if all queries are not known in advance to be represented in said concept specification language, then a proposed query represented in said concept specification language is matched against a pre-stored repository of queries represented in said concept 92

specification language and, if a match is not found, then the query is subsequently used by said method of conceptual annotation.

33. The method according to claim 29 wherein retrieval matches said user- defined descriptions against said annotated text and retrieves matching documents and other text-forms.

34. The method according to claim 1 comprising: b) said annotation of said identified linguistic entities in a text markup language to produce linguistically annotated documents and other text-forms comprises annotation of said identified linguistic entities in a Text Markup Language (TML) to produce linguistically annotated documents and other text-forms; d) said identification of concepts using linguistic information comprises identification of Concepts and Concept Rules using linguistic information, where said Concepts and Concept Rules are represented in a Concept Specification Language (CSL) and said Concepts-to-be-identified and Concept Rules-to-be-identified occur in one of:

1) said text of documents and other text-forms in which linguistic entities have been identified, or

2) said linguistically annotated documents and other text-forms; or

3) said stored linguistically annotated documents and other text- forms; e) said annotation of said identified concepts in said text markup language to produce conceptually annotated documents and other text-forms comprises annotation of said identified Concepts and Concept Rules in said TML to produce conceptually annotated documents and other text-forms; g) defining and learning CSL Concepts and Concept Rules; 93

h) said checking user-defined descriptions of concepts represented in said concept specification language comprises checking user-defined descriptions of Concepts and Concept Rules represented in CSL; and i) said retrieval by matching said user-defined descriptions of concepts against said conceptually annotated documents and other text-forms comprises retrieval by matching said user-defined descriptions of CSL Concepts and Concept Rules against said conceptually annotated documents and other text-forms.

35. A system for implementing said method according to claim 1 comprising one of: a) a server, comprising a communications interface to one or more clients over a network or other communication connection, one or more central processing units (CPUs), one or more input devices, one or more program and data storage areas comprising a module or submodules for an information retriever, and one or more output devices; and b) one or more clients, comprising a communications interface to a server over a network or other communication connection, one or more central processing units (CPUs), one or more input devices, one or more program and data storage areas comprising one or more submodules for an information retriever, and one or more output devices.

36. A system for implementing said method according to claim 34 comprising one of: a) a server, comprising a communications interface to one or more clients over a network or other communication connection, one or more central processing units (CPUs), one or more input devices, one or more program and data storage areas comprising a module or 94

submodules for an information retriever, and one or more output devices; and b) one or more clients, comprising a communications interface to a server over a network or other communication connection, one or more central processing units (CPUs), one or more input devices, one or more program and data storage areas comprising one or more submodules for an information retriever, and one or more output devices.

37. The system of claim 35 wherein the information retriever takes as input text in documents and other text-forms in the form of a signal from one or more input devices to a user interface, and carries out predetermined information retrieval processes to produce a collection of text in documents and other text-forms, which are output from the user interface in the form of a signal to one or more output devices.

38. The system of claim 36 wherein the information retriever takes as input text in documents and other text-forms in the form of a signal from one or more input devices to a user interface, and carries out predetermined information retrieval processes to produce a collection of text in documents and other text-forms, which are output from the user interface in the form of a signal to one or more output devices.

39. The system according to claim 37 wherein predetermined information retrieval processes, accessed by said user interface, comprises: a) identification of linguistic entities in the text of documents and other text-forms; b) annotation of said identified linguistic entities in a text markup language to produce linguistically annotated documents and other text-forms; 95

c) storage of said linguistically annotated documents and other text- forms; d) identification of concepts using linguistic information, where said concepts are represented in a concept specification language and said concepts to be identified occur in one of:

1 ) said text of documents and other text-forms in which linguistic entities have been identified in step a), or

2) said linguistically annotated documents and other text-forms of step b); or

3) stored linguistically annotated documents and other text-forms of step c); e) annotation of said identified concepts in said text markup language to produce conceptually annotated documents and other text-forms; f) storage of said conceptually annotated documents and other text- forms; g) defining and learning concept representations of said concept specification language; h) checking user-defined descriptions of concepts represented in said concept specification language; and i) retrieval by matching said user-defined descriptions of concepts against said conceptually annotated documents and other text-forms.

40. The system according to claim 38 wherein predetermined information retrieval processes, accessed by said user interface, comprise a text document annotator, CSL processor, CSL parser, and text document retriever.

41. The system according to claim 40 wherein said text document annotator, accessed by said user interface, comprises a document loader from a document database, which passes text documents to the annotator, and outputs one or more annotated documents. 96

42. The system according to claim 41 wherein said annotator takes as input one or more text documents, outputs one or more annotated documents, and is comprised of a linguistic annotator which passes linguistically annotated documents to a conceptual annotator.

43. The system according to claim 42 wherein said linguistically annotated documents, are annotated with a representation in a Text Markup Language.

44. The system according to claim 42 wherein said Text Markup Language (TML) has the syntax of XML, and conversion to and from TML is accomplished with an XML converter.

45. The system according to claim 42 wherein said linguistic annotator, taking as input one or more text documents, and outputting one or more linguistically annotated documents, comprises one or more of the following: a) a preprocessor; b) a tagger; and c) a parser.

46. The system according to claim 45 wherein said preprocessor, taking as input one or more text documents or the documents output by any other appropriate linguistic identification process, and producing as output one or more preprocessed documents, comprises means for one or more of the following: a) breaking text into words; b) marking phrase boundaries; c) identifying numbers, symbols, and other punctuation; d) expanding abbreviations; and e) splitting apart contractions.

47. The system according to claim 45 wherein said tagger takes as input a set of tags, one or more preprocessed documents or the documents output by any 97

other appropriate linguistic identification process and produces as output one or more documents tagged with the appropriate part of speech from a given tagset.

48. The system according to claim 45 wherein said parser takes as input one or more tagged documents or the documents output by any other appropriate linguistic identification process and produces as output one or more parsed documents.

49. The system according to claim 42 wherein said conceptual annotator takes as input one or more linguistically annotated documents, a list of CSL Concepts and Concept Rules for annotation, and optionally data from a synonym resource, and outputs one or more conceptually annotated documents.

50. The system according to claim 42 wherein said conceptually annotated documents are annotated with a representation in TML.

51. The system according to claim 42 wherein said input of one or more linguistically annotated documents to said conceptual annotator comprises at least one of the following sources: a) the linguistic annotator directly; b) storage in some linguistically annotated form such as the representation produced by the final linguistic identification process of the linguistic annotator; and c) storage in TML followed by conversion from TML to the representation produced by the final linguistic identification process of the linguistic annotator.

52. The system according to claim 42 wherein said conceptual annotator comprises a Concept identifier. 98

53. The system according to claim 52 wherein said Concept identifier produces conceptually annotated documents as a result of: a) compiling CSL into finite state automata (FSAs); b) matching said FSAs against linguistically annotated documents.

54. The system according to claim 53 wherein said compilation into FSAs also includes as part of compilation one or both of the following: a) the grammar from the parser used by the system to parse linguistically annotated documents; and b) sets of synonyms.

55. The system according to claim 52 wherein said Concept identifier produces conceptually annotated documents as a result of recursive descent matching which consists of traversing an expression in CSL and recursively matching constituents of said expression against linguistic entities in linguistically annotated text.

56. The system according to claim 53 wherein said recursive descent matching comprises sets of synonyms.

57. The system according to claim 52 wherein said Concept identifier produces conceptually annotated documents as a result of bottom-up matching which comprises: a) generating in a bottom-up fashion multiple spans, where each span is

2) a set of words and constituents that follow each other and, optionally, structural information about the words or word and constituents or constituent; b) generating in a bottom-up fashion spans consumed by single-term patterns in an expression in CSL; 99

c) generating in a bottom-up fashion spans consumed by operators in an expression in CSL; and d) matching in a bottom-up fashion said spans against linguistic entities in linguistically annotated documents.

58. The system according to claim 57 wherein said bottom-up matching, where bottom-up matching comprises sets of synonyms.

59. The system according to claim 52 wherein said Concept identifier produces conceptually annotated documents as a result of methods for identifying Concepts that are index-based comprising use of an inverted index, where a) said inverted index contains words, constituents, and tags for linguistic information from linguistically annotated text; b) said inverted index contains spans for said words, constituents, and tags from linguistically annotated text; c) where a span is

60. The system according to claim 57 wherein said Concept identifier using index-based methods produces conceptually annotated documents as a result of index-based matching, where said index-based matching comprises: a) using backtracking to resolve the constraints of CSL operators in an expression in CSL; b) attaching iterators to all items in the CSL expression; c) using the iterators to produce matches of all items in the CSL expression against text in the inverted index; 100

d) maintaining a state for the iterator for each item in the CSL expression where that state is used to determine whether or not it has been processed before in the match of said expression against said inverted index, and also relevant information about the progress of the match; e) maintaining a state for the iterator for each item that is a word in the expression in CSL where that state comprises the following information: a list of applicable synonyms of the word in question, and the current synonym being used for matching; an iterator into the inverted index that can enumerate all instances of the word in said index, and which records the current word; f) during the course of a match, each item in the CSL expression is tested, and if successful, returns a set of spans covering the match of its corresponding sub-expression (i.e., components of said CSL expression).

61. The system according to claim 60 wherein said index-based matching, where index-based matching comprises sets of synonyms.

62. The method according to claim 57 wherein said identification of concepts uses linguistic information, and said concepts are represented in a concept specification language, as a result of index-based methods for identifying concepts comprising candidate checking index-based matching where said candidate checking index-based matching comprises identifying sets of candidate spans, where a) a candidate span is a span that may contain a Concept to be identified (matched); b) any span that is not covered by a candidate span from the sets of candidate spans is one that cannot contain a Concept to be identified (matched); 101

c) each sub-expression of a CSL expression is associated with a procedure; d) each such procedure is used to generate candidate spans or to check whether a given span is a candidate span.

63. The system according to claim 62 wherein said candidate spans produced by said candidate checking index-based matching serve as input to Concept identification methods comprising compiling and matching finite state automata, recursive descent matching, bottom-up matching, and index based matching.

64. The system according to claim 57 wherein said Concept identifier produces conceptually annotated documents as a result of methods for identifying Concepts comprising using an inverted index with compiling and matching finite state automata, recursive descent matching, bottom-up matching, and index based matching.

65. The system according to claim 49 wherein said conceptually annotated documents are stored.

66. The system according to claim 20 wherein said CSL processor, accessed by said user interface, comprises a CSL Concept and Concept Rule learner, and a CSL query checker.

67. The system according to claim 66 wherein said CSL Concept and Concept Rule learner comprises: a) highlighting instances of Concepts in the text of documents; b) creating new CSL Rules from said highlighted instances of Concepts; c) creating new CSL Concepts from said CSL Rules; d) adding and, if necessary, integrating said new CSL Concepts and Concept Rules with pre-existing CSL Concepts and Concept Rules. 102

68. The system according to claim 67 wherein creating new CSL Rules comprises: a) using the Concept identifier to match together CSL vocabulary specifications and highlighted linguistically annotated documents; b) defining linguistic variants; c) adding synonyms from a set of synonyms; d) adding parts of speech.

69. The system according to claim 66 wherein said CSL Concept and Concept Rule learner comprises means for: a) highlighting instances of Concepts in the text of linguistically annotated documents to produce highlighted linguistically annotated documents; where b) said linguistically annotated documents can be either produced on demand or stored in TML or other formats; and c) said highlighted linguistically annotated documents can be either produced on demand or stored in TML or other formats; d) producing new CSL Concept Rules from said highlighted instances of Concepts in said highlighted linguistically annotated document; and e) adding and, if necessary, integrating said new CSL Concepts and Concept Rules with pre-existing CSL Concepts and Concept Rules.

70. The system according to claim 66 wherein said CSL Concept and Concept Rule learner comprises means for: a) highlighting instances of Concepts in the text of documents to produce highlighted documents; b) linguistic annotation of said documents to produce highlighted linguistically annotated documents; c) said highlighted text documents can be either produced on demand or stored in TML or other formats; 103

d) said highlighted linguistically annotated documents can be either produced on demand or stored in TML or other formats; e) producing new and CSL Concept Rules from said highlighted instances of Concepts in said highlighted linguistically annotated documents; and f) adding and, if necessary, integrating said new CSL Concepts and Concept Rules with pre-existing CSL Concepts and Concept Rules.

71. The system according to claim 34 wherein said user-defined descriptions of CSL Concepts and Concept Rules comprise user queries to an information retrieval system, said user queries being represented in CSL.

72. The system according to claim 66 wherein said CSL query checker, accessed by said user interface, takes as input a proposed CSL query and, if all queries are known in advance, passes said query to the retriever.

73. The system according to claim 66 wherein said CSL query checker accessed by said user interface, takes as input a proposed CSL query and, if all queries are not known in advance, matches said query against known CSL Concepts and Concept Rules and, if a match is found, then the query is parsed with a CSL parser and passed to the retriever.

74. The system according to claim 66 wherein said CSL query checker, accessed by said user interface, takes as input a proposed CSL query and, if all queries are not known in advance, matches said query against known CSL Concepts and Concept Rules and, if a match is not found, then the query is parsed with a CSL parser and added to the list of CSL Concepts and Concept Rules to be annotated, which are then passed to the annotator. 104

75. The system according to claim 40 wherein said CSL parser takes as input a synonym database, CSL query, and CSL Concepts and Rules, and outputs CSL Concepts and Rules for annotation as a result of the following: a) word compilation; b) Concept compilation; c) downward synonym propagation; and d) upward synonym propagation.

76. The system according to claim 40 wherein said text document retriever, accessed by said user interface, comprises a retriever which takes one or more annotated documents as input, passes retrieved and categorized documents to a TML converter, which passes them to a document viewer.

77. The method according to claim 34 wherein a tag hierarchy in the CSL is a set of declarations, each declaration relating a tag to a set of tags, declaring that each of the latter tags is to be considered an instance of the former tag.

78. The method according to claim 34 wherein a Concept in the CSL is used to represent concepts.

79. The method according to claim 78 wherein a Concept in the CSL can either be global or internal to other Concepts.

80. The method according to claim 78 wherein a Concept in the CSL uses words and other Concepts in the definition of Concept Rules.

81. The method according to claim 80 wherein a Concept Rule in the CSL comprises an optional name internal to the Concept followed by a Pattern.

82. The method according to claim 81 wherein a Pattern in the CSL may match: 105

a) single terms in an annotated text (a "single-term Pattern"); or b) some configuration in an annotated text (a "configurational Pattern").

83. The method according to claim 82 wherein a single-term Pattern in the CSL comprises a reference to: a) the name of a word; b) optionally, its part of speech tag; and c) optionally, synonyms of the word.

84. The method according to claim 82 wherein a configurational Pattern in the CSL consists of the form A Operator B, where the Operator is Boolean:

85. The method according to claim 82 wherein a configurational Pettern in the CSL is any expression in the notation used to represent syntactic descriptions.

86. The method according to claim 85 wherein a configurational Pattern in the CSL consists of the form A Operator B, where the Operator is of two types: a) Dominance, and b) Precedence.

87. The method according to claim 86 wherein a configurational Pattern in the CSL consists of the form A Dominates B, where a) A is a syntactic constituent (which can be identified by a phrasal tag, though not necessarily); b) B is any Pattern; and c) the entire Pattern matches any configuration where what B refers to is a subconstituent of A.

88. The method according to claim 87 wherein a configurational Pattern in the CSL of the form A Dominates B is wide-matched, where said wide-matching returns the interval of the dominant expression A in a text is returned instead of 106

the interval of the dominated expression B, and where said interval is a consecutive sequence of words in a text that is commonly though not necessarily represented as two integers separated by a dash.

89. The method according to claim 86 wherein a configurational Pattern in the CSL consists of the form A Precedes B, where a) A is any Pattern; b) B is any Pattern; and c) the entire Pattern matches any configuration where what B refers to is a subconstituent of A.

90. The method according to claim 84 wherein a Boolean operator in the CSL can be applied to any Patterns to obtain further Patterns.

91. The method according to claim 82 wherein any of the Patterns defined in the CSL is a CSL Expression.

92. The method according to claim 82 wherein a Pattern defined in the CSL is fully recursive.

93. The method according to claim 82 wherein a Macro in the CSL represents a Pattern in a compact, parameterized form and can be used wherever a Pattern is used.

94. The method according to claim 1 wherein said concepts, represented in said concept specification language, derive from the sublanguages used to analyze event-based specialized domains comprising insurance claims, business and financial reports, police incident reports, medical reports, and aviation incident reports. 107

95. The method according to claim 34 wherein said Concepts, represented in said CSL, derive from the sublanguages used to analyze event-based specialized domains comprising insurance claims, business and financial reports, police incident reports, medical reports, and aviation incident reports.