SYSTEM AND METHOD FOR CREATING A SEARCHABLE WORD INDEX OF A SCANNED DOCUMENT INCLUDING MULTIPLE INTERPRETATIONS OF A WORD AT A GIVEN DOCUMENT LOCATION
The present application is related to and claims priority from U.S. Provisional Patent Application No. 60/187,362, filed Mar. 6, 2000 for "System and Method for Converting Archived Data into Searchable Text," with inventors G. Bret Millar, Timothy L. Andersen, and E. Derek Rowley, which is incorporated herein by reference in its entirety. The present application also claims priority to PCT Application Ser. No. PCT/US01/07127, which was filed on Mar. 6, 2001 and is likewise incorporated herein by reference in its entirety.
1. The Field of the Invention
The present invention relates generally to the field of optical character recognition (OCR). More specifically, the present invention relates to a system and method for creating a searchable word index of a scanned document including multiple interpretations of a word at a given location within the document.
2. Technical Background
In the field of optical character recognition (OCR), analog documents (e.g., paper, microfilm, etc.) are digitally scanned, segmented, and converted into text that may be read, searched, and edited by means of a computer. In order to provide for rapid searching, each recognized word is typically stored in a searchable word index with links to the location (e.g., page number and page coordinates) at which the word may be found within the scanned document.
In some conventional OCR systems, multiple recognition engines are used to recognize each word in the document. The use of multiple recognition engines generally increases overall recognition accuracy, since the recognition engines typically use different OCR techniques, each having different strengths and weaknesses.
When the recognition engines produce differing interpretations of the same image of a word in the scanned document, one interpretation is typically selected as the "correct" interpretation. Often, the OCR system rely on a "voting" (winner takes all) strategy with the majority interpretation being selected as the correct one. Alternatively, or in addition, confidence scores may be used. For example, suppose two recognition engines correctly recognize the word "may" with confidence scores of 80% and 70%, respectively, while another recognition engine interprets the same input data as "way" with a 90% confidence score, while yet another recognition engine recognizes the input data as "uuav" with a 60% confidence score. In such an example, a combination of voting and confidence scores may lead to a selection of "may" as the preferred interpretation.
Unfortunately, by selecting a single interpretation and discarding the rest, the objectively correct interpretation is also frequently discarded. Often, image noise and other effects confuse a majority of the recognition engines, with only a minority of the recognition engines arriving at the correct interpretation. In the above example, the correct interpretation could have been "way," which would have been discarded using standard methods. Accordingly, con
ventional OCR systems have never been able to approach total accuracy, no matter how many recognition engines are employed.
What is needed, then, is a system and method for creating 5 a searchable word index of a scanned document including multiple interpretations of a word at a given location within the document. What is also needed is a system and method for creating a searchable word index that selectively reduces the size of the index by eliminating interpretations that are 10 not found in a dictionary or word list. In addition, what is needed is a system and method for creating a searchable word index that permits rescaling of a scanned document without requiring modification of location data within the word index.
BRIEF DESCRIPTION OF THE DRAWINGS
Non-exhaustive embodiments of the invention are described with reference to the figures, in which: 20 FIG. 1 is a block diagram of a conventional system for creating a searchable word index of a scanned document;
FIG. 2 is a block diagram of a system for creating a searchable word index of a scanned document including multiple interpretations for a word at a given location within 25 the document;
FIG. 3 is block diagram of linked word nodes;
FIG. 4 is a block diagram of a system for creating a searchable word index including a word filter in communication with a dictionary; 30 FIG. 5 is a physical block diagram of a computer system for creating a searchable word index of a scanned document including multiple interpretations for a word at a given location within the document; and
FIG. 6 is a flowchart of a method for creating a searchable 35 word index of a scanned document including multiple interpretations for a word at a given location within the document.
DETAILED DESCRIPTION OF THE 40 PREFERRED EMBODIMENTS
Reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the
45 embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.
50 Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided, such as examples of programming, user selections, network transactions, data
55 base queries, database structures, etc., to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other
60 instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Referring now to FIG. 1, there is shown a conventional optical character recognition (OCR) system 100 that pro
65 duces a searchable word index 102 from an analog document 104 (such as a paper or microfilm document). Initially, the analog document 104 is scanned by a digital scanner