COMPUTER-IMPLEMENTED METHOD FOR
AUTOMATIC EXTRACTION OF DATA FROM
This invention relates to optical character recognition (OCR) devices and processes, and more particularly to a computer-implemented method for automatic extraction of data from printed forms. 10
BACKGROUND OF THE INVENTION
Automatic computer-implemented reading of data from printed forms is typically done in a sequence of three steps. First a form is optically scanned to create an 15 electronic image, which is then written in digital storage as a rectangular array of 0's and l's representing white and black subareas or pixels. Then the image is processed, to extract regions or fields containing the data to be read. Finally, the black and white subimage in 20 each extracted region is interpreted and expressed as an alphanumeric code, such as ASCII or EBCDIC.
The data present in printed forms may be defined as having two aspects: a value and a significance. For example, the word "Yes" is a value that becomes data 25 only when its significance, i.e., the question it answers, is made clear. Printed forms provide a conventional means for recording data in which significance is predefined as a background of text and graphics, such as boxed areas. Since forms are printed mechanically, the 30 background is identical over different instances of the same form. Thus the position of data values on the form is in correspondence with the data significance. Optical character recognition (OCR) devices take advantage of this fact to read data from credit card receipts, billing 35 statements, etc. Such "OCR forms" are designed with data values entered in spaces well separated from background printing to assure that the latter are not erroneously interpreted as data values. Data significance does not appear explicitly, but is stored in the computer and 40 associated with the data values on the basis of position in the image. In some cases, forms are printed in a color invisible to the scanner to avoid a possibility of confusion. Data values are carefully positioned during printing, and the form precisely registered during scanning. 45 All these steps serve to guarantee that the data values are exactly where the reading or scanning equipment performs its extraction process.
In recent years, demand has grown for a capability to capture data from printed forms that do not meet OCR 50 constraints. Forms routinely used in government and commercial operations, such as birth and marriage certificates, are designed to be intelligible to the human eye and brain. While people are sophisticated processors of visual images, they also require that both attributes of a 55 data element, the significance and the value, be present on the document. Thus background printing is provided to supply the meaning of each data field, and lines and boxes are imposed to make clear the association of data value and data significance. The crowded appearance 60 of these "people forms", compared with OCR forms, is a necessary outcome of a requirement to pack a great deal of information into a limited space.
It is likewise difficult to enforce controls in the preparation of people forms. A birth or marriage certificate 65 filled out with a typewriter is registered by eye, often with errors in translation and skew compared to the ideal orientation. Data values may superimpose on the
form background as a result. The printing process itself is subject to mechanical slippage that may give the same effect. Finally, mechanical slippage and electronic noise occurring during the optical scanning process present a further source of registration error. This is particularly true if economical general-purpose scanners are used. The net result of all these factors is that printing of a given data value on people forms may be skewed, may overlap boundary lines separating data regions, and even when ideally positioned does not consistently appear in a fixed, predictable region in scanned images of different instances of the form. These difficulties pose severe problems for automatic computer-implemented data extraction, rendering inapplicable the sort of processing used for OCR forms.
SUMMARY OF THE INVENTION
The invention is a computer-implemented method operable with conventional OCR scanning equipment and software for the automatic extraction of data from printed forms.
A blank master form is scanned and its digital image stored. Clusters of ON bits of the master form image are first recognized as part of a line and then connected to form lines. All of the lines in the master form image are then identified by row and column start position and column end position, thereby creating a master-formdescription. The resulting image, which consists only of lines in the master form, can then be displayed. Regions or masks in the displayed image of master form lines are then created, each mask corresponding to a field where data would be located in a fUled-in form. Each data mask is spaced from nearby lines by a predetermined data margin, referred to as D.
A filled-in or data form is then scanned and lines are also recognized and identified in a similar manner to create a data-form-description. The data-form-description is compared with the master-form-description by computing the horizontal and vertical offsets and skew of the two forms relative to one another. The created data masks, whose orientation with respect to the master form has been previously determined, are then transposed into the data form image using the computed values of horizontal and vertical offsets and skew. In this manner, the data masks are correctly located on the data form so that the actual data values in the data form reside within the corresponding data masks. Routines are then implemented for detecting extraneous data intruding into the data masks and for growing the masks, i.e. enlarging the masks to capture data which may extend beyond the perimeter of the masks.. Thus, the data masks are adaptive in that they are grown if data does not he entirely within the perimeter of the masks. During the mask growth routine, lines which are part of the background form are detected and removed by line removal algorithms.
Following the removal of extraneous data from the masks, the growth of the masks to capture data, and any subsequent line removal, the remaining data from the masks is extracted and transferred to a new file. The new file then contains only data comprising characters of the data values in the desired regions, which can then be operated on by conventional OCR software to identify the specific character values.
For a fuller understanding of the nature and advantages of the present invention reference should be made