|Publication number||US7519221 B1|
|Application number||US 11/069,510|
|Publication date||Apr 14, 2009|
|Filing date||Feb 28, 2005|
|Priority date||Feb 28, 2005|
|Publication number||069510, 11069510, US 7519221 B1, US 7519221B1, US-B1-7519221, US7519221 B1, US7519221B1|
|Inventors||Dennis G. Nicholson|
|Original Assignee||Adobe Systems Incorporated|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (7), Non-Patent Citations (2), Referenced by (21), Classifications (12), Legal Events (2)|
|External Links: USPTO, USPTO Assignment, Espacenet|
1. Field of the Invention
The present invention relates to methods for representing documents within a computer system. More specifically, the present invention relates to a method and an apparatus for generating a synthetic font to facilitate creation of an electronic document from scanned page-images, wherein the resulting electronic document reproduces both the logical content and the physical appearance of the original document.
2. Related Art
As businesses and other organizations become increasingly more computerized, they are beginning to store and maintain electronic versions of paper documents on computer systems. The process of storing paper documents on a computer system typically involves a “document imaging” process, which converts the paper documents into electronic documents. This document imaging process typically begins with an imaging step, wherein document page-images are generated using a scanner, a copier, or a camera. These page-images are typically analyzed and enhanced using a computer program before being assembled into a document container, such as an Adobe® Portable Document Format (PDF) file.
A number of formats are presently used for document imaging. These formats include: (1) plain image, (2) searchable image (SI), and (3) formatted text and graphics (FT&G). The “plain-image” format provides a bitmap representation of the image, which is quite useful for archival applications, such as check processing.
The searchable image (SI) format uses scanned images for document display (e.g., in a document viewer), and uses invisible text derived from the scanned images for document search and retrieval. There are two common flavors of searchable image: (1) SI (exact); and SI (compact). SI (exact) maintains a bit-for-bit copy of the scanned pages, whereas SI (compact) applies lossy compression to the original page-images to produce smaller but nearly identical “perceptually lossless” page-images for document display.
Formatted text and graphics (FT&G) uses, formatted text, graphical lines, and placed images to construct representations of the original page-images. FT&G can be “uncorrected,” which means it includes suspects (word images+hidden text) in place of formatted text for low-confidence optical character recognition (OCR) results. Alternatively, FT&G can be “corrected” by manually converting suspects to formatted text. (Note that the term “OCR” refers to the process of programmatically converting scanned blobs into corresponding ASCII characters.)
When determining which document imaging format to use, a user typically considers a number of attributes of interest. For example, the attributes of interest can include the following:
With respect to these attributes, the above-described image formats generally perform as follows:
As can be seen from the list above, each of these document imaging formats has unique advantages compared to the other formats. Hence, when a user has to choose one of the document imaging formats, the user typically has to forego advantages that the user would like to have from the other formats.
Hence, what is needed is a method and an apparatus for obtaining the advantages of all of the existing document imaging formats within a single document imaging format.
One embodiment of the present invention provides a system that creates an electronic version of a document from page-images of the document. During operation, the system receives the page-images for the document. Next, the system extracts character images from the page-images, and generates a synthetic font for the document from the extracted character images. Finally, the system constructs the electronic version of the document by using the synthetic font to represent text regions of the document, and by using image-segments extracted from the pages-images to represent non-text regions of the document.
In a variation on this embodiment, generating the synthetic font involves: producing glyphs from the extracted character images; obtaining character labels for the glyphs; and using the glyphs and associated character labels to form the synthetic font.
In a further variation, obtaining character labels for the glyphs involves performing an optical character recognition (OCR) operation on the glyphs.
In a further variation, producing glyphs from the extracted character images involves statistically analyzing extracted character images which are similar to each other to ensure that the character images fall into homogenous clusters.
In a further variation, the statistical analysis is based on an inter-character distance metric.
In a further variation, producing glyphs from the extracted character images involves converting the extracted character images to grayscale. Next, the system iteratively: registers extracted character images in each cluster with sub-pixel accuracy; extracts a high-resolution, noise-reduced prototype from the registered character images for each cluster; measures a distance from each registered character image to its associated prototype, and uses the measured distances to purify each cluster via histogram analysis of inter-cluster and intra-cluster distances.
In a further variation, extracting the noise-reduced prototype from the registered character images for a given cluster involves averaging registered character images in the given cluster to produce a reduced-noise glyph which is representative of the given cluster.
The following description is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
The data structures and code described in this detailed description are typically stored on a computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. This includes, but is not limited to, magnetic and optical storage devices, such as disk drives, magnetic tape, CDs (compact discs) and DVDs (digital versatile discs or digital video discs), and computer instruction signals embodied in a transmission medium (with or without a carrier wave upon which the signals are modulated). For example, the transmission medium may include a communications network, such as a LAN, a WAN, or the Internet.
Document Imaging Process
The present invention provides a technique for generating a new universal document imaging format, which provides the combined advantages of current document imaging formats. In particular, the new document imaging format provides the low-production cost and reliable fidelity of image-based formats. At the same time, the new format provides the small file size, superior display quality, performance, reflowability, and accessibility of formatted-text based formats. Additionally, techniques to generate the new document format facilitate enhanced OCR accuracy, which in turn results in improved searchability.
During the electronic document creation process, character images are extracted from the page-images. (Note that the term “character images” and the process of extracting character images for optical character recognition (OCR) purposes are well-known in the art.) Similar character images are combined to statistically remove noise and other artifacts introduced by the printing and imaging (e.g., scanning) processes. The resulting high-resolution, type-set quality glyphs are then labeled via OCR, and the labeled glyphs are used to construct synthetic document-specific fonts. Finally, the electronic document is constructed using the synthetic fonts to precisely duplicate text regions and image-segments extracted from the page-images to duplicate non-text regions. The result is a document that is perceptually identical to the original printed document, but is created using a common font mechanism so that the document text is searchable, selectable, reflowable, accessible, etc. This electronic document generally looks better than the scanned images due to statistical removal of noise from the imaged glyphs.
During the document imaging process, character images 114 are extracted from the text regions. These character images are analyzed to generate a synthetic font 116. This synthetic font 116 is then used to represent text regions 109-112 from page-images 104-106, thereby forming corresponding “converted” text regions 128-131 in the “imaged” document, which comprises page-images 124-126. Note that image-segments from non-text regions 107-108 are simply transferred without significant modification from page-images 104-105 to corresponding page-images 124-125. This process is described in more detail below with reference to the flow chart in
First, the system receives page-images for the document (step 202). Note that these page-images, which are also referred to as “scanned images,” can be created using a scanner, copier, camera, or other imaging device. Next, the system partitions the page-images into text regions and non-text regions (step 204). There exist a number of well-known techniques to differentiate text regions from non-text regions, so this step will not be discussed further.
The system subsequently extracts character images from the text regions (step 206). (This is a well-known process, which is widely used in OCR systems.) The system then generates a synthetic font from the character images, through a process which is described in more detail below with reference to
Finally, the system constructs the new electronic version of the document. This involves using the synthetic font to precisely duplicate the text regions of the document, and using image-segments extracted from the pages-images to represent non-text regions of the document (step 210).
Note that OCR errors that arise during this process will have the same effect as they do in searchable image formats. That is, the glyph will appear as the noise-reduced scanned glyph, but that glyph will be mislabeled. For example, an “I” might be mislabeled as a “1”. In this case, viewers will see the scanned “I” but a search for an ASCII “I” will not find the “I”.
Synthetic Font Creation
The system then performs an iterative process, which involves a number of steps. First, the system overlays the character images in each cluster with sub-pixel accuracy (step 408). Note that this involves registering the character images with each other at a resolution finer than a pixel. There are a number of ways to do this, such as up-sampling the pixels so that each pixel becomes 4 or 16 pixels.
Next, the system extracts a noise-reduced prototype from the character images for each cluster (step 410). The system then measures the distance from each registered character image to its associated prototype (step 412). Then, the system uses the measured distances to purify each cluster through a histogram analysis of inter-cluster and intra-cluster distances (step 414). This iterative process is repeated until the clusters are stable.
Note that any one of a number of well-known distance metrics (from various pattern-recognition techniques) can be used to measure the distance between a given registered character image and its corresponding prototype. For example, the system can perform an exclusive-OR operation between the character image and the prototype, and can count the number of bits that differ between them. Of course, other, more-sophisticated distance metrics can be used instead of a simple bit difference. Ideally the distance metric correlates with perceived visual difference.
The histogram analysis generally ranks the character images by distance from the prototype. If necessary, the clusters are “purified” by removing character images that are a large distance from the prototype. These removed character images can then be re-clustered, so that they fall into different and/or new clusters.
Next, the system uses the final prototype for each cluster as the representative glyph for the cluster (step 416). The system also performs a standard OCR operation to obtain character labels for each representative glyph (step 418). Note that if this OCR operation is not accurate, it is possible for a glyph to be associated with an erroneous character label. Hence, if possible, it is desirable to perform a manual correction on these character label assignments. If it is not possible to correct character assignment, the representative glyph will still provide an accurate visual representation of the character, even if the assigned character label is not accurate.
Finally, the representative glyphs and associated character labels are used to form the synthetic font (step 420).
Note that synthetic fonts may have multiple glyphs for each “character” to preserve the perceptually lossless property.
The present invention uses traditional font machinery to construct an image which is perceptually identical to the original printed page. Because of its font-based construction, the electronic document has advantages (e.g., text extraction, reflow, accessibility) not available in image-based formats. Hence, the present invention combines desirable document imaging properties from different formats into a single format.
The techniques described above also include a number of refinements to the synthetic font generation process. These refinements involve: (1) working at increased resolution to achieve precise glyph registration; (2) working with enhanced grayscale glyphs to de-emphasize scanning artifacts; (3) iteratively refining clusters using histogram analysis and pre-computed font base analyses; and (4) employing OCR techniques within the clustering process. These refinements combine to facilitate a significant removal of printing and scanning artifacts resulting in very clean character prototypes, fewer clusters, and virtual elimination of clustering errors. The production of very clean prototypes significantly improves OCR accuracy. Furthermore, the refined techniques result in improved compression due to the smaller number of prototypes per character.
The foregoing descriptions of embodiments of the present invention have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention. The scope of the present invention is defined by the appended claims.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US4769716||Oct 17, 1986||Sep 6, 1988||International Business Machines Corporation||Facsimile transmission using enhanced symbol prototypes with precalculated front and back white spaces|
|US5956419 *||Apr 28, 1995||Sep 21, 1999||Xerox Corporation||Unsupervised training of character templates using unsegmented samples|
|US6069978 *||Oct 22, 1997||May 30, 2000||Ricoh Company Ltd.||Method and apparatus for improving a text image by using character regeneration|
|US20020006220 *||Apr 20, 2001||Jan 17, 2002||Ricoh Company, Ltd.||Method and apparatus for recognizing document image by use of color information|
|US20040202349 *||Apr 11, 2003||Oct 14, 2004||Ricoh Company, Ltd.||Automated techniques for comparing contents of images|
|US20050018906 *||Oct 15, 2002||Jan 27, 2005||Napper Jonathon Leigh||Character identification|
|US20050069173 *||Aug 9, 2004||Mar 31, 2005||Sony Corporation||Direction-recognizing apparatus, direction-recognizing method, direction-recognizing system, and robot apparatus|
|1||*||Kopec, G.-"Document-Specific Character Template Estimation"-SPIE, 1996, pp. 1-13.|
|2||*||Xu, Y.-"Prototype Extraction and Adaptive OCR"-IEEE, 1999, pp. 1280-1296.|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US7646921||Sep 11, 2006||Jan 12, 2010||Google Inc.||High resolution replication of document based on shape clustering|
|US7650035||Sep 11, 2006||Jan 19, 2010||Google Inc.||Optical character recognition based on shape clustering and multiple optical character recognition processes|
|US7697758||Sep 11, 2006||Apr 13, 2010||Google Inc.||Shape clustering and cluster-level manual identification in post optical character recognition processing|
|US7702182||Feb 16, 2006||Apr 20, 2010||Adobe Systems, Incorporated||Method and apparatus for creating a high-fidelity glyph prototype from low-resolution glyph images|
|US7933447 *||Aug 30, 2006||Apr 26, 2011||Canon Kabushiki Kaisha||Image processing apparatus and method thereof|
|US8111927||May 20, 2010||Feb 7, 2012||Google Inc.||Shape clustering in post optical character recognition processing|
|US8131085||Jul 15, 2011||Mar 6, 2012||Google Inc.||Shape clustering in post optical character recognition processing|
|US8170351||Jul 21, 2011||May 1, 2012||Google Inc.||Shape clustering in post optical character recognition processing|
|US8175394||Sep 8, 2006||May 8, 2012||Google Inc.||Shape clustering in post optical character recognition processing|
|US8280175 *||Feb 17, 2009||Oct 2, 2012||Fuji Xerox Co., Ltd.||Document processing apparatus, document processing method, and computer readable medium|
|US8384917||Feb 15, 2010||Feb 26, 2013||International Business Machines Corporation||Font reproduction in electronic documents|
|US8494273||Sep 5, 2010||Jul 23, 2013||International Business Machines Corporation||Adaptive optical character recognition on a document with distorted characters|
|US8577155 *||Nov 17, 2009||Nov 5, 2013||Wisers Information Limited||System and method for duplicate text recognition|
|US8666174 *||Jan 17, 2012||Mar 4, 2014||Google Inc.||Techniques for shape clustering and assignment of character codes in post optical character recognition processing|
|US8855413 *||May 13, 2011||Oct 7, 2014||Abbyy Development Llc||Image reflow at word boundaries|
|US8913833 *||May 2, 2007||Dec 16, 2014||Fuji Xerox Co., Ltd.||Image processing apparatus, image enlarging apparatus, image coding apparatus, image decoding apparatus, image processing system and medium storing program|
|US20070258661 *||May 2, 2007||Nov 8, 2007||Fuji Xerox Co., Ltd.||Image processing apparatus, image enlarging apparatus, image coding apparatus, image decoding apparatus, image processing system and medium storing program|
|US20100054599 *||Feb 17, 2009||Mar 4, 2010||Fuji Xerox Co., Ltd.||Document processing apparatus, document processing method, and computer readable medium|
|US20100254613 *||Nov 17, 2009||Oct 7, 2010||Wisers Information Limited||System and method for duplicate text recognition|
|US20120114243 *||Jan 17, 2012||May 10, 2012||Google Inc.||Shape Clustering in Post Optical Character Recognition Processing|
|US20120288190 *||May 13, 2011||Nov 15, 2012||Tang ding-yuan||Image Reflow at Word Boundaries|
|U.S. Classification||382/181, 382/209, 382/186, 382/180|
|International Classification||G06K9/34, G06K9/18, G06K9/62, G06K9/00|
|Cooperative Classification||G06K2209/01, G06K9/34, G06T11/203|
|Feb 28, 2005||AS||Assignment|
Owner name: ADOBE SYSTEMS, INCORPORATED, CALIFORNIA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NICHOLSON, DENNIS G.;REEL/FRAME:016349/0565
Effective date: 20050225
|Sep 12, 2012||FPAY||Fee payment|
Year of fee payment: 4