CA2486528A1 - Document structure identifier - Google Patents

Document structure identifier Download PDF

Info

Publication number
CA2486528A1
CA2486528A1 CA002486528A CA2486528A CA2486528A1 CA 2486528 A1 CA2486528 A1 CA 2486528A1 CA 002486528 A CA002486528 A CA 002486528A CA 2486528 A CA2486528 A CA 2486528A CA 2486528 A1 CA2486528 A1 CA 2486528A1
Authority
CA
Canada
Prior art keywords
document
token
segments
tokens
creating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CA002486528A
Other languages
French (fr)
Other versions
CA2486528C (en
Inventor
David Slocombe
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tata Consultancy Services Ltd
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of CA2486528A1 publication Critical patent/CA2486528A1/en
Application granted granted Critical
Publication of CA2486528C publication Critical patent/CA2486528C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/157Transformation using dictionaries or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/123Storage facilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

A method of automated document structure identification based on visual cues is disclosed herein. The two dimensional layout of the document is analyzed to discern visual cues related to the structure of the document, and the text o f the document is tokenized so that similarly structured elements are treated similarly. The method can be applied in the generation of extensible mark-up language files, natural language parsing and search engine ranking mechanism s.

Claims (13)

1. A method of creating a document structure model of a computer parsable document having contents on at least one page, the method comprising:
identifying the contents of the document as segments having defined characteristics and representing structure in the document;
creating tokens to characterize the content and structure of the document, each token associated with one of the at least one pages based on the position of each segment in relation to other segments on the same page, each token having characteristics defining a structure in the document determined in accordance with the structure of the page associated with the token; and creating the document structure model in accordance with the characteristics of the tokens across all of the at least one pages of the document.
2. The method of claim 1, wherein the computer parsable document is a page description language file, and wherein the step of identifying the contents of the document includes the step of converting the page description language to a linearized, two dimensional format.
3. The method of claim 1, wherein a segment type for each segment is selected from a list including text segments, image segments and rule segments to represent character based text, vector and bitmapped images and rules respectively.
4. The method of claim 3, wherein the text segments represent strings of text having a common baseline.
5. The method of claim 1, wherein the characteristics of the tokens define a structure selected from a list including candidate paragraphs, table groups, list mark candidates, Dividers, and Zones.
6. The method of claim 5, wherein one token contains at least one segment, and the characteristics of the one token are determined in accordance with the characteristics of the contained segment.
7. The method of claim 1, wherein one token contains at least one other token, and the characteristics of the container token are determined in accordance with the characteristics of the contained token.
8. The method of claim 1, wherein each token is assigned an identification number which includes a geometric index for tracking the location of tokens in the document.
9. The method of claim 1 wherein the document structure model is created using rules based processing of the characteristics of the tokens.
10. The method of claim 5 wherein at least two disjoint Zones are represented in the document structure model as a Galley.
11. The method of claim 5 wherein the candidate paragraph is represented in the document structure model as a structure selected from a list including titles, bulleted lists, enumerated lists, inset blocks, paragraphs, block quotes, tables, footers, header, and footnotes.
12. A system for creating a document structure model using the method of claim 1, the system comprising:
a visual data acquirer for identifying the segments in the document;
a visual tokenizer connected to the visual data acquirer for receiving the identified segments, for creating the tokens characterizing the document, the visual tokenizer; and a document structure identifier for creating the document structure model based on the tokens received from the visual tokenizer.
13. The system of claim 12 further including a translation engine for reading the document structure model created by the document structure identifier and creating file in a format selected from a list including Extensible Markup Language, Hypertext Markup Language and Standard Generalized Markup Language, in accordance with the content and structure of the document structure model.
CA2486528A 2002-05-20 2003-05-20 Document structure identifier Expired - Fee Related CA2486528C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US38136502P 2002-05-20 2002-05-20
US60/381,365 2002-05-20
PCT/CA2003/000729 WO2003098370A2 (en) 2002-05-20 2003-05-20 Document structure identifier

Publications (2)

Publication Number Publication Date
CA2486528A1 true CA2486528A1 (en) 2003-11-27
CA2486528C CA2486528C (en) 2010-04-27

Family

ID=29550111

Family Applications (1)

Application Number Title Priority Date Filing Date
CA2486528A Expired - Fee Related CA2486528C (en) 2002-05-20 2003-05-20 Document structure identifier

Country Status (9)

Country Link
US (1) US20040006742A1 (en)
EP (1) EP1508080A2 (en)
JP (1) JP2005526314A (en)
AU (1) AU2003233278A1 (en)
CA (1) CA2486528C (en)
IS (1) IS7525A (en)
MX (1) MXPA04011507A (en)
NZ (1) NZ536775A (en)
WO (1) WO2003098370A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109086263A (en) * 2017-06-14 2018-12-25 云拓科技有限公司 The claimed structure structuring device

Families Citing this family (93)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2004282819B2 (en) * 2003-09-12 2009-11-12 Aristocrat Technologies Australia Pty Ltd Communications interface for a gaming machine
US7281005B2 (en) * 2003-10-20 2007-10-09 Telenor Asa Backward and forward non-normalized link weight analysis method, system, and computer program product
US8144360B2 (en) * 2003-12-04 2012-03-27 Xerox Corporation System and method for processing portions of documents using variable data
US20060004729A1 (en) * 2004-06-30 2006-01-05 Reactivity, Inc. Accelerated schema-based validation
US7493320B2 (en) 2004-08-16 2009-02-17 Telenor Asa Method, system, and computer program product for ranking of documents using link analysis, with remedies for sinks
US7913163B1 (en) * 2004-09-22 2011-03-22 Google Inc. Determining semantically distinct regions of a document
US20060085740A1 (en) * 2004-10-20 2006-04-20 Microsoft Corporation Parsing hierarchical lists and outlines
US7698637B2 (en) * 2005-01-10 2010-04-13 Microsoft Corporation Method and computer readable medium for laying out footnotes
US7818304B2 (en) * 2005-02-24 2010-10-19 Business Integrity Limited Conditional text manipulation
US7602972B1 (en) * 2005-04-25 2009-10-13 Adobe Systems, Incorporated Method and apparatus for identifying white space tables within a document
US7721198B2 (en) 2006-01-31 2010-05-18 Microsoft Corporation Story tracking for fixed layout markup documents
US7676741B2 (en) * 2006-01-31 2010-03-09 Microsoft Corporation Structural context for fixed layout markup documents
US8509563B2 (en) * 2006-02-02 2013-08-13 Microsoft Corporation Generation of documents from images
US7836399B2 (en) 2006-02-09 2010-11-16 Microsoft Corporation Detection of lists in vector graphics documents
US7739587B2 (en) * 2006-06-12 2010-06-15 Xerox Corporation Methods and apparatuses for finding rectangles and application to segmentation of grid-shaped tables
KR101058039B1 (en) * 2006-07-04 2011-08-19 삼성전자주식회사 Image Forming Method and System Using MMML Data
US7852499B2 (en) * 2006-09-27 2010-12-14 Xerox Corporation Captions detector
US7810026B1 (en) 2006-09-29 2010-10-05 Amazon Technologies, Inc. Optimizing typographical content for transmission and display
US7979785B1 (en) 2006-10-04 2011-07-12 Google Inc. Recognizing table of contents in an image sequence
US7912829B1 (en) 2006-10-04 2011-03-22 Google Inc. Content reference page
US8782551B1 (en) * 2006-10-04 2014-07-15 Google Inc. Adjusting margins in book page images
US8707167B2 (en) * 2006-11-15 2014-04-22 Ebay Inc. High precision data extraction
US8023740B2 (en) * 2007-08-13 2011-09-20 Xerox Corporation Systems and methods for notes detection
US8782516B1 (en) 2007-12-21 2014-07-15 Amazon Technologies, Inc. Content style detection
US7991709B2 (en) * 2008-01-28 2011-08-02 Xerox Corporation Method and apparatus for structuring documents utilizing recognition of an ordered sequence of identifiers
US7937338B2 (en) * 2008-04-30 2011-05-03 International Business Machines Corporation System and method for identifying document structure and associated metainformation
US8145654B2 (en) 2008-06-20 2012-03-27 Lexisnexis Group Systems and methods for document searching
US8126899B2 (en) 2008-08-27 2012-02-28 Cambridgesoft Corporation Information management system
US9229911B1 (en) * 2008-09-30 2016-01-05 Amazon Technologies, Inc. Detecting continuation of flow of a page
US9460063B2 (en) * 2009-01-02 2016-10-04 Apple Inc. Identification, selection, and display of a region of interest in a document
JP5412903B2 (en) * 2009-03-17 2014-02-12 コニカミノルタ株式会社 Document image processing apparatus, document image processing method, and document image processing program
US10303722B2 (en) 2009-05-05 2019-05-28 Oracle America, Inc. System and method for content selection for web page indexing
US20100287152A1 (en) 2009-05-05 2010-11-11 Paul A. Lipari System, method and computer readable medium for web crawling
US9135249B2 (en) * 2009-05-29 2015-09-15 Xerox Corporation Number sequences detection systems and methods
US8627203B2 (en) * 2010-02-25 2014-01-07 Adobe Systems Incorporated Method and apparatus for capturing, analyzing, and converting scripts
US8311331B2 (en) * 2010-03-09 2012-11-13 Microsoft Corporation Resolution adjustment of an image that includes text undergoing an OCR process
US8977955B2 (en) * 2010-03-25 2015-03-10 Microsoft Technology Licensing, Llc Sequential layout builder architecture
US8949711B2 (en) * 2010-03-25 2015-02-03 Microsoft Corporation Sequential layout builder
AU2011248243B2 (en) * 2010-05-03 2015-03-26 Perkinelmer Informatics, Inc. Method and apparatus for processing documents to identify chemical structures
US9251123B2 (en) * 2010-11-29 2016-02-02 Hewlett-Packard Development Company, L.P. Systems and methods for converting a PDF file
US8549399B2 (en) * 2011-01-18 2013-10-01 Apple Inc. Identifying a selection of content in a structured document
US8380753B2 (en) * 2011-01-18 2013-02-19 Apple Inc. Reconstruction of lists in a document
US9690770B2 (en) 2011-05-31 2017-06-27 Oracle International Corporation Analysis of documents using rules
AU2012281160B2 (en) 2011-07-11 2017-09-21 Paper Software LLC System and method for processing document
AU2012282688B2 (en) * 2011-07-11 2017-08-17 Paper Software LLC System and method for processing document
EP2732381A4 (en) 2011-07-11 2015-10-21 Paper Software LLC System and method for searching a document
WO2013009904A1 (en) 2011-07-11 2013-01-17 Paper Software LLC System and method for processing document
US9280525B2 (en) * 2011-09-06 2016-03-08 Go Daddy Operating Company, LLC Method and apparatus for forming a structured document from unstructured information
US8881002B2 (en) 2011-09-15 2014-11-04 Microsoft Corporation Trial based multi-column balancing
US8850305B1 (en) * 2011-12-20 2014-09-30 Google Inc. Automatic detection and manipulation of calls to action in web pages
US9047533B2 (en) * 2012-02-17 2015-06-02 Palo Alto Research Center Incorporated Parsing tables by probabilistic modeling of perceptual cues
US9977876B2 (en) 2012-02-24 2018-05-22 Perkinelmer Informatics, Inc. Systems, methods, and apparatus for drawing chemical structures using touch and gestures
JP5984439B2 (en) * 2012-03-12 2016-09-06 キヤノン株式会社 Image display device and image display method
US9384172B2 (en) 2012-07-06 2016-07-05 Microsoft Technology Licensing, Llc Multi-level list detection engine
US9632990B2 (en) * 2012-07-19 2017-04-25 Infosys Limited Automated approach for extracting intelligence, enriching and transforming content
US9280520B2 (en) 2012-08-02 2016-03-08 American Express Travel Related Services Company, Inc. Systems and methods for semantic information retrieval
US9516089B1 (en) * 2012-09-06 2016-12-06 Locu, Inc. Identifying and processing a number of features identified in a document to determine a type of the document
US9483740B1 (en) 2012-09-06 2016-11-01 Go Daddy Operating Company, LLC Automated data classification
US10013488B1 (en) * 2012-09-26 2018-07-03 Amazon Technologies, Inc. Document analysis for region classification
US20140101544A1 (en) * 2012-10-08 2014-04-10 Microsoft Corporation Displaying information according to selected entity type
KR101319966B1 (en) * 2012-11-12 2013-10-18 한국과학기술정보연구원 Apparatus and method for converting format of electric document
US9535583B2 (en) 2012-12-13 2017-01-03 Perkinelmer Informatics, Inc. Draw-ahead feature for chemical structure drawing applications
US8854361B1 (en) 2013-03-13 2014-10-07 Cambridgesoft Corporation Visually augmenting a graphical rendering of a chemical structure representation or biological sequence representation with multi-dimensional information
WO2014163749A1 (en) 2013-03-13 2014-10-09 Cambridgesoft Corporation Systems and methods for gesture-based sharing of data between separate electronic devices
US9430127B2 (en) 2013-05-08 2016-08-30 Cambridgesoft Corporation Systems and methods for providing feedback cues for touch screen interface interaction with chemical and biological structure drawing applications
US9751294B2 (en) 2013-05-09 2017-09-05 Perkinelmer Informatics, Inc. Systems and methods for translating three dimensional graphic molecular models to computer aided design format
CN104517106B (en) * 2013-09-29 2017-11-28 北大方正集团有限公司 A kind of list recognition methods and system
US10031836B2 (en) * 2014-06-16 2018-07-24 Ca, Inc. Systems and methods for automatically generating message prototypes for accurate and efficient opaque service emulation
US10275458B2 (en) * 2014-08-14 2019-04-30 International Business Machines Corporation Systematic tuning of text analytic annotators with specialized information
US10652739B1 (en) 2014-11-14 2020-05-12 United Services Automobile Association (Usaa) Methods and systems for transferring call context
US9648164B1 (en) 2014-11-14 2017-05-09 United Services Automobile Association (“USAA”) System and method for processing high frequency callers
US10360294B2 (en) * 2015-04-26 2019-07-23 Sciome, LLC Methods and systems for efficient and accurate text extraction from unstructured documents
US9959257B2 (en) * 2016-01-08 2018-05-01 Adobe Systems Incorporated Populating visual designs with web content
CA3055172C (en) 2017-03-03 2022-03-01 Perkinelmer Informatics, Inc. Systems and methods for searching and indexing documents comprising chemical information
US10339212B2 (en) * 2017-08-14 2019-07-02 Adobe Inc. Detecting the bounds of borderless tables in fixed-format structured documents using machine learning
US10891419B2 (en) 2017-10-27 2021-01-12 International Business Machines Corporation Displaying electronic text-based messages according to their typographic features
US10572587B2 (en) * 2018-02-15 2020-02-25 Konica Minolta Laboratory U.S.A., Inc. Title inferencer
US10691936B2 (en) * 2018-06-29 2020-06-23 Konica Minolta Laboratory U.S.A., Inc. Column inferencer based on generated border pieces and column borders
US10699112B1 (en) * 2018-09-28 2020-06-30 Automation Anywhere, Inc. Identification of key segments in document images
US11036916B2 (en) * 2018-11-30 2021-06-15 International Business Machines Corporation Aligning proportional font text in same columns that are visually apparent when using a monospaced font
US10824894B2 (en) * 2018-12-03 2020-11-03 Bank Of America Corporation Document content identification utilizing the font
US11468346B2 (en) * 2019-03-29 2022-10-11 Konica Minolta Business Solutions U.S.A., Inc. Identifying sequence headings in a document
US10956731B1 (en) * 2019-10-09 2021-03-23 Adobe Inc. Heading identification and classification for a digital document
US10949604B1 (en) 2019-10-25 2021-03-16 Adobe Inc. Identifying artifacts in digital documents
US11556852B2 (en) 2020-03-06 2023-01-17 International Business Machines Corporation Efficient ground truth annotation
US11361146B2 (en) * 2020-03-06 2022-06-14 International Business Machines Corporation Memory-efficient document processing
US11494588B2 (en) 2020-03-06 2022-11-08 International Business Machines Corporation Ground truth generation for image segmentation
US11495038B2 (en) 2020-03-06 2022-11-08 International Business Machines Corporation Digital image processing
US11194953B1 (en) * 2020-04-29 2021-12-07 Indico Graphical user interface systems for generating hierarchical data extraction training dataset
US10970458B1 (en) * 2020-06-25 2021-04-06 Adobe Inc. Logical grouping of exported text blocks
US11423206B2 (en) * 2020-11-05 2022-08-23 Adobe Inc. Text style and emphasis suggestions
US20230315799A1 (en) * 2022-04-01 2023-10-05 Wipro Limited Method and system for extracting information from input document comprising multi-format information
US11907643B2 (en) * 2022-04-29 2024-02-20 Adobe Inc. Dynamic persona-based document navigation

Family Cites Families (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0382321B1 (en) * 1984-11-14 1999-02-03 Canon Kabushiki Kaisha Image processing system
US5220657A (en) * 1987-12-02 1993-06-15 Xerox Corporation Updating local copy of shared data in a collaborative system
US5131053A (en) * 1988-08-10 1992-07-14 Caere Corporation Optical character recognition method and apparatus
US5159667A (en) * 1989-05-31 1992-10-27 Borrey Roland G Document identification by characteristics matching
US5701500A (en) * 1992-06-02 1997-12-23 Fuji Xerox Co., Ltd. Document processor
AU5294293A (en) * 1992-10-01 1994-04-26 Quark, Inc. Publication system management and coordination
US5848184A (en) * 1993-03-15 1998-12-08 Unisys Corporation Document page analyzer and method
JP2618832B2 (en) * 1994-06-16 1997-06-11 日本アイ・ビー・エム株式会社 Method and system for analyzing logical structure of document
US5678053A (en) * 1994-09-29 1997-10-14 Mitsubishi Electric Information Technology Center America, Inc. Grammar checker interface
JPH1063744A (en) * 1996-07-18 1998-03-06 Internatl Business Mach Corp <Ibm> Method and system for analyzing layout of document
US5956737A (en) * 1996-09-09 1999-09-21 Design Intelligence, Inc. Design engine for fitting content to a medium
US6081262A (en) * 1996-12-04 2000-06-27 Quark, Inc. Method and apparatus for generating multi-media presentations
JPH10228473A (en) * 1997-02-13 1998-08-25 Ricoh Co Ltd Document picture processing method, document picture processor and storage medium
US5999664A (en) * 1997-11-14 1999-12-07 Xerox Corporation System for searching a corpus of document images by user specified document layout components
US6343377B1 (en) * 1997-12-30 2002-01-29 Netscape Communications Corp. System and method for rendering content received via the internet and world wide web via delegation of rendering processes
US6078924A (en) * 1998-01-30 2000-06-20 Aeneid Corporation Method and apparatus for performing data collection, interpretation and analysis, in an information platform
JP3692764B2 (en) * 1998-02-25 2005-09-07 株式会社日立製作所 Structured document registration method, search method, and portable medium used therefor
US6269188B1 (en) * 1998-03-12 2001-07-31 Canon Kabushiki Kaisha Word grouping accuracy value generation
JP3696731B2 (en) * 1998-04-30 2005-09-21 株式会社日立製作所 Structured document search method and apparatus, and computer-readable recording medium recording a structured document search program
US6243501B1 (en) * 1998-05-20 2001-06-05 Canon Kabushiki Kaisha Adaptive recognition of documents using layout attributes
US6343265B1 (en) * 1998-07-28 2002-01-29 International Business Machines Corporation System and method for mapping a design model to a common repository with context preservation
US6880122B1 (en) * 1999-05-13 2005-04-12 Hewlett-Packard Development Company, L.P. Segmenting a document into regions associated with a data type, and assigning pipelines to process such regions
US6542635B1 (en) * 1999-09-08 2003-04-01 Lucent Technologies Inc. Method for document comparison and classification using document image layout
US6694053B1 (en) * 1999-12-02 2004-02-17 Hewlett-Packard Development, L.P. Method and apparatus for performing document structure analysis
US6912555B2 (en) * 2002-01-18 2005-06-28 Hewlett-Packard Development Company, L.P. Method for content mining of semi-structured documents
US20030154071A1 (en) * 2002-02-11 2003-08-14 Shreve Gregory M. Process for the document management and computer-assisted translation of documents utilizing document corpora constructed by intelligent agents

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109086263A (en) * 2017-06-14 2018-12-25 云拓科技有限公司 The claimed structure structuring device

Also Published As

Publication number Publication date
MXPA04011507A (en) 2005-09-30
JP2005526314A (en) 2005-09-02
US20040006742A1 (en) 2004-01-08
NZ536775A (en) 2007-11-30
WO2003098370A3 (en) 2004-08-05
CA2486528C (en) 2010-04-27
IS7525A (en) 2004-11-11
AU2003233278A1 (en) 2003-12-02
WO2003098370A2 (en) 2003-11-27
EP1508080A2 (en) 2005-02-23

Similar Documents

Publication Publication Date Title
CA2486528A1 (en) Document structure identifier
US20170235841A1 (en) Enterprise search method and system
US7433893B2 (en) Method and system for compression indexing and efficient proximity search of text data
Li et al. The role of discourse units in near-extractive summarization
Travis et al. The SGML implementation guide: a blueprint for SGML migration
TWI536181B (en) Language identification in multilingual text
Hu et al. Title extraction from bodies of HTML documents and its application to web page retrieval
Xue et al. Web page title extraction and its application
CN112231494B (en) Information extraction method and device, electronic equipment and storage medium
Généreux et al. Introducing the reference corpus of contemporary portuguese on-line
WO2008041367A1 (en) Document searching device, document searching method, document searching program
Hering The annual report algorithm: Retrieval of financial statements and extraction of textual information
Matsuoka et al. Examination of effective features for CRF-based bibliography extraction from reference strings
Malhotra et al. Web page segmentation towards information extraction for web semantics
JP2000250908A (en) Support device for production of electronic book
Shashirekha et al. Dictionary based Amharic-Arabic cross language information retrieval
Aizawa et al. Construction of a new ACL anthology corpus for deeper analysis of scientific paper
Osman et al. Opinion search in web logs
Balaji et al. Finding related research papers using semantic and co-citation proximity analysis
KR101140263B1 (en) Method, system and computer readable recording medium for refining web based documents using text pattern extraction
Chanod et al. From legacy documents to xml: A conversion framework
Pantelia ‘Noûs, INTO CHAOS’: THE CREATION OF THE THESAURUS OF THE GREEK LANGUAGE
Tsapatsoulis Web image indexing using WICE and a learning-free language model
Rajeswari et al. Development and customization of in-house developed OCR and its evaluation
Nitu et al. Reconstructing scanned documents for full-text indexing to empower digital library services

Legal Events

Date Code Title Description
EEER Examination request
MKLA Lapsed

Effective date: 20220301

MKLA Lapsed

Effective date: 20200831