WO2008057474A3 - Methods and systems for analyzing data in media material having a layout - Google Patents
Methods and systems for analyzing data in media material having a layout Download PDFInfo
- Publication number
- WO2008057474A3 WO2008057474A3 PCT/US2007/023234 US2007023234W WO2008057474A3 WO 2008057474 A3 WO2008057474 A3 WO 2008057474A3 US 2007023234 W US2007023234 W US 2007023234W WO 2008057474 A3 WO2008057474 A3 WO 2008057474A3
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- media material
- layout
- systems
- methods
- block segments
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/416—Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
Abstract
The present invention relates to systems and methods for analyzing media material having a layout. A media material analyzer includes a segmenter and an article composer. The segmenter identifies block segments associated with columnar body text in the media material. The article composer determines which of the identified block segments belong to one or more articles in the media material. The article composer can determine whether candidate block segments belong to a same article based on language statistics information, layout transition information, or both language statistics information and layout transition information. A system for searching media material having a layout over a network is also provided.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/592,268 | 2006-11-03 | ||
US11/592,268 US7801358B2 (en) | 2006-11-03 | 2006-11-03 | Methods and systems for analyzing data in media material having layout |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2008057474A2 WO2008057474A2 (en) | 2008-05-15 |
WO2008057474A3 true WO2008057474A3 (en) | 2008-09-12 |
Family
ID=39359793
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2007/023234 WO2008057474A2 (en) | 2006-11-03 | 2007-11-05 | Methods and systems for analyzing data in media material having a layout |
Country Status (8)
Country | Link |
---|---|
US (2) | US7801358B2 (en) |
EP (1) | EP2080113B1 (en) |
JP (2) | JP5134628B2 (en) |
CN (1) | CN101573705B (en) |
AU (1) | AU2007317938B2 (en) |
CA (1) | CA2668413C (en) |
IL (1) | IL198507A (en) |
WO (1) | WO2008057474A2 (en) |
Families Citing this family (52)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7676744B2 (en) * | 2005-08-19 | 2010-03-09 | Vistaprint Technologies Limited | Automated markup language layout |
US7584424B2 (en) * | 2005-08-19 | 2009-09-01 | Vista Print Technologies Limited | Automated product layout |
JP4977452B2 (en) * | 2006-01-24 | 2012-07-18 | 株式会社リコー | Information management apparatus, information management method, information management program, recording medium, and information management system |
US7966557B2 (en) | 2006-03-29 | 2011-06-21 | Amazon Technologies, Inc. | Generating image-based reflowable files for rendering on various sized displays |
US7810026B1 (en) * | 2006-09-29 | 2010-10-05 | Amazon Technologies, Inc. | Optimizing typographical content for transmission and display |
US7801358B2 (en) * | 2006-11-03 | 2010-09-21 | Google Inc. | Methods and systems for analyzing data in media material having layout |
US8234277B2 (en) | 2006-12-29 | 2012-07-31 | Intel Corporation | Image-based retrieval for high quality visual or acoustic rendering |
US8250469B2 (en) * | 2007-12-03 | 2012-08-21 | Microsoft Corporation | Document layout extraction |
US8392816B2 (en) * | 2007-12-03 | 2013-03-05 | Microsoft Corporation | Page classifier engine |
US8126881B1 (en) | 2007-12-12 | 2012-02-28 | Vast.com, Inc. | Predictive conversion systems and methods |
US8782516B1 (en) | 2007-12-21 | 2014-07-15 | Amazon Technologies, Inc. | Content style detection |
WO2009084554A1 (en) * | 2007-12-27 | 2009-07-09 | Nec Corporation | Text segmentation device, text segmentation method, and program |
US8572480B1 (en) | 2008-05-30 | 2013-10-29 | Amazon Technologies, Inc. | Editing the sequential flow of a page |
US8218913B1 (en) * | 2008-08-12 | 2012-07-10 | Google Inc. | Identifying a front page in media material |
US8290268B2 (en) * | 2008-08-13 | 2012-10-16 | Google Inc. | Segmenting printed media pages into articles |
US9229911B1 (en) * | 2008-09-30 | 2016-01-05 | Amazon Technologies, Inc. | Detecting continuation of flow of a page |
US9032285B2 (en) * | 2009-06-30 | 2015-05-12 | Hewlett-Packard Development Company, L.P. | Selective content extraction |
US8499236B1 (en) | 2010-01-21 | 2013-07-30 | Amazon Technologies, Inc. | Systems and methods for presenting reflowable content on a display |
US8345978B2 (en) * | 2010-03-30 | 2013-01-01 | Microsoft Corporation | Detecting position of word breaks in a textual line image |
US8385652B2 (en) | 2010-03-31 | 2013-02-26 | Microsoft Corporation | Segmentation of textual lines in an image that include western characters and hieroglyphic characters |
US8625897B2 (en) * | 2010-05-28 | 2014-01-07 | Microsoft Corporation | Foreground and background image segmentation |
US8682075B2 (en) * | 2010-12-28 | 2014-03-25 | Hewlett-Packard Development Company, L.P. | Removing character from text in non-image form where location of character in image of text falls outside of valid content boundary |
CN102841900B (en) * | 2011-06-23 | 2016-01-20 | 腾讯科技(深圳)有限公司 | page processing method and device |
US9177199B2 (en) | 2011-08-03 | 2015-11-03 | Eastman Kodak Company | Semantic magazine pages |
US10025979B2 (en) * | 2012-01-23 | 2018-07-17 | Microsoft Technology Licensing, Llc | Paragraph property detection and style reconstruction engine |
US9372841B2 (en) * | 2012-02-27 | 2016-06-21 | Bert A. Silich | 4-dimensional geometric reading |
DE102012102797B4 (en) * | 2012-03-30 | 2017-08-10 | Beyo Gmbh | Camera-based mobile device for converting a document based on captured images into a format optimized for display on the camera-based mobile device |
US9946690B2 (en) | 2012-07-06 | 2018-04-17 | Microsoft Technology Licensing, Llc | Paragraph alignment detection and region-based section reconstruction |
US9852215B1 (en) * | 2012-09-21 | 2017-12-26 | Amazon Technologies, Inc. | Identifying text predicted to be of interest |
USD754162S1 (en) * | 2013-01-04 | 2016-04-19 | Level 3 Communications, Llc | Display screen or portion thereof with graphical user interface |
USD768659S1 (en) * | 2013-01-04 | 2016-10-11 | Level 3 Communications, Llc | Display screen or portion thereof with graphical user interface |
US10007946B1 (en) | 2013-03-07 | 2018-06-26 | Vast.com, Inc. | Systems, methods, and devices for measuring similarity of and generating recommendations for unique items |
US9104718B1 (en) | 2013-03-07 | 2015-08-11 | Vast.com, Inc. | Systems, methods, and devices for measuring similarity of and generating recommendations for unique items |
US9465873B1 (en) | 2013-03-07 | 2016-10-11 | Vast.com, Inc. | Systems, methods, and devices for identifying and presenting identifications of significant attributes of unique items |
US9830635B1 (en) | 2013-03-13 | 2017-11-28 | Vast.com, Inc. | Systems, methods, and devices for determining and displaying market relative position of unique items |
US9195782B2 (en) | 2013-06-26 | 2015-11-24 | Siemens Product Lifecycle Management Software Inc. | System and method for combining input tools into a composite layout |
US10296570B2 (en) * | 2013-10-25 | 2019-05-21 | Palo Alto Research Center Incorporated | Reflow narrative text objects in a document having text objects and graphical objects, wherein text object are classified as either narrative text object or annotative text object based on the distance from a left edge of a canvas of display |
US10127596B1 (en) | 2013-12-10 | 2018-11-13 | Vast.com, Inc. | Systems, methods, and devices for generating recommendations of unique items |
US11080777B2 (en) | 2014-03-31 | 2021-08-03 | Monticello Enterprises LLC | System and method for providing a social media shopping experience |
US10303745B2 (en) * | 2014-06-16 | 2019-05-28 | Hewlett-Packard Development Company, L.P. | Pagination point identification |
US11144994B1 (en) | 2014-08-18 | 2021-10-12 | Street Diligence, Inc. | Computer-implemented apparatus and method for providing information concerning a financial instrument |
US10474702B1 (en) | 2014-08-18 | 2019-11-12 | Street Diligence, Inc. | Computer-implemented apparatus and method for providing information concerning a financial instrument |
WO2016122556A1 (en) * | 2015-01-30 | 2016-08-04 | Hewlett-Packard Development Company, L.P. | Identification of a breakpoint based on a correlation measurement |
US10229314B1 (en) * | 2015-09-30 | 2019-03-12 | Groupon, Inc. | Optical receipt processing |
US10417516B2 (en) | 2017-08-24 | 2019-09-17 | Vastec, Inc. | System and method for preprocessing images to improve OCR efficacy |
US10268704B1 (en) | 2017-10-12 | 2019-04-23 | Vast.com, Inc. | Partitioned distributed database systems, devices, and methods |
FI20176151A1 (en) | 2017-12-22 | 2019-06-23 | Vuolearning Ltd | A heuristic method for analyzing content of an electronic document |
CN108959254A (en) * | 2018-06-29 | 2018-12-07 | 中教汇据(北京)科技有限公司 | A kind of analytic method for article content in periodical pdf document |
US11308268B2 (en) * | 2019-10-10 | 2022-04-19 | International Business Machines Corporation | Semantic header detection using pre-trained embeddings |
US11556610B2 (en) | 2019-11-08 | 2023-01-17 | Accenture Global Solutions Limited | Content alignment |
US20220335240A1 (en) * | 2021-04-15 | 2022-10-20 | Microsoft Technology Licensing, Llc | Inferring Structure Information from Table Images |
US20230274081A1 (en) * | 2022-02-07 | 2023-08-31 | Arizona Board Of Regents On Behalf Of Arizona State University | Systems and methods for annotating line charts in the wild |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5848184A (en) * | 1993-03-15 | 1998-12-08 | Unisys Corporation | Document page analyzer and method |
US6577763B2 (en) * | 1997-11-28 | 2003-06-10 | Fujitsu Limited | Document image recognition apparatus and computer-readable storage medium storing document image recognition program |
US20070050406A1 (en) * | 2005-08-26 | 2007-03-01 | At&T Corp. | System and method for searching and analyzing media content |
US20070291288A1 (en) * | 2006-06-15 | 2007-12-20 | Richard John Campbell | Methods and Systems for Segmenting a Digital Image into Regions |
Family Cites Families (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2902097B2 (en) * | 1990-10-29 | 1999-06-07 | 沖電気工業株式会社 | Information processing device and character recognition device |
US5335290A (en) * | 1992-04-06 | 1994-08-02 | Ricoh Corporation | Segmentation of text, picture and lines of a document image |
JPH06203020A (en) * | 1992-12-29 | 1994-07-22 | Hitachi Ltd | Method an device for recognizing and generating text format |
JP3302147B2 (en) * | 1993-05-12 | 2002-07-15 | 株式会社リコー | Document image processing method |
JPH08180131A (en) * | 1994-12-21 | 1996-07-12 | Canon Inc | Image processing method |
US5805731A (en) * | 1995-08-08 | 1998-09-08 | Apple Computer, Inc. | Adaptive statistical classifier which provides reliable estimates or output classes having low probabilities |
US5848186A (en) * | 1995-08-11 | 1998-12-08 | Canon Kabushiki Kaisha | Feature extraction system for identifying text within a table image |
US7437351B2 (en) * | 1997-01-10 | 2008-10-14 | Google Inc. | Method for searching media |
US6173073B1 (en) * | 1998-01-05 | 2001-01-09 | Canon Kabushiki Kaisha | System for analyzing table images |
US6941321B2 (en) * | 1999-01-26 | 2005-09-06 | Xerox Corporation | System and method for identifying similarities among objects in a collection |
ATE245278T1 (en) * | 1999-11-04 | 2003-08-15 | Meltec Multi Epitope Ligand Te | METHOD FOR AUTOMATICALLY ANALYZING MICROSCOPE IMAGEMENTS |
JP4608740B2 (en) * | 2000-02-21 | 2011-01-12 | ソニー株式会社 | Information processing apparatus and method, and program storage medium |
US7447771B1 (en) * | 2000-05-26 | 2008-11-04 | Newsstand, Inc. | Method and system for forming a hyperlink reference and embedding the hyperlink reference within an electronic version of a paper |
US6735335B1 (en) * | 2000-05-30 | 2004-05-11 | Microsoft Corporation | Method and apparatus for discriminating between documents in batch scanned document files |
AU2000278962A1 (en) * | 2000-10-19 | 2002-04-29 | Copernic.Com | Text extraction method for html pages |
US7376893B2 (en) * | 2002-12-16 | 2008-05-20 | Palo Alto Research Center Incorporated | Systems and methods for sentence based interactive topic-based text summarization |
US7756871B2 (en) * | 2004-10-13 | 2010-07-13 | Hewlett-Packard Development Company, L.P. | Article extraction |
US7624093B2 (en) * | 2006-01-25 | 2009-11-24 | Fameball, Inc. | Method and system for automatic summarization and digest of celebrity news |
US7792353B2 (en) * | 2006-10-31 | 2010-09-07 | Hewlett-Packard Development Company, L.P. | Retraining a machine-learning classifier using re-labeled training samples |
US7702680B2 (en) * | 2006-11-02 | 2010-04-20 | Microsoft Corporation | Document summarization by maximizing informative content words |
US7801358B2 (en) * | 2006-11-03 | 2010-09-21 | Google Inc. | Methods and systems for analyzing data in media material having layout |
-
2006
- 2006-11-03 US US11/592,268 patent/US7801358B2/en not_active Expired - Fee Related
- 2006-12-22 US US11/644,009 patent/US7899249B2/en active Active
-
2007
- 2007-11-05 CN CN2007800489054A patent/CN101573705B/en active Active
- 2007-11-05 WO PCT/US2007/023234 patent/WO2008057474A2/en active Application Filing
- 2007-11-05 AU AU2007317938A patent/AU2007317938B2/en active Active
- 2007-11-05 JP JP2009535346A patent/JP5134628B2/en active Active
- 2007-11-05 CA CA2668413A patent/CA2668413C/en active Active
- 2007-11-05 EP EP07861696.8A patent/EP2080113B1/en active Active
-
2009
- 2009-05-03 IL IL198507A patent/IL198507A/en active IP Right Grant
-
2012
- 2012-03-26 JP JP2012069249A patent/JP2012123845A/en not_active Withdrawn
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5848184A (en) * | 1993-03-15 | 1998-12-08 | Unisys Corporation | Document page analyzer and method |
US6577763B2 (en) * | 1997-11-28 | 2003-06-10 | Fujitsu Limited | Document image recognition apparatus and computer-readable storage medium storing document image recognition program |
US20070050406A1 (en) * | 2005-08-26 | 2007-03-01 | At&T Corp. | System and method for searching and analyzing media content |
US20070291288A1 (en) * | 2006-06-15 | 2007-12-20 | Richard John Campbell | Methods and Systems for Segmenting a Digital Image into Regions |
Non-Patent Citations (1)
Title |
---|
BRANTS T. ET AL.: "Topic-Based Document Segmentation with Probabilistic Latent Semantic Analysis", CIKM'02, ACM, 4 November 2002 (2002-11-04) - 9 November 2002 (2002-11-09), pages 211 - 218 * |
Also Published As
Publication number | Publication date |
---|---|
US7801358B2 (en) | 2010-09-21 |
JP5134628B2 (en) | 2013-01-30 |
WO2008057474A2 (en) | 2008-05-15 |
CN101573705B (en) | 2011-05-11 |
JP2010509656A (en) | 2010-03-25 |
US20080107337A1 (en) | 2008-05-08 |
CA2668413C (en) | 2015-06-23 |
CN101573705A (en) | 2009-11-04 |
AU2007317938A1 (en) | 2008-05-15 |
EP2080113B1 (en) | 2018-09-19 |
IL198507A (en) | 2014-06-30 |
CA2668413A1 (en) | 2008-05-15 |
US7899249B2 (en) | 2011-03-01 |
EP2080113A2 (en) | 2009-07-22 |
EP2080113A4 (en) | 2016-08-10 |
US20080107338A1 (en) | 2008-05-08 |
IL198507A0 (en) | 2010-02-17 |
JP2012123845A (en) | 2012-06-28 |
AU2007317938B2 (en) | 2011-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2008057474A3 (en) | Methods and systems for analyzing data in media material having a layout | |
WO2007019311A3 (en) | Systems for and methods of finding relevant documents by analyzing tags | |
WO2008144964A8 (en) | Detecting name entities and new words | |
WO2006084102A3 (en) | Recommender system for identifying a new set of media items responsive to an input set of media items and knowledge base metrics | |
WO2008036150A3 (en) | Notification system for source code discovery | |
WO2008039542A3 (en) | System and method of ad-hoc analysis of data | |
WO2008137086A3 (en) | Method and system for disambiguating informational objects | |
WO2006124243A3 (en) | System and method for utilizing the content of an online conversation to select advertising content and/or other relevant information for display | |
SG121934A1 (en) | Systems, methods, and interfaces for providing personalized search and information access | |
WO2011146391A3 (en) | Data collection, tracking, and analysis for multiple media including impact analysis and influence tracking | |
WO2006081474A3 (en) | Multi-path simultaneous xpath evaluation over data streams | |
WO2010002423A3 (en) | System and method of leveraging proximity data in a web-based socially-enabled knowledge networking environment | |
WO2007134293A3 (en) | Wordspotting system | |
WO2007106806A3 (en) | Methods and apparatus for using radar to monitor audiences in media environments | |
WO2005006283A3 (en) | Rendering advertisements with documents having one or more topics using user topic interest information | |
WO2007056344A3 (en) | Techiques for model optimization for statistical pattern recognition | |
BRPI0515950A (en) | systems and methods of providing information related to a document, graphical user interface, and computer readable media | |
WO2006113281A3 (en) | System and method for measuring display compliance | |
WO2008008339A3 (en) | System and method for analyzing web content | |
SE0103361D0 (en) | Object oriented data processing | |
WO2005098714A3 (en) | Systems and methods for determining user actions | |
WO2007044241A3 (en) | A data container for association with media items | |
WO2007139762A3 (en) | Methods and apparatus for managing retention of information assets | |
WO2007143614A3 (en) | Techniques to associate media information with related information | |
WO2007133625A3 (en) | Multi-lingual information retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07839926 Country of ref document: EP Kind code of ref document: A2 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07839926 Country of ref document: EP Kind code of ref document: A2 |