|Publication number||US6968331 B2|
|Application number||US 10/055,586|
|Publication date||Nov 22, 2005|
|Filing date||Jan 22, 2002|
|Priority date||Jan 22, 2002|
|Also published as||US20030140307|
|Publication number||055586, 10055586, US 6968331 B2, US 6968331B2, US-B2-6968331, US6968331 B2, US6968331B2|
|Inventors||Ziv Bar-Yossef, Sridhar Rajagopalan|
|Original Assignee||International Business Machines Corporation|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (9), Non-Patent Citations (41), Referenced by (10), Classifications (9), Legal Events (5)|
|External Links: USPTO, USPTO Assignment, Espacenet|
1. Field of the Invention
This invention generally relates to the field of computer based search systems, and more particularly relates to a system and method for improving data quality in large hyperlinked text databases using pagelets and templates, and to the use of the cleaned data in hypertext information retrieval algorithms.
2. Description of Related Art
The explosive growth of content available on the World-Wide-Web has led to an increased demand and opportunity for tools to organize, search and effectively use the available information. People are increasingly finding it difficult to sort through the great mass of content available. New classes of information retrieval algorithms—link-based information retrieval algorithms—have been proposed and show increasing promise in addressing the problems caused by this information overload.
Three important principles (or assumptions)—collectively called Hypertext IR Principles—underlie most, if not all, link-based methods in information retrieval.
1. Relevant Linkage Principle: Links confer authority; by placing a link from a page p to a page q, the author of p recommends q or at least acknowledges the relevance of q to the subject of p.
2. Topical Unity Principle: Documents co-cited within the same document are related to each other.
3. Lexical Affinity Principle: Proximity of text and links within a page is a measure of the relevance of one to the other.
Each of these principles, while generally true, is frequently and systematically violated on the web. Moreover, these violations have an adverse impact on the quality of results produced by linkage based search and mining algorithms. This necessitates the use of several heuristic methods to deal with unreliable data that degrades performance and overall quality of searching and data mining.
Therefore a need exists to overcome the problems with the prior art as discussed above, and particularly for a method of cleaning the data prior to a search and eliminating violations of hypertext information retrieval principles.
According to a preferred embodiment of the present invention, a computing system and method clean a set of text documents to minimize violations of Hypertext IR Principles as a preparation step towards running an information retrieval/mining system. The cleaning process includes first, decomposing each page of the set of text documents into one or more pagelets; second, identifying possible templates; and finally, eliminating the templates from the data. Traditional IR search and mining algorithms can then be used to process the remaining data, as opposed to the original pages, to provide more precise results.
The present invention, according to a preferred embodiment, overcomes problems with the prior art by “cleaning” the underlying data so that violations of Hypertext Information Retrieval (IR) Principles are minimized, then applying conventional IR algorithms. This results in higher precision, better scalability, and more understandable algorithms for link-based information retrieval.
A preferred embodiment of the present invention presents a formal framework and introduces new methods for unifying a large number of these data cleaning heuristics. The violations of the hypertext information retrieval principles result in significant performance degradations in all linkage based search and mining algorithms. Therefore, eliminating these violations in a preprocessing step will result in a uniform improvement in quality across the board.
The web contains frequent violations of the Hypertext IR Principles. These violations are not random, but rather happen for systematic reasons. The web contains many navigational links (links that help navigating inside a web-site), download links (links to download pages, for instance, those which point to a popular Internet browser download page), links which point to business partners, links which are introduced to deliberately mislead link-based search algorithms, and paid advertisement links. Each of these auxiliary links violates the Relevant Linkage Principle. In algorithmic terms, these are a significant source of noise that search algorithms have to combat, and which can sometimes result in non-relevant pages being ranked as highly authoritative. An example of this would be that a highly popular, but very broad, homepage (e.g., Yahoo!) is ranked as a highly authoritative page regardless of the query because many pages contain a pointer to it.
Another common violation occurs from pages that cater to a mixture of topics. Bookmark pages and personal homepages are particularly frequent instances of this kind of violation. For example, suppose that a colleague is a fan of professional football, as well as an authority on finite model theory. Further that these two interests are obvious from his homepage. Some linkage based information retrieval tools will then incorrectly surmise that these two broad topics are related. Since the web has a significantly larger amount of information about professional football than it has about finite model theory, it is possible, even probable, that a link-based search for resources about finite model theory returns pages about pro football.
Another issue arises from the actual construction of the web pages. HTML is a linearization of a document; however, the true structure is most like a tree. For constructs such as a two dimensional table, trees are not effective descriptions of document structure either. Thus, lexical affinity should be judged on the real structure of the document, not on the particular linearization of it as determined by the conventions used in HTML. Additionally, there are many instances of lists that are arranged in alphabetical order within a page. Assuming that links that are close to each other on such a list are more germane to each other than otherwise would be wrong.
The proliferation of the use of templates in creating web pages has also been a source of Hypertext IR Principles violations. A template is a pre-prepared master HTML shell page that is used as a basis for composing new web pages. The content of the new page is plugged into the template shell, resulting in a collection of pages that share a common look and feel. Templates can spread over several sister sites and contain links to other web sites. Since all pages that conform to a common template share many links, it is clear that these links cannot be relevant to the specific content on these pages.
According to a preferred embodiment of the invention, each page from a collection of documents is decomposed into one or more pagelets. These pagelets are screened to eliminate the ones that belong to templates. Traditional IR algorithms can then be used on the remaining pagelets to return a more precise result set. The collection of documents may reside locally; be located on an internal LAN; or may be the collection or a subset of the collection of documents located on the World Wide Web.
Each computer system 102 may include, inter alia, one or more computers and at least a computer readable medium 108. The computers preferably include means for reading and/or writing to the computer readable medium. The computer readable medium allows a computer system to read data, instructions, messages or message packets, and other computer readable information from the computer readable medium. The computer readable medium, for example, may include non-volatile memory, such as Floppy, ROM, Flash memory, Disk drive memory, CD-ROM, and other permanent storage. It is useful, for example, for transporting information, such as data and computer instructions, between computer systems.
The computer system 102, according to the present example, includes a controller/processor 216 (shown in
Glue software 214 may include drivers, stacks, and low level application programming interfaces (API's) and provides basic functional components for use by the operating system platform 212 and by compatible applications that run on the operating system platform 212 for managing communications with resources and processes in the computing system 102.
The information retrieval tool 110 can work with a generic data gathering application 306 (such as a web crawler) and a generic hypertext information retrieval application 308 (such as a search engine, a similar page finder, a focused crawler, or a page classifier). The data gathering application 306 fetches a collection of hypertext documents 402. These documents can be fetched from the Word-Wide Web 106, from a local intranet network, or from any other source. The documents are stored on database tables 408. The information retrieval application 308 processes the collection of hypertext documents 402 stored on the database tables 408, and based on a user's query 404 extracts results 406 from this collection matching the query. For example, when the information retrieval application 308 is a search engine, the application finds all the documents in the collection 402 that match the query terms given by the user.
The data cleaning application 112 processes the collection of hypertext documents 402 stored on the database tables, after they were fetched by the data gathering application 306 and before the information retrieval application 308 extracts results from them. The data cleaning application 112 assumes the data gathering application 306 stores all the pages it fetches on the PAGES database table 410 and all the links between these pages in the LINKS database table 412. The data cleaning application 112 stores the clean set of pages and pagelets on the PAGES 410, LINKS 412, and PAGELETS 414 tables. The information retrieval application 308 thus gets the clean data from these tables. An exemplary scheme for the database tables 408 used by the information retrieval tool is depicted in
An exemplary HTML page, illustrating the concept of the use of pagelets according to a preferred embodiment of the present invention, is shown in
A preferred embodiment of the template identifier 314 is as follows. A template is a collection of pagelets T satisfying the following two requirements:
(1) all the pagelets in T are identical or almost identical; and (2) every two pages owning pagelets in T are reachable one from the other via other pages also owning pagelets in T; the path connecting each such two pages can be undirected.
A preferred embodiment uses the concept of shingling, as taught by U.S. Pat. No. 6,119,124, “Method for Clustering Closely Resembling Data Objects,” filed Mar. 26, 1998, the entire teachings of which are hereby incorporated by reference, and applies it to cluster similar pagelets. A shingle is a hash value that is insensitive to small perturbations (i.e. two strings that are almost identical get the same shingle value with a high probability, whereas two very different strings have a low probability of receiving the same shingle value). A shingle calculator 318 calculates shingle values for each pagelet in the PAGELETS table 414 and also for each page in the PAGES table 410.
The exemplary operational sequence shown in
The template identifier 314, at step 704, then sorts the pagelets by their shingle into clusters. Each such cluster contains pagelets sharing the same shingle, and therefore represents a set of pagelets that are identical or almost identical. The template identifier 314 enumerates the clusters at step 706, and outputs the pagelets belonging to each cluster at step 708.
The present invention can be realized in hardware, software, or a combination of hardware and software. A system according to a preferred embodiment of the present invention can be realized in a centralized fashion in one computer system, or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system—or other apparatus adapted for carrying out the methods described herein—is suited. A typical combination of hardware and software could be a general-purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein.
The present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods. Computer program means or computer program in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following a) conversion to another language, code or, notation; and b) reproduction in a different material form.
A computer system may include, inter alia, one or more computers and at least a computer readable medium, allowing a computer system, to read data, instructions, messages or message packets, and other computer readable information from the computer readable medium. The computer readable medium may include non-volatile memory, such as ROM, Flash memory, Disk drive memory, CD-ROM, and other permanent storage. Additionally, a computer readable medium may include, for example, volatile storage such as RAM, buffers, cache memory, and network circuits. Furthermore, the computer readable medium may comprise computer readable information in a transitory state medium such as a network link and/or a network interface, including a wired network or a wireless network, that allow a computer system to read such computer readable information.
Although specific embodiments of the invention have been disclosed, those having ordinary skill in the art will understand that changes can be made to the specific embodiments without departing from the spirit and scope of the invention. The scope of the invention is not to be restricted, therefore, to the specific embodiments, and it is intended that the appended claims cover any and all such applications, modifications, and embodiments within the scope of the present invention.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US5909677 *||Jun 18, 1996||Jun 1, 1999||Digital Equipment Corporation||Method for determining the resemblance of documents|
|US6119124 *||Mar 26, 1998||Sep 12, 2000||Digital Equipment Corporation||Method for clustering closely resembling data objects|
|US6138113 *||Aug 10, 1998||Oct 24, 2000||Altavista Company||Method for identifying near duplicate pages in a hyperlinked database|
|US6230155 *||Nov 23, 1998||May 8, 2001||Altavista Company||Method for determining the resemining the resemblance of documents|
|US6349296 *||Aug 21, 2000||Feb 19, 2002||Altavista Company||Method for clustering closely resembling data objects|
|US6614764 *||Feb 1, 2000||Sep 2, 2003||Hewlett-Packard Development Company, L.P.||Bridged network topology acquisition|
|US6615209 *||Oct 6, 2000||Sep 2, 2003||Google, Inc.||Detecting query-specific duplicate documents|
|US6658423 *||Jan 24, 2001||Dec 2, 2003||Google, Inc.||Detecting duplicate and near-duplicate files|
|US6665837 *||Aug 10, 1998||Dec 16, 2003||Overture Services, Inc.||Method for identifying related pages in a hyperlinked database|
|1||*||Agrawal, R. and R. Srikant "Fast Algorithms for Mining Association Rules", Proceedings of the 20<SUP>th </SUP>VLDB Conference, pp. 487 499, 1994.|
|2||Albert, R., Jeong, J. and Barabasi, A.-L., "The Diameter of the World Wide Web," Nature, 401:130-131, 1999.|
|3||*||Arasu, A. and H. Garcia-Molina "Extracting Structured Data from Web Pages", Proceedings of the ACM SIGMOD Conference, Jun. 9-12, 2003.|
|4||*||Bar-Yossef, Z. and S. Rajagopalan "Template Detection via Data Mining and its Applications", Proceedings of the WWW2002 Conference, pp. 580-591, May 7-11, 2002.|
|5||*||Bharat, K. and A. Broder "Mirror, Mirror on the Web: A Syudy of Host Pairs with Replicated Content", Proceedings of the 8<SUP>th </SUP>International Conference on the World Wide Web (WWW99), May 1999.|
|6||Bharat, K. and Henzinger, M.R., "Improved Algorithms for Topic Distillation in a Hyperlinked Environment," In Proceedings of the 21<SUP>st </SUP>Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 104-111, 1998.|
|7||Brin, S. and Page, L., "The Anatomy of a Large-Scale Hypertextual Web Search Engine," In Proceedings of the 7<SUP>th </SUP>International World Wide Web Conference (WWW7), pp. 107-117, 1998.|
|8||*||Brin, S., J. Davis and H. Garcia-Molina "Copy Detection Mechanisms for Digital Documents", Proceedings of the ACM SIGMOD Conference, pp. 398-409, May 1995.|
|9||*||Broder, A.Z. "On the Resemblance and Containment of Documents", Proceedings of Compression and Complexity of SEQUENCES, p. 21, Jun. 11-13, 1997.|
|10||*||Broder, A.Z. "Some Applications of Rabin's Fingerprinting Method", in R. Capocelli, A. De Santis, U. Vaccaro (eds), "Sequence II: Methods in Communications, Security and Computer Science", Springer-Verlag, 1993.|
|11||Broder, A.Z., Glassman, S.C. and Manasse, M.S., "Syntactic Clustering of the Web," In Proceedings of the 6<SUP>th </SUP>International World Wide Web Conference (WWW6), pp. 1157-1166, 1997.|
|12||*||Broder, A.Z., S.C. Glassman, M.S. Manasse and G. Zweig "Syntactic Clustering of the Web", Proceedings of the 6<SUP>th </SUP>International World Wide Web (WWW) Conference (WWW6), pp. 1157-1166, 1997.|
|13||Chakrabarti S., van den Berg, M. and Dom, B.E., "Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery," In Proceedings of the 8<SUP>th </SUP>International World Wide Web Conference (WWW8), pp. 1623-1640, 1999.|
|14||Chakrabarti, S., Dom, B. and Indyk, P., "Enhanced Hypertext Categorization Using Hyperlinks," In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, pp. 307-318, 1998.|
|15||Chakrabarti, S., Dom, B.E., Gibson, D., Kleinberg, J.M., Raghavan, P. and Rajagopalan, S., "Automatic Resource List Compilation by Analyzing Hyperlink Structure and Associated Text," In Proceedings of the 7<SUP>th </SUP>International World Wide Web Conference (WWW7), pp. 65-74, 1998.|
|16||Chakrabarti, S., Dom, B.E., Gibson, D., Kleinberg., J.M., Kumar, S.R., Raghavan, P., Rajagopalan, S. and Tomkins, A., "Hypersearching the Web," Scientific American, Jun. 1999.|
|17||*||Chakrabarti, S., M. Joshi and V. Tawde "Enhanced Topic Distillation Using Text, Markup Tags and Hyperlinks", Proceedings o the ACM SIGIR Conference, Sep. 9-12, 2001.|
|18||Chakrabarti, S., van den Berg, M. and Dom, B.E., "Distributed Hypertext Resource Discovery through Examples,", In Proceedings of the 25<SUP>th </SUP>International Conference on Very Large Databases (VLDB), pp. 375-386, 1999.|
|19||*||Crescenzi, V., G. Mecca and P. Merialdo "RoadRunner: Automatic Data Extraction from Data-Intensive Web Sites", Proceedin of the ACM SIGMOD Conference, p. 624, Jun. 4-6, 2002.|
|20||*||Crescenzi, V., G. Mecca and P. Merialdo "RoadRunner: Towards Automatic Data Extraction from Large Web Sites", Proceedings of the 27<SUP>th </SUP>VLDB Conference, 2001.|
|21||*||Davidson, B.D. "Recognizing Nepotistic Links on the Web", Proceedings of the AAAI-2000 Workshop on Artificial Intelligence fo Web Search, pp. 23-28, 2000.|
|22||Davison, B.D., "Recognizing Nepoistic Links on the Web," In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search, pp. 23-28, 2000.|
|23||Dean, J. and Henzinger, M.R., "Finding Related Pages in the World Wide Web," In Proceedings of the 8<SUP>th </SUP>International World Wide Web Conference (WWW8), pp. 1467-1479, 1999.|
|24||*||Fang, M., N. Shivakumar, H. Garcia-Molina, R. Motwani and J.D. Ullman "Computing Iceberg Queries Effectively", Proceeding of the 24<SUP>th </SUP>VLDB Conference, 1998.|
|25||Gibson, D. Kleinberg, J.M. and Raghavan, P., "Inferring Web Communities from Link Topology," In Proceedings of the 9<SUP>th </SUP>ACM Conference on Hypertext and Hypermedia, pp. 225-234, 1998.|
|26||Google. Google. http://www.google.com.|
|27||*||Haveliwala, T.H., A. Gionis, D. Klein and P. Indyk "Evaluating Strategies for Similarity Search on the Web", Proceedings of the WWW2002 Conference, May 7-11, 2002.|
|28||*||Heintze, N. "Scalable Document Fingerprinting (Extended Abstract)", Proceedings of the 1996 USENIX Workshop on Electroni Commerce, Nov. 1996.|
|29||*||Huang, L. "A Survey on Web Information Retrieval Technologies", Technical Report TR-120, Experimental Computer Systems Lab (ECSL), Department of Computer Science, SUNY Stony Brook, Feb. 2000.|
|30||Kleinberg, J.M., "Authoritative Sources in a Hyperlinked Environment," Journal of the ACM, pp. 604-632, 1999.|
|31||*||Kumar, R., P. Raghavan, R. Rajagopalan and A. Tomkins "Trawling the Web for Emerging Cyber-Communities", Proceedings the 8<SUP>th </SUP>International World Wide Web (WWW) Conference (WWW8), pp. 1481-1493, 1999.|
|32||Kumar, R., Raghavan, P., Rajagopalan, S. and Tomkins, A., "Trawling the Web for Emerging Cyber-Communities," In Proceedings of the 8<SUP>th </SUP>International World Wide Web Conference (WWW8), pp. 1481-1493, 1999.|
|33||*||Laender, A.H.F., B.A. Ribeiro-Neto, A.S. da Silva and J.S. Teixeira "A Brief Survey of Web Data Extraction Tools", SIGMOD Record, vol. 31, No. 2, pp. 84-93, Jun. 2002.|
|34||Lempel, R. and Moran, S., "The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect," In Proceedings of the 9<SUP>th </SUP>International World Wide Web Conference (WWW9), pp. 387-401, 2000.|
|35||*||Ma, L., N. Goharian, A. Chowdhury and M. Chung "Extracting Unstructured Data from Template Generated Web Documents", Proceedings of the 12<SUP>the </SUP> International Conference on Information and Knowledge Management, pp. 512-515, Nov. 3-8, 2003.|
|36||Maarek, Y.S., Berry, D.M. and Kaiser, G.E., "An Information Retrieval Approach for Automatically Constructing Software Libraries," Transactions on Software Engineering, 17(8):800-813, 1991.|
|37||*||Manber, U. "Finding Similar Files in a Large File System", Technical Report TR 93-33, University of Arizona, Department of Computer Science, Oct. 1993.|
|38||Modha, D.S. and Spangler, W.S., "Clustering Hypertext with Applications to Web Searching," In Proceedings of the ACM Hypertext 2000 Conference, pp. 143-152, 2000.|
|39||*||Shivakumar, N. and H. Garcia-Molina "SCAM: A Copy Detection Mechanism for Digital Documents", Proceedings of the 2<SUP>nd </SUP>Annual Conference on Theory abd Practice of Digital Libraries, Jun. 1995.|
|40||*||W3C "Document Object Model (DOM) Level 2 Core Specification Version 1.0, W3C Recommendation Nov. 13, 2000", downloaded from www.w3.org.|
|41||*||Yi, L., B. Liu and X. Li "Eliminating Noisy Information in Web Pages for Data Mining", Proceedings of the ACM SIGKDD Conference, Aug. 24-27, 2003.|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US7389471 *||Jun 11, 2003||Jun 17, 2008||Microsoft Corporation||Utilizing common layout and functionality of multiple web pages|
|US7698317||Apr 20, 2007||Apr 13, 2010||Yahoo! Inc.||Techniques for detecting duplicate web pages|
|US7792821||Jun 29, 2006||Sep 7, 2010||Microsoft Corporation||Presentation of structured search results|
|US7831581 *||Feb 28, 2005||Nov 9, 2010||Radix Holdings, Llc||Enhanced search|
|US8583420 *||Apr 15, 2008||Nov 12, 2013||The European Community, Represented By The European Commission||Method for the extraction of relation patterns from articles|
|US20040255233 *||Jun 11, 2003||Dec 16, 2004||Croney Joseph K.||Utilizing common layout and functionality of multiple web pages|
|US20080005118 *||Jun 29, 2006||Jan 3, 2008||Microsoft Corporation||Presentation of structured search results|
|US20080263026 *||Apr 20, 2007||Oct 23, 2008||Amit Sasturkar||Techniques for detecting duplicate web pages|
|US20100138216 *||Apr 15, 2008||Jun 3, 2010||The European Comminuty, Represented By The European Commission||method for the extraction of relation patterns from articles|
|US20120072817 *||Sep 16, 2011||Mar 22, 2012||Oracle International Corporation||Enterprise application workcenter|
|U.S. Classification||1/1, 715/207, 715/234, 707/E17.013, 707/999.002|
|Cooperative Classification||Y10S707/99932, G06F17/30882|
|Jan 22, 2002||AS||Assignment|
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BAR-YOSSEF, ZIV;RAJAGOPALAN, SRIDHAR;REEL/FRAME:012522/0894;SIGNING DATES FROM 20011017 TO 20020115
|Apr 17, 2009||FPAY||Fee payment|
Year of fee payment: 4
|Jul 5, 2013||REMI||Maintenance fee reminder mailed|
|Oct 11, 2013||FPAY||Fee payment|
Year of fee payment: 8
|Oct 11, 2013||SULP||Surcharge for late payment|
Year of fee payment: 7