|Publication number||US20020078087 A1|
|Application number||US 09/737,946|
|Publication date||Jun 20, 2002|
|Filing date||Dec 18, 2000|
|Priority date||Dec 18, 2000|
|Also published as||US7078766, US20020074599|
|Publication number||09737946, 737946, US 2002/0078087 A1, US 2002/078087 A1, US 20020078087 A1, US 20020078087A1, US 2002078087 A1, US 2002078087A1, US-A1-20020078087, US-A1-2002078087, US2002/0078087A1, US2002/078087A1, US20020078087 A1, US20020078087A1, US2002078087 A1, US2002078087A1|
|Original Assignee||Stone Alan E.|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (13), Referenced by (29), Classifications (13), Legal Events (2)|
|External Links: USPTO, USPTO Assignment, Espacenet|
 The invention generally relates to web pages, browsers and search engines, and in particular, to a content indicator for accelerated detection of a changed web page.
 Today, web pages are commonly stored on web servers. A web server is a server that stores or provides web pages, typically in Hypertext Markup Language (HTML) format, and makes these web pages available to clients upon request, such as in response to a “Get” request using Hypertext Transfer Protocol (HTTP)—HTTP/1.1, Request For Comments 2616, June 1999. A client may be any software program that may request access the web pages. Two common web clients include a web browser and search engine indexers. A web browser is a program which can retrieve web pages from remote web servers and display the web page for the user.
 The Internet is typically indexed via search engine indexers, also known as web “spiders.” Typically, these spiders may be dedicated machines that relentlessly visit all the publicly addressable Internet addresses to gain access to the HyperText Transfer Protocol (HTTP) port number 80 to find “home pages” or “web pages.” Once found, the spider navigates through the content of each ‘page’, indexing both content and hyperlinks. The index may provide, for example, a correspondence between the subject matter of a web page and an address or Universal Resource Identifier for each web page. This information is then provided to a search engine, to allow the search engine to identify addresses or locations of pertinent web pages in response to a particular search.
 Changes to web pages can create problems for browsers and search engine indexers or spiders. Web content is frequently changed, by adding new content to pages, removing or adding new pages, or changing a hyperlink to another page, etc. When a browser retrieves a web page, a copy of the web page is stored in a local cache. When a second request for the cached web page is received at the browser from a user, the browser determines whether to use the cached copy of the web page, or whether to retrieve the web page from the web server. In HTTP/1.1 protocol, RFC 2068, a technique is described for the web server to provide a page content change indication. The content change indication is provided by either file size, file date, or a file digest specified by MD5 message digest algorithm, described in RFC 1321. The client can request one or more of these values from the web server for a particular page. The web server then retrieves the page from memory, calculates the file digest, file size or file date, and then returns this information to the client, where the client may use this information to decide whether to use the cached copy or request a copy from the web server. However, this is a slow and inefficient technique. Also, in some instances, web pages may be stored at a location where a web server is not available. For example, it is common to store web pages on a server or a network accessible drive, without the additional burden of an HTTP server. Thus, in such cases, it is desirable to obtain a page content change indication without querying the web server.
 For the search engine, the changes in the web content can cause the web index to become outdated, which may create search results that include stale pages, pages that have moved or disappeared, broken links, etc. As a result, the web spider usually indexes web content relentlessly, constantly downloading indexing the same web content over and over again in attempt to provide updated indexes. This is very inefficient because this repetitive downloading of web pages consumes a large amount of bandwidth. As a result, it is desirable to provide a technique to obtain a page content change indication so that only the changed pages would be necessary to download and re-index.
 The foregoing and a better understanding of the present invention will become apparent from the following detailed description of exemplary embodiments and the claims when read in connection with the accompanying drawings, all forming a part of the disclosure of this invention. While the foregoing and following written and illustrated disclosure focuses on disclosing example embodiments of the invention, it should be clearly understood that the same is by way of illustration and example only and is not limited thereto. The spirit and scope of the present invention is limited only by the terms of the appended claims.
 The following represents brief descriptions of the drawings, wherein:
FIG. 1 is a diagram illustrating insertion of a digest or other content indicator into a file according to an example embodiment.
FIG. 2 is a diagram illustrating insertion of a digest into a file according to another example embodiment.
FIG. 3 is a diagram illustrating use of a digest according to yet another example embodiment.
FIG. 4 is a diagram illustrating a HTML document according to an example embodiment.
FIG. 5 is a block diagram that illustrates a network according to an example embodiment.
 Referring to the Figures in which like numerals indicate like elements, FIG. 1 is a diagram illustrating insertion of a digest or other content indicator into a file according to an example embodiment. As shown in FIG. 1, a web page or HTML page authoring tool 112 is provided to author or generate web pages or HTML pages. HTML authoring tool 112 typically may be a software program running on a processing node, such as a computer. The processing node or computer may include a processor, memory and other components. Web page authoring tool 112 may be, for example, software programs such as Front Page or Word, both available from Microsoft Corporation, Redmond, Washington.
 According to the embodiment shown in FIG. 1, a page-resident content indicator may be provided for each page to allow programs or clients to detect web page changes. For example, the authoring tool 112 may include an additional program that calculates or generates a content indicator for each file. The files may be, for example, a web page or HTML page, a graphic, a script, etc.
 According to an example embodiment, a content indicator is calculated or generated for each web page. The content indicator may then be stored in or with the file or web page. A content indicator may be anything that allows a client or other program to detect a change or update to the content of the web pages. According to an example embodiment, a content indicator, when compared to another content indicator for the same web page, provides an indication as to whether or not the content of the web page has been changed or updated.
 A content indicator may include, for example, a file size of the web page, a date and time that the web page was last modified or changed, and a file digest. When a file digest is calculated for a web page, a digest function takes an arbitrary sized message or file, such as a web page, and generates a number, which is typically a fixed length quantity. A hash algorithm or hash function, also known as a message digest is typically a one-way function. It is considered a function because it takes an input message and produces an output. It may be considered one-way because it is not practical to figure out what input corresponds to a given output. If it is cryptographically secure, it should be impossible to find two messages or files that have the same file digest. Thus, if a change is made to a web page, the digest for that page will change. The digest may be calculated, for example, using message digest algorithms, including MD2, MD4 and MD5, and documented in Request for Comments 1319, 1320, 1321, respectively. Other algorithms, such as hash functions or Cyclic Redundancy Checks (CRC) algorithms, etc. may be used to generate the file digests. The term digest will be used hereinbelow in the various embodiments and examples. However, other types of content indicators may be used as well.
 Therefore, as shown in FIG. 1, the page authoring tool 112 includes a digest calculator 114 to calculate or generate a digest for each file or web page each time a web page or file is generated or created or updated, and then to store this digest with the corresponding web page. Files 120 includes files 120A, 120B and 120C, which may be web pages, HTML pages or other types of files. Thus, according to an example embodiment, the digests may be page-resident, since the digests may reside with the corresponding web pages or files 120.
FIG. 4 is a diagram illustrating a HTML document according to an example embodiment. The web page authoring tool 112 (FIG. 1) generates or updates, or is used to generate or update, the HTML web page shown in FIG. 4, including the head and title of the message and the body of the message 410. The digest calculator 114 (FIG. 1) then calculates or generates the file digest 405 based on the HTML page shown in FIG. 4. The file digest 405 may then be prepended or attached to or stored within the HTML file. Thus, each file or web page may include a corresponding digest that is encoded onto the file or web page.
 The page-resident file digests for each of the files or web pages allows web indexers to quickly index the web pages since the indexer can identify which pages have changed, and then update the index using only changed web pages. For example, the indexer can read the file digest for each web page. If the digest for a web page matches the digest for a previous version of the web page that has already been indexed, then the indexer can skip this page and move on to the next web page without downloading the web page. If the digest for a web page is different from a previous digest for that web page, this indicates that the web page has changed, and the indexer can download and index that page. This allows the indexer to selectively download only those web pages that have changed, resulting in a significant decrease in bandwidth usage to index a set of web pages.
 The page-resident digests for each of the stored web pages or files are also beneficial to the browsers that may be accessing these web pages. For example, in the event that the web pages are stored on a local storage drive or if a web server is not available, the browser may compare a digest from the cache-stored page to the digest from the page stored on the storage drive to determine if the cache-stored web page is invalid. If the cached copy of the page is invalid, as indicated by different digests, then the browser will retrieve the web page from the storage device. Otherwise, if the digests are the same, then this indicates that the cached copy of the page is still valid, and the browser may then use the cached copy, and need not download the entire web page from the network drive.
FIG. 2 is a diagram illustrating insertion of a digest into a file according to another example embodiment. As shown in FIG. 2, as user-programmable digest insertion tool 130, or a content indicator insertion tool in the general case, is provided. Rather than calculating a digest each time a file or web page is created, updated or saved, the digest insertion tool 130 can be programmed or directed to calculate updated digests for a plurality of files or web pages 120, and then replace the existing digest in each file with the updated digest. The digest insertion tool 130 may also include or use the digest calculator to calculate or generate a digest for each file or web page.
FIG. 3 is a diagram illustrating use of a digest according to yet another example embodiment. As shown in FIG. 3, a digest repository insertion tool 140 is provided to read each file or web page 120 and the file path. The file path for each file may be the path that identifies the location or address of the file in a network, for example. The file path may be a Universal Resource Identifier (URI) or a Universal Resource Location (URL), for example. The digest repository insertion tool 140 includes a digest calculator 114. The digest repository insertion tool 140 then calculates or computes a digest for each web page or file, or uses the digest calculator 114 to perform these calculations. The digest repository insertion tool 140 then stores a file path and digest in a digest repository or storage 170, for each file or web page. Two example file path and digest pairs are shown below:
 1) home/stonea/new.html MD5=“CD25D86057DA6337090518B858D41E2”
 2) home/stonea/improved.html home/stonea/new.html
 Where “home/stonea/new.html” is the file path and “CD25D86057DA6337090518B858D41 E2” is the digest for file 1), shown above as an example.
 It may be advantageous to store such an array or listing of file path and digest pairs for each of a plurality of files or web pages. This would allow a web indexer or a browser to retrieve entries from the digest repository 170, rather than retrieve portions of the web pages or files, to quickly obtain a current digest for each page or file. The client, indexer or web browser, may then compare the digest from the repository to a local copy of the digest for the same page to determine if the web page has changed, which would typically be indicated by digests that are different.
 The page authoring tool 112, digest insertion tools 130 or 140, the files or web pages 120 and the digest repository 170 may be provided on a single processing node, or spread across multiple processing nodes, where a processing node may be a computer, a server or similar system.
FIG. 5 is a block diagram that illustrates a network according to an example embodiment. For example, as shown in FIG. 5, web page authoring tool 112 may be a software program running on processing node 510, digest insertion tool 130 or 140 may be a software program running on processing node 515, files 120 may be stored in processing node 520, while digest repository may be stored on processing node 525. This is just an example network, however, the invention is not limited in scope to such a network or arrangement.
 Several embodiments of the present invention are specifically illustrated and/or described herein. However, it will be appreciated that modifications and variations of the present invention are covered by the above teachings and within the purview of the appended claims without departing from the spirit and intended scope of the invention.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US5481672 *||Feb 27, 1992||Jan 2, 1996||Canon Kabushiki Kaisha||Detecting rewriting of stored data, using codes based on password and the stored data|
|US5978842 *||Jul 18, 1997||Nov 2, 1999||Netmind Technologies, Inc.||Distributed-client change-detection tool with change-detection augmented by multiple clients|
|US6055522 *||Jun 19, 1997||Apr 25, 2000||Futuretense, Inc.||Automatic page converter for dynamic content distributed publishing system|
|US6161126 *||Feb 2, 1999||Dec 12, 2000||Immersion Corporation||Implementing force feedback over the World Wide Web and other computer networks|
|US6411959 *||Sep 29, 1999||Jun 25, 2002||International Business Machines Corporation||Apparatus and method for dynamically updating a computer-implemented table and associated objects|
|US6411989 *||Dec 28, 1998||Jun 25, 2002||Lucent Technologies Inc.||Apparatus and method for sharing information in simultaneously viewed documents on a communication system|
|US6460023 *||Jun 16, 1999||Oct 1, 2002||Pulse Entertainment, Inc.||Software authorization system and method|
|US6681369 *||May 5, 1999||Jan 20, 2004||Xerox Corporation||System for providing document change information for a community of users|
|US20010039563 *||Apr 10, 2001||Nov 8, 2001||Yunqi Tian||Two-level internet search service system|
|US20010044820 *||Apr 6, 2001||Nov 22, 2001||Scott Adam Marc||Method and system for website content integrity assurance|
|US20010056460 *||Apr 23, 2001||Dec 27, 2001||Ranjit Sahota||Method and system for transforming content for execution on multiple platforms|
|US20020013825 *||Sep 21, 2001||Jan 31, 2002||Freivald Matthew P.||Unique-change detection of dynamic web pages using history tables of signatures|
|US20020022977 *||Dec 1, 2000||Feb 21, 2002||Schiff Martin R.||Systems and methods of maintaining client relationships|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US7107336 *||Feb 23, 2001||Sep 12, 2006||International Business Machines Corporation||Method and apparatus for enhanced server page execution|
|US7418661 *||Sep 17, 2002||Aug 26, 2008||Hewlett-Packard Development Company, L.P.||Published web page version tracking|
|US7584230 *||Mar 27, 2007||Sep 1, 2009||At&T Intellectual Property, I, L.P.||Method, systems and computer program products for monitoring files|
|US7788713 *||Jun 23, 2004||Aug 31, 2010||Intel Corporation||Method, apparatus and system for virtualized peer-to-peer proxy services|
|US8219633 *||Sep 26, 2011||Jul 10, 2012||Limelight Networks, Inc.||Acceleration of web pages access using next page optimization, caching and pre-fetching|
|US8224964||Jun 30, 2004||Jul 17, 2012||Google Inc.||System and method of accessing a document efficiently through multi-tier web caching|
|US8250457||Sep 26, 2011||Aug 21, 2012||Limelight Networks, Inc.||Acceleration and optimization of web pages access by changing the order of resource loading|
|US8275790 *||Oct 14, 2008||Sep 25, 2012||Google Inc.||System and method of accessing a document efficiently through multi-tier web caching|
|US8321533||Aug 2, 2010||Nov 27, 2012||Limelight Networks, Inc.||Systems and methods thereto for acceleration of web pages access using next page optimization, caching and pre-fetching techniques|
|US8341177 *||Dec 28, 2006||Dec 25, 2012||Symantec Operating Corporation||Automated dereferencing of electronic communications for archival|
|US8341711 *||Nov 7, 2008||Dec 25, 2012||Whitehat Security, Inc.||Automated login session extender for use in security analysis systems|
|US8346784||May 29, 2012||Jan 1, 2013||Limelight Networks, Inc.||Java script reductor|
|US8346885||May 14, 2012||Jan 1, 2013||Limelight Networks, Inc.||Systems and methods thereto for acceleration of web pages access using next page optimization, caching and pre-fetching techniques|
|US8370420 *||Jul 11, 2002||Feb 5, 2013||Citrix Systems, Inc.||Web-integrated display of locally stored content objects|
|US8495171||May 29, 2012||Jul 23, 2013||Limelight Networks, Inc.||Indiscriminate virtual containers for prioritized content-object distribution|
|US8549390 *||Oct 26, 2006||Oct 1, 2013||International Business Machines Corporation||Verifying content of resources in markup language documents|
|US8788475||Jun 28, 2012||Jul 22, 2014||Google Inc.||System and method of accessing a document efficiently through multi-tier web caching|
|US8818990||Aug 9, 2004||Aug 26, 2014||International Business Machines Corporation||Method, apparatus and computer program for retrieving data|
|US8925051||Nov 20, 2012||Dec 30, 2014||Whitehat Security, Inc.||Automated login session extender for use in security analysis systems|
|US9002909 *||Apr 27, 2006||Apr 7, 2015||Clearswift Limited||Tracking marked documents|
|US9015348||Jul 19, 2013||Apr 21, 2015||Limelight Networks, Inc.||Dynamically selecting between acceleration techniques based on content request attributes|
|US9058402||May 29, 2012||Jun 16, 2015||Limelight Networks, Inc.||Chronological-progression access prioritization|
|US20040243536 *||May 28, 2003||Dec 2, 2004||Integrated Data Control, Inc.||Information capturing, indexing, and authentication system|
|US20050071366 *||Aug 9, 2004||Mar 31, 2005||International Business Machines Corporation||Method, apparatus and computer program for retrieving data|
|US20050289648 *||Jun 23, 2004||Dec 29, 2005||Steven Grobman||Method, apparatus and system for virtualized peer-to-peer proxy services|
|US20070277045 *||May 24, 2007||Nov 29, 2007||Kabushiki Kaisha Toshiba||Data processing apparatus and a method for processing data|
|US20100050067 *||Mar 19, 2007||Feb 25, 2010||International Business Machines Corporation||Bookmarking internet resources in an internet browser|
|US20120089695 *||Sep 26, 2011||Apr 12, 2012||Fainberg Leonid||Acceleration of web pages access using next page optimization, caching and pre-fetching|
|US20120210237 *||Aug 16, 2012||Computer Associates Think, Inc.||Recording A Trail Of Webpages|
|U.S. Classification||715/255, 257/E27.112, 257/E27.098, 715/234, 257/E29.281|
|International Classification||H01L27/12, H01L27/11, H01L29/786|
|Cooperative Classification||H01L29/78612, H01L27/1203, H01L27/11|
|European Classification||H01L29/786B3, H01L27/12B|
|Dec 18, 2000||AS||Assignment|
Owner name: INTEL CORP., CALIFORNIA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:STONE, ALAN E.;REEL/FRAME:011366/0473
Effective date: 20001214
|Nov 11, 2003||AS||Assignment|
Owner name: INTEL CORPORATION, CALIFORNIA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:DIALOGIC CORPORATION;REEL/FRAME:014120/0462
Effective date: 20031017
Owner name: INTEL CORPORATION, CALIFORNIA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:DIALOGIC CORPORATION;REEL/FRAME:014120/0451
Effective date: 20031017