Search Images Maps Play YouTube News Gmail Drive More »
Sign in
Screen reader users: click this link for accessible mode. Accessible mode has the same essential features but works better with your reader.

Patents

  1. Advanced Patent Search
Publication numberUS20020078087 A1
Publication typeApplication
Application numberUS 09/737,946
Publication dateJun 20, 2002
Filing dateDec 18, 2000
Priority dateDec 18, 2000
Also published asUS7078766, US20020074599
Publication number09737946, 737946, US 2002/0078087 A1, US 2002/078087 A1, US 20020078087 A1, US 20020078087A1, US 2002078087 A1, US 2002078087A1, US-A1-20020078087, US-A1-2002078087, US2002/0078087A1, US2002/078087A1, US20020078087 A1, US20020078087A1, US2002078087 A1, US2002078087A1
InventorsAlan Stone
Original AssigneeStone Alan E.
Export CitationBiBTeX, EndNote, RefMan
External Links: USPTO, USPTO Assignment, Espacenet
Content indicator for accelerated detection of a changed web page
US 20020078087 A1
Abstract
Various embodiments are described for content indicators to accelerate detection of a changed web page or other file.
Images(6)
Previous page
Next page
Claims(28)
What is claimed is:
1. An apparatus comprising:
an authoring tool to generate files;
a calculator to calculate a content indicator for one or more of the files, the page authoring tool to store each of the calculated content indicators with a corresponding file.
2. The apparatus of claim 1 wherein the authoring tool comprises a HTML authoring tool or program.
3. The apparatus of claim 1 wherein the calculator comprises a digest calculator to calculate digests for each of the files.
4. The apparatus of claim 1 wherein the apparatus encodes each of the content indicators within a corresponding file.
5. An apparatus comprising:
an insertion tool comprising a calculator to calculate a content indicator for each of a plurality of files, the insertion tool to insert each of the calculated content indicators within a corresponding one of the files.
6. The apparatus of claim 5 wherein the calculator comprises a digest calculator to calculate digests for each of the files.
7. The apparatus of claim 5 wherein the apparatus encodes each of the content indicators within a corresponding file.
8. The apparatus of clam 6 wherein the insertion tool comprises an insertion tool to insert each of the calculated digests within a corresponding one of the files.
9. The apparatus of claim 5 wherein the files comprise one or more of the following:
web pages;
HTML pages;
Graphics; and
Scripts.
10. An apparatus comprising:
an insertion tool comprising a calculator to calculate a content indicator for each of a plurality of files, the insertion tool to obtain a path for each file and store a file path and the content indicator in a repository for each of a plurality of files.
11. The apparatus of claim 10 wherein the calculator comprises a digest calculator to calculate digests for each of a plurality of files.
12. The apparatus of claim 11 wherein the calculator comprises a digest calculator to calculate digests for each of a plurality of web pages.
13. The apparatus of claim 10 wherein the path file indicates a location or address of the file.
14. A method comprising:
generating a file;
calculating a content indicator for the file;
storing the content indicator with the file to provide content change indication for the file when compared to another content indicator.
15. The method of claim 14 wherein the calculating comprises calculating a digest for the file.
16. The method of claim 15 wherein the storing comprises storing the digest with the file to provide content change indication for the file when compared to another digest.
17. The method of claim 14 and further comprises:
retrieving the content indicator for the file;
comparing the content indicator for the file to a content indicator corresponding to a previous version of the file; and
determining whether the file has changed based on the comparing.
18. The method of claim 14 and further comprises:
retrieving the content indicator for the file;
comparing the content indicator for the file to a content indicator corresponding to a previous version of the file; and
retrieving the file if the content indicator for the file and the content indicator corresponding to a previous version of the file do not match.
19. The method of claim 18 and further comprising updating an index based on the retrieved file if there was not a match.
20. A method comprising:
obtaining a path for a file;
retrieving the file;
calculating a content indicator for the file;
storing the file path and the content indicator for the file in a repository or storage.
21. The method of claim 20 wherein the calculating comprises calculating a digest for the file.
22. The method of claim 21 wherein the storing comprises storing the file path and the digest in a digest repository or storage.
23. An apparatus comprising a readable media having instructions thereon, the instructions resulting in the following when executed:
generating a file;
calculating a content indicator for the file;
storing the content indicator with the file to provide content change indication for the file when compared to another content indicator.
24. The apparatus of claim 23 wherein the calculator comprises calculating a digest for the file.
25. The apparatus of claim 23 wherein the storing comprises storing the digest with the file to provide content change indication for the file when compared to another digest.
26. An apparatus comprising a readable media having instructions thereon, the instructions resulting in the following when executed:
obtaining a path for a file;
retrieving the file;
calculating a content indicator for the file;
storing the file path and the content indicator for the file in a repository or storage.
27. The apparatus of claim 26 wherein the calculating comprises calculating a digest for the file.
28. The apparatus of claim 27 wherein the storing comprises storing the file path and the digest in a digest repository or storage.
Description
BRIEF DESCRIPTION OF THE DRAWINGS

[0006] The foregoing and a better understanding of the present invention will become apparent from the following detailed description of exemplary embodiments and the claims when read in connection with the accompanying drawings, all forming a part of the disclosure of this invention. While the foregoing and following written and illustrated disclosure focuses on disclosing example embodiments of the invention, it should be clearly understood that the same is by way of illustration and example only and is not limited thereto. The spirit and scope of the present invention is limited only by the terms of the appended claims.

[0007] The following represents brief descriptions of the drawings, wherein:

[0008]FIG. 1 is a diagram illustrating insertion of a digest or other content indicator into a file according to an example embodiment.

[0009]FIG. 2 is a diagram illustrating insertion of a digest into a file according to another example embodiment.

[0010]FIG. 3 is a diagram illustrating use of a digest according to yet another example embodiment.

[0011]FIG. 4 is a diagram illustrating a HTML document according to an example embodiment.

[0012]FIG. 5 is a block diagram that illustrates a network according to an example embodiment.

DETAILED DESCRIPTION

[0013] Referring to the Figures in which like numerals indicate like elements, FIG. 1 is a diagram illustrating insertion of a digest or other content indicator into a file according to an example embodiment. As shown in FIG. 1, a web page or HTML page authoring tool 112 is provided to author or generate web pages or HTML pages. HTML authoring tool 112 typically may be a software program running on a processing node, such as a computer. The processing node or computer may include a processor, memory and other components. Web page authoring tool 112 may be, for example, software programs such as Front Page or Word, both available from Microsoft Corporation, Redmond, Washington.

[0014] According to the embodiment shown in FIG. 1, a page-resident content indicator may be provided for each page to allow programs or clients to detect web page changes. For example, the authoring tool 112 may include an additional program that calculates or generates a content indicator for each file. The files may be, for example, a web page or HTML page, a graphic, a script, etc.

[0015] According to an example embodiment, a content indicator is calculated or generated for each web page. The content indicator may then be stored in or with the file or web page. A content indicator may be anything that allows a client or other program to detect a change or update to the content of the web pages. According to an example embodiment, a content indicator, when compared to another content indicator for the same web page, provides an indication as to whether or not the content of the web page has been changed or updated.

[0016] A content indicator may include, for example, a file size of the web page, a date and time that the web page was last modified or changed, and a file digest. When a file digest is calculated for a web page, a digest function takes an arbitrary sized message or file, such as a web page, and generates a number, which is typically a fixed length quantity. A hash algorithm or hash function, also known as a message digest is typically a one-way function. It is considered a function because it takes an input message and produces an output. It may be considered one-way because it is not practical to figure out what input corresponds to a given output. If it is cryptographically secure, it should be impossible to find two messages or files that have the same file digest. Thus, if a change is made to a web page, the digest for that page will change. The digest may be calculated, for example, using message digest algorithms, including MD2, MD4 and MD5, and documented in Request for Comments 1319, 1320, 1321, respectively. Other algorithms, such as hash functions or Cyclic Redundancy Checks (CRC) algorithms, etc. may be used to generate the file digests. The term digest will be used hereinbelow in the various embodiments and examples. However, other types of content indicators may be used as well.

[0017] Therefore, as shown in FIG. 1, the page authoring tool 112 includes a digest calculator 114 to calculate or generate a digest for each file or web page each time a web page or file is generated or created or updated, and then to store this digest with the corresponding web page. Files 120 includes files 120A, 120B and 120C, which may be web pages, HTML pages or other types of files. Thus, according to an example embodiment, the digests may be page-resident, since the digests may reside with the corresponding web pages or files 120.

[0018]FIG. 4 is a diagram illustrating a HTML document according to an example embodiment. The web page authoring tool 112 (FIG. 1) generates or updates, or is used to generate or update, the HTML web page shown in FIG. 4, including the head and title of the message and the body of the message 410. The digest calculator 114 (FIG. 1) then calculates or generates the file digest 405 based on the HTML page shown in FIG. 4. The file digest 405 may then be prepended or attached to or stored within the HTML file. Thus, each file or web page may include a corresponding digest that is encoded onto the file or web page.

[0019] The page-resident file digests for each of the files or web pages allows web indexers to quickly index the web pages since the indexer can identify which pages have changed, and then update the index using only changed web pages. For example, the indexer can read the file digest for each web page. If the digest for a web page matches the digest for a previous version of the web page that has already been indexed, then the indexer can skip this page and move on to the next web page without downloading the web page. If the digest for a web page is different from a previous digest for that web page, this indicates that the web page has changed, and the indexer can download and index that page. This allows the indexer to selectively download only those web pages that have changed, resulting in a significant decrease in bandwidth usage to index a set of web pages.

[0020] The page-resident digests for each of the stored web pages or files are also beneficial to the browsers that may be accessing these web pages. For example, in the event that the web pages are stored on a local storage drive or if a web server is not available, the browser may compare a digest from the cache-stored page to the digest from the page stored on the storage drive to determine if the cache-stored web page is invalid. If the cached copy of the page is invalid, as indicated by different digests, then the browser will retrieve the web page from the storage device. Otherwise, if the digests are the same, then this indicates that the cached copy of the page is still valid, and the browser may then use the cached copy, and need not download the entire web page from the network drive.

[0021]FIG. 2 is a diagram illustrating insertion of a digest into a file according to another example embodiment. As shown in FIG. 2, as user-programmable digest insertion tool 130, or a content indicator insertion tool in the general case, is provided. Rather than calculating a digest each time a file or web page is created, updated or saved, the digest insertion tool 130 can be programmed or directed to calculate updated digests for a plurality of files or web pages 120, and then replace the existing digest in each file with the updated digest. The digest insertion tool 130 may also include or use the digest calculator to calculate or generate a digest for each file or web page.

[0022]FIG. 3 is a diagram illustrating use of a digest according to yet another example embodiment. As shown in FIG. 3, a digest repository insertion tool 140 is provided to read each file or web page 120 and the file path. The file path for each file may be the path that identifies the location or address of the file in a network, for example. The file path may be a Universal Resource Identifier (URI) or a Universal Resource Location (URL), for example. The digest repository insertion tool 140 includes a digest calculator 114. The digest repository insertion tool 140 then calculates or computes a digest for each web page or file, or uses the digest calculator 114 to perform these calculations. The digest repository insertion tool 140 then stores a file path and digest in a digest repository or storage 170, for each file or web page. Two example file path and digest pairs are shown below:

[0023] 1) home/stonea/new.html MD5=“CD25D86057DA6337090518B858D41E2”

[0024] 2) home/stonea/improved.html home/stonea/new.html

[0025] Where “home/stonea/new.html” is the file path and “CD25D86057DA6337090518B858D41 E2” is the digest for file 1), shown above as an example.

[0026] It may be advantageous to store such an array or listing of file path and digest pairs for each of a plurality of files or web pages. This would allow a web indexer or a browser to retrieve entries from the digest repository 170, rather than retrieve portions of the web pages or files, to quickly obtain a current digest for each page or file. The client, indexer or web browser, may then compare the digest from the repository to a local copy of the digest for the same page to determine if the web page has changed, which would typically be indicated by digests that are different.

[0027] The page authoring tool 112, digest insertion tools 130 or 140, the files or web pages 120 and the digest repository 170 may be provided on a single processing node, or spread across multiple processing nodes, where a processing node may be a computer, a server or similar system.

[0028]FIG. 5 is a block diagram that illustrates a network according to an example embodiment. For example, as shown in FIG. 5, web page authoring tool 112 may be a software program running on processing node 510, digest insertion tool 130 or 140 may be a software program running on processing node 515, files 120 may be stored in processing node 520, while digest repository may be stored on processing node 525. This is just an example network, however, the invention is not limited in scope to such a network or arrangement.

[0029] Several embodiments of the present invention are specifically illustrated and/or described herein. However, it will be appreciated that modifications and variations of the present invention are covered by the above teachings and within the purview of the appended claims without departing from the spirit and intended scope of the invention.

FIELD

[0001] The invention generally relates to web pages, browsers and search engines, and in particular, to a content indicator for accelerated detection of a changed web page.

BACKGROUND

[0002] Today, web pages are commonly stored on web servers. A web server is a server that stores or provides web pages, typically in Hypertext Markup Language (HTML) format, and makes these web pages available to clients upon request, such as in response to a “Get” request using Hypertext Transfer Protocol (HTTP)—HTTP/1.1, Request For Comments 2616, June 1999. A client may be any software program that may request access the web pages. Two common web clients include a web browser and search engine indexers. A web browser is a program which can retrieve web pages from remote web servers and display the web page for the user.

[0003] The Internet is typically indexed via search engine indexers, also known as web “spiders.” Typically, these spiders may be dedicated machines that relentlessly visit all the publicly addressable Internet addresses to gain access to the HyperText Transfer Protocol (HTTP) port number 80 to find “home pages” or “web pages.” Once found, the spider navigates through the content of each ‘page’, indexing both content and hyperlinks. The index may provide, for example, a correspondence between the subject matter of a web page and an address or Universal Resource Identifier for each web page. This information is then provided to a search engine, to allow the search engine to identify addresses or locations of pertinent web pages in response to a particular search.

[0004] Changes to web pages can create problems for browsers and search engine indexers or spiders. Web content is frequently changed, by adding new content to pages, removing or adding new pages, or changing a hyperlink to another page, etc. When a browser retrieves a web page, a copy of the web page is stored in a local cache. When a second request for the cached web page is received at the browser from a user, the browser determines whether to use the cached copy of the web page, or whether to retrieve the web page from the web server. In HTTP/1.1 protocol, RFC 2068, a technique is described for the web server to provide a page content change indication. The content change indication is provided by either file size, file date, or a file digest specified by MD5 message digest algorithm, described in RFC 1321. The client can request one or more of these values from the web server for a particular page. The web server then retrieves the page from memory, calculates the file digest, file size or file date, and then returns this information to the client, where the client may use this information to decide whether to use the cached copy or request a copy from the web server. However, this is a slow and inefficient technique. Also, in some instances, web pages may be stored at a location where a web server is not available. For example, it is common to store web pages on a server or a network accessible drive, without the additional burden of an HTTP server. Thus, in such cases, it is desirable to obtain a page content change indication without querying the web server.

[0005] For the search engine, the changes in the web content can cause the web index to become outdated, which may create search results that include stale pages, pages that have moved or disappeared, broken links, etc. As a result, the web spider usually indexes web content relentlessly, constantly downloading indexing the same web content over and over again in attempt to provide updated indexes. This is very inefficient because this repetitive downloading of web pages consumes a large amount of bandwidth. As a result, it is desirable to provide a technique to obtain a page content change indication so that only the changed pages would be necessary to download and re-index.

Referenced by
Citing PatentFiling datePublication dateApplicantTitle
US7107336 *Feb 23, 2001Sep 12, 2006International Business Machines CorporationMethod and apparatus for enhanced server page execution
US7418661 *Sep 17, 2002Aug 26, 2008Hewlett-Packard Development Company, L.P.Published web page version tracking
US7584230 *Mar 27, 2007Sep 1, 2009At&T Intellectual Property, I, L.P.Method, systems and computer program products for monitoring files
US7788713 *Jun 23, 2004Aug 31, 2010Intel CorporationMethod, apparatus and system for virtualized peer-to-peer proxy services
US8219633 *Sep 26, 2011Jul 10, 2012Limelight Networks, Inc.Acceleration of web pages access using next page optimization, caching and pre-fetching
US8224964Jun 30, 2004Jul 17, 2012Google Inc.System and method of accessing a document efficiently through multi-tier web caching
US8250457Sep 26, 2011Aug 21, 2012Limelight Networks, Inc.Acceleration and optimization of web pages access by changing the order of resource loading
US8275790 *Oct 14, 2008Sep 25, 2012Google Inc.System and method of accessing a document efficiently through multi-tier web caching
US8321533Aug 2, 2010Nov 27, 2012Limelight Networks, Inc.Systems and methods thereto for acceleration of web pages access using next page optimization, caching and pre-fetching techniques
US8341177 *Dec 28, 2006Dec 25, 2012Symantec Operating CorporationAutomated dereferencing of electronic communications for archival
US8341711 *Nov 7, 2008Dec 25, 2012Whitehat Security, Inc.Automated login session extender for use in security analysis systems
US8346784May 29, 2012Jan 1, 2013Limelight Networks, Inc.Java script reductor
US8346885May 14, 2012Jan 1, 2013Limelight Networks, Inc.Systems and methods thereto for acceleration of web pages access using next page optimization, caching and pre-fetching techniques
US8370420 *Jul 11, 2002Feb 5, 2013Citrix Systems, Inc.Web-integrated display of locally stored content objects
US8495171May 29, 2012Jul 23, 2013Limelight Networks, Inc.Indiscriminate virtual containers for prioritized content-object distribution
US8549390 *Oct 26, 2006Oct 1, 2013International Business Machines CorporationVerifying content of resources in markup language documents
US20070277045 *May 24, 2007Nov 29, 2007Kabushiki Kaisha ToshibaData processing apparatus and a method for processing data
US20100050067 *Mar 19, 2007Feb 25, 2010International Business Machines CorporationBookmarking internet resources in an internet browser
US20120089695 *Sep 26, 2011Apr 12, 2012Fainberg LeonidAcceleration of web pages access using next page optimization, caching and pre-fetching
US20120210237 *Feb 16, 2011Aug 16, 2012Computer Associates Think, Inc.Recording A Trail Of Webpages
Classifications
U.S. Classification715/255, 257/E27.112, 257/E27.098, 715/234, 257/E29.281
International ClassificationH01L27/12, H01L27/11, H01L29/786
Cooperative ClassificationH01L29/78612, H01L27/1203, H01L27/11
European ClassificationH01L29/786B3, H01L27/12B
Legal Events
DateCodeEventDescription
Nov 11, 2003ASAssignment
Owner name: INTEL CORPORATION, CALIFORNIA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:DIALOGIC CORPORATION;REEL/FRAME:014120/0462
Effective date: 20031017
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:DIALOGIC CORPORATION;REEL/FRAME:014120/0451
Dec 18, 2000ASAssignment
Owner name: INTEL CORP., CALIFORNIA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:STONE, ALAN E.;REEL/FRAME:011366/0473
Effective date: 20001214