Search Images Maps Play YouTube News Gmail Drive More »
Advanced Patent Search | Web History | Sign in

Patents

An automated form filler and script executor is integrated with a web browser engine, which is communicatively coupled to a web crawler, thereby enabling the crawler to identify dynamic web content based on submission of forms completed by the form filler. The crawler is capable of identifying web pages containing forms that require submission, and JavaScript code that requires execution, respectively, for requesting dynamic web content from a server. The crawler passes a representation of such web pages to the browser engine. The form filler systematically completes the form based on various combinations of search parameter values provided by the web page for requesting dynamic content. Request messages are constructed by the browser engine and passed to the crawler for submission to the server. The dynamic content, received by the crawler from the server in response to the request, can be indexed according to conventional search engine indexing techniques.

InventorsBangalore Subbaramaiah Prabhakar, Shivakumar Ganesan, Yarram Sunil Kumar, Shreekanth Karvaje, Binu Raj
Original AssigneeYahoo ! Inc.
Primary Examiner: Pierre M Vital
Secondary Examiner: Fred I Ehichioya
Attorney: Hickman Palermo Truong & Becker
Current U.S. Classification1/1; 707/999.001; 707/999.01; 707/999.1; 707/999.2; 707/999.202; 709/203

View patent at USPTO
Search USPTO Assignment Database
Download USPTO Public PAIR data

Citations

Cited PatentFiling dateIssue dateOriginal AssigneeTitle
US6219818Feb 18, 1999Apr 17, 2001NetMind Technologies, Inc.Checksum-comparing change-detection tool indicating degree and location of change of internet documents
US6738344Sep 27, 2000May 18, 2004Hewlett-Packard Development Company, L.P.Link extenders with link alive propagation
US6871213Oct 11, 2000Mar 22, 2005Kana Software, Inc.System and method for web co-navigation with dynamic content including incorporation of business rule into web document
US20020078136Dec 14, 2000International Business Machines CorporationMethod, apparatus and computer program product to crawl a web site
US20020083068Oct 29, 2001Method and apparatus for filling out electronic forms
US20020156779Sep 28, 2001Internet search engine
US20040083424Oct 16, 2003NEC CORPORATIONApparatus, method, and computer program product for checking hypertext
US20050114319Mar 9, 2004Microsoft CorporationSystem and method for checking a content site for efficacy
US20050120060Nov 26, 2004System and method for solving the dead-link problem of web pages on the Internet
US20050192936Feb 12, 2004Decision-theoretic web-crawling and predicting web-page change
US20050262063Apr 26, 2005Watchfire CorporationMethod and system for website analysis
US20060112089Nov 22, 2004Methods and apparatus for assessing web page decay
US20060294052Aug 13, 2005Unsupervised, automated web host dynamicity detection, dead link detection and prerequisite page discovery for search indexed web pages

Referenced by

Citing PatentFiling dateIssue dateOriginal AssigneeTitle
US7899807Dec 20, 2007Mar 1, 2011Yahoo! Inc.System and method for crawl ordering by search impact

Claims

1. An automated machine-implemented method for crawling dynamic content, the method comprising:

based on first data identifying a plurality of pages to crawl, a processor
determining that information about a first page at a first location should be indexed;
wherein said first location is identified in said first data by a first link;

in response to determining that information about the first page should be indexed, prior to indexing said information, determining that the first link is associated with a first prerequisite uniform resource locator (URL) for the first page;
wherein the first prerequisite URL identifies the location of a first prerequisite page that must be visited prior to visiting the first page in order to retrieve dynamic content that the first page will contain;
wherein determining the first prerequisite URL for the first page is based on second data, said second data identifying, for each particular page of one or more pages, a particular prerequisite URL, the particular prerequisite URL identifying a location of a particular prerequisite page that must be visited prior to visiting the particular page;
in response to determining that the first link is associated with a first prerequisite URL for the first page, visiting the first prerequisite page at the location identified by the first prerequisite URL; and
after visiting the first prerequisite page, indexing said information about the first page;
wherein indexing said information about the first page comprises visiting the first page and indexing information based upon content received while visiting the first page.

2. The method of claim 1, further comprising:

prior to said step of visiting the first page and in response to visiting the first prerequisite page, receiving state information;

wherein visiting the first page comprises sending information based on the state information to a server at which the first page is stored.

3. The method of claim 2, wherein the state information is a cookie.

4. A volatile or non-volatile machine-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 2.

5. The method of claim 2, wherein the state information is a session identifier.

6. A volatile or non-volatile machine-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 5.

7. A volatile or non-volatile machine-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 2.

8. The method of claim 1, wherein the first page includes dynamically generated content.

9. A volatile or non-volatile machine-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 8.

10. The method of claim 1, wherein:

the first prerequisite page includes a form.

11. The method of claim 10, wherein the first link includes form data responsive to the form.

12. A volatile or non-volatile machine-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 11.

13. The method of claim 10, further comprising:

determining an encoded form data set for the first page;
wherein determining the encoded form data set is based on third data identifying, for each of one or more pages, an encoded form data set;

wherein visiting the first page comprises sending the encoded form data to a server at which the first page is stored.

14. The method of claim 13, wherein sending the encoded form data to the server at which the first page is stored comprises sending the encoded form data set as part of a POST transaction.

15. A volatile or non-volatile machine-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 14.

16. A volatile or non-volatile machine-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 13.

17. A volatile or non-volatile machine-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 10.

18. The method of claim 1, further comprising maintaining the first data and the second data in a database at a web crawler or a search engine.

19. A volatile or non-volatile machine-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 18.

20. The method of claim 1, further comprising:

at a first particular time prior to determining that information about the first page should be indexed:
visiting the first page;
determining that the first prerequisite page must be visited prior to visiting the first page in order to retrieve dynamic content that the first page contains; and
in response to said determining that the first prerequisite page must be visited prior to visiting the first page, storing the first prerequisite URL in the second data in association with the first link.

21. A volatile or non-volatile machine-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 20.

22. The method of claim 1, wherein determining that the first prerequisite page must be visited prior to visiting the first page occurs in response to, during a first attempt to visit the first page, determining that the first link is a dead link.

23. The method of claim 22, wherein determining that the first link is a dead link occurs in response to receiving an error page during the first attempt to visit the first page.

24. A volatile or non-volatile machine-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 23.

25. The method of claim 22, wherein determining that the first link is a dead link occurs in response to being redirected to a general page during the first attempt to visit the first page.

26. A volatile or non-volatile machine-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 25.

27. A volatile or non-volatile machine-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 22.

28. The method of claim 1, wherein each of the steps of claim 1 are performed by a web crawler.

29. A volatile or non-volatile machine-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 28.

30. A volatile or non-volatile machine-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, causes the one or more processors to perform the method recited in claim 1.