Back to Home | Admin Console Help | Log Out
 Admin Console Help
 
Admin Console Help

Home

Content Sources
 Web Crawl
   Start and Block URLs
   Case-Insensitive Patterns
   Proxy Servers
   HTTP Headers
   Duplicate Hosts
   Coverage Tuning
   Crawl Schedule
   Host Load Schedule
   Freshness Tuning
   Secure Crawl
  Connector Managers
  Connectors
  Feeds
  Groups
  Databases
  Google Apps
  OneBox Modules
  Diagnostics

Index

Search

Reports

GSA Unification

GSAn

Administration

More Information

Content Sources > Web Crawl > Crawl Schedule

Use the Content Sources > Web Crawl > Crawl Schedule page to perform the following tasks:

Before Starting these Tasks

Before selecting a crawl mode or scheduling a crawl, you must perform the following tasks:

  • Specify URLs from which the search appliance should start crawling on the Content Sources > Web Crawl > Start and Block URLs page
  • Configure servers for crawling

Selecting a Crawl Mode

Select one of the search appliance crawl modes, described in the following table. Both modes of crawling use the same URLs that are configured on the Content Sources > Web Crawl > Start and Block URLs page.

Mode Description
Continuous crawl In this mode, the crawler automatically locates and indexes updated content.
Scheduled crawl This mode gives you control over the time and the duration of all crawls. A scheduled crawl proceeds until one of the following events happens:
  • The time limit that you specified has passed.
  • The crawler reaches the document limit specified by your license.
  • The crawler reaches the limit that you set on the Content Sources > Web Crawl > Host Load Schedule page, under Maximum Number of URLs to Crawl.
  • The crawler has crawled all reachable URLs.

To select a crawl mode:

  1. Click Content Sources > Web Crawl > Crawl Schedule.
  2. Click the radio button for either Continuous crawl or Scheduled crawl mode.
  3. Click Save.
    Once your selection is saved, the bottom part of the page displays information relevant to the chosen crawl mode--either the crawl schedule, or freshness tuning for continuous crawls.

Scheduling a Crawl

Before you can schedule a crawl, the search appliance must be in scheduled crawl mode. A crawl schedule allows you to integrate the crawl with any other system activities that occur on your servers, such as routine system backups. The following table describes options for scheduling a crawl.

Option Description
Begin Crawl on Select a day of the week or everyday for the crawl to begin.
Start Hour Select an hour for the crawl to begin.
Start Minute Select a minute for the crawl to begin.
Duration Hour Select to limit the crawl to a specific duration, which is expressed in hours and minutes. If you set a crawl time limit, the crawler runs for the specified number of hours and minutes or until it crawls all of the URLs. For example, if you set a time limit of two hours and schedule a start time of 2 a.m., the crawler will crawl your servers from 2 a.m. to 4 a.m., unless it finishes crawling before the two-hour limit.Select a duration in hours for the crawl.
Duration Minute Select a duration in minutes for the crawl.
Delete Click to delete the crawl schedule.
Add More Rows Click to add more rows to the crawl schedule.
Save Click to save the crawl schedule.

To schedule a crawl:

  1. Click Content Sources > Web Crawl > Crawl Schedule.
  2. Ensure that search appliance crawling is in scheduled crawl mode.
    If not, select Scheduled crawl mode.
  3. To select a day, select the day from the Begin Crawl on drop-down list.
  4. To select the time when you want the crawl to begin, select the hour from the Start Hour drop-down list and the minutes from the Start Minute drop-down list.
  5. To limit the duration of the crawl, select the duration from the drop-down list.
    You can select a length of time up to 24 hours and 45 minutes.
  6. Click Save.

To schedule more crawls, click Add More Rows.

Deleting a Crawl Schedule

To delete a crawl schedule:

  1. Click Content Sources > Web Crawl > Crawl Schedule.
  2. Click the delete checkbox for the crawl schedule you want to remove.
  3. Click Save.

Configuring Index Removal and Backoff Intervals for URLs in Error States

If the Google Search Appliance receives an error when fetching a URL, it records the error in Index > Diagnostics > Index Diagnostics and takes one of the following actions:

  • Immediately removes the URL from the search index (for 404 errors),
    or
  • Schedules a series of retries after certain time intervals, known as "backoff" intervals, before removing the URL from the index ( for other errors, such as 401)

You can either use the search appliance default settings for index removal and backoff intervals, or configure the following options for the selected error state:

  • Immediate Index Removal--Select this option to immediately remove the URL from the index.
  • Number of Failures for Index Removal--Use this option to specify the number of times the search appliance is to retry fetching a URL. The value must be 3 or less.
  • Successive Backoff Intervals (hours)--Use this option to specify the number of hours between backoff intervals.

To configure settings:

  1. Click Content Sources > Web Crawl > Crawl Schedule.
  2. Clear the checkbox next to Use defaults for backoff retries and index removal settings.
  3. Select a URL State from the pull-down menu.
  4. Edit the settings that you want to change.
  5. Optionally, to add another setting, click Add More Rows and repeat steps 3 and 4.
  6. Click Save.

To restore default settings:

  1. Click Content Sources > Web Crawl > Crawl Schedule.
  2. Click the checkbox next to Restore Defaults.
  3. Click Save.

Related Tasks

The following table lists tasks related to scheduling a crawl.

Task Method
Starting a crawl If your search appliance is in continuous crawl mode, you can start a crawl immediately by clicking Resume Crawl on the Content Sources > Diagnostics > Crawl Status page. The crawl starts in fifteen minutes and the change in status appears at that time.
If your search appliance is in scheduled crawl mode, the crawl begins at the time you have selected.
Stopping a crawl You can stop a continuous crawl at any time. To stop the crawl, click Pause Crawl on the Content Sources > Diagnostics > Crawl Status page. If you want to stop a scheduled crawl, change the crawl mode to continuous crawl, and then pause the crawl by using the Content Sources > Diagnostics > Crawl Status page.

When a crawl is stopped, the documents that were crawled remain in the index. The index contains some old documents and some newly crawled documents.

Viewing crawl status You can view the status of a scheduled crawl in the Content Sources > Diagnostics > Crawl Status page. To view the most recent status, click Refresh in your web browser.
Specify the maximum number of concurrent connections open on every web server for crawling Specify host load by using the Content Sources > Web Crawl > Host Load Schedule page.

For More Information

For detailed information about crawling, see "Administering Crawl," which is linked to the Google Search Appliance help center.


 
© Google Inc.