Admin Console Help
Home
Content Sources
Web Crawl
Start and Block URLs
Case-Insensitive Patterns
Proxy Servers
HTTP Headers
Duplicate Hosts
Coverage Tuning
Crawl Schedule
Host Load Schedule
Freshness Tuning
Secure Crawl
Connector Managers
Connectors
Feeds
Groups
Databases
Google Apps
OneBox Modules
Diagnostics
Index
Search
Reports
GSA Unification
GSAn
Administration
More Information
|
![]() |
![]() |
Content Sources > Web Crawl > Crawl Schedule
Use the Content Sources > Web Crawl > Crawl Schedule page to perform the following tasks:
Before Starting these Tasks
Before selecting a crawl mode or scheduling a crawl, you must perform the following tasks:
- Specify URLs from which the search appliance should start crawling on the Content Sources > Web Crawl > Start and Block URLs page
- Configure servers for crawling
Selecting a Crawl Mode
Select one of the search appliance crawl modes, described in the following table. Both modes of crawling use the same URLs that are configured on the Content Sources > Web Crawl > Start and Block URLs page.
Mode |
Description |
Continuous crawl |
In this mode, the crawler automatically locates
and indexes updated content. |
Scheduled crawl |
This mode gives you control
over the time and the duration of all crawls. A scheduled crawl proceeds until one of the following events happens:
- The time limit that you specified has passed.
- The crawler reaches the document limit specified by your license.
- The crawler reaches the limit that you set on the Content Sources > Web Crawl > Host Load Schedule page, under Maximum Number of URLs to Crawl.
- The crawler has crawled all reachable URLs.
|
To select a crawl mode:
- Click Content Sources > Web Crawl > Crawl Schedule.
- Click the radio button for either Continuous crawl or Scheduled crawl mode.
- Click Save.
Once your selection is saved, the bottom part of the page displays information relevant to the chosen crawl mode--either the crawl schedule, or freshness tuning for continuous crawls.
Scheduling a Crawl
Before you can schedule a crawl, the search appliance must be in scheduled crawl mode. A crawl schedule allows you to integrate the crawl with any other system activities that occur on your servers, such as routine system backups. The following table describes options for scheduling a crawl.
Option |
Description |
Begin Crawl on |
Select a day of the week or everyday for the crawl to begin. |
Start Hour |
Select an hour for the crawl to begin. |
Start Minute |
Select a minute for the crawl to begin. |
Duration Hour |
Select to limit the crawl to a specific duration, which is expressed in hours and minutes. If you set a crawl time limit, the crawler runs for the specified number of hours and minutes or until it crawls all of the URLs. For example, if you set a time limit of two hours and schedule a start time of 2 a.m., the crawler will crawl your servers from 2 a.m. to 4 a.m., unless it finishes crawling before the two-hour limit.Select a duration in hours for the crawl. |
Duration Minute |
Select a duration in minutes for the crawl. |
Delete |
Click to delete the crawl schedule. |
Add More Rows |
Click to add more rows to the crawl schedule. |
Save |
Click to save the crawl schedule. |
To schedule a crawl:
- Click Content Sources > Web Crawl > Crawl Schedule.
- Ensure that search appliance crawling is in scheduled crawl mode.
If not, select Scheduled crawl mode.
- To select a day, select the day from the Begin Crawl on drop-down list.
- To select the time when you want the crawl to begin, select the hour from the Start Hour drop-down list and the minutes from the Start Minute drop-down list.
- To limit the duration of the crawl, select the duration from the drop-down list.
You can select a length of time up to 24 hours and 45 minutes.
- Click Save.
To schedule more crawls, click Add More Rows.
Deleting a Crawl Schedule
To delete a crawl schedule:
- Click Content Sources > Web Crawl > Crawl Schedule.
- Click the delete checkbox for the crawl schedule you want to remove.
- Click Save.
Configuring Index Removal and Backoff Intervals for URLs in Error States
If the Google Search Appliance receives an error when fetching a URL, it records the error in Index > Diagnostics > Index Diagnostics and takes one of the following actions:
- Immediately removes the URL from the search index (for 404 errors),
or
- Schedules a series of retries after certain time intervals, known as "backoff" intervals, before removing the URL from the index ( for other errors, such as 401)
You can either use the search appliance default settings for index removal and backoff intervals, or configure the following options for the selected error state:
- Immediate Index Removal--Select this option to immediately remove the URL from the index.
- Number of Failures for Index Removal--Use this option to specify the number of times the search appliance is to retry fetching a URL. The value must be 3 or less.
- Successive Backoff Intervals (hours)--Use this option to specify the number of hours between backoff intervals.
To configure settings:
- Click Content Sources > Web Crawl > Crawl Schedule.
- Clear the checkbox next to Use defaults for backoff retries and index removal settings.
- Select a URL State from the pull-down menu.
- Edit the settings that you want to change.
- Optionally, to add another setting, click Add More Rows and repeat steps 3 and 4.
- Click Save.
To restore default settings:
- Click Content Sources > Web Crawl > Crawl Schedule.
- Click the checkbox next to Restore Defaults.
- Click Save.
Related Tasks
The following table lists tasks related to scheduling a crawl.
Task |
Method |
Starting a crawl |
If your search appliance is in continuous crawl mode, you can start a crawl immediately by clicking Resume Crawl on the Content Sources > Diagnostics > Crawl Status page. The crawl starts in fifteen minutes and the change in status appears at that time. |
If your search appliance is in scheduled crawl mode, the crawl begins at the time you have selected. |
Stopping a crawl |
You can stop a continuous crawl at any time. To stop the crawl, click Pause Crawl on the Content Sources > Diagnostics > Crawl Status page. If you want to stop a scheduled crawl, change the crawl mode to continuous crawl, and then pause the crawl by using the Content Sources > Diagnostics > Crawl Status page.
When a crawl is stopped, the documents that were crawled remain in the index. The index contains some old documents and some newly crawled documents. |
Viewing crawl status |
You can view the status of a scheduled crawl in the Content Sources > Diagnostics > Crawl Status page. To view the most recent status, click Refresh in your web browser. |
Specify the maximum number of concurrent connections open on every web server for crawling |
Specify host load by using the Content Sources > Web Crawl > Host Load Schedule page. |
For More Information
For detailed information about crawling, see "Administering Crawl," which is linked to the Google Search Appliance help center.
|