Google Search Appliance Admin Console Help

Back to Home | Admin Console Help | Log Out

Admin Console Help

Home

Content Sources
Web Crawl
Connector Managers
Connectors
Feeds
Groups
Databases
Google Apps
OneBox Modules
Diagnostics
   Crawl Status
   Real-time Diagnostics
   Crawl Queue

Index

Search

Reports

GSA Unification

GSAⁿ

Administration

More Information

Content Sources > Diagnostics > Crawl Queue

Use the Content Sources > Diagnostics > Crawl Queue page to define crawl queue snapshots and view information about the crawl queue. This help page contains information about the following topics:

About the Crawl Queue and Crawl Queue Snapshots
Before Starting these Tasks
Creating Crawl Queue Snapshots
Viewing Crawl Queue Snapshots

About the Crawl Queue and Crawl Queue Snapshots

The crawl queue is the set of URLs that the appliance is waiting to crawl or that are overdue to be crawled. Use this information to help you determine whether specific hosts are crawled at the right time and why information from certain documents is fresher than information from other documents.

Currently Queued Hosts shows a list of hosts that the search appliance is waiting to crawl and the associated host load. The queue continuously changes as the crawler processes the queued hosts. To ensure that the list of hosts is up-to-date, click Refresh.

The crawl queue is dynamic, because the queue continuously changes as the crawler processes new URLs. A crawl queue snapshot includes current information, but does not indicate what will happen in the future. The crawl queue changes constantly.

The table of crawl queue snapshots lists previously captured snapshots. You can view the contents of a snapshot, export it to a comma-separated value (.csv) file, or delete it.

A snapshot appears in the table as soon as you start the capture by clicking Capture Crawl Queue. When a snapshot is in progress, its status is Capturing, and it has a Cancel link. Once the snapshot is complete, the Cancel link becomes a Delete link.

Before Starting these Tasks

Before you start these tasks, ensure that crawl patterns are entered on the Admin Console and the crawler is running. Note that creating a crawl queue snapshot is resource-intensive and reduces the crawler's performance.

Creating Crawl Queue Snapshots

Use these instructions to create a crawl queue snapshot. Note that the search appliance can create only one crawl queue snapshot at a time. After you start creating a snapshot, wait to start another snapshot until the first is created.

To define a crawl queue snapshot:

Click Content Sources > Diagnostics > Crawl Queue.
In the Name field, type a report name of up to twenty characters, which must consist of ASCII or non-ASCII characters, hyphens, and underscores. The report name cannot start with a hyphen.
In the Number of URLs to include field, specify the number of queued URLs to include in the snapshot. The number can be from one to 100,000.
In the Forthcoming hours to include field, specify the number of future hours of scheduling to include in the snapshot, starting from the current time.
To limit the snapshot to a single host, click Include URLs from this host only and enter a hostname. By default, the crawl queue snapshot includes all hosts.
Click Get Snapshot.

Viewing Crawl Queue Snapshots

To display snapshots for a particular host, click View, then click the host name.

Each snapshot contains the following information.

Item	Description
Pagerank	PageRank for the resource to be crawled. PageRank is one of the factors that influences a resource's position in the queue, allowing more important documents to be crawled more frequently than less important documents. Note that PageRank information not included when you export a snapshot.
Last Crawled Time	The last time this URL was crawled.
Next Scheduled Time	The time that the resource is scheduled to be crawled. The time can change, and is affected by PageRank, queue backlog, and other factors. Overdue items are shown in red until they are crawled. If you continue to see the same overdue item, you might want to investigate further. Next Scheduled Time appears only after the initial crawl of a URL.
Change Interval	The frequency of expected changes to this resource. After the initial crawl, the appliance initializes the change interval for all URLs to two days. The appliance then adjusts the change interval based on the actual change frequency of the URL. Each time the appliance crawls a URL, it learns whether the resource has changed since the previous crawl. If the resource changed, the change interval is shortened. If the resource did not change, the change interval is lengthened. You can affect this calculation by using the Freshness Tuning feature. Change Interval appears only after the initial crawl of a URL.
URL	The resource whose content is crawled.