7.0 - Google Search Appliance Help Center

Back to Home | Help Center | Log Out

Help Center

Home

Crawl and Index

Serving

Status and Reports
Crawl Status
Crawl Diagnostics
Real-time Diagnostics
Crawl Queue
Content Statistics
Export URLs
Serving Status
System Status
Serving Logs
Search Reports
Search Logs
Event Log

Connector Administration

Social Connect

Cloud Connect

GSA Unification

GSAⁿ

Administration

More Information

Status and Reports > Crawl Diagnostics

Use the Crawl Diagnostics page to determine the status of each URL that the search appliance is configured to crawl and to create reports on URL status. This page contains the following topics:

Before Starting these Tasks
About Crawl Diagnostics
Viewing a URL Report in Tree Format
Viewing a URL Report in List Format
Exporting Reports in .CSV Format
Exporting Reports in XML Format
For More Informaiton

Before Starting these Tasks

Before performing any of the tasks described on this page, examine the Crawl and Index > Crawl URLs page and ensure that the crawl is configured correctly.

About Crawl Diagnostics

The Crawl Diagnostics page provides the ability to view reports on the status of each URL that the search appliance is configured to crawl. The default information displayed reports the status of all hosts in the default collection, in tree format. The default report displays the following information about the top-level URL.

Column	Description
Host Name	The names of host computers crawled by the search appliance
Crawled URLs	The total number of URLs crawled successfully by the search appliance
Retrieval Errors	The total number of URLs the search appliance did not crawl because empty documents, connection failures, unreachable servers, authentication failures, HTTP 404 errors (file not found), and other issues caused errors. To see a full list of errors that might occur, select the URL Status drop-down list.
Excluded URLs	The total number of files excluded from the crawl, either because the crawl patterns explicitly exclude all files with a certain file extension or the URLs are not included in the crawl patterns.

You can view crawl diagnostic information in two formats:

Tree format, in which URLs are displayed hierarchically by host and directory. To view URLs in tree format, click a host and navigate through the directory structure that is displayed.
List format, in which URLs are displayed in a list.

You can switch between tree and list formats at any time by clicking the appropriate radio button. The URLs can be sorted by domain, host name, directory, or URL. When you navigate to file-level diagnostic details, you see the following:

PageRank - the relevancy rank in the index that this URL received.
File/Directory - file names and directories.
Crawl Status - sort by the links to view all of the URLs, the ones that were crawled successfully, ones that had errors, or ones that were excluded.
- Successful
  The total number of URLs crawled at the time of viewing this page.
- Errors
  The URLs that could not be reached by the crawler because the server (where crawl was attempted) returned an error for them, possibly due to network problems. Depending on the error, the crawler will retry crawling some URLs.
  When the crawl status reports an error, it displays the error, such as:
  "Retrying url: Host unreachable while trying to fetch robots.txt."
- Excluded
  The URLs that were discovered but dropped and not crawled at all. Some reasons for exclusion are the existence of a robots.txt file, an entry in Do Not Crawl URLs with the Following Patterns, or perhaps the URL contained an excluded document type, such as a GIF file.
Time Crawled - the date and time that the URL was crawled.

When you navigate to individual document details, you see More information about this page, which includes:

Url After Redirects - a link to the final URL for the page, after redirects. If the URL is for a subfile in a compressed file, this field is not a link. For a link to the subfile, see Display URL.
Link to this page - a link to the URL of the page.
Cached version link - a link to the page in cache.
PageRank - the relevancy rank in the index that this URL received.
Last modified - the date when the document was last modified; this is not the document date determined by document date rules in the Crawl and Index > Document Dates page.
Authentication Method at Crawl Time - if the URL was crawled with security, it shows the type of security.
Security at Serve Time - if the URL is marked as secure, it shows what the type of serving security.
Number of links on this page to crawled pages - the number of links to pages crawled by the search appliance. The maximum value is 2000. If the search appliance crawls and indexes more than 2000 links in a document, for example 2500, this value only shows 2000.
A link to a list of public crawled pages that link to this page.
A link to a list of all crawled pages that link to this page.
Crawl frequency for the URL.
Download time in ms - time to download the document for indexing, in microseconds.
Content Type of the URL.
- Preview Status - Status of the document preview image. Status can be NOT YET STARTED, INIT, PENDING, CONNECT, DOWNLOAD, CONVERT, READY, FAILED, CANCELLED, ABORTED.
Content Size of the URL.
Language - document language.
Encoding - document encoding.
Currently Inflight - whether or not the URL is currently being crawled.
Display URL: The display URL for the subfile in a compressed file. Clicking on the link opens the subfile.
Feed Info - shows the following information about the feed:
- Feed Type - the type of feed (full, incremental, and so on).
- Datasource - the feed datasource
- Display URL - the URL that is displayed to the user in search results.
This page is in the following collections - with a list of collections.
This page has the following Access Control List (ACL) - an ACL can include the following components: permitted users, permitted groups, denied users, and denied groups. This section shows any components that are present in the ACL in the principal format
( Namespace ) :: [ Domain ] name.
ACL inheritance - a page can have ACLs that inherit permissions from a chain of parent ACLs. If ACL inheritance applies to this page, the ACL inheritance chain for the page is displayed. For information on ACL inheritance, see "Managing Search for Controlled-Access Content: Crawl, Index, and Serve," which is linked to the Google Search Appliance help center.

You can also export files in two different formats for offline analysis:

As a .csv file that you can analyze in a spreadsheet program or other software that can open .csv files. This report includes only data that were displayed on the Admin Console page, not the entire list of URLs.
As an .xml file that follows the standard Google Sitemaps Protocol format. This report can have up to 10,000 URLs in a 10MB file.

Viewing a URL Report in Tree Format

Tree format reports display URLs in a tree structure.

To view a report in tree format:

Set the URL display mode to Tree format.
Click a URL in the Host Name column or enter the URL pattern and port of the host you want to see in the URLs starting with field. Most special characters are handled correctly when they are entered in the field. If you need diagnostic information for a file whose name contains a percent sign (%), enter %25 in the search box on this page to display all filenames that contain percent signs.
Click the Show URLs button.
In the URL Status drop-down list, select the state of the URLs that you want to see, or leave the selection at Any status.
Click Include to include only states you are interested in or Exclude to filter out URLs with states that you are not interested in. For example, if you want to see only documents that have been successfully crawled, you can select Retrieval error in the drop-down list and click Exclude.
Click the folder names to navigate to file-level details.

If you want the document recrawled immediately, click Recrawl this URL. This action submits an immediate recrawl request to the search appliance to download the URL. Also, the crawl status of this URL changes to "Crawled: New Document," even if the page has not been changed.

If you want this URL pattern recrawled immediately, click Recrawl this pattern. This submits an immediate recrawl request to the search appliance appliance to download those URLs that match this pattern. Also, the crawl status of those URLs changes to 'Crawled: New Document' even if the pages have not been changed.

Viewing a URL Report in List Format

List format reports display URLs in a flat list. The diagnostic report shows 100 URLs per page. To see more URLs, click More at the bottom of the page.

To view report in list format:

Set the URL display mode to List format.
In the URL Status drop-down list, select the state of the URLs that you want to see, or leave the selection at Any status.
Click Include to include only states you are interested in or Exclude to filter out URLs with states that you are not interested in. For example, if you want to see only documents that have been successfully crawled, you can select Crawled in the drop-down list, and click Include. The report will show 100 URLs per page.
Click a file name to view file-level details.

Exporting Reports in .CSV Format

To export a report in a .csv file:

Set the URL display mode to Tree format.
Select the data that you want to include or exclude by using the URL Status drop-down list. For more detailed instructions, see Viewing a URL Report in Tree Format.
Click the Export All Pages to a File button.
The File Download wizard appears.
Click Save and browse to a location where you want to save the file. The file name offered describes the collection in this format: CrawlDiagnostics_<collection_name>_<host_name_port>_.csv

Exporting Reports in XML Format

An .xml file that follows the standard Google Sitemaps Protocol format. The report can have up to 10,000 URLs in a 10MB file and might include the following information:
- Loc - the full URL of the document
- LastCrawled - the date when the document was last crawled
- ChangeFreq - the predicted frequency of the document update
- Priority - the priority of this URL relative to other URLS on your site (the priority ranges from 0.0 to 1.0; by default, documents are given a priority of 0.5)

If your site contains more than 10,000 URLs or your Sitemap is bigger than 10MB, you must create multiple Sitemap files and use a Sitemap index file. You should use a Sitemap index file even if you have a small site but plan on growing beyond 10,000 URLs or a file size of 10MB.

When a URL is first discovered, it will not have ChangeFreq data. The lack of data for this field does not violate the Sitemap file protocol, since it only requires Loc data; all other XML tags are optional.

To export an .xml file that follows the Sitemaps format:

Set the URL display mode to List format.
Select the data that you want to include or exclude by using the URL Status drop-down list. For more detailed instructions, see Viewing a URL Report in List Format.
Click the Export All Pages to a File button.
The File Download wizard appears.
Click Save and browse to a location where you want to save the file. The file name offered describes the collection in this format: CrawlDiagnostics_<collection_name>_<host_name_port>_.xml

For More Information

For more information about SiteMaps, see the Google Sitemaps Protocol documentation, available from Google Webmaster Central.

For more information on crawl, see "Administering Crawl," which is linked to the Google Search Appliance help center.