Introduction
For information about specific feature limitations, see Specifications and Usage Limits.
Deprecation Notices
On-Board File System Crawling
In GSA release 7.4, on-board file system crawling (File System Gateway) is deprecated. It will be removed in a future release. If you have configured on-board file system crawling for your GSA, install and configure the Google Connector for File Systems 4.0.4 or later instead. For more information, see “Deploying the Connector for File Systems,” available from the Connector Documentation page.
On-Board Database Crawler
In GSA release 7.4, the on-board database crawler is deprecated. It will be removed in a future release. If you have configured on-board database crawling for your GSA, install and configure the Google Connector for Databases 4.0.4 or later instead. For more information, see “Deploying the Connector for Databases,” available from the Connector Documentation page.
What Is Search Appliance Crawling?
The search appliance visits the Missitucky University home page, then it:
1.
|
Crawl Modes
The Google Search Appliance supports two modes of crawling:
For information about choosing a crawl mode and starting a crawl, see Selecting a Crawl Mode.
Continuous Crawl
In continuous crawl mode, the search appliance is crawling your enterprise content at all times, ensuring that newly added or updated content is added to the index as quickly as possible. After the Google Search Appliance is installed, it defaults to continuous crawl mode and establishes the default collection (see Default Collection).
The search appliance does not recrawl any URLs until all new URLs have been discovered or the license limit has been reached (see What Is the Search Appliance License Limit?). A URL in the index is recrawled even if there are no longer any links to that URL from other pages in the index.
Scheduled Crawl
What Content Can Be Crawled?
Crawling FTP is not supported on the Google Search Appliance.
Public Web Content
Secure Web Content
The search appliance can crawl and index content protected by forms-based single sign-on systems.
Content from Network File Shares
In GSA release 7.4, on-board file system crawling (File System Gateway) is deprecated. For more information, see Deprecation Notices. |
For a complete list of supported file formats, refer to Indexable File Formats.
Databases
In GSA release 7.4, the on-board database crawler is deprecated. For more information, see Deprecation Notices. |
For information about crawling databases, refer to Database Crawling and Serving.
Compressed Files
For more information, refer to Crawling and Indexing Compressed Files.
What Content Is Not Crawled?
Also the Google Search Appliance cannot:
The following sections describe all these exclusions.
Content Prohibited by Crawl Patterns
A Google Search Appliance administrator can prohibit the crawler from following and indexing particular URLs. For example, any URL that should not appear in search results or be counted as part of the search appliance license limit should be excluded from crawling. For more information, refer to Configuring a Crawl.
Content Prohibited by a robots.txt File
To prohibit any crawler from accessing all or some of the content on an HTTP or HTTPS site, a content server administrator or webmaster typically adds a robots.txt file to the root directory of the content server or Web site. This file tells the crawlers to ignore all or some files and directories on the server or site. Documents crawled using other protocols, such as SMB, are not affected by the restrictions of robots.txt. For the Google Search Appliance to be able to access the robot.txt file, the file must be public. For examples of robots.txt files, see Using robots.txt to Control Access to a Content Server.
For detailed information about HTTP status codes, visit http://en.wikipedia.org/wiki/List_of_HTTP_status_codes.
Content Excluded by the nofollow Robots META Tag
The Google Search Appliance does not crawl a Web page if it has been marked with the nofollow Robots META tag (see Using Robots meta Tags to Control Access to a Web Page).
Links within the area Tag
For example, the following HTML defines an region that contains a link:
<map name="n5BDE56.Body.1.4A70">
<area shape="rect" coords="0,116,311,138" id="TechInfoCenter"
href="http://www.bbb.com/main/help/ourcampaign/ourcampaign.htm" alt="">
</map>
Unlinked URLs
•
|
Using a jump page (see Ensuring that Unlinked URLs Are Crawled), which is a page that can provide links to pages that are not linked to from any other pages. List unlinked URLs on a jump page and add the URL of the jump page to the crawl path.
|
Configuring the Crawl Path and Preparing the Content
Before crawling starts, the Google Search Appliance administrator configures the crawl path (see Configuring a Crawl), which includes URLs where crawling should start, as well as URL patterns that the crawler should follow and should not follow. Other information that webmasters, content owners, and search appliance administrators typically prepare before crawling starts includes:
How Does the Search Appliance Crawl?
About the Diagrams in this Section
Crawl Overview
The following diagram provides an overview of the following major crawling processes:
The sections following the diagram provide details about each of the these major processes.
Starting the Crawl and Populating the Crawl Queue
Enterprise PageRank, the last time it was crawled, and estimated change frequency |
After configuring the crawl path and preparing content for crawling, the search appliance administrator starts a continuous or scheduled crawl (see Selecting a Crawl Mode). The following diagram provides an overview of starting the crawl and populating the crawl queue.
The start URLs that the search appliance administrator has configured. |
|
Attempting to Fetch a URL and Indexing the Document
If the search appliance successfully fetches a URL, it downloads the document. If you have enabled and configured infinite space detection, the search appliance uses the checksum to test if there are already 20 documents with the same checksum in the index (20 is the default value, but you can change it when you configure infinite space detection). If there are 20 documents with the same checksum in the index, the document is considered a duplicate and discarded (in Index Diagnostics, the document is shown as “Considered Duplicate”). If there are fewer than 20 documents with the same checksum in the index, the search appliance caches the document for indexing. For more information, refer to Enabling Infinite Space Detection.
When fetching documents from a slow server, the search appliance paces the process so that it does not cause server problems. The search appliance administrator can also adjust the number of concurrent connections to a server by configuring the web server host load schedule (see Configuring Web Server Host Load Schedules).
Determining Document Changes with If-Modified-Since Headers and the Content Checksum
To detect changes to cached documents when recrawling it, the search appliance:
If the checksum has changed since the last modification time, the search appliance determines the size of the file (see File Type and Size), modifies the file as necessary, follows newly discovered links within the document (see Following Links within the Document), and indexes the document.
Fetching URLs from File Shares
In GSA release 7.4, on-board file system crawling (File System Gateway) is deprecated. For more information, see Deprecation Notices. |
When the Google Search Appliance fetches a URL from a file share, the object that it actually retrieves and the method of processing it depends on the type of object that is requested. For each type of object requested, the following table provides an overview of the process that the search appliance follows. For information on how these objects are counted as part of the search appliance license limit, refer to When Is a Document Counted as Part of the License Limit?.
Because of limitations of the share listing process, a share name is not returned if it uses non-ASCII characters or exceeds 12 characters in length. To work around this limitation, you can specify the share itself in Start URLs on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console.
|
File Type and Size
To change the maximum file size, enter new values on the Content Sources > Web Crawl > Host Load Schedule page. For more information about setting the maximum file size to download, click Admin Console Help > Content Sources > Web Crawl > Host Load Schedule.
By default, the search appliance indexes up to 2.5MB of each text or HTML document, including documents that have been truncated or converted to HTML. You can change the default by entering an new amount of up to 10MB. For more information, refer to Changing the Amount of Each Document that Is Indexed.
Compressed document types, such as Microsoft Office 2007, might not be converted properly if the uncompressed file size is greater than the maximum file size. In these cases, you see a conversion error message on the Index > Diagnostics > Index Diagnostics page.
LINK Tags in HTML Headers
The search appliance indexes LINK tags in HTML headers. However, it strips these headers from cached HTML pages to avoid cross-site scripting (XSS) attacks.
Following Links within the Document
For each document that it indexes, the Google Search Appliance follows newly discovered URLs (HTML links) within that document. When following URLs, the search appliance observes the index limit that is set on the Index > Index Settings page in the Admin Console. For example, if the index limit is 5MB, the search appliance only follows URLs within the first 5MB of a document. There is no limit to the number of URLs that can be followed from one document.
Before following a newly discovered link, the search appliance checks the URL against:
The search appliance crawler only follows HTML links in the following format:
<a href="/page2.html">link to page 2</a>
It follows HTML links in PDF files, Word documents, and Shockwave documents. The search appliance also supports JavaScript crawling (see JavaScript Crawling) and can detect links and content generated dynamically through JavaScript execution.
When Does Crawling End?
The Google Search Appliance administrator can end a continuous crawl by pausing it (see Stopping, Pausing, or Resuming a Crawl).
The search appliance administrator can configure a scheduled crawl to end at a specified time. A scheduled crawl also ends when the license limit is reached (see What Is the Search Appliance License Limit?). The following table provides more details about the conditions that cause a scheduled crawl to end.
When Is New Content Available in Search Results?
For both scheduled crawls and continuous crawls, documents usually appear in search results approximately 30 minutes after they are crawled. This period can increase if the system is under a heavy load, or if there are many non-HTML documents (see Non-HTML Content).
How Are URLs Scheduled for Recrawl?
1.
|
URLs that are designated for recrawl by the administrator- for example, when you request a certain URL pattern to be crawled by using the Content Sources > Web Crawl > Start and Block URLs, Content Sources > Web Crawl > Freshness Tuning or Index > Diagnostics > Index Diagnostics page in the Admin Console or sent in web feeds where the crawl-immediately attribute for the record is set to true.
|
2.
|
URLs that are set to crawl frequently on the Content Sources > Web Crawl > Freshness Tuning page and have not been crawled in the last 23 hours.
|
If you need to give URLs high priority, you can do a few things to change their priority:
•
|
You can submit a recrawl request by using the Content Sources > Web Crawl > Start and Block URLs, Content Sources > Web Crawl > Freshness Tuning or Index > Diagnostics > Index Diagnostics pages, which gives the URLs the highest priority possible.
|
•
|
•
|
You can add a URL to the Crawl Frequently list on the Content Sources > Web Crawl > Freshness Tuning page, which ensures that the URL gets crawled about every 24 hours.
|
To see how often a URL has been recrawled in the past, as well as the status of the URL, you can view the crawl history of a single URL by using the Index > Diagnostics > Index Diagnostics page in the Admin Console.
How Are Network Connectivity Issues Handled?
What Is the Search Appliance License Limit?
Google Search Appliance License Limit
Google recommends managing crawl patterns on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console to ensure that the total number of URLs that match the crawl patterns remains at or below the license limit.
When Is a Document Counted as Part of the License Limit?
If there are one or more robots meta tags embedded in the head of a document, they can affect whether the document is counted as part of the license limit. For more information about this topic, see Using Robots meta Tags to Control Access to a Web Page.
To view license information for your Google Search Appliance, use the Administration > License page. For more information about this page, click Admin Console Help > Administration > License in the Admin Console.
License Expiration and Grace Period
To configure your search appliance to receive email notifications, use the Administration > System Settings page. For more information about this page, click Admin Console Help > Administration > System Settings in the Admin Console.
How Many URLs Can Be Crawled?
If the Google Search Appliance has reached the maximum number of URLs that can be crawled, this number appears in URLs Found That Match Crawl Patterns on the Content Sources > Diagnostics > Crawl Status page in the Admin Console.
For an overview of the priorities assigned to URLs in the crawl queue, see Starting the Crawl and Populating the Crawl Queue.
How Are Document Dates Handled?
To enable search results to be sorted and presented based on dates, the Google Search Appliance extracts dates from documents according to rules configured by the search appliance administrator (see Defining Document Date Rules).
If no date is found, the search appliance indexes the document without a date.
•
|
Are Documents Removed From the Index?
The search appliance administrator can also remove documents from the index (see Removing Documents from the Index) manually.
Removing all links to a document in the index does not remove the document from the index.
Document Removal Process
The following table describes the conditions that cause documents to be removed from the index.
The limit on the number of URLs in the index is the value of Maximum number of pages overall on the Administration > License page. |
|
To determine which content should be included in the index, the search appliance uses the start urls, follow patterns, and do not follow URL patterns specified on the Content Sources > Web Crawl > Start and Block URLs page. If these URL patterns are modified, the search appliance examines each document in the index to determine whether it should be retained or removed. |
|
See What Happens When Documents Are Removed from Content Servers?. |
Note: Search appliance software versions prior to 4.6 include a process called the “remove doc ripper.” This process removes documents from the index every six hours. If the appliance has crawled more documents than its license limit, the ripper removes documents that are below the Enterprise PageRank threshold. The ripper also removes documents that don’t match any follow patterns or that do match exclude patterns. If you want to remove documents from search results, use the Remove URLs feature on the Search > Search Features > Front Ends > Remove URLs page. When the remove doc ripper has run with your changes to the crawl patterns, you should delete all Remove URL patterns. The Remove URL patterns are checked at search query time and are expensive to process. A large number of Remove URLs patterns affects search query speed.