Back to Home | Admin Console Help | Log Out
 Admin Console Help
 
Admin Console Help

Home

Content Sources
 Web Crawl
   Start and Block URLs
   Case-Insensitive Patterns
   Proxy Servers
   HTTP Headers
   Duplicate Hosts
   Coverage Tuning
   Crawl Schedule
   Host Load Schedule
   Freshness Tuning
   Secure Crawl
  Connector Managers
  Connectors
  Feeds
  Groups
  Databases
  Google Apps
  OneBox Modules
  Diagnostics

Index

Search

Reports

GSA Unification

GSAn

Administration

More Information

Content Sources > Web Crawl > Host Load Schedule

Use this page to control the crawling load on the search appliance and on web or file servers where the crawled content is stored. This help page contains information on the following topics:

About the Maximum Number of URLs to Crawl

Use the Maximum Number of URLs to Crawl field to specify a number of URLs to crawl that is smaller than the limit permitted by your license. The default value is a blank. If the field is blank, the search appliance crawls the number of URLs specified by your license.

Whether you change the maximum number of URLs to crawl depends on the number of documents you need crawled and served, and on whether all of your content files must be crawled and served. If you set the maximum number of URLs to crawl to a number smaller than the number of documents in your repositories, ensure that the crawl patterns include and prioritize the most important documents in the repositories.

Setting this number to lower than the current number of indexed documents might result in removing some documents from the index or preventing new content from being indexed.

About the Maximum File Sizes to Download

Use the Maximum File Sizes to Download field to change the file size limits for the downloader to use when crawling documents. The range of valid values for Text or HTML files and all other document types is 0 to 2048.

When the Google Search Appliance downloads a document during a crawl, it determines the type and size of the file. If a text or HTML document is larger than the maximum file size, the search appliance truncates the file and discards the remainder of the file. If any other type of document is larger than the maximum file size, the search appliance discards the file completely.

File size limits are based on content size, not network traffic. For example, downloading a 20MB file might result in 21MB of traffic listed in apache logs. This is caused by normal networking overhead,lost packets, and so on.

About the ACL Size Limit

Use the ACL size limit field to change the number of users and groups to add to a per-URL ACL for all URLs. The default value is 10000. The maximum value for this field is 100000.

About the Load on the Web Server Host

The Web Server Host Load value specifies the maximum number of concurrent connections opened for crawling between the search appliance and each web server during any one-minute period. The default number of concurrent connections is four. Google recommends that you start four connections, then increase the value after you determine that your web or file servers have sufficient capacity for a higher load. Consult the administrator whose sites the search appliance crawls to determine a server's load capacity.

The search appliance attempts to maintain the specified number of concurrent connections. During the crawl, the search appliance dynamically analyzes the responses from file and web servers. If a server does not have sufficient capacity for the host load that is set, the search appliance reduces the crawl rate until an acceptable response time is achieved. The number of connections might drop below the specified number, depending on system activity.

You can set the host load to zero (0). Content residing on a computer whose host load is set to zero is not crawled.

You can specify the host load as a decimal value. For example, you can specify a value such as .5, 1 or 2.0, with up to two decimal places.

The search appliance behaves differently depending on whether the value is set to a decimal value under 1.00 or to a value of more than 1.00.

  • A value of 1 or more sets the average number of connections within any one minute window. For example, a value of 2.0 indicates that, on average, the search appliance opens two concurrent connections to each host.
  • A decimal value of 1 or more also sets an average number of connections. For example, a value of 3.5 indicates that, on average, the search appliance opens 3.5 concurrent connections to each host. Further, a value of 3.5 indicates that 50% of the time, three connections are open and 50% of the time, four connections are open.
  • A decimal value under 1 sets the percentage of time during which the search appliance opens connections. For example, a value of .25 indicates that, on average, connections to the web or file server are opened 25% of the time, or for approximately 15 seconds per minute. The search appliances bases the number of connections on the responsiveness of the server.

You can calculate the maximum crawl rate of the search appliance by dividing the host load by the average time to fetch documents. For example, with a host load of 0.5 and an average fetch time of 0.1 seconds, the maximum crawl rate is 5 documents per second. The search appliance continually recalculates the average fetch time, based on the most recent documents fetched.

Warning: Some file servers and web servers do not have the capacity for a high host load. If the search appliance opens too many connections or makes too many HTTP requests to a file server or web server, the server might become unavailable.

The Host Load Setting in a Proxied Environment

In a proxied environment, set the host load of the proxy server as well as each server behind the proxy server. The search appliance respects the individual host load settings of a server, even if it is behind a proxy. The host load of the proxy server controls how many concurrent connections can be made to the proxy server. The host load of the proxy server also determines the crawl rate for servers that are behind the proxy. Servers with virtual host names are treated the same way as proxies.

When sites are crawled using a proxy, the same host load is used to crawl all sites behind the proxy. The host load used is the maximum host load specified for any URL pattern crawled using the proxy. You can either specify an overall host load setting that is small enough that it does not affect the performance of any proxied sites or you can use the Exceptions to Web Server Host Load fields to make a null entry for proxied sites, in which case the maximum host load is used.

Determining Exceptions to Web Server Host Loads

You can assign different maximum host loads to particular web servers or URLs during designated time periods. The default web server host load of four connections or the host load you specify elsewhere on this page applies to any time period during which you do not assign a host load exception.

For example, you might have three web servers with the capacity for a higher crawl load during the night. You can designate a higher host load for these three web servers between 12 a.m. and 6 a.m. To minimize the host load during the day, you might set a host load exception of zero between 9:00 a.m. and 5:00 p.m.

To set a period of 24 hours, use 12 a.m. as the start time and 12 a.m. as the end time.

Before Starting these Tasks

Before you modify the web server host load setting or set exceptions to the web server host load setting, confer with the administrator of the web or file servers where the crawled content is stored. The administrator can provide information on the capacity of the web and file servers to respond to requests from the search appliance and on when you can increase or decrease the host load settings.

If you are creating exceptions to the host load setting, determine the IP address or fully-qualified domain name for each host and the time periods for which you are creating the exceptions.

Setting the Maximum Number of URLs to Crawl

Use this section to set the maximum number of URLs to crawl. By default, the search appliance crawls enough URLs to reach your license limit. (To check the license limit for a particular search appliance, navigate to Administration > License.) You can set the number of URLs to a lower number on this page.

To set the maximum number of URLs to crawl:

  1. Click Content Sources > Web Crawl > Host Load Schedule.
  2. In the Maximum Number of URLs to Crawl field, type in a value lower than the license limit.
  3. Click Save.

Setting the Maximum File Size to Download

Use this section to change the file size limits for the downloader to use when crawling documents.

To set the file size limits:

  1. Click Content Sources > Web Crawl > Host Load Schedule.
  2. In the Text/HTML Documents and/or All Other Document Types fields, type integer values.
  3. Click Save.

Setting the ACL Size Limit

Use this section to change the number of users and groups to add to per-URL ACLs to a value other than the default value of 10000.

To set the ACL size limit:

  1. Click Content Sources > Web Crawl > Host Load Schedule.
  2. In the ACL size limit field, type an integer value.
  3. Click Save.

Setting the Web Server Host Load

Use this section to set the web server host load to a value other than the default value of four.

To set the web server host load:

  1. Click Content Sources > Web Crawl > Host Load Schedule.
  2. Type a new value in the Web Server Host Load field.
  3. Click Save.

Setting Exceptions to Web Server Host Load

Use this section to set host load exceptions for particular time periods, for all servers, for particular servers. Use fully qualified domain names or IP addresses to designate the servers, one host name entry per line.

In addition to specifying the host load as domain names or IP addresses, you can also specify host load using crawl patterns for specific parts of a site. Regular expression is not supported. List the patterns in order from most specific to least specific as shown in the following example:

mycompany.com/marketing/
mycompany.com/engineering/
mycompany.com

Take note that having multiple host load exceptions for the same IP address can negatively affect crawling for that host. Generally, if several host load exceptions are added for the same IP address, the search appliance prioritizes crawling the URLs with higher importance and/or higher directory level.

For example, consider the following URL patterns:

mycompany.com/files/huge/, with a webserver hostload of 4.0
mycompany.com/1level/, with a webserver hostload of 2.0

The 1level directory is prioritized first because it is a higher level than the other pattern.

Furthermore, if the default host load for the IP address is 1.0, then the host load applied to the above patterns would be 1.0 instead of 2.0 or 4.0.

To set exceptions to the web server host load:

  1. Click Content Sources > Web Crawl > Host Load Schedule.
  2. In the Hostload field, type the host load exception value.
  3. In the From and To drop-down lists, select the start and end times during which that host load applies.
  4. Designate the servers to which the host load exception value applies.
    • Select For all hosts if the setting applies to all hosts.
    • Select For these hosts if the value applies to particular hosts, then type in the IP addresses or fully-qualified domain names of the servers in the text field, one host per line. If the value applies to URLs, then type in the URL patterns in the text field, one pattern per line.
  5. To add additional entries, click Add More Host Load Exceptions.
  6. To change the order of host load entries, click Move Up or Move Down next to the entry that you want to move.
  7. Click Save.

For More Information

For more information, see "Administering Crawl," which is linked to the Google Search Appliance help center.
 
© Google Inc.