Back to Home | Help Center | Log Out
 Help Center
 
Help Center

Home

Crawl and Index
  Crawl URLs
  Databases
  Feeds
  Crawl Schedule
  Crawler Access
  Proxy Servers
  Forms Authentication
  Case-Insensitive Patterns
  HTTP Headers
  Duplicate Hosts
  Document Dates
  Host Load Schedule
  Coverage Tuning
  Freshness Tuning
  Collections
  Composite Collections
  Index Settings
  Entity Recognition

Serving

Status and Reports

Connector Administration

Social Connect

Cloud Connect

GSA Unification

GSAn

Administration

More Information

Crawl and Index > Duplicate Hosts

Use the Crawl and Index > Duplicate Hosts page to perform the following tasks:

Duplicate Hosts

By configuring duplicate hosts, you can prevent the search appliance from recrawling content on a canonical host that resides on mirrored servers.

For example, if you have load-balanced servers in your system that serve the same content, it's best not to crawl all of the servers, because they contain duplicates of the content files, not unique content files. Entries on the Duplicate Hosts page identify the duplicate hosts so that links found during the crawl that point to a duplicate host are treated as though they point to the corresponding canonical host.

The following requirements apply to entries on this page:

  • Only one Canonical Host entry is permitted in each field in the Canonical Host column.
  • Canonical Host entries must be fully qualified host names.
  • Multiple Duplicate Host entries are permitted in the same field for the corresponding canonical host. Separate each host name with a space.
  • Each field in the Duplicate Host column must contain at least one entry.

In the following example, the canonical host www.your-company.com corresponds to the duplicate hosts www.offsite.com and web.offsite.com.

Canonical Host Duplicate Host(s)
www.your-company.com www.offsite.com web.offsite.com
www2.your-company.com website.example

Infinite Space

In "infinite space," the search appliance repeatedly crawls similar URLs with the same content while useful content goes uncrawled. For example, the search appliance might start crawling infinite space if a page that it fetches contains a link back to itself with a different url. The search appliance keeps crawling this page because, each time, the URL contains progressively more query parameters or a longer path. When a URL is in infinite space, the search appliance does not crawl links in the content.

By enabling infinite space detection, you can prevent crawling of duplicate content to avoid infinite space indexing. When you select Enable infinite space detection, the following two options appear:

  • Enable URL string check--If enabled, the search appliance checks the URL string for any signals of infinite space, such as repetitive path or query strings. If there are such signals, the search appliance further checks if the Number of URLs with identical content option is satisfied to determine if the duplicated URL is in infinite space. If this option is not enabled, all the duplicate URLs specified by the following option are considered in infinite space.
  • Number of URLs with identical content required to trigger classification of those URLs as an infinite space--If the number of URLs of same content exceed this threshold, they are considered duplicates, except the first crawled one.

If a URL is a duplicate, you can use the Crawl and Index > Crawl Schedule page to configure a recrawl schedule and whether to remove the URL from the index. For more information, see the help page for Crawl and Index > Crawl Schedule.

If there is valid content in repetitive URLs, you need to remove the following regular expressions from the Do Not Crawl URLs with the Following Patterns field on the Crawl and Index > Crawl URLs page:

  • regexp:/([^/]*)/\\1/\\1/
  • regexp:/([^/]*)/([^/]*)/\\1/\\2/
  • regexp:&([^&]*)&\\1&\\1

These patterns prevent crawling of repetitive URLs and prevent infinite space detection from working.

Before Starting these Tasks

Ensure that all canonical hosts you intend to list on this page are listed in the Follow and Crawl Only URLs with the Following Patterns field on the Crawl and Index > Crawl URLs page.

Configuring Duplicate Hosts

To configure duplicate hosts:

  1. Click Crawl and Index > Duplicate Hosts.
  2. In the Canonical Host column, type a fully qualified host name in the field in the top row.
  3. In the Duplicate Host(s) column of the corresponding row, type any number of fully qualified host names in the top row. Separate host names with a space.
  4. To add additional rows, click Add More Lines.
  5. When all rows are added, click Save Host Information.

Note that adding a duplicate host entry will not remove any duplicate URLs that are already indexed. A new entry prevents only newly-discovered duplicate URLs from being indexed. To remove existing duplicate URLs from the index, enter the appropriate patterns in the Do Not Follow Patterns field on the Content Sources > Web Crawl > Start and Block URLs page.

Configuring Infinite Space Detection

To configure infinite space detection:

  1. Click Crawl and Index > Duplicate Hosts.
  2. Click the Enable infinite space detection checkbox.
  3. Optionally, click Enable URL string check.
  4. Optionally, enter a number in Number of URLs with identical content required to trigger classification of those URLs as an infinite space.
  5. Click Save Configuration.

Subsequent Tasks

There are no subsequent tasks associated with configuring duplicate hosts.


 
© Google Inc.