Back to Home | Admin Console Help | Log Out
 Admin Console Help
 
Admin Console Help

Home

Content Sources
 Web Crawl
   Start and Block URLs
   Case-Insensitive Patterns
   Proxy Servers
   HTTP Headers
   Duplicate Hosts
   Coverage Tuning
   Crawl Schedule
   Host Load Schedule
   Freshness Tuning
   Secure Crawl
  Connector Managers
  Connectors
  Feeds
  Groups
  Databases
  Google Apps
  OneBox Modules
  Diagnostics

Index

Search

Reports

GSA Unification

GSAn

Administration

More Information

Content Sources > Web Crawl > Duplicate Hosts

Use the Content Sources > Web Crawl > Duplicate Hosts page to perform the following tasks:

Duplicate Hosts

By configuring duplicate hosts, you can prevent the search appliance from recrawling content on a canonical host that resides on mirrored servers.

For example, if you have load-balanced servers in your system that serve the same content, it's best not to crawl all of the servers, because they contain duplicates of the content files, not unique content files. Entries on the Duplicate Hosts page identify the duplicate hosts so that links found during the crawl that point to a duplicate host are treated as though they point to the corresponding canonical host.

The following requirements apply to entries on this page:

  • Only one Canonical Host entry is permitted in each field in the Canonical Host column.
  • Canonical Host entries must be fully qualified host names.
  • Multiple Duplicate Host entries are permitted in the same field for the corresponding canonical host. Separate each host name with a space.
  • Each field in the Duplicate Host column must contain at least one entry.

In the following example, the canonical host www.your-company.com corresponds to the duplicate hosts www.offsite.com and web.offsite.com.

Canonical Host Duplicate Host(s)
www.your-company.com www.offsite.com web.offsite.com
www2.your-company.com website.example

Infinite Space

In "infinite space," the search appliance repeatedly crawls similar URLs with the same content while useful content goes uncrawled. For example, the search appliance might start crawling infinite space if a page that it fetches contains a link back to itself with a different url. The search appliance keeps crawling this page because, each time, the URL contains progressively more query parameters or a longer path. When a URL is in infinite space, the search appliance does not crawl links in the content.

By enabling infinite space detection, you can prevent crawling of duplicate content to avoid infinite space indexing. When you select Enable infinite space detection, the following two options appear:

  • Enable URL string check--If enabled, the search appliance checks the URL string for any signals of infinite space, such as repetitive path or query strings. If there are such signals, the search appliance further checks if the Number of URLs with identical content option is satisfied to determine if the duplicated URL is in infinite space. If this option is not enabled, all the duplicate URLs specified by the following option are considered in infinite space.
  • Number of URLs with identical content required to trigger classification of those URLs as an infinite space--If the number of URLs of same content exceed this threshold, they are considered duplicates, except the first crawled one.

If a URL is a duplicate, you can use the Content Sources > Web Crawl > Crawl Schedule page to configure a recrawl schedule and whether to remove the URL from the index. For more information, see the help page for Content Sources > Web Crawl > Crawl Schedule.

If there is valid content in repetitive URLs, you need to remove the following regular expressions from the Do Not Crawl URLs with the Following Patterns field on the Content Sources > Web Crawl > Start and Block URLs page:

  • regexp:/([^/]*)/\\1/\\1/
  • regexp:/([^/]*)/([^/]*)/\\1/\\2/
  • regexp:&([^&]*)&\\1&\\1

These patterns prevent crawling of repetitive URLs and prevent infinite space detection from working.

Before Starting these Tasks

Ensure that all canonical hosts you intend to list on this page are listed in the Follow Patterns field on the Content Sources > Web Crawl > Start and Block URLs page.

Configuring Duplicate Hosts

To configure duplicate hosts:

  1. Click Content Sources > Web Crawl > Duplicate Hosts.
  2. In the Canonical Host column, type a fully qualified host name in the field in the top row.
  3. In the Duplicate Host(s) column of the corresponding row, type any number of fully qualified host names in the top row. Separate host names with a space.
  4. To add additional rows, click Add More Lines.
  5. When all rows are added, click Save.

Note that adding a duplicate host entry will not remove any duplicate URLs that are already indexed. A new entry prevents only newly-discovered duplicate URLs from being indexed. To remove existing duplicate URLs from the index, enter the appropriate patterns in the Do Not Follow Patterns field on the Content Sources > Web Crawl > Start and Block URLs page.

Configuring Infinite Space Detection

To configure infinite space detection:

  1. Click Content Sources > Web Crawl > Duplicate Hosts.
  2. Click the Enable infinite space detection checkbox.
  3. Optionally, click Enable URL string check.
  4. Optionally, enter a number in Number of URLs with identical content required to trigger classification of those URLs as an infinite space.
  5. Click Save.

Subsequent Tasks

There are no subsequent tasks associated with configuring duplicate hosts.


 
© Google Inc.