![]() |
|
Admin Console Help
Home |
Content Sources > Web Crawl > Duplicate HostsUse the Content Sources > Web Crawl > Duplicate Hosts page to perform the following tasks: Duplicate HostsBy configuring duplicate hosts, you can prevent the search appliance from recrawling content on a canonical host that resides on mirrored servers. For example, if you have load-balanced servers in your system that serve the same content, it's best not to crawl all of the servers, because they contain duplicates of the content files, not unique content files. Entries on the Duplicate Hosts page identify the duplicate hosts so that links found during the crawl that point to a duplicate host are treated as though they point to the corresponding canonical host. The following requirements apply to entries on this page:
In the following example, the canonical host www.your-company.com corresponds to the duplicate hosts www.offsite.com and web.offsite.com.
Infinite SpaceIn "infinite space," the search appliance repeatedly crawls similar URLs with the same content while useful content goes uncrawled. For example, the search appliance might start crawling infinite space if a page that it fetches contains a link back to itself with a different url. The search appliance keeps crawling this page because, each time, the URL contains progressively more query parameters or a longer path. When a URL is in infinite space, the search appliance does not crawl links in the content. By enabling infinite space detection, you can prevent crawling of duplicate content to avoid infinite space indexing. When you select Enable infinite space detection, the following two options appear:
If a URL is a duplicate, you can use the Content Sources > Web Crawl > Crawl Schedule page to configure a recrawl schedule and whether to remove the URL from the index. For more information, see the help page for Content Sources > Web Crawl > Crawl Schedule. If there is valid content in repetitive URLs, you need to remove the following regular expressions from the Do Not Crawl URLs with the Following Patterns field on the Content Sources > Web Crawl > Start and Block URLs page:
These patterns prevent crawling of repetitive URLs and prevent infinite space detection from working. Before Starting these TasksEnsure that all canonical hosts you intend to list on this page are listed in the Follow Patterns field on the Content Sources > Web Crawl > Start and Block URLs page.Configuring Duplicate HostsTo configure duplicate hosts:
Note that adding a duplicate host entry will not remove any duplicate URLs that are already indexed. A new entry prevents only newly-discovered duplicate URLs from being indexed. To remove existing duplicate URLs from the index, enter the appropriate patterns in the Do Not Follow Patterns field on the Content Sources > Web Crawl > Start and Block URLs page. Configuring Infinite Space DetectionTo configure infinite space detection:
Subsequent TasksThere are no subsequent tasks associated with configuring duplicate hosts. |
||||||||
© Google Inc.
|