![]() |
|
Help Center
Home |
Crawl and Index > Crawl URLsUse the Crawl and Index > Crawl URLs page to perform the following tasks: Before crawling starts, you must specify one or more starting locations. You can control and refine the breadth of the crawl by specifying URL patterns to follow and others to avoid. For a given URL to be crawled, it must match at least one URL pattern in the Follow and Crawl Only URLs with the Following Patterns field and none of the URL patterns in the Do Not Crawl URLs with the Following Patterns field. If a URL is matched by patterns from both Follow and Crawl Only URLs with the Following Patterns and Do Not Crawl URLs with the Following Patterns, the URL is not crawled. The crawler can access content over HTTP, HTTPS, and SMB protocols. More information about file system crawling appears below. Before Starting this TaskBefore specifying URLs to crawl, you and other people in your organization might prepare the content for crawling. This activity includes controlling crawling of documents by:
For more information about these tasks, refer to "Administering Crawl: Preparing for a Crawl," which is linked to the Google Search Appliance help center. If you want to include secure URLs in a crawl, configure crawler accesses to content servers that require authentication before adding any secure URL to the starting URLs. To configure crawler access to secure servers, use the Crawl and Index > Crawler Access page. Configuring a CrawlThe fields described in the following table enable you to control and refine crawls. In each field, type one URLs per line. Press Enter to add additional URLs, one per line. Empty lines and comments (starting with #) are permitted. The URL patterns that you list in the fields on this page must conform to the "Rules for Valid URL Patterns" described in "Administering Crawl: Constructing URL Patterns," which is linked to the Google Search Appliance help center. By default, the Google Search Appliance rewrites URLs ending with index.html or index.htm so that they end with "/". For example, if the search appliance crawled the URL http://company.com/index.html, it is rewritten as http://company.com/. To configure a crawl:
File System CrawlingTo crawl documents stored in an SMB file share, type a URI using the smb: protocol, using the following format:
For example:
Do not start the crawl at the top-level SMB path. For example, the following path is invalid for crawl: Case-Insensitive URL Pattern MatchingURLs on the Crawl and Index > Crawl URLs page are case-sensitive. If you want case-insensitive URL pattern matching, use either the Crawl and Index > Case-Insensitive URLs page or the operator regexpIgnoreCase. For example, suppose you enter the following pattern:
That pattern would also match the following URLs:
Crawling and Indexing Compressed FilesThe search appliance supports crawling and indexing compressed files in the following formats: .zip, .tar, .tar.gz, and .tgz. To enable the search appliance to crawl these types of compressed files:
Testing URL PatternsTo test which URLs are going to be matched by one of the patterns you have entered on this page, click a Test these patterns link to open the Pattern Tester Utility. This utility lets you specify a list of URLs on the left and a set of patterns on the right. It notifies you if each URL is matched by one of the patterns in the set. When it opens, the Pattern Tester Utility is initialized with your saved entries from the Crawl and Index > Crawl URLs page. You can enter more URLs and patterns into the tester utility to analyze your pattern sets. However, your modifications are not saved; you have to enter and save them explicitly on the Crawl and Index > Crawl URLs page. After you click the Test these patterns link, the results appear on the same page. The green background indicates that at least one of the patterns does match the URLs you want to crawl. It also shows the first pattern that matched. The red background shows that none of the patterns matched this URL. To return to the Crawl and Index > Crawl URLs page, click the Back to Crawl and Index > Crawl URLs link. For More InformationFor detailed information about crawling, see "Administering Crawl," which is linked to the Google Search Appliance help center. |
||
© Google Inc.
|