![]() |
|
Admin Console Help
Home |
Content Sources > Web Crawl > Start and Block URLsUse the Content Sources > Web Crawl > Start and Block URLs page to perform the following tasks:
Before crawling starts, you must specify one or more starting locations. You can control and refine the breadth of the crawl by specifying URL patterns to follow and others to avoid. The crawler can access content over HTTP, HTTPS, and SMB protocols. More information about file system crawling appears below. The Content Sources > Web Crawl > Start and Block URLs page gives you the option of working with URLs and patterns in one of two views: Action or Batch Edit. Action view provides several features for working with individual URLs and patterns, while Batch Edit view enables you to work with multiple URLs at once. For more details about this topic, see "Choosing a View Type." Before Starting this TaskBefore specifying URLs to crawl, you and other people in your organization might prepare the content for crawling. This activity includes controlling crawling of documents by:
For more information about these tasks, refer to "Administering Crawl: Preparing for a Crawl," which is linked to the Google Search Appliance help center. If you want to include secure URLs in a crawl, configure crawler accesses to content servers that require authentication before adding any secure URL to the starting URLs. To configure crawler access to secure servers, use the Content Sources > Web Crawl > Secure Crawl > Crawler Access page. About Start and Block URLsThe URLs and URL patterns that you enter on this page enable you to control and refine crawls. This page provides the following sections for entering URLs and URL patterns: For a given URL to be crawled, it must match at least one URL pattern in the Follow Patterns section and none of the URL patterns in the Do Not Follow Patterns section. If a URL is matched by patterns from both Follow Patterns and Do Not Follow Patterns, the URL is not crawled. The URLs and patterns that you list on this page must conform to the "Rules for Valid URL Patterns" described in "Administering Crawl: Constructing URL Patterns," which is linked to the Google Search Appliance help center. Start URLsStart URLs control where the Google Search Appliance begins crawling your content. The search appliance should be able to reach all content that you want to include in a particular crawl by following the links from one or more of the start URLs. All entries in this section must be fully qualified URLs in the following format:
In this format, the protocol can include HTTP, HTTPS (for secure content) or SMB (for fileshares). The information contained in square brackets [ ] is optional. The forward slash "/" after <host>[:port] is required. If you enter a URL with an IPv6 address, you must put square brackets around the address, as shown in the following format:
Follow PatternsEnter all the URLs that appear in Start URLs in this section. Only URLs matching the patterns you specify in this section are followed and crawled. This allows you to control which files are crawled on your server. The URLs that are discovered are checked against these patterns for inclusion in the index. For a URL to be crawled and indexed, there must be a sequence of links matching the follow patterns from one of the start URLs. If there is no valid link path, add the URL to the Start URLs section. Do Not Follow PatternsEnter URL patterns for specific file types, directories, or other sets of pages that you do not want crawled in this section. For example, entering the pattern contains:? in this section prevents many Common Gateway Interface (CGI) scripts from being crawled. Entering the pattern contains:? also excludes content feeds sent by connectors because googleconnector URLs contain question marks: Even if you do not use googleconnector URLs, enterprise content management systems use query parameters that may contain question marks in their URLs. If your installation includes connectors for content management systems, use caution when you enter patterns containing question marks. For your convenience, this section is pre-populated with many URL patterns and file types, some of which you may not want the crawler to index. Google does not recommend deleting any of the default patterns. If you find that parts of your site are currently being excluded by these rules, deactivate the corresponding rule by putting a "#" in front of it. This way you can easily recover the default settings, should you need them. Rewrite URLsBy default, the Google Search Appliance rewrites URLs ending with index.html or index.htm so that they end with "/". For example, if the search appliance crawled the URL http://company.com/index.html, it is rewritten as http://company.com/. File System CrawlingTo crawl documents stored in an SMB file share, type a URI using the smb: protocol, using the following format:
For example:
Do not start the crawl at the top-level SMB path. For example, the following path is invalid for crawl: Case-Insensitive URL Pattern MatchingURLs on the Content Sources > Web Crawl > Start and Block URLs page are case-sensitive. If you want case-insensitive URL pattern matching, use either the Content Sources > Web Crawl > Case-Insensitive Patterns page or the operator regexpIgnoreCase. For example, suppose you enter the following pattern:
That pattern would also match the following URLs:
Crawling and Indexing Compressed FilesThe search appliance supports crawling and indexing compressed files in the following formats: .zip, .tar, .tar.gz, and .tgz. In both search results and index diagnostics, a reference to a compressed file will appear as: [path]/[compressed_file]#[extracted_file] To enable the search appliance to crawl these types of compressed files:
Choosing a View TypeWhen you are working with start and block URLs, you have the option of choosing the page view that works best for you: Action view or Batch Edit view. To choose a view click your choice next to View Type. You can switch between page views without any loss of data, however Batch Edit view does not provide all the options that Action view does. Using Action ViewUse Action view when you want to configure crawl or validate, troubleshoot, recrawl, or test a URL or pattern. In Action view, the Start URLs, Follow Patterns, and Do Not Follow Patterns are presented as tables where each row contains an individual URL or pattern. This view enables you to add, edit, or delete individual URLs and patterns. Additionally, Action view provides the Actions pull-down menus and Filter boxes. Using the Actions Pull-Down MenusThe Actions pull-down menus enables you to act on individual URLs and patterns. To use a menu, click Actions in the row for a URL or pattern. The Actions pull-down menus for each URL and patterns table contains different commands, as described in the following table.
Using FiltersIn some cases, such as troubleshooting a URL, you might want to clear the URL and patterns tables of all but the relevant entries. In Action view, you can filter the tables by typing or pasting a value in any Filter box. For example, if you only want to show URLs that begin with "https" in the tables, enter that value in a Filter box. All of the tables on the page are filtered by that value. Configuring Crawl in Action ViewTo configure crawl in Action view, add URLs and patterns:
To add follow patterns only:
To add do not follow patterns only:
Editing Individual URLs or PatternsTo edit an individual URL or pattern:
Deleting Individual URLs or PatternsTo delete an individual URL or pattern:
Testing URL PatternsTo test which URLs are going to be matched by one of the patterns you have entered on this page, select Test Pattern from the Actions pull-down menu or click Test these patterns, opening the Pattern Tester Utility. This utility lets you specify a list of URLs on the left and a set of patterns on the right. It notifies you if each URL is matched by one of the patterns in the set. When it opens, the Pattern Tester Utility is initialized with either the selected pattern or your saved entries from the Content Sources > Web Crawl > Start and Block URLs page. You can enter more URLs and patterns into the tester utility to analyze your pattern sets. However, your modifications are not saved; you have to enter and save them explicitly on the Content Sources > Web Crawl > Start and Block URLs page. The green background indicates that at least one of the patterns does match the URLs you want to crawl. It also shows the first pattern that matched. The red background shows that none of the patterns matched this URL. Using Batch Edit ViewUse Batch Edit view when you want to add, edit, or delete multiple URLs and patterns at once. Configuring Crawl in Batch Edit ViewWhen you are working in Batch Edit view, type one URLs per line. Press Enter to add more URLs, one per line. Empty lines and comments (starting with #) are permitted. To configure crawl in Batch Edit view:
Editing or Deleting URLs or PatternsYou can edit or delete any URLs or patterns in Batch Edit view. After making any changes, click Save. For More InformationFor detailed information about crawling, see "Administering Crawl," which is linked to the Google Search Appliance help center. |
||||||||
© Google Inc.
|