Back to Home | Admin Console Help | Log Out
 Admin Console Help
 
Admin Console Help

Home

Content Sources
 Web Crawl
   Start and Block URLs
   Case-Insensitive Patterns
   Proxy Servers
   HTTP Headers
   Duplicate Hosts
   Coverage Tuning
   Crawl Schedule
   Host Load Schedule
   Freshness Tuning
   Secure Crawl
  Connector Managers
  Connectors
  Feeds
  Databases
  Google Apps
  OneBox Modules
  Diagnostics

Index

Search

Reports

GSA Unification

GSAn

Administration

More Information

Content Sources > Web Crawl > Start and Block URLs

Use the Content Sources > Web Crawl > Start and Block URLs page to perform the following tasks:

Before crawling starts, you must specify one or more starting locations. You can control and refine the breadth of the crawl by specifying URL patterns to follow and others to avoid. The crawler can access content over HTTP, HTTPS, and SMB protocols. More information about file system crawling appears below.

The Content Sources > Web Crawl > Start and Block URLs page gives you the option of working with URLs and patterns in one of two views: Action or Batch Edit. Action view provides several features for working with individual URLs and patterns, while Batch Edit view enables you to work with multiple URLs at once. For more details about this topic, see "Choosing a View Type."

Before Starting this Task

Before specifying URLs to crawl, you and other people in your organization might prepare the content for crawling. This activity includes controlling crawling of documents by:

  • Adding a robots exclusion protocol (robots.txt) for each content server to be crawled
  • Embedding robots META tags in HTML headers of documents
  • Adding googleoff/googleon tags to HTML document bodies

For more information about these tasks, refer to "Administering Crawl: Preparing for a Crawl," which is linked to the Google Search Appliance help center.

If you want to include secure URLs in a crawl, configure crawler accesses to content servers that require authentication before adding any secure URL to the starting URLs. To configure crawler access to secure servers, use the Content Sources > Web Crawl > Secure Crawl > Crawler Access page.

About Start and Block URLs

The URLs and URL patterns that you enter on this page enable you to control and refine crawls. This page provides the following sections for entering URLs and URL patterns:

For a given URL to be crawled, it must match at least one URL pattern in the Follow Patterns section and none of the URL patterns in the Do Not Follow Patterns section. If a URL is matched by patterns from both Follow Patterns and Do Not Follow Patterns, the URL is not crawled.

The URLs and patterns that you list on this page must conform to the "Rules for Valid URL Patterns" described in "Administering Crawl: Constructing URL Patterns," which is linked to the Google Search Appliance help center.

Start URLs

Start URLs control where the Google Search Appliance begins crawling your content. The search appliance should be able to reach all content that you want to include in a particular crawl by following the links from one or more of the start URLs.

All entries in this section must be fully qualified URLs in the following format:

<protocol>://<host><domain>[:port]/[path]

In this format, the protocol can include HTTP, HTTPS (for secure content) or SMB (for fileshares). The information contained in square brackets [ ] is optional. The forward slash "/" after <host>[:port] is required.

If you enter a URL with an IPv6 address, you must put square brackets around the address, as shown in the following format:

http://[2001::1]:80/[path]

Follow Patterns

Enter all the URLs that appear in Start URLs in this section.

Only URLs matching the patterns you specify in this section are followed and crawled. This allows you to control which files are crawled on your server.

The URLs that are discovered are checked against these patterns for inclusion in the index. For a URL to be crawled and indexed, there must be a sequence of links matching the follow patterns from one of the start URLs. If there is no valid link path, add the URL to the Start URLs section.

Do Not Follow Patterns

Enter URL patterns for specific file types, directories, or other sets of pages that you do not want crawled in this section.

For example, entering the pattern contains:? in this section prevents many Common Gateway Interface (CGI) scripts from being crawled.

Entering the pattern contains:? also excludes content feeds sent by connectors because googleconnector URLs contain question marks:

googleconnector://connector-name.localhost/doc?docid=unique-id

Even if you do not use googleconnector URLs, enterprise content management systems use query parameters that may contain question marks in their URLs. If your installation includes connectors for content management systems, use caution when you enter patterns containing question marks.

For your convenience, this section is pre-populated with many URL patterns and file types, some of which you may not want the crawler to index. Google does not recommend deleting any of the default patterns. If you find that parts of your site are currently being excluded by these rules, deactivate the corresponding rule by putting a "#" in front of it. This way you can easily recover the default settings, should you need them.

Rewrite URLs

By default, the Google Search Appliance rewrites URLs ending with index.html or index.htm so that they end with "/". For example, if the search appliance crawled the URL http://company.com/index.html, it is rewritten as http://company.com/.

You can disable this behavior so that the search appliance does not rewrite URLs ending with index.html or index.htm. To disable rewriting URLs, uncheck the Rewrite URLs ending with index.html or index.htm to / checkbox. To re-enable the behavior, check the checkbox.

This setting is global and applies to all the URLs crawled by the search appliance. You cannot apply this setting to a specific URL pattern.

File System Crawling

To crawl documents stored in an SMB file share, type a URI using the smb: protocol, using the following format:

smb://file-server.domain/your-sharename/folder/

For example:

smb://file-server.domain/myshare/myfolder/

Do not start the crawl at the top-level SMB path. For example, the following path is invalid for crawl:

smb://file-server.domain/

Case-Insensitive URL Pattern Matching

URLs on the Content Sources > Web Crawl > Start and Block URLs page are case-sensitive. If you want case-insensitive URL pattern matching, use either the Content Sources > Web Crawl > Case-Insensitive Patterns page or the operator regexpIgnoreCase. For example, suppose you enter the following pattern:

regexpIgnoreCase:http://www\\.mycompany\\.com/documents/

That pattern would also match the following URLs:

http://www.mycompany.com/Documents/
http://www.mycompany.com/DOCUMENTS/

Crawling and Indexing Compressed Files

The search appliance supports crawling and indexing compressed files in the following formats: .zip, .tar, .tar.gz, and .tgz. In both search results and index diagnostics, a reference to a compressed file will appear as:

[path]/[compressed_file]#[extracted_file]

To enable the search appliance to crawl these types of compressed files:

  1. Under Do Not Follow Patterns, put a "#" in front of the following patterns:
    • .tar$
    • .zip$
    • .tar .gz$
    • .tgz$
    • regexpIgnoreCase:([^.]..|[^p].|[^s])[.]gz$
  2. Click Save.

Choosing a View Type

When you are working with start and block URLs, you have the option of choosing the page view that works best for you: Action view or Batch Edit view. To choose a view click your choice next to View Type. You can switch between page views without any loss of data, however Batch Edit view does not provide all the options that Action view does.

Using Action View

Use Action view when you want to configure crawl or validate, troubleshoot, recrawl, or test a URL or pattern.

In Action view, the Start URLs, Follow Patterns, and Do Not Follow Patterns are presented as tables where each row contains an individual URL or pattern. This view enables you to add, edit, or delete individual URLs and patterns. Additionally, Action view provides the Actions pull-down menus and Filter boxes.

Using the Actions Pull-Down Menus

The Actions pull-down menus enables you to act on individual URLs and patterns. To use a menu, click Actions in the row for a URL or pattern. The Actions pull-down menus for each URL and patterns table contains different commands, as described in the following table.

Table Actions Pull-Down Menu Commands
Start URLs
  • Check Crawler Access--Navigate to the Content Sources > Diagnostics > Real-time Diagnostics page to validate that the crawler can fetch the selected URL using all the crawler settings (for example, security and proxy).
  • Index Diagnostics--Navigate to the Index > Diagnostics > Index Diagnostics page to view the index status for the selected URL.
  • Recrawl--Recrawl the selected URL.
Follow Patterns
  • Test Patterns--Open the Pattern Tester Utility for the selected pattern. For more information about the Pattern Tester Utility, see "Testing URL Patterns."
  • Recrawl--Recrawl the selected pattern.

Using Filters

In some cases, such as troubleshooting a URL, you might want to clear the URL and patterns tables of all but the relevant entries. In Action view, you can filter the tables by typing or pasting a value in any Filter box. For example, if you only want to show URLs that begin with "https" in the tables, enter that value in a Filter box. All of the tables on the page are filtered by that value.

Configuring Crawl in Action View

To configure crawl in Action view, add URLs and patterns:

  1. Click Content Sources > Web Crawl > Start and Block URLs.
  2. Click Action View.
  3. Under Start URLs, click Add.
  4. Enter one or more start URLs, as well as matching follow patterns and any do not follow patterns for the start URLs.
    type one URLs per line.
    Press Enter to add more URLs, one per line. Comments (starting with #) are permitted.
  5. Click Save.
  6. Optionally, uncheck or recheck Rewrite URLs ending with index.html or index.htm to /.

To add follow patterns only:

  1. Under Follow Patterns, click Add.
  2. Enter one or more URL patterns.
    Press Enter to add more patterns, one per line. Comments (starting with #) are permitted.
  3. Click Save.

To add do not follow patterns only:

  1. Under Do Not Follow Patterns, click Add.
  2. Enter one or more URL patterns.
    Press Enter to add more patterns, one per line. Comments (starting with #) are permitted.
  3. Click Save.

Editing Individual URLs or Patterns

To edit an individual URL or pattern:

  1. In the table, click on the URL or pattern.
  2. In the text box, make any changes.
  3. Click Save.

Deleting Individual URLs or Patterns

To delete an individual URL or pattern:

  1. In a table, click the trash can icon on the line that corresponds to the URL or pattern.
  2. Click OK.

Testing URL Patterns

To test which URLs are going to be matched by one of the patterns you have entered on this page, select Test Pattern from the Actions pull-down menu or click Test these patterns, opening the Pattern Tester Utility. This utility lets you specify a list of URLs on the left and a set of patterns on the right. It notifies you if each URL is matched by one of the patterns in the set.

When it opens, the Pattern Tester Utility is initialized with either the selected pattern or your saved entries from the Content Sources > Web Crawl > Start and Block URLs page. You can enter more URLs and patterns into the tester utility to analyze your pattern sets. However, your modifications are not saved; you have to enter and save them explicitly on the Content Sources > Web Crawl > Start and Block URLs page.

The green background indicates that at least one of the patterns does match the URLs you want to crawl. It also shows the first pattern that matched. The red background shows that none of the patterns matched this URL.

Using Batch Edit View

Use Batch Edit view when you want to add, edit, or delete multiple URLs and patterns at once.

Configuring Crawl in Batch Edit View

When you are working in Batch Edit view, type one URLs per line. Press Enter to add more URLs, one per line. Empty lines and comments (starting with #) are permitted.

To configure crawl in Batch Edit view:

  1. Click Content Sources > Web Crawl > Start and Block URLs.
  2. Click Batch Edit View.
  3. Type starting URLs in Start URLs.
  4. Type all starting URLs in Follow Patterns.
  5. Optionally, type URL patterns in Do Not Follow Patterns.
  6. Optionally, uncheck or recheck Rewrite URLs ending with index.html or index.htm to /.
  7. Click Save.

Editing or Deleting URLs or Patterns

You can edit or delete any URLs or patterns in Batch Edit view. After making any changes, click Save.

For More Information

For detailed information about crawling, see "Administering Crawl," which is linked to the Google Search Appliance help center.


 
© Google Inc.