7.0 - Google Search Appliance Help Center

Back to Home | Help Center | Log Out

Help Center

Home

Crawl and Index
Crawl URLs
Databases
Feeds
Crawl Schedule
Crawler Access
Proxy Servers
Forms Authentication
Case-Insensitive Patterns
HTTP Headers
Duplicate Hosts
Document Dates
Host Load Schedule
Coverage Tuning
Freshness Tuning
Collections
Composite Collections
Index Settings
Entity Recognition

Serving

Status and Reports

Connector Administration

Social Connect

Cloud Connect

GSA Unification

GSAⁿ

Administration

More Information

Crawl and Index > Crawl URLs

Use the Crawl and Index > Crawl URLs page to perform the following tasks:

Configuring a crawl
Testing URL patterns

Before crawling starts, you must specify one or more starting locations. You can control and refine the breadth of the crawl by specifying URL patterns to follow and others to avoid.

For a given URL to be crawled, it must match at least one URL pattern in the Follow and Crawl Only URLs with the Following Patterns field and none of the URL patterns in the Do Not Crawl URLs with the Following Patterns field. If a URL is matched by patterns from both Follow and Crawl Only URLs with the Following Patterns and Do Not Crawl URLs with the Following Patterns, the URL is not crawled.

The crawler can access content over HTTP, HTTPS, and SMB protocols. More information about file system crawling appears below.

Before Starting this Task

Before specifying URLs to crawl, you and other people in your organization might prepare the content for crawling. This activity includes controlling crawling of documents by:

Adding a robots exclusion protocol (robots.txt) for each content server to be crawled
Embedding robots META tags in HTML headers of documents
Adding googleoff/googleon tags to HTML document bodies

For more information about these tasks, refer to "Administering Crawl: Preparing for a Crawl," which is linked to the Google Search Appliance help center.

If you want to include secure URLs in a crawl, configure crawler accesses to content servers that require authentication before adding any secure URL to the starting URLs. To configure crawler access to secure servers, use the Crawl and Index > Crawler Access page.

Configuring a Crawl

The fields described in the following table enable you to control and refine crawls. In each field, type one URLs per line. Press Enter to add additional URLs, one per line. Empty lines and comments (starting with #) are permitted. The URL patterns that you list in the fields on this page must conform to the "Rules for Valid URL Patterns" described in "Administering Crawl: Constructing URL Patterns," which is linked to the Google Search Appliance help center.

Field Description

Start Crawling from the Following URLs

Type start URLs in this field. This field must contain at least one start URL.

Start URLs control where the Google Search Appliance begins crawling your content. The search appliance should be able to reach all content that you want to include in a particular crawl by following the links from one or more of the start URLs.

All entries in this field must be fully qualified URLs, using the following format:

<protocol>://<host><domain>[:port]/[path]

In this format, the protocol can include HTTP, HTTPS (for secure content) or SMB (for fileshares). The information contained in square brackets [ ] is optional. The forward slash "/" after <host>[:port] is required.

If you enter a URL with an IPv6 address, you must put square brackets around the address, as shown in the following format:

http://[2001::1]:80/[path]

Follow and Crawl Only URLs with the Following Patterns

Type all the start URLs that appear in Start Crawling from the Following URLs in this field.

Only URLs matching the patterns you specify in this field are followed and crawled. This allows you to control which files are crawled on your server.

The URLs that are discovered are checked against these patterns for inclusion in the index. For a URL to be crawled and indexed, there must be a sequence of links matching the follow patterns from one of the start URLs. If there is no valid link path, add the URL to the Start Crawling from the Following URLs field.

Do Not Crawl URLs with the Following Patterns

Type URL patterns for specific file types, directories, or other sets of pages that you do not want crawled in this field.

For example, entering the pattern contains:? in this box prevents many Common Gateway Interface (CGI) scripts from being crawled.

Entering the pattern contains:? also excludes content feeds sent by connectors because googleconnector URLs contain question marks:

googleconnector://connector-name.localhost/doc?docid=unique-id

Even if you do not use googleconnector URLs, enterprise content management systems use query parameters that may contain question marks in their URLs. If your installation includes connectors for content management systems, use caution when you enter patterns containing question marks.

For your convenience, this field is prepopulated with many URL patterns and file types, some of which you may not want the crawler to index. Google does not recommend deleting any of the default patterns. If you find that parts of your site are currently being excluded by these rules, deactivate the corresponding rule by putting a "#" in front of it. This way you can easily recover the default settings, should you need them.

By default, the Google Search Appliance rewrites URLs ending with index.html or index.htm so that they end with "/". For example, if the search appliance crawled the URL http://company.com/index.html, it is rewritten as http://company.com/.

You can disable this behavior so that the search appliance does not rewrite URLs ending with index.html or index.htm.
To disable rewriting URLs, uncheck the Rewrite URLs ending with index.html or index.htm checkbox. To re-enable the behavior, check the checkbox.

This setting is global and applies to all the URLs crawled by the search appliance. You cannot apply this setting to a specific URL pattern.

To configure a crawl:

Click Crawl and Index > Crawl URLs.
Type starting URLs in Start Crawling from the Following URLs.
Type all starting URLs in Follow and Crawl Only URLs with the Following Patterns.
Optionally, type URL patterns in Do Not Crawl URLs with the Following Patterns.
Optionally, uncheck or recheck Rewrite URLs ending with index.html or index.htm.
Click Save URLs to Crawl.

File System Crawling

To crawl documents stored in an SMB file share, type a URI using the smb: protocol, using the following format:

smb://file-server.domain/your-sharename/folder/

For example:

smb://file-server.domain/myshare/myfolder/

Do not start the crawl at the top-level SMB path. For example, the following path is invalid for crawl:

smb://file-server.domain/

Case-Insensitive URL Pattern Matching

URLs on the Crawl and Index > Crawl URLs page are case-sensitive. If you want case-insensitive URL pattern matching, use either the Crawl and Index > Case-Insensitive URLs page or the operator regexpIgnoreCase. For example, suppose you enter the following pattern:

regexpIgnoreCase:http://www\\.mycompany\\.com/documents/

That pattern would also match the following URLs:

http://www.mycompany.com/Documents/
http://www.mycompany.com/DOCUMENTS/

Crawling and Indexing Compressed Files

The search appliance supports crawling and indexing compressed files in the following formats: .zip, .tar, .tar.gz, and .tgz. To enable the search appliance to crawl these types of compressed files:

Under Do Not Crawl URLs with the Following Patterns, put a "#" in front of the following patterns:
- .tar$
- .zip$
- .tar .gz$
- .tgz$
- regexpIgnoreCase:([^.]..|[^p].|[^s])[.]gz$
Click Save URLs to Crawl.

Testing URL Patterns

To test which URLs are going to be matched by one of the patterns you have entered on this page, click a Test these patterns link to open the Pattern Tester Utility. This utility lets you specify a list of URLs on the left and a set of patterns on the right. It notifies you if each URL is matched by one of the patterns in the set.

When it opens, the Pattern Tester Utility is initialized with your saved entries from the Crawl and Index > Crawl URLs page. You can enter more URLs and patterns into the tester utility to analyze your pattern sets. However, your modifications are not saved; you have to enter and save them explicitly on the Crawl and Index > Crawl URLs page.

After you click the Test these patterns link, the results appear on the same page. The green background indicates that at least one of the patterns does match the URLs you want to crawl. It also shows the first pattern that matched. The red background shows that none of the patterns matched this URL.

To return to the Crawl and Index > Crawl URLs page, click the Back to Crawl and Index > Crawl URLs link.

For More Information

For detailed information about crawling, see "Administering Crawl," which is linked to the Google Search Appliance help center.