Google Search Appliance Admin Console Help

Back to Home | Admin Console Help | Log Out

Admin Console Help

Home

Content Sources
Web Crawl
   Start and Block URLs
   Case-Insensitive Patterns
   Proxy Servers
   HTTP Headers
   Duplicate Hosts
   Coverage Tuning
   Crawl Schedule
   Host Load Schedule
   Freshness Tuning
   Secure Crawl
    Crawler Access
    Forms Authentication
Connector Managers
Connectors
Feeds
Groups
Databases
Google Apps
OneBox Modules
Diagnostics

Index

Search

Reports

GSA Unification

GSAⁿ

Administration

More Information

Content Sources > Web Crawl > Secure Crawl > Crawler Access

Use the Content Sources > Web Crawl > Secure Crawl > Crawler Access page to configure how the crawler accesses content servers that require authentication before granting access to confidential content. Configure crawler access to secure content servers before you specify any secure URLs as starting URLs on the Content Sources > Web Crawl > Start and Block URLspage.

Before Starting this Task

Before setting options for crawling secure content, have ready the URLs (or matching patterns), the domain used by the web server, and the user names and passwords. Before enabling Kerberos crawling, you must configure Kerberos for serving on the Kerberos tab of the Search > Secure Search > Universal Login Auth Mechanisms page.

Crawl and Serve Secure Content

You can index and serve results on content that is protected by authentication mechanisms (HTTP Basic authentication, NTLM, and Integrated Windows Authentication/Kerberos authentication) for content that resides on a protected web server or a protected file share.

For HTTP Basic and NTLM, you must create an authentication rule instructing the crawler how to authenticate when crawling the protected content. An authentication rule consists of a URL pattern matching the protected files, username, domain (if using NTLM), and password. Using the Make Public checkbox, you can allow users to get results on both the public content (normally available to everyone) and the secure (confidential) content. Note that by selecting Make Public, all documents matching the URL pattern become public, even if there are ACLs associated with the documents or the authmethod attribute in the feed record is set to a secure value.

For Kerberos, you must enable Kerberos crawling, as described in "Enabling Kerberos Crawling."

Any public documents that match a URL pattern that you enter on this page are treated as access-controlled. The documents will only appear during secure search. When entering URL patterns, be sure to include only those that include access-controlled content.

If a content server supports both NTLM and Kerberos, you can configure either NTLM or Kerberos crawling. If you configure both, Kerberos takes precedence over NTLM. To make the search appliance bypass Kerberos and use NTLM with NTLM-only content servers, make sure those content servers only send NTLM as the supported mechanism in the WWW-Authenticate field of the header, as shown in the following example:

HTTP/1.1 401 Unauthorized 
WWW-Authenticate: NTLM

Do not include the Negotiate field, as shown in the following example:

HTTP/1.1 401 Unauthorized
WWW-Authenticate: Negotiate
WWW-Authenticate: NTLM

Setting Options for Crawling Secure Content

To set options for crawling secure content:

Click Content Sources > Web Crawl > Secure Crawl > Crawler Access.
Under Users and Passwords for Crawling, enter the URLs Matching Pattern, the username, the domain (if using NTLM), and the password and confirmation.
If you need more rows for additional patterns, click the Add More Rows button.
Click Save.

For information about setting up crawling secure content using Kerberos, see "Enabling Kerberos Crawling."

Example:

This example shows how to configure the crawler to authenticate to various servers:

as a member of a Windows domain via NTLM (the first two entries), or
using a simple user name and password combination for HTTP Basic authentication (the last two entries).

For URLs Matching Pattern, Use:	Username:	In Domain:	Password:	Confirm Password:
`https://www.example.com/secure/`	crawler	example	******	******
`https://www.example.com/robots.txt`	crawler	example	******	******
`https://designdocs.example.com/`	JohnDoe		******	******
`smb://engserver.example.com/corp/`	JohnDoe		******	******

The user names listed in the previous table are created by an authentication administrator, not by the search appliance.

Important: The entries you make in the Users and Passwords for Crawling section are sequential rules. Always enter more specific rules before general rules.

For example, first enter the following specific statement:

http://corp.example.com/secure/

Then enter the following more general statement:

http://corp.example.com/

To change the order of statements, click the Move Up or Move Down link next to the statement that you want to move.

If incorrect access information or credentials are entered here, then the retrieval or exclusion errors will appear in the Index > Diagnostics > Index Diagnostics page of the Admin Console.

Important Security Note: Unless your Crawler Access patterns are correctly written, you can risk sending Basic Authentication credentials of both the crawler and your users to an untrusted web server. The crawler's authentication credentials may be sent to a web server when crawling a URL that matches a pattern that you have set up for crawling secure content. The user's authentication credentials may be sent to a web server when any URL that matches a pattern is a relevant search result.

To prevent a web server from collecting the crawler's and your users' Basic Authentication credentials, ensure that Crawler Access patterns only match those URLs that actually require authentication.

Access to 'robots.txt' File

If a web server is configured to require authentication for all HTTP or HTTPS requests, be sure to create an authentication rule with a pattern that matches the '/robots.txt' file.

In order to obey the Robots Exclusion Protocol, the crawler attempts to retrieve robots.txt. If the attempt results in an HTTP 401 (authentication required) response code, the crawler will be unable to crawl any other URLs on the site. If the attempt to access /robots.txt results in HTTP 200 (success) or HTTP 404 (not found) response codes, the crawl can proceed to the content of that HTTP or HTTPS site.

If a site requires authentication for all requests and an authentication rule matching /robots.txt does not exist, the crawler will receive a HTTP 401 response code and will be unable to crawl any other URLs on the site.

In the above example, we've created a second authentication rule matching /robots.txt for the www.example.com web server, since the first rule matches only URLs in the /secure/ directory.

Enabling Kerberos Crawling

After configuring Kerberos for serving on the Kerberos tab of the Search > Secure Search > Universal Login Auth Mechanisms page, you can enable Kerberos crawling on the Content Sources > Web Crawl > Secure Crawl > Crawler Access page.

To enable Kerberos crawling:

Click the Enable Kerberos Crawling checkbox.
Click Save.

File System Crawling and Serving

Documents located on SMB (Server Message Block) file shares are indexed and served in search results as public or secure content.

If access to the file share requires authentication, be sure to include the file share's URL pattern in the crawler access configuration. To ensure that content from a file share that requires authentication is served as secure content, clear the Make Public checkbox.

By default, if no authentication is specified on the Crawler Access page, the search appliance crawls the file shares that are listed in the Start and Follow URLs as a "Guest" user. To successfully crawl the files as a Guest, the file server must be configured to allow access to "Guest" and/or "Everyone." Set up these account names in English, even if you normally use another language.

About Secure Search Results

The search appliance can serve results over both plaintext HTTP as well as encrypted HTTPS.

When secure content results are displayed, the total number of results and number of pages returned is hidden to prevent exposing information about secure documents to users who do not have access.

Although there is no overload on secure servers at crawl time, a search request will add some load to servers containing secure content.

In the Search Box section of Page Layout, you can add option buttons to your search page that let your users decide to search on public content or on the complete index (both public and secure content) at the time of their search.

A query against public and secure content requires that the user be authenticated by entering the username and password for the secure area. If your servers require a domain name for authentication, users should enter it like this: domain\username. If a user enters an incorrect username or password, no secure results will be included in the search results.

Crawling Password Protected Pdf Files

To crawl Password Protected pdf files, configure the rules under Password Protected Pdf files.

Important Security Note: Take note that this interface will not change the security of the document. This means that if the document is marked as public during crawl, everyone that searches for it is able to retrieve it and see its cached version.

Example:

The following example shows how to configure the pdf owner password for some patterns.

For URLs Matching Pattern, Use:	Owner Password:	Confirm Password:
`https://www.example.com/files/file.pdf`	******	******
`https://www.example.com/other_files/`	******	******
`https://designdocs.example.com/`	******	******
`smb://engserver.example.com/corp/`	******	******

The entries listed in the previous table are created by an authentication administrator, not by the search appliance.

Important: The entries added are sequential rules. Always enter more specific rules before general rules. If unsure, use the Test these patterns link to understand if your rules match the URLs provided.

To setup the rules for password protected pdf files, perform the following steps:

Click Content Sources > Web Crawl > Secure Crawl > Crawler Access.
Under Password Protected Pdf files, enter the URLs Matching Pattern, the password and the confirmation.
If you need more rows for additional patterns, click the Add More Rows button.
Click Save.

To change the order of rules, click the Move Up or Move Down link next to the rule that you want to move.

To remove a rule, perform the following steps:

Click Content Sources > Web Crawl > Secure Crawl > Crawler Access.
Under Password Protected Pdf files, either check the Delete checkbox for that row or delete the URL pattern.
Perform step 2 for as many rows as necessary.
Click Save.

For More Information

For more information about crawler access, see "Managing Search for Controlled-Access Content," which is linked to the Google Search Appliance help center.