![]() |
|
Admin Console Help
Home |
Content Sources > Web Crawl > Secure Crawl > Crawler AccessUse the Content Sources > Web Crawl > Secure Crawl > Crawler Access page to configure how the crawler accesses content servers that require authentication before granting access to confidential content. Configure crawler access to secure content servers before you specify any secure URLs as starting URLs on the Content Sources > Web Crawl > Start and Block URLspage. Before Starting this TaskBefore setting options for crawling secure content, have ready the URLs (or matching patterns), the domain used by the web server, and the user names and passwords. Before enabling Kerberos crawling, you must configure Kerberos for serving on the Kerberos tab of the Search > Secure Search > Universal Login Auth Mechanisms page. Crawl and Serve Secure ContentYou can index and serve results on content that is protected by authentication mechanisms (HTTP Basic authentication, NTLM, and Integrated Windows Authentication/Kerberos authentication) for content that resides on a protected web server or a protected file share. For HTTP Basic and NTLM, you must create an authentication rule instructing the crawler how to authenticate when crawling the protected content. An authentication rule consists of a URL pattern matching the protected files, username, domain (if using NTLM), and password. Using the Make Public checkbox, you can allow users to get results on both the public content (normally available to everyone) For Kerberos, you must enable Kerberos crawling, as described in "Enabling Kerberos Crawling." Any public documents that match a URL pattern that you enter on this page are treated as access-controlled. The documents will only appear during secure search. When entering URL patterns, be sure to include only those that include access-controlled content. If a content server supports both NTLM and Kerberos, you can configure either NTLM or Kerberos crawling. If you configure both, Kerberos takes precedence over NTLM. To make the search appliance bypass Kerberos and use NTLM with NTLM-only content servers, make sure those content servers only send NTLM as the supported mechanism in the WWW-Authenticate field of the header, as shown in the following example: HTTP/1.1 401 Unauthorized Do not include the Negotiate field, as shown in the following example: HTTP/1.1 401 Unauthorized Setting Options for Crawling Secure ContentTo set options for crawling secure content:
For information about setting up crawling secure content using Kerberos, see "Enabling Kerberos Crawling." Example: This example shows how to configure the crawler to authenticate to various servers:
The user names listed in the previous table are created by an authentication administrator, not by the search appliance. Important: The entries you make in the Users and Passwords for Crawling section are sequential rules. Always enter more specific rules before general rules. For example, first enter the following specific statement: Then enter the following more general statement: To change the order of statements, click the Move Up or Move Down link next to the statement that you want to move. If incorrect access information or credentials are entered here, then the retrieval or exclusion errors will appear in the Index > Diagnostics > Index Diagnostics page of the Admin Console. Important Security Note: Unless your Crawler Access patterns are correctly written, you can risk sending Basic Authentication credentials of both the crawler and your users to an untrusted web server. The crawler's authentication credentials may be sent to a web server when crawling a URL that matches a pattern that you have set up for crawling secure content. The user's authentication credentials may be sent to a web server when any URL that matches a pattern is a relevant search result. To prevent a web server from collecting the crawler's and your users' Basic Authentication credentials, ensure that Crawler Access patterns only match those URLs that actually require authentication. Access to 'robots.txt' FileIf a web server is configured to require authentication for all HTTP or HTTPS requests, be sure to create an authentication rule with a pattern that matches the '/robots.txt' file. In order to obey the Robots Exclusion Protocol, the crawler attempts to retrieve If a site requires authentication for all requests and an authentication rule matching /robots.txt does not exist, the crawler will receive a HTTP 401 response code and will be unable to crawl any other URLs on the site. Enabling Kerberos CrawlingAfter configuring Kerberos for serving on the Kerberos tab of the Search > Secure Search > Universal Login Auth Mechanisms page, you can enable Kerberos crawling on the Content Sources > Web Crawl > Secure Crawl > Crawler Access page. To enable Kerberos crawling:
File System Crawling and ServingDocuments located on SMB (Server Message Block) file shares are indexed and served in search results as public or secure content. If access to the file share requires authentication, be sure to include the file share's URL pattern in the crawler access configuration. To ensure that content from a file share that requires authentication is served as secure content, clear the Make Public checkbox. By default, if no authentication is specified on the Crawler Access page, the search appliance crawls the file shares that are listed in the Start and Follow URLs as a "Guest" user. To successfully crawl the files as a Guest, the file server must be configured to allow access to
About Secure Search ResultsThe search appliance can serve results over both plaintext HTTP as well as encrypted HTTPS. When secure content results are displayed, the total number of results and number of pages returned is hidden to prevent exposing information about secure documents to users who do not have access. Although there is no overload on secure servers at crawl time, a search request will add some load to servers containing secure content. In the Search Box section of Page Layout, you can add option buttons to your search page that let your users decide to search on public content or on the complete index (both public and secure content) at the time of their search. A query against public and secure content requires
that the user be authenticated by entering the username and password for the secure area.
If your servers require a domain name for authentication, users should enter it like this:
Crawling Password Protected Pdf FilesTo crawl Password Protected pdf files, configure the rules under Password Protected Pdf files. Important Security Note: Take note that this interface will not change the security of the document. This means that if the document is marked as public during crawl, everyone that searches for it is able to retrieve it and see its cached version. Example: The following example shows how to configure the pdf owner password for some patterns.
The entries listed in the previous table are created by an authentication administrator, not by the search appliance. Important: The entries added are sequential rules. Always enter more specific rules before general rules. If unsure, use the Test these patterns link to understand if your rules match the URLs provided. To setup the rules for password protected pdf files, perform the following steps:
To change the order of rules, click the Move Up or Move Down link next to the rule that you want to move. To remove a rule, perform the following steps:
For More InformationFor more information about crawler access, see "Managing Search for Controlled-Access Content," which is linked to the Google Search Appliance help center. |
||||||||||||||||||||||||||||||||||||||||||
© Google Inc.
|