|
|
|
Google Search Appliance: Getting the Most from Your Google Search Appliance > Crawling and Indexing
|
|
Crawling and Indexing
After the Google Search Appliance has been set up (see Setting Up a Search Appliance), you can configure the search appliance to crawl the content sources that you identified during the planning phase, as described in Planning.
The Google Search Appliance can crawl:
The Google Search Appliance is also capable of indexing:
![]()
![]()
This section briefly describes how the Google Search Appliance indexes each type of content.
Crawling Public Content
The following figure provides an overview of crawling public content.
What Content Is Not Crawled?
![]()
Do not follow and crawl URLs that you specify by using the Crawl and Index > Crawl URLs page in the Admin Console
![]()
robots.txt file—The Google Search Appliance always obeys the rules in robots.txt (see Content Prohibited by a robots.txt File in Administering Crawl) and it is not possible to override this feature. Before the search appliance crawls any content servers in your environment, check with the content server administrator or webmaster to ensure that robots.txt allows the search appliance user agent access to the appropriate content (see Identifying the User Agent in Administering Crawl)Configuring Crawl of Public Content
To configure a search appliance to crawl a content source, you specify top-level URLs and directory addresses and links that the search appliance should follow by using the Crawl and Index > Crawl URLs page in the Admin Console. In addition to specifying start URLs, you can also specify URLs that the search appliance should not follow and crawl.
Configure continuous crawl by performing the following steps with the Admin Console:
1. Specifying where to start the crawl by listing top-level URLs and directory addresses in the Start Crawling from the Following URLs section on the Crawl and Index > Crawl URLs page, shown in the following figure.
2. Specifying links for the search appliance to follow and index by listing patterns in the Follow and Crawl Only URLs with the Following Patterns section.
3. Listing any URLs that you don’t want the search appliance to crawl in the Do Not Crawl URLs with the Following Patterns section.After you save the URL patterns, the search appliance begins crawling in continuous mode.
If you prefer to have the search appliance crawl according to scheduled times, you must also perform the additional following tasks by using the Crawl and Index > Crawl Schedule page in the Admin Console:
To schedule crawling times for a specific host, you can change the host load and times in the Crawl and Index > Host Load Schedule page. By setting a host load of 0, the crawler will not crawl that host during the configured time period.
If you wish to have a document added to the crawl queue right away, then you can do so by entering in the URL in Re-Crawl These URL Patterns on the Crawl and Index > Freshness Tuning page.
Learn More about Public Crawl
For in-depth information about public crawl, configuring a search appliance to crawl, and starting a crawl, refer to the introduction in Administering Crawl.
For a complete list of file types that the search appliance can crawl, refer to Indexable File Formats.
Crawling and Serving Controlled-Access Content
The following figure provides an overview of crawling controlled-access content.
Configuring Crawl of Controlled-Access Content
1. Configuring the crawl as described in Configuring Crawl of Public Content, but also providing the search appliance with URL patterns that match the controlled content.
![]()
For HTTP Basic and NTLM HTTP, use the Crawl and Index > Crawler Access page
![]()
For HTTPS web sites, the search appliance uses a serving certificate as a client certificate when crawling. Upload a new certificate by using the Administration > Certificate Authorities pageThe following figure shows the Crawl and Index > Crawler Access page.
Managing Serve of Controlled-Access Content
A search appliance can use the following methods to establish the user’s identity:
The search appliance applies the rules in the order in which they appear in the authorization routing table on the Serving > Flexible Authorization page.
Configuring Serve of Controlled-Access Content
![]()
To configure a search appliance to perform forms authentication, use the Serving > Universal Login Auth Mechanisms > Cookie page.
![]()
To configure a search appliance to perform HTTP Basic or NTLM HTTP authentication, use the Serving > Universal Login Auth Mechanisms > HTTP page.
![]()
To configure the search appliance to require X.509 Certificate Authentication for search requests from users, use the Serving > Universal Login Auth Mechanisms > Client Certificate page.
![]()
To enable the search appliance to use IWA/Kerberos authentication during secure serve, use the Serving > Universal Login Auth Mechanisms > Kerberos page.
![]()
To configure the search appliance to use the Authentication SPI, use the Serving > Universal Login Auth Mechanisms > SAML page.
![]()
To configure the search appliance to use conectors, use the Serving > Universal Login Auth Mechanisms > Connectors page
![]()
To enable the search appliance to authenticate credentials against an LDAP server, use the Serving > Universal Login Auth Mechanisms > LDAP page in the Admin Console.
![]()
To configure the search appliance to use the Authorization SPI, use the Serving > Access Control page.
![]()
To configure flexible authorization rules, use the Serving > Flexible Authorization page.Learn More about Controlled-Access Content
For complete information about configuring a search appliance to crawl and serve controlled-access content, refer to Managing Search for Controlled-Access Content.
Indexing Content in Non-Web Repositories
Also, Google partners have developed connectors for other non-web repositories. For information about these connectors, visit the Google Solutions Marketplace (http://www.google.com/enterprise/marketplace/).
The following figure provides an overview of indexing content in non-web repositories.
You can also create a custom connector for the Google Search Appliance, as described in Developing Custom Connectors.
Serving Results from a Content Management System
To authorize access to private or protected content from a repository, the Google Search Appliance creates a connector instance at query time. The connector instance forwards authentication credentials to the repository for authorization checking. The connector manager recognizes identities passed from basic authentication, SAML authentication (see Authentication SPI), and client certificates. If a SAML authentication provider is setup to support single sign-on (SSO), the connector manager also recognizes identities passed from the SSO provider.
Obtaining the Connector Manager and Connectors
Configuring a Connector
Before you configure a connector, install the following software components:
2. Registering a connector manager by using the Connector Administration > Connector Managers page in the Admin Console.
3. Adding a connector by using the Connector Administration > Connectors page, shown in the following figure.
4. Configuring crawl patterns by using the Crawl and Index > Crawl URLs page.
5. If required by the connector, configuring feeds by using the Crawl and Index > Feeds page.
8. Verifying that the search appliance is indexing URLs from the connector by using the Status and Reports > Crawl Diagnostics page.Learn More about Connectors
For in-depth information about connectors, refer to the Google Search Appliance connector documents.
Indexing Hard-to-Find Content
You can also use feeds delete data from the index on the search appliance.
The Google Search Appliance Supports two types of feeds, as described in the following table.
The following figure provides an overview of indexing hard-to-find content by using feeds.
Pushing a Feed to the Search Appliance
To push a content feed to the search appliance, you must provide the following components:
You can use one of the feed clients described in the Feeds Protocol Developer’s Guide or write your own. For information about writing a feed client, refer to Writing Applications with the Feeds Protocol.
Push a content feed to the search appliance by performing the following steps:
1. Adding the URL for the document defined in the Feed Client to crawl patterns by using the Crawl and Index > Crawl URLs page. URLs specified in the feed will only be crawled if they pass through the patterns specified on the Crawl and Index > Crawl URLs page.
2. Configuring the search appliance to accept the feed by using the Crawl and Index > Feeds page, shown in the following figure. To prevent unauthorized additions to your index, feeds are only accepted from machines that are specified on this page.Learn More about Feeds
For complete documentation on feeds, refer to the Feeds Protocol Developer’s Guide.
Indexing Database Content
The following figure provides an overview of indexing content in databases.
Synchronizing a Database
Synchronize a database by performing the following tasks with the Admin Console:
1. Creating a new database source on the Crawl and Index > Databases page, shown in the following figure.
2. Setting URL patterns that enable the search appliance to crawl the database by using the Crawl and Index > Crawl URLs page.
3. Starting a database synchronization by using the Crawl and Index > Databases page.Learn More about Database Synchronization
For in-depth information about how the Google Search Appliance indexes and serves database content, as well as a complete list of databases and JDBC adapter versions that the Google Search Appliance supports, refer to Database Crawling and Serving in Administering Crawl.
Testing Indexed Content
Once the content has been crawled and indexed, you can ensure that it is searchable by using the Test Center. The Test Center enables you to test search across the indexed content, limiting it to specific collections (see Segmenting the Index) or using specific front-ends (see Using Front Ends) and verifying that the correct content is indexed and that the results are what you expect.
You can find a link to the Test Center at the upper right side of the Admin Console. When you click the Test Center link, a new browser window opens and displays the Test Center page, as shown in the following figure.