Google logo
PDF Previous Next


Crawling and Indexing

After the Google Search Appliance has been set up (see Setting Up a Search Appliance), you can configure the search appliance to crawl the content sources that you identified during the planning phase, as described in Planning.

Crawl is the process by which the Google Search Appliance discovers enterprise content and creates a master index. The resulting index consists of all of the words, phrases, and meta-data in the crawled documents. When users search for information, their queries are executed against the index rather than the actual documents themselves. Searching against content that is already indexed in the appliance is not interrupted, even as new content continues to be indexed.

The Google Search Appliance can crawl:

*
*

The Google Search Appliance is also capable of indexing:

*
*
*

This section briefly describes how the Google Search Appliance indexes each type of content.

Crawling Public Content

Public content is not restricted in any way; users don’t need credentials to view it. Some of the most common forms of public content include:

*
*
*
*
*
*

The Google Search Appliance supports crawling of many types of formats, including word processing, spreadsheet, presentation, and others.

The Google Search Appliance crawls content on web sites or file systems according to crawl patterns that you specify by using the Admin Console. As the search appliance crawls public content sources, it indexes documents that it finds. To find more documents, the crawler follows links within the documents that it indexes. The search appliance does not crawl content that you to exclude from the index.

The following figure provides an overview of crawling public content.

What Content Is Not Crawled?

The Google Search Appliance does not crawl unlinked URLs or links that are embedded within an area tag. Also, the search appliance does not crawl or index content that is excluded by these mechanisms:

*
Do not follow and crawl URLs that you specify by using the Crawl and Index > Crawl URLs page in the Admin Console
*
robots.txt file—The Google Search Appliance always obeys the rules in robots.txt (see Content Prohibited by a robots.txt File in Administering Crawl) and it is not possible to override this feature. Before the search appliance crawls any content servers in your environment, check with the content server administrator or webmaster to ensure that robots.txt allows the search appliance user agent access to the appropriate content (see Identifying the User Agent in Administering Crawl)
*

Typically, webmasters, content owners, and search appliance administrators create robots.txt files and add META tags to documents before a search appliance starts crawling.

Configuring Crawl of Public Content

To configure a search appliance to crawl a content source, you specify top-level URLs and directory addresses and links that the search appliance should follow by using the Crawl and Index > Crawl URLs page in the Admin Console. In addition to specifying start URLs, you can also specify URLs that the search appliance should not follow and crawl.

By default, the search appliance crawls in continuous crawl mode. This means that after the Google Search Appliance creates the index, it always crawls content sources looking for new or modified content and updates the index to ensure that it contains the freshest listings. The search appliance can also crawl content according to a schedule.

Configure continuous crawl by performing the following steps with the Admin Console:

1.
Specifying where to start the crawl by listing top-level URLs and directory addresses in the Start Crawling from the Following URLs section on the Crawl and Index > Crawl URLs page, shown in the following figure.
2.
Specifying links for the search appliance to follow and index by listing patterns in the Follow and Crawl Only URLs with the Following Patterns section.
3.
Listing any URLs that you don’t want the search appliance to crawl in the Do Not Crawl URLs with the Following Patterns section.
4.

After you save the URL patterns, the search appliance begins crawling in continuous mode.

If you prefer to have the search appliance crawl according to scheduled times, you must also perform the additional following tasks by using the Crawl and Index > Crawl Schedule page in the Admin Console:

1.
2.
3.

To schedule crawling times for a specific host, you can change the host load and times in the Crawl and Index > Host Load Schedule page. By setting a host load of 0, the crawler will not crawl that host during the configured time period.

If you wish to have a document added to the crawl queue right away, then you can do so by entering in the URL in Re-Crawl These URL Patterns on the Crawl and Index > Freshness Tuning page.

Learn More about Public Crawl

For in-depth information about public crawl, configuring a search appliance to crawl, and starting a crawl, refer to the introduction in Administering Crawl.

For a complete list of file types that the search appliance can crawl, refer to Indexable File Formats.

Crawling and Serving Controlled-Access Content

Controlled-access content is secure content—it is restricted so that not all users have access to it. For access to controlled-access content, users need authorization.

A search appliance discovers and indexes controlled-access content in the same way that it indexes all other content: by performing a crawl through the content sources. However, the search appliance requires access credentials to discover and index controlled-access content. Once you set up the search appliance with access credentials, it maintains a copy of all crawled content in the index.

The following figure provides an overview of crawling controlled-access content.

The following table lists the access-control methods that the search appliance supports and whether the methods are supported for crawl, serve, or both.

HTTP Basic

X

X

NTLM HTTP

X

X

LDAP (Lightweight Directory Access Protocol)

X

Forms Authentication

X

X

x.509 Certificates

X

X

Integrated Windows Authentication/Kerberos

X

X

SAML Service Provider Interfaces (SPIs)

X

Connectors

X

X

Configuring Crawl of Controlled-Access Content

If the content files you want crawled and indexed are in a location that requires a login, create a special user account on your network for the search appliance. When you configure crawl on the Admin Console, provide the user name and password for that account. The search appliance presents those credentials before crawling files in that location.

Configure a search appliance to crawl controlled-access content by performing the following steps with the Admin Console:

1.
Configuring the crawl as described in Configuring Crawl of Public Content, but also providing the search appliance with URL patterns that match the controlled content.
2.
*
For HTTP Basic and NTLM HTTP, use the Crawl and Index > Crawler Access page
*

The following figure shows the Crawl and Index > Crawler Access page.

Managing Serve of Controlled-Access Content

When a user issues a search request for controlled-access content, the search appliance verifies the user’s identity and determines whether the user has authorization to view the content. This check is performed before the search appliance displays any content in search results. By performing the results access control checks in real-time, the Google Search Appliance ensures that users only see results they are authorized to view.

A search appliance can use the following methods to establish the user’s identity:

*
*
*
*
*
*
*

Once the user’s identity has been established, a search appliance attempts to determine whether the user has access to the secure content that matches their search. The search appliance performs authorization checks by applying flexible authorization rules. You can configure rules for:

*
*
*
*
*
*
*

The search appliance applies the rules in the order in which they appear in the authorization routing table on the Serving > Flexible Authorization page.

If the authorization check is successful, the secure content that matches the search query is included in the user’s search results.

Configuring Serve of Controlled-Access Content

The process for configuring serve of controlled-access content is dependent on the security method you want to use, as described in the following list:

*
To configure a search appliance to perform forms authentication, use the Serving > Universal Login Auth Mechanisms > Cookie page.
*
*
*
*
To configure the search appliance to use the Authentication SPI, use the Serving > Universal Login Auth Mechanisms > SAML page.
*
To configure the search appliance to use conectors, use the Serving > Universal Login Auth Mechanisms > Connectors page
*
To enable the search appliance to authenticate credentials against an LDAP server, use the Serving > Universal Login Auth Mechanisms > LDAP page in the Admin Console.
*
*
To configure flexible authorization rules, use the Serving > Flexible Authorization page.

Learn More about Controlled-Access Content

For complete information about configuring a search appliance to crawl and serve controlled-access content, refer to Managing Search for Controlled-Access Content.

Indexing Content in Non-Web Repositories

If your organization has content that is stored in non-web repositories, such as Enterprise Content Management (ECM) systems, you can enable the Google Search Appliance to index and serve this content by using the connector framework.

The Google Search Appliance provides the indexing capabilities for the following content management systems:

*
*
*
*
*
*
*
*

Also, Google partners have developed connectors for other non-web repositories. For information about these connectors, visit the Google Solutions Marketplace (http://www.google.com/enterprise/marketplace/).

The connector manager is the central part of the connector framework for the Google Search Appliance. The Connector Manager itself manages creation, instantiation, scheduling and monitoring of connectors that supply content and provide authentication and authorization services to the Google Search Appliance. Connectors run on connector managers residing on servlet containers installed on computers on your network. All Google-supported connectors are certified on Apache Tomcat.

When connecting to a document repository through an enterprise connector, the Google Search Appliance uses a process called “traversal.” During traversal, the connector issues queries to the repository to retrieve document data to feed to the Google Search Appliance for indexing. The connector manager formats the content and any associated metadata for a feed to the Google Search Appliance, which then creates an index of the documents.

The following figure provides an overview of indexing content in non-web repositories.

You can also create a custom connector for the Google Search Appliance, as described in Developing Custom Connectors.

Serving Results from a Content Management System

For public content in a repository, searches work the same way as they do with web and file-system content. The Google Search Appliance searches its index and returns relevant result sets to the user without any involvement by the connector.

To authorize access to private or protected content from a repository, the Google Search Appliance creates a connector instance at query time. The connector instance forwards authentication credentials to the repository for authorization checking. The connector manager recognizes identities passed from basic authentication, SAML authentication (see Authentication SPI), and client certificates. If a SAML authentication provider is setup to support single sign-on (SSO), the connector manager also recognizes identities passed from the SSO provider.

Obtaining the Connector Manager and Connectors

To run a connector, you need the software for the connector manager and the connector. The following table lists methods for obtaining the software components that you need to use connectors, as well as the support provided for each component.

Source code for the connector manager and connectors

Download the code from the Google Enterprise Connector Manager project (http://code.google.com/p/google-enterprise-connector-manager/).

The open-source software is for the development of third-party connectors. Developers using the resources provided in this project can create connectors for virtually any type of document-based repository. Google does not support the open-source software or changes you make to the open-source software.

An installer package that deploys Apache Tomcat, a connector manager, and a particular connector type

Download the package from Google Enterprise Support web site.

Google supports the installer and the software packaged with the installer.

Configuring a Connector

Before you configure a connector, install the following software components:

*
*
*

The specific process that you follow for configuring a connector depends on the type of connector. Generally, you can configure a connector by performing the following steps:

1.
2.
Registering a connector manager by using the Connector Administration > Connector Managers page in the Admin Console.
3.
Adding a connector by using the Connector Administration > Connectors page, shown in the following figure.

4.
Configuring crawl patterns by using the Crawl and Index > Crawl URLs page.
5.
6.
7.
8.

Learn More about Connectors

For in-depth information about connectors, refer to the Google Search Appliance connector documents.

Indexing Hard-to-Find Content

During crawl, the search appliance finds most of the content that it indexes by following links within documents. However, many organizations have content that cannot be found this way because it is not linked from other documents. If your organization has content that cannot be found through links on crawled web pages, you can ensure that the Google Search Appliance indexes it by using Feeds. Feeds are also useful for the following types of content:

*
*

You can also use feeds delete data from the index on the search appliance.

The Google Search Appliance Supports two types of feeds, as described in the following table.

Web feed

A web feed does not provide content to the Google Search Appliance. Instead, a web feed provide a list of URLs to the search appliance. Optionally, a web feed may include metadata. The crawler queues the URLs listed in the web feed and fetches content for each document listed in the feed. Web feeds are incremental. The search appliance recrawls web feeds periodically, based on the crawl settings for your search appliance.

Content Feed

A content feed provides both URLs and their content to the search appliance. A content feed may include metadata. A content feed can be either full or incremental. The search appliance only crawls content feeds when they are pushed.

The following figure provides an overview of indexing hard-to-find content by using feeds.

Pushing a Feed to the Search Appliance

To push a content feed to the search appliance, you must provide the following components:

*
*

You can use one of the feed clients described in the Feeds Protocol Developer’s Guide or write your own. For information about writing a feed client, refer to Writing Applications with the Feeds Protocol.

URL Patterns and Trusted IP lists that you define with the Admin Console ensure that your index only lists content from desirable sources. When pushing URLs with a feed, you must verify that the Admin Console will accept the feed and allow your content through to the index. For a feed to succeed, it must be fed from a trusted IP address and at least one URL in the feed must pass the rules defined on the Admin Console.

Push a content feed to the search appliance by performing the following steps:

1.
Adding the URL for the document defined in the Feed Client to crawl patterns by using the Crawl and Index > Crawl URLs page. URLs specified in the feed will only be crawled if they pass through the patterns specified on the Crawl and Index > Crawl URLs page.
2.
Configuring the search appliance to accept the feed by using the Crawl and Index > Feeds page, shown in the following figure. To prevent unauthorized additions to your index, feeds are only accepted from machines that are specified on this page.

3.
4.
5.

Learn More about Feeds

For complete documentation on feeds, refer to the Feeds Protocol Developer’s Guide.

Indexing Database Content

The Google Search Appliance can also index records in a relational database. The Google Search Appliance supports indexing of the following relational database management systems:

*
*
*
*
*

The search appliance provides access to data stored in relational databases by crawling the content directly from the database and serving the content. The process of crawling a database is called “synchronizing a database.” To access content in a database, the Google Search Appliance sends SQL (Structured Query Language) queries using JDBC (Java Database Connectivity) adapters provided by database companies. It crawls the contents of the database and then pushes records from a database into the search appliance’s index using feeds.

The following figure provides an overview of indexing content in databases.

Synchronizing a Database

Synchronize a database by performing the following tasks with the Admin Console:

1.
Creating a new database source on the Crawl and Index > Databases page, shown in the following figure.

2.
3.
Starting a database synchronization by using the Crawl and Index > Databases page.

Learn More about Database Synchronization

For in-depth information about how the Google Search Appliance indexes and serves database content, as well as a complete list of databases and JDBC adapter versions that the Google Search Appliance supports, refer to Database Crawling and Serving in Administering Crawl.

Testing Indexed Content

Once the content has been crawled and indexed, you can ensure that it is searchable by using the Test Center. The Test Center enables you to test search across the indexed content, limiting it to specific collections (see Segmenting the Index) or using specific front-ends (see Using Front Ends) and verifying that the correct content is indexed and that the results are what you expect.

You can find a link to the Test Center at the upper right side of the Admin Console. When you click the Test Center link, a new browser window opens and displays the Test Center page, as shown in the following figure.

PDF Previous Next