Back to Home | Help Center | Log Out
 Help Center
 
Help Center

Home

Crawl and Index
  Crawl URLs
  Databases
  Feeds
  Crawl Schedule
  Crawler Access
  Proxy Servers
  Forms Authentication
  Case-Insensitive Patterns
  HTTP Headers
  Duplicate Hosts
  Document Dates
  Host Load Schedule
  Coverage Tuning
  Freshness Tuning
  Collections
  Composite Collections
  Index Settings
  Entity Recognition

Serving

Status and Reports

Connector Administration

Social Connect

Cloud Connect

GSA Unification

GSAn

Administration

More Information

Crawl and Index > Entity Recognition

Use the Crawl and Index > Entity Recognition page to perform the following tasks:

About Entity Recognition

Entity recognition enables the Google Search Appliance to discover interesting entities in documents with missing or poor metadata and store these entities in the search index.

For example, suppose that your search appliance crawls and indexes multiple content sources, but only one of these sources has robust metadata. By using entity recognition, you can enrich the metadata-poor content sources with discovered entities and discover new, interesting entities in the source with robust metadata.

After you configure and enable entity recognition, the search appliance automatically discovers specific entities in your content sources during indexing, annotates them, and stores them in the index. Once the entities are indexed, you can enhance keyword search by adding the entities in dynamic navigation, which uses metadata in documents and entities discovered by entity recognition to enable users to browse search results by using specific attributes. For information about this topic, see "Adding Entities to Dynamic Navigation."

Additionally, by default entity recognition extracts and stores full URLs in the index. This includes both document URLs and plain text URLs that appear in documents. So you can match specific URLs with entity recognition and add them to dynamic navigation, enabling users to browse search results by full or partial URL. For more information about this topic, see "Use Case: Matching URLs for Dynamic Navigation," in Administering Crawl: Advanced Topics.

The Crawl and Index > Entity Recognition page enables you to specify the entities that you want the search appliance to discover in your documents. However, before you can specify entities on this page, you must define each entity by creating dictionaries of terms and regular expressions. Dictionaries for terms are required for entity recognition. Dictionaries enable entity recognition to annotate entities, that is, to discover specific entities in the content and annotate them as entities. Generally, with dictionaries, you define an entity with lists of terms and regular expressions.

Optionally, you can also create composite entities that run on the annotated terms. Like dictionaries, composite entities define entities, but composite entities enable the search appliance to discover more complex terms. In a composite entity, you can define an entity with a sequence of terms. Because composite entities run on annotated terms, all the words in a sequence must be tagged with an entity and so depend on dictionaries.

The search appliance provides sample dictionaries and composite entities, as shown on the Crawl and Index > Entity Recognition page.

If you want to identify terms that should not be stored in the index, you can upload the terms in an entity blacklist file.

Google recommends that you perform the tasks for setting up entity recognition in the following order:

  1. Creating dictionaries and, optionally, composite entities.
  2. Adding new entities by adding dictionaries and, optionally, composite entities.
  3. Enabling entity recognition.

Entity recognition only runs on documents that are added to the index after you enable entity recognition. Documents already in the index are not affected. To run entity recognition on documents already in the index, force the search appliance to recrawl URL patterns by using the Status and Reports > Crawl Diagnostics page.

Entity recognition works for languages that are read left to right. It does not work for languages that are read right to left.

Dictionaries of Terms and Regular Expressions

You must define each entity by at least one dictionary, which contains one or more terms. For example, the entity "Capital" might be defined by a dictionary that contains a list of country capitals: Abu Dhabi, Abuja, Accra, Addis Ababa, and so on.

After you create a dictionary, you can upload it to the search appliance as described in "Adding New Entities or Updating Existing Entities." In addition to adding new entities or updating existing entities, you can perform the following tasks:

Entity recognition accepts dictionaries in either of the following two formats:

Dictionaries in TXT Format

A dictionary in .txt format contains an entity term on each line. Terms can be formed by several words separated by spaces. The following example shows an excerpt from the the .txt file for the entity "Country."

...
United Arab Emirates
United Kingdom
United States
United States of America
Uruguay
Vanuatu
Vatican
...
Dictionaries in XML Format

XML format enables a richer definition of an entity than .txt format. In particular, XML dictionaries enable you to define synonyms and regular expressions. The following code shows the XML schema for a dictionary.

<?xml version="1.0"?> 
<instances> 
   <instance> 
   <name> Entity name </name> 
   <term> Term 1 for the entity </term> 
   <term> Term 2 for the entity </term> 
   ...
   <pattern> Regular expression 1 </pattern> 
   <pattern> Regular expression 2 </pattern>
  ...
   <store_term_or_name> term/name </store_term_or_name>
   <store_regex_or_name>name/regex</store_regex_or_name>
   </instance>
   <instance>
   ...
   </instance>
</instances>

Each instance can contain the tags described in the following table. Each instance must contain at least one <term> or <pattern>.

Tag Name Description
<name>
The name that identifies the group of synonyms. The text in this tag is used to annotate the content source. (required)
<term>
A term in the list of synonyms. (zero or more)
<pattern>
A regular expression. (zero or more)
<store_term_or_name>
In case of identifying a term, whether the term or the name should be stored. By default the term is stored internally. (optional)
<store_regex_or_name>
In case of finding a match with the regular expression, whether the text that matched the regular expression or the name should be stored. By default the text is stored internally. (optional)

The following example shows the dictionary for the entity "Water."

<?xml version="1.0"?> 
<instances> 
   <instance> 
   <name> Water </name> 
   <term> water </term>
   <term> H20 </term> 
   </instance>
</instances>

Both water and H20 are recognized as Water. Because Water is in the name tag, Water is the string used in dynamic navigation to refer to pages that contain water and H20. Because <store_term_or_name> is not defined, any appearance of the term in the content sources is stored internally.

The following example shows an internal project defined as a regular expression.

<?xml version="1.0"?> 
<instances> 
   <instance> 
   <name> Internal project </name> 
   <pattern> P[A-Z][0-9]*/{0,1}[0-9]* </pattern> 
   </instance>
</instances>

Because <store_regex_or_name> is not defined, the matching text is stored internally every time entity recognition finds a match for the regular expression.

Composite Entities that Run on Annotated Terms

Optionally, you can define a composite entity, which enables you to define each entity as a sequence of terms. That is, the composite entities take a sequence of words rather than the words themselves as input. All the words of the sequence have to have been annotated with an entity. Also, all the words must appear within the same line of a document in the order defined in the composite entities file. Then, if the sequence of words matches the composite entity, the full sequence is identified as an instance of the entity represented by the composite entity.

For example, suppose that you want to define a composite entity that detects full names, that is, combinations of titles, names, middlenames, and surnames. First, you need to define four dictionary-based entities, Title, Name, Middlename, and Surname, and provide a dictionary for each one. Then you define the composite entity, FullName, which detects full names.

For information about adding a new a composite entity to the search appliance, see "Adding New Composite Entities." All new composite entities are appended to the Composite Entities Definition file. In addition to adding a new composite entity, you can also perform the following tasks:

LL1 Grammars

A composite entity is written as an LL1 grammar. Entity recognition accepts LL1 grammars that take the following format:

     {Production Name} ::= {Production Name 2} [Terminal Name 1]...

The restrictions listed in the following table apply to an LL1 grammar.

Restriction Comments
All productions are enclosed in {} A production rule is a rewrite rule that specifies a symbol substitution that can be performed recursively.
All productions in any rule consequent need to be defined at some point.
All productions in any rule antecedent need to be used by some rules
All terminals are enclosed in [] A terminal corresponds to an entity name that has been previously defined in a dictionary.
The implication sign is ::= Means "consists of," that is, the production to the left of the symbol consists of the elements to the right.
The epsilon symbol is represented by [epsilon] epsilon represents the end of the production.
FullName Composite Entity Example

The following example shows the FullName grammar.

{FullName}::= {Title set}{Name set}{Middlenames}{Surname set}
{Title set}::=[Title] {Title set} 
{Title set ::= [epsilon] 
{Name set} ::= [Name] {Name set2} 
{Name set2} ::= [Name] {Name set2} 
{Name set2} ::= [epsilon] 
{Middlenames} ::= [Middlename] 
{Middlenames} ::= [epsilon] 
{Surname set} ::= [Surname] {Surname set2} 
{Surname set2} ::= [Surname] {Surname set2} 
{Surname set2} ::= [epsilon] 

This grammar accepts sequences of annotated words that contain, in the following order:

  1. Zero or one terms annotated as Title, where the entity Title has been previously defined.
  2. A set of one or more Name, where the entity Name has been previously defined.
  3. Zero or one Middlename, where the entity Middlename has been previously defined.
  4. A set of one or more Surname, where the entity Surname has been previously defined.

Entity Blacklist File

The entity blacklist is a text file that contains terms not to store in the index. If the search appliance discovers an entity that is present in the blacklist when indexing new documents, it discards the entity.

By default, the entity blacklist file is empty. To add terms to the entity blacklist, create a .txt file in which each line contains a term that represents a blacklisted entity. A term can include any number of words, but cannot contain regular expressions.

After you create the entity blacklist file, you can upload it to the search appliance. You can also download the entity blacklist file.

The entity blacklist is applied to documents that are indexed after you upload the file. Documents already in the index are not affected. To apply the entity blacklist to documents already in the index, force the search appliance to recrawl URL patterns by using the Status and Reports > Crawl Diagnostics page.

Enabling and Disabling Entity Recognition

Entity recognition is disabled by default. To enable it, click Enable. To disable it, click Disable.

Adding New Entities or Updating Existing Entities

Entities defined by dictionaries have the following two flags:

  • Case-sensitive--Indicates whether the terms in the dictionary are case-sensitive. This option applies only to dictionaries containing simple terms (.txt dictionary). Regular expression patterns (.xml dictionary) are set to case-sensitive in the RE2 pattern definition.
  • Transient--Indicates whether terms found in the content for this entity do not have to be stored in the entity data. For example, suppose that you want to identify full names, that is, occurrences of a sequence of title, names, middle names, or surnames. But you do not want to store individual titles, names, middlenames, or surnames. In this case, you would indicate that titles, names, middlenames, or surnames are transient, but full names are not.

To add a new entity defined by a dictionary:

  1. In the Entity name field, enter the name of a new or existing entity.
  2. Click Browse to navigate to the dictionary file in its location and select it.
  3. Under Case sensitive? indicate whether the entity terms in the dictionary file are case-sensitive.
  4. Under Transient? indicate whether or not the entity should be stored.
    If the entity already exists, the Transient flag already set prevails over the new value.
  5. Click Upload.
    If the entity name is new, it appears in the list of entities. If the entity name already exists, the dictionary is added to the list of dictionaries for the entity.

Deleting Entities

To delete an entity:

  1. In the List of entities, click the Delete link for the entity.
  2. When you are prompted, confirm the deletion.

Downloading Dictionaries

To download the dictionary file for an entity:

  1. In the List of entities, click the Download link for the entity.
  2. Save or open the file.

Editing Dictionaries

If you want to make changes to a dictionary for an entity, follow these steps:

  1. Download the dictionary for the entity.
  2. Edit the dictionary outside the search appliance.
  3. Delete the existing entity.
  4. Add the entity and the updated dictionary.

Adding New Composite Entities

When you add a new composite entity, it gets appended to the Composite Entities Definition file.

To add a new composite entity:

  1. In the Composite Entities name field, enter a unique name for the composite entity.
  2. In the Composite Entities in LL1 Format box, enter the composite entity.
  3. Click Upload.
    The new composite entity appears in the scrolling list of composite entities.

Updating Composite Entities

When you upload a composite entity file, the input file must be in the following format:

  • For each grammar, start with grammars {
  • Add the entity_name
  • Add one line with each production
  • Close with }

The following example shows the valid format for a composite entity file:

grammars {
entity_name: "Location"
productions: "{Location} ::= [City] [Country]" }

To update a composite entity:

  1. In the scrolling list of user-defined composite entities, select the one that you want to update.
  2. In the Composite Entities in LL1 Format box, make changes to it.
  3. Click Update.

You can also update composite entities by following this procedure:

  1. Downloading the Composite Entities Definition file.
  2. Updating the Composite Entities Definition file outside the search appliance.
  3. Uploading the updated Composite Entities Definition file.

Deleting Composite Entities

To delete a composite entity:

  1. In the scrolling list of composite entities, select the composite entity that you want to delete.
  2. Click Delete.

Downloading the Composite Entities Definition File

The Composite Entities Definition file contains all the user-defined composite entities.

To download the Composite Entities Definition file:

  1. Next to Download the composite entities definition file, click Download.
  2. Save or open the file.

Uploading a Composite Entities Definition File

Uploading a new or updated Composite Entities Definitions file overwrites the current one.

To upload a new Composite Entities Definition file:

  1. Next to Upload a new composite entities definition file, click Choose File to navigate to the file in its location and select it.
  2. Click Upload.

Downloading the Entity Blacklist File

To download the entity blacklist file:

  1. Next to Download the entity blacklist file, click Download.
  2. Save or open the file.

Uploading the Entity Blacklist File

Uploading a new or updated entity blacklist file overwrites the current one. The entity blacklist file must be a .txt file.

To upload an entity blacklist file:

  1. Next to Upload the entity blacklist file, click Choose File to navigate to the file in its location and select it.
  2. Click Upload.

Adding Entities to Dynamic Navigation

Once the entities are indexed, you can add the entities to dynamic navigation.

To add entities to dynamic navigation, use the Serving > Dynamic Navigation page to perform the following steps:

  1. Click Serving > Dynamic Navigation.
  2. Under Existing Configurations, click Add.
  3. In the Name box, type a name for the new configuration.
  4. Under Attributes, click Add Entity.
  5. In the Display Label box, enter the entity name that you want displayed in the dynamic navigation menu.
  6. From the Attribute Name drop-down menu, select the entity name.
  7. From the Type drop-down menu, select the entity type.
  8. Select options for sorting entities in the dynamic navigation panel.
  9. Click OK.

For information about applying the dynamic navigation configuration to a front end and showing dynamic navigation options in a front end, click Help Center > Serving > Dynamic Navigation.

For More Information

For more information about entity recognition, see "Creating the Search Experience: Best Practices," which is linked to the Google Search Appliance help center.

 


 
© Google Inc.