Back to Home | Admin Console Help | Log Out
 Admin Console Help
 
Admin Console Help

Home

Content Sources

Index
  Index Settings
  Document Dates
  Entity Recognition
  Alerts
  Collections
  Composite Collections
  Diagnostics
  Reset Index

Search

Reports

GSA Unification

GSAn

Administration

More Information

Index > Entity Recognition

Use the tabs on the Index > Entity Recognition page to perform the tasks listed in the following table.

Tab Tasks
Simple Entities tab
Composite Entities tab
Blacklist tab
Entity Diagnostics tab
Adjustments tab

About Entity Recognition

Entity recognition enables the Google Search Appliance to discover interesting entities in documents with missing or poor metadata and store these entities in the search index.

For example, suppose that your search appliance crawls and indexes multiple content sources, but only one of these sources has robust metadata. By using entity recognition, you can enrich the metadata-poor content sources with discovered entities and discover new, interesting entities in the source with robust metadata.

The search appliance can extract entities from the content of documents, and from the metadata associated with a document.

After you configure and enable entity recognition, the search appliance automatically discovers specific entities in your content sources during indexing, annotates them, and stores them in the index. Once the entities are indexed, you can enhance keyword search by adding the entities in dynamic navigation, which uses metadata in documents and entities discovered by entity recognition to enable users to browse search results by using specific attributes. For information about this topic, see "Adding Entities to Dynamic Navigation."

Additionally, by default entity recognition extracts and stores full URLs in the index. This includes both document URLs and plain text URLs that appear in documents. So you can match specific URLs with entity recognition and add them to dynamic navigation, enabling users to browse search results by full or partial URL. For more information about this topic, see "Use Case: Matching URLs for Dynamic Navigation," in Administering Crawl: Advanced Topics.

The Index > Entity Recognition page enables you to specify the simple entities that you want the search appliance to discover in your documents. However, before you can specify entities on this page, you must define each entity by creating dictionaries of terms and regular expressions. Dictionaries for terms are required for entity recognition. Dictionaries enable entity recognition to annotate entities, that is, to discover specific entities in the content and annotate them as entities. Generally, with dictionaries, you define a simple entity with lists of terms and regular expressions.

Optionally, you can also create composite entities that run on the annotated terms. Like dictionaries, composite entities define entities, but composite entities enable the search appliance to discover more complex terms. In a composite entity, you can define an entity with a sequence of terms. Because composite entities run on annotated terms, all the words in a sequence must be tagged with an entity and so depend on dictionaries.

The search appliance provides sample dictionaries and composite entities, as shown on the Index > Entity Recognition page.

If you want to identify terms that should not be stored in the index, you can upload the terms in an entity blacklist file.

You can test your entity recognition configuration and fine-tune it, enhancing it and correcting any mistakes.

Google recommends that you perform the tasks for setting up entity recognition in the following order:

  1. Creating dictionaries and, optionally, composite entities.
  2. Adding new simple entities by adding dictionaries and, optionally, composite entities.
  3. Testing your entity recognition configuration.
  4. Enabling entity recognition.

Entity recognition only runs on documents that are added to the index after you enable entity recognition. Documents already in the index are not affected. To run entity recognition on documents already in the index, force the search appliance to recrawl URL patterns by using the Index > Diagnostics > Index Diagnostics page.

Entity recognition works for languages that are read left to right. It does not work for languages that are read right to left.

Dictionaries of Terms and Regular Expressions

You must define each entity by at least one dictionary, which contains one or more terms. For example, the entity "Capital" might be defined by a dictionary that contains a list of country capitals: Abu Dhabi, Abuja, Accra, Addis Ababa, and so on.

After you create a dictionary, you can upload it to the search appliance as described in "Adding New Entities or Updating Existing Entities." In addition to adding new entities or updating existing entities, you can perform the following tasks:

Entity recognition accepts dictionaries in either of the following two formats:

Dictionaries in TXT Format

A dictionary in .txt format contains an entity term on each line. Terms can be formed by several words separated by spaces. The following example shows an excerpt from the the .txt file for the entity "Country."

...
United Arab Emirates
United Kingdom
United States
United States of America
Uruguay
Vanuatu
Vatican
...
Dictionaries in XML Format

XML format enables a richer definition of an entity than .txt format. In particular, XML dictionaries enable you to define synonyms and regular expressions. The following code shows the XML schema for a dictionary.

<?xml version="1.0"?> 
<instances> 
   <instance> 
   <name> Entity name </name> 
   <term> Term 1 for the entity </term> 
   <term> Term 2 for the entity </term> 
   ...
   <pattern> Regular expression 1 </pattern> 
   <pattern> Regular expression 2 </pattern>
  ...
   <store_term_or_name> term/name </store_term_or_name>
   <store_regex_or_name>name/regex/regex_tagged_as_first_group</store_regex_or_name>
   </instance>
   <instance>
   ...
   </instance>
</instances>

Each instance can contain the tags described in the following table. Each instance must contain at least one <term> or <pattern>.

Tag Name Description
<name>
The name that identifies the group of synonyms. The text in this tag is used to annotate the content source. (required)
<term>
A term in the list of synonyms. (zero or more)
<pattern>
A regular expression. (zero or more)
<store_term_or_name>
In case of identifying a term, whether the term or the name should be stored. By default the term is stored internally. (optional)
<store_regex_or_name>
In case of finding a match with the regular expression, whether the text that matched the regular expression or the name should be stored. By default the text is stored internally. In case of regex_tagged_as_first_group, the entity that matches the first matching group is stored, otherwise, the text is stored. For example: if the pattern is http://www.mycompany.com/(\w+)/[^\s]* and the text is http://www.mycompany.com/services/services.html, the pattern matches "services," which is stored. (optional)

The following example shows the dictionary for the simple entity "Water."

<?xml version="1.0"?> 
<instances> 
   <instance> 
   <name> Water </name> 
   <term> water </term>
   <term> H20 </term> 
   </instance>
</instances>

Both water and H20 are recognized as Water. Because Water is in the name tag, "Water" is the string used in dynamic navigation to refer to pages that contain water and H20. Because <store_term_or_name> is not defined, any appearance of the term in the content sources is stored internally.

The following example shows an internal project defined as a regular expression.

<?xml version="1.0"?> 
<instances> 
   <instance> 
   <name> Internal project </name> 
   <pattern> P[A-Z][0-9]*/{0,1}[0-9]* </pattern> 
   </instance>
</instances>

Because <store_regex_or_name> is not defined, the matching text is stored internally every time entity recognition finds a match for the regular expression.

Entity recognition for metadata processes the following terms, splitting the values into terms according to the separator configuration:

"meta_name=value1 value2 ..." 
"meta_name"
"value1"
"value2"
".."

The following example shows a dictionary for metadata in the format <meta name="author" value="smith and other text"> or <meta name="creator" value="jones and other text">.

<?xml version="1.0"?>
<instances><instance>
  <name> writer </name>
  <pattern> author=(.*)</pattern>
  <pattern> creator=(.*)</pattern>
  <store_regex_or_name>regex_tagged_as_first_group</store_regex_or_name>
</instance></instances>

The value matched to pattern (.*) will be the value of the recognized entity.

Composite Entities that Run on Annotated Terms

Optionally, you can define a composite entity, which enables you to define each entity as a sequence of terms. That is, the composite entities take a sequence of words rather than the words themselves as input. All the words of the sequence have to have been annotated with an entity. Also, all the words must appear within the same line of a document in the order defined in the composite entities file. Then, if the sequence of words matches the composite entity, the full sequence is identified as an instance of the entity represented by the composite entity.

For example, suppose that you want to define a composite entity that detects full names, that is, combinations of titles, names, middlenames, and surnames. First, you need to define four dictionary-based entities, Title, Name, Middlename, and Surname, and provide a dictionary for each one. Then you define the composite entity, FullName, which detects full names.

For information about adding a new a composite entity to the search appliance, see "Adding New Composite Entities." All new composite entities are appended to the Composite Entities Definition file. In addition to adding a new composite entity, you can also perform the following tasks:

LL1 Grammars

A composite entity is written as an LL1 grammar. Entity recognition accepts LL1 grammars that take the following format:

     {Production Name} ::= {Production Name 2} [Terminal Name 1]...

The restrictions listed in the following table apply to an LL1 grammar.

Restriction Comments
All productions are enclosed in {} A production rule is a rewrite rule that specifies a symbol substitution that can be performed recursively.
All productions in any rule consequent need to be defined at some point.
All productions in any rule antecedent need to be used by some rules
All terminals are enclosed in [] A terminal corresponds to an entity name that has been previously defined in a dictionary.
The implication sign is ::= Means "consists of," that is, the production to the left of the symbol consists of the elements to the right.
The epsilon symbol is represented by [epsilon] epsilon represents the end of the production.
FullName Composite Entity Example

The following example shows the FullName grammar.

{FullName}::= {Title set}{Name set}{Middlenames}{Surname set}
{Title set}::=[Title] {Title set} 
{Title set ::= [epsilon] 
{Name set} ::= [Name] {Name set2} 
{Name set2} ::= [Name] {Name set2} 
{Name set2} ::= [epsilon] 
{Middlenames} ::= [Middlename] 
{Middlenames} ::= [epsilon] 
{Surname set} ::= [Surname] {Surname set2} 
{Surname set2} ::= [Surname] {Surname set2} 
{Surname set2} ::= [epsilon] 

This grammar accepts sequences of annotated words that contain, in the following order:

  1. Zero or one terms annotated as Title, where the entity Title has been previously defined.
  2. A set of one or more Name, where the entity Name has been previously defined.
  3. Zero or one Middlename, where the entity Middlename has been previously defined.
  4. A set of one or more Surname, where the entity Surname has been previously defined.

Entity Blacklist File

The entity blacklist is a text file that contains terms not to store in the index. If the search appliance discovers an entity that is present in the blacklist when indexing new documents, it discards the entity.

By default, the entity blacklist file is empty. To add terms to the entity blacklist, create a .txt file in which each line contains a term that represents a blacklisted entity. A term can include any number of words, but cannot contain regular expressions.

After you create the entity blacklist file, you can upload it to the search appliance. You can also download the entity blacklist file.

The entity blacklist is applied to documents that are indexed after you upload the file. Documents already in the index are not affected. To apply the entity blacklist to documents already in the index, force the search appliance to recrawl URL patterns by using the Index > Diagnostics > Index Diagnostics page.

Enabling and Disabling Entity Recognition

Entity recognition is disabled by default. To enable it, click Enable. To disable it, click Disable.

Adding New Simple Entities or Updating Existing Entities

Simple entities defined by dictionaries have the following two flags:

  • Application area--Indicates where to apply the uploaded dictionary. This flag has the following options:
    • All--Get entities for the dictionary from document URLs, content, and metadata (default).
    • Url--Get entities for the dictionary only from document URLs, not from content or metadata.
    • Content--Get entities for the dictionary only from document content, not from URLs or metadata.
    • Metadata--Get entities for the dictionary only from document metadata, not from content or URLs.
  • Case-sensitive--Indicates whether the terms in the dictionary are case-sensitive. This option applies only to dictionaries containing simple terms (.txt dictionary). Regular expression patterns (.xml dictionary) are set to case-sensitive in the RE2 pattern definition.
  • Transient--Indicates whether terms found in the content for this entity do not have to be stored in the entity data. For example, suppose that you want to identify full names, that is, occurrences of a sequence of title, names, middle names, or surnames. But you do not want to store individual titles, names, middlenames, or surnames. In this case, you would indicate that titles, names, middlenames, or surnames are transient, but full names are not.

To add a new simple entity defined by a dictionary:

  1. In the Entity name field, enter the name of a new or existing entity.
  2. Click Choose File to navigate to the dictionary file in its location and select it.
  3. Under Application area, choose where to apply the uploaded dictionary.
  4. Under Case sensitive? indicate whether the entity terms in the dictionary file are case-sensitive.
  5. Under Transient? indicate whether or not the entity should be stored.
    If the entity already exists, the Transient flag already set prevails over the new value.
  6. Click Upload.
    If the entity name is new, it appears in the list of entities. If the entity name already exists, the dictionary is added to the list of dictionaries for the entity.

Deleting Entities

To delete an entity:

  1. In the List of entities, click the Delete link for the entity.
  2. When you are prompted, confirm the deletion.

Downloading Dictionaries

To download the dictionary file for an entity:

  1. In the List of entities, click the Download link for the entity.
  2. Save or open the file.

Editing Dictionaries

If you want to make changes to a dictionary for an entity, follow these steps:

  1. Download the dictionary for the entity.
  2. Edit the dictionary outside the search appliance.
  3. Delete the existing entity.
  4. Add the entity and the updated dictionary.

Adding New Composite Entities

When you add a new composite entity, it gets appended to the Composite Entities Definition file.

To add a new composite entity:

  1. In the Composite Entities name field, enter a unique name for the composite entity.
  2. In the Composite Entities in LL1 Format box, enter the composite entity.
  3. Click Upload.
    The new composite entity appears in the scrolling list of composite entities.

Updating Composite Entities

To update a composite entity:

  1. In the scrolling list of user-defined composite entities, select the one that you want to update.
  2. In the Composite Entities in LL1 Format box, make changes to it.
  3. Click Update.

You can also update composite entities by following this procedure:

  1. Downloading the Composite Entities Definition file.
  2. Updating the Composite Entities Definition file outside the search appliance.
  3. Uploading the updated Composite Entities Definition file.

Deleting Composite Entities

To delete a composite entity:

  1. In the scrolling list of composite entities, select the composite entity that you want to delete.
  2. Click Delete.

Downloading the Composite Entities Definition File

The Composite Entities Definition file contains all the user-defined composite entities.

To download the Composite Entities Definition file:

  1. Under Download the composite entities definition file, click Download.
  2. Save or open the file.

Uploading a Composite Entities Definition File

When you upload a composite entity file, the input file must be in the following format:

  • For each grammar, start with grammars {
  • Add the entity_name
  • Add one line with each production
  • Close with }

The following example shows the valid format for a composite entity file:

grammars {
entity_name: "Location"
productions: "{Location} ::= [City] [Country]" }

Uploading a new or updated Composite Entities Definitions file overwrites the current one.

To upload a new Composite Entities Definition file:

  1. Under Upload a new composite entities definition file, click Choose File to navigate to the file in its location and select it.
  2. Click Upload.

Downloading the Entity Blacklist File

To download the entity blacklist file:

  1. Under Download the entity blacklist file, click Download.
  2. Save or open the file.

Deleting the Entity Blacklist File

To delete the entity blacklist file, click Delete.

Uploading the Entity Blacklist File

Uploading a new or updated entity blacklist file overwrites the current one. The entity blacklist file must be a .txt file.

To upload an entity blacklist file:

  1. Under Upload the entity blacklist file, click Choose File to navigate to the file in its location and select it.
  2. Click Upload.

Adding Entities to Dynamic Navigation

Once the entities are indexed, you can add the entities to dynamic navigation.

To add entities to dynamic navigation, use the Search > Search Features > Dynamic Navigation page to perform the following steps:

  1. Click Search > Search Features > Dynamic Navigation.
  2. Under Existing Configurations, click Add.
  3. In the Name box, type a name for the new configuration.
  4. Under Attributes, click Add Entity.
  5. In the Display Label box, enter the entity name that you want displayed in the dynamic navigation menu.
  6. From the Attribute Name drop-down menu, select the entity name.
  7. From the Type drop-down menu, select the entity type.
  8. Select options for sorting entities in the dynamic navigation panel.
  9. Click OK.

For information about applying the dynamic navigation configuration to a front end and showing dynamic navigation options in a front end, click Admin Console Help > Search > Search Features > Dynamic Navigation.

Testing Entity Recognition

Testing entity recognition enables you to fine-tune your configuration and correct any mistakes. Also, by testing, you can gain an understanding of your dictionaries and use this understanding to develop the best ones for your corpus.

You can test your entity recognition configuration for a specific document by using the options on the Entity Diagnostics tab. To test your configuration, use a document from one of the following sources:

For this document, you can see highlighted entities that the search appliance has extracted from it. If there are issues with extracted entities, you can modify your entity recognition configuration and retest.

You can run tests when entity recognition is enabled or disabled. Any tests that you run do not affect the index or search appliance in any way.

Take note that entity diagnostics does not consider the <head> section of a document. So any entities that appear in the <head> section are not reported. For example, if the document title appears in the <head> section, it does not show up in entity diagnostics.

Testing a URL

To test entity recognition using a URL:

  1. Enter the URL of a public document.
  2. Click Go.

Testing a Local File

To test entity recognition using a local file:

  1. Click Choose File and navigate to a file smaller than 1MB.
  2. Click Go.

Testing a Cached Public Document

To test entity recognition using a public document that is cached in the search appliance index:

  1. Click Index > Diagnostics > Index Diagnostics.
  2. Click List Format.
  3. Under All Hosts, click the document that you want to test.
  4. Under More information about this page, locate cached version in the list and click Open in Entity Diagnostics.

Submitting a Test Search

You can also verify that entity recognition is working by submitting a test search for some content that you know should match Entity Recognition rules. Request results in XML and use the search parameter &getfields=* to see if there are new metadata.

The format of the metatags would be:

<MT N="gsaentity_[entity_name]" V="[entity recognized value]" />

Adjusting Parameters

The Adjustments tab enables expert users to fine-tune entity recognition parameters. Google discourages you from making changes on this tab unless necessary.

The following table lists the parameter flags that can be adjusted on this tab.

Flag Default Comments
Maximum number of entities per document 50 Valid values are 1-100.
Maximum number of words of the longest entity 20 Valid values are 1-100.
Store entities as as found in text

Upper- and lower-case variants of the same word can be stored according to one of the following selections:

  • in text: -- As found in text.
  • Lowercase -- All letters are lowercase.
  • Uppercase -- All letters are uppercase.
  • Capitalize each word -- The first letter of the word is uppercase. In the case of multiple words, the first letter of each word is uppercase. All other letters are lowercase.
  • Capitalize the first word only -- The first letter of the word is uppercase. In the case of multiple words, only the first letter of the first word is uppercase. All other letters are lowercase.
Punctuation marks when followed by space

!
?
'
.
;
:
.

Do not change this flag unless advised to do so by Google Search Appliance support.

All symbols that entity recognition must treat as punctuation marks when followed by whitespace. Each line must contain a single punctuation mark.

Punctuation marks None

Do not change this flag unless advised to do so by Google Search Appliance support.

All symbols that entity recognition must treat as punctuation marks any time they appear. Each line must contain a single punctuation mark.

To adjust parameter flags:

  1. Change the value of one or more flags.
  2. Click Update.

For More Information

For more information about entity recognition, see "Creating the Search Experience: Best Practices," which is linked to the Google Search Appliance help center.

 


 
© Google Inc.