|
Admin Console Help
Home |
Index > Entity RecognitionUse the tabs on the Index > Entity Recognition page to perform the tasks listed in the following table.
About Entity RecognitionEntity recognition enables the Google Search Appliance to discover interesting entities in documents with missing or poor metadata and store these entities in the search index. For example, suppose that your search appliance crawls and indexes multiple content sources, but only one of these sources has robust metadata. By using entity recognition, you can enrich the metadata-poor content sources with discovered entities and discover new, interesting entities in the source with robust metadata. The search appliance can extract entities from the content of documents, and from the metadata associated with a document. After you configure and enable entity recognition, the search appliance automatically discovers specific entities in your content sources during indexing, annotates them, and stores them in the index. Once the entities are indexed, you can enhance keyword search by adding the entities in dynamic navigation, which uses metadata in documents and entities discovered by entity recognition to enable users to browse search results by using specific attributes. For information about this topic, see "Adding Entities to Dynamic Navigation." Additionally, by default entity recognition extracts and stores full URLs in the index. This includes both document URLs and plain text URLs that appear in documents. So you can match specific URLs with entity recognition and add them to dynamic navigation, enabling users to browse search results by full or partial URL. For more information about this topic, see "Use Case: Matching URLs for Dynamic Navigation," in Administering Crawl: Advanced Topics. The Index > Entity Recognition page enables you to specify the simple entities that you want the search appliance to discover in your documents. However, before you can specify entities on this page, you must define each entity by creating dictionaries of terms and regular expressions. Dictionaries for terms are required for entity recognition. Dictionaries enable entity recognition to annotate entities, that is, to discover specific entities in the content and annotate them as entities. Generally, with dictionaries, you define a simple entity with lists of terms and regular expressions. Optionally, you can also create composite entities that run on the annotated terms. Like dictionaries, composite entities define entities, but composite entities enable the search appliance to discover more complex terms. In a composite entity, you can define an entity with a sequence of terms. Because composite entities run on annotated terms, all the words in a sequence must be tagged with an entity and so depend on dictionaries. The search appliance provides sample dictionaries and composite entities, as shown on the Index > Entity Recognition page. If you want to identify terms that should not be stored in the index, you can upload the terms in an entity blacklist file. You can test your entity recognition configuration and fine-tune it, enhancing it and correcting any mistakes. Google recommends that you perform the tasks for setting up entity recognition in the following order:
Entity recognition only runs on documents that are added to the index after you enable entity recognition. Documents already in the index are not affected. To run entity recognition on documents already in the index, force the search appliance to recrawl URL patterns by using the Index > Diagnostics > Index Diagnostics page. Entity recognition works for languages that are read left to right. It does not work for languages that are read right to left. Dictionaries of Terms and Regular ExpressionsYou must define each entity by at least one dictionary, which contains one or more terms. For example, the entity "Capital" might be defined by a dictionary that contains a list of country capitals: Abu Dhabi, Abuja, Accra, Addis Ababa, and so on. After you create a dictionary, you can upload it to the search appliance as described in "Adding New Entities or Updating Existing Entities." In addition to adding new entities or updating existing entities, you can perform the following tasks: Entity recognition accepts dictionaries in either of the following two formats: Dictionaries in TXT FormatA dictionary in .txt format contains an entity term on each line. Terms can be formed by several words separated by spaces. The following example shows an excerpt from the the .txt file for the entity "Country." ... United Arab Emirates United Kingdom United States United States of America Uruguay Vanuatu Vatican ... Dictionaries in XML FormatXML format enables a richer definition of an entity than .txt format. In particular, XML dictionaries enable you to define synonyms and regular expressions. The following code shows the XML schema for a dictionary. <?xml version="1.0"?> <instances> <instance> <name> Entity name </name> <term> Term 1 for the entity </term> <term> Term 2 for the entity </term> ... <pattern> Regular expression 1 </pattern> <pattern> Regular expression 2 </pattern> ... <store_term_or_name> term/name </store_term_or_name> <store_regex_or_name>name/regex/regex_tagged_as_first_group</store_regex_or_name> </instance> <instance> ... </instance> </instances> Each instance can contain the tags described in the following table. Each instance must contain at least one
The following example shows the dictionary for the simple entity "Water." <?xml version="1.0"?> <instances> <instance> <name> Water </name> <term> water </term> <term> H20 </term> </instance> </instances> Both The following example shows an internal project defined as a regular expression. <?xml version="1.0"?> <instances> <instance> <name> Internal project </name> <pattern> P[A-Z][0-9]*/{0,1}[0-9]* </pattern> </instance> </instances> Because Entity recognition for metadata processes the following terms, splitting the values into terms according to the separator configuration: "meta_name=value1 value2 ..." "meta_name" "value1" "value2" ".." The following example shows a dictionary for metadata in the format <meta name="author" value="smith and other text"> or <meta name="creator" value="jones and other text">. <?xml version="1.0"?> <instances><instance> <name> writer </name> <pattern> author=(.*)</pattern> <pattern> creator=(.*)</pattern> <store_regex_or_name>regex_tagged_as_first_group</store_regex_or_name> </instance></instances> The value matched to pattern (.*) will be the value of the recognized entity. Composite Entities that Run on Annotated TermsOptionally, you can define a composite entity, which enables you to define each entity as a sequence of terms. That is, the composite entities take a sequence of words rather than the words themselves as input. All the words of the sequence have to have been annotated with an entity. Also, all the words must appear within the same line of a document in the order defined in the composite entities file. Then, if the sequence of words matches the composite entity, the full sequence is identified as an instance of the entity represented by the composite entity. For example, suppose that you want to define a composite entity that detects full names, that is, combinations of titles, names, middlenames, and surnames. First, you need to define four dictionary-based entities, Title, Name, Middlename, and Surname, and provide a dictionary for each one. Then you define the composite entity, FullName, which detects full names. For information about adding a new a composite entity to the search appliance, see "Adding New Composite Entities." All new composite entities are appended to the Composite Entities Definition file. In addition to adding a new composite entity, you can also perform the following tasks:
LL1 GrammarsA composite entity is written as an LL1 grammar. Entity recognition accepts LL1 grammars that take the following format: {Production Name} ::= {Production Name 2} [Terminal Name 1]... The restrictions listed in the following table apply to an LL1 grammar.
FullName Composite Entity ExampleThe following example shows the FullName grammar. {FullName}::= {Title set}{Name set}{Middlenames}{Surname set} {Title set}::=[Title] {Title set} {Title set ::= [epsilon] {Name set} ::= [Name] {Name set2} {Name set2} ::= [Name] {Name set2} {Name set2} ::= [epsilon] {Middlenames} ::= [Middlename] {Middlenames} ::= [epsilon] {Surname set} ::= [Surname] {Surname set2} {Surname set2} ::= [Surname] {Surname set2} {Surname set2} ::= [epsilon] This grammar accepts sequences of annotated words that contain, in the following order:
Entity Blacklist FileThe entity blacklist is a text file that contains terms not to store in the index. If the search appliance discovers an entity that is present in the blacklist when indexing new documents, it discards the entity. By default, the entity blacklist file is empty. To add terms to the entity blacklist, create a .txt file in which each line contains a term that represents a blacklisted entity. A term can include any number of words, but cannot contain regular expressions. After you create the entity blacklist file, you can upload it to the search appliance. You can also download the entity blacklist file. The entity blacklist is applied to documents that are indexed after you upload the file. Documents already in the index are not affected. To apply the entity blacklist to documents already in the index, force the search appliance to recrawl URL patterns by using the Index > Diagnostics > Index Diagnostics page. Enabling and Disabling Entity RecognitionEntity recognition is disabled by default. To enable it, click Enable. To disable it, click Disable. Adding New Simple Entities or Updating Existing EntitiesSimple entities defined by dictionaries have the following two flags:
To add a new simple entity defined by a dictionary:
Deleting EntitiesTo delete an entity:
Downloading DictionariesTo download the dictionary file for an entity:
Editing DictionariesIf you want to make changes to a dictionary for an entity, follow these steps:
Adding New Composite EntitiesWhen you add a new composite entity, it gets appended to the Composite Entities Definition file. To add a new composite entity:
Updating Composite EntitiesTo update a composite entity:
You can also update composite entities by following this procedure:
Deleting Composite EntitiesTo delete a composite entity:
Downloading the Composite Entities Definition FileThe Composite Entities Definition file contains all the user-defined composite entities. To download the Composite Entities Definition file:
Uploading a Composite Entities Definition FileWhen you upload a composite entity file, the input file must be in the following format:
The following example shows the valid format for a composite entity file: grammars { Uploading a new or updated Composite Entities Definitions file overwrites the current one. To upload a new Composite Entities Definition file:
Downloading the Entity Blacklist FileTo download the entity blacklist file:
Deleting the Entity Blacklist FileTo delete the entity blacklist file, click Delete. Uploading the Entity Blacklist FileUploading a new or updated entity blacklist file overwrites the current one. The entity blacklist file must be a .txt file. To upload an entity blacklist file:
Adding Entities to Dynamic NavigationOnce the entities are indexed, you can add the entities to dynamic navigation. To add entities to dynamic navigation, use the Search > Search Features > Dynamic Navigation page to perform the following steps:
For information about applying the dynamic navigation configuration to a front end and showing dynamic navigation options in a front end, click Admin Console Help > Search > Search Features > Dynamic Navigation. Testing Entity RecognitionTesting entity recognition enables you to fine-tune your configuration and correct any mistakes. Also, by testing, you can gain an understanding of your dictionaries and use this understanding to develop the best ones for your corpus. You can test your entity recognition configuration for a specific document by using the options on the Entity Diagnostics tab. To test your configuration, use a document from one of the following sources:
For this document, you can see highlighted entities that the search appliance has extracted from it. If there are issues with extracted entities, you can modify your entity recognition configuration and retest. You can run tests when entity recognition is enabled or disabled. Any tests that you run do not affect the index or search appliance in any way. Take note that entity diagnostics does not consider the Testing a URLTo test entity recognition using a URL:
Testing a Local FileTo test entity recognition using a local file:
Testing a Cached Public DocumentTo test entity recognition using a public document that is cached in the search appliance index:
Submitting a Test SearchYou can also verify that entity recognition is working by submitting a test search for some content that you know should match Entity Recognition rules. Request results in XML and use the search parameter The format of the metatags would be: Adjusting ParametersThe Adjustments tab enables expert users to fine-tune entity recognition parameters. Google discourages you from making changes on this tab unless necessary. The following table lists the parameter flags that can be adjusted on this tab.
To adjust parameter flags:
For More InformationFor more information about entity recognition, see "Creating the Search Experience: Best Practices," which is linked to the Google Search Appliance help center.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© Google Inc. |