![]() |
|
Help Center
Home |
Crawl and Index > Entity RecognitionUse the Crawl and Index > Entity Recognition page to perform the following tasks:
About Entity RecognitionEntity recognition enables the Google Search Appliance to discover interesting entities in documents with missing or poor metadata and store these entities in the search index. For example, suppose that your search appliance crawls and indexes multiple content sources, but only one of these sources has robust metadata. By using entity recognition, you can enrich the metadata-poor content sources with discovered entities and discover new, interesting entities in the source with robust metadata. After you configure and enable entity recognition, the search appliance automatically discovers specific entities in your content sources during indexing, annotates them, and stores them in the index. Once the entities are indexed, you can enhance keyword search by adding the entities in dynamic navigation, which uses metadata in documents and entities discovered by entity recognition to enable users to browse search results by using specific attributes. For information about this topic, see "Adding Entities to Dynamic Navigation." Additionally, by default entity recognition extracts and stores full URLs in the index. This includes both document URLs and plain text URLs that appear in documents. So you can match specific URLs with entity recognition and add them to dynamic navigation, enabling users to browse search results by full or partial URL. For more information about this topic, see "Use Case: Matching URLs for Dynamic Navigation," in Administering Crawl: Advanced Topics. The Crawl and Index > Entity Recognition page enables you to specify the entities that you want the search appliance to discover in your documents. However, before you can specify entities on this page, you must define each entity by creating dictionaries of terms and regular expressions. Dictionaries for terms are required for entity recognition. Dictionaries enable entity recognition to annotate entities, that is, to discover specific entities in the content and annotate them as entities. Generally, with dictionaries, you define an entity with lists of terms and regular expressions. Optionally, you can also create composite entities that run on the annotated terms. Like dictionaries, composite entities define entities, but composite entities enable the search appliance to discover more complex terms. In a composite entity, you can define an entity with a sequence of terms. Because composite entities run on annotated terms, all the words in a sequence must be tagged with an entity and so depend on dictionaries. The search appliance provides sample dictionaries and composite entities, as shown on the Crawl and Index > Entity Recognition page. If you want to identify terms that should not be stored in the index, you can upload the terms in an entity blacklist file. Google recommends that you perform the tasks for setting up entity recognition in the following order:
Entity recognition only runs on documents that are added to the index after you enable entity recognition. Documents already in the index are not affected. To run entity recognition on documents already in the index, force the search appliance to recrawl URL patterns by using the Status and Reports > Crawl Diagnostics page. Entity recognition works for languages that are read left to right. It does not work for languages that are read right to left. Dictionaries of Terms and Regular ExpressionsYou must define each entity by at least one dictionary, which contains one or more terms. For example, the entity "Capital" might be defined by a dictionary that contains a list of country capitals: Abu Dhabi, Abuja, Accra, Addis Ababa, and so on. After you create a dictionary, you can upload it to the search appliance as described in "Adding New Entities or Updating Existing Entities." In addition to adding new entities or updating existing entities, you can perform the following tasks: Entity recognition accepts dictionaries in either of the following two formats: Dictionaries in TXT FormatA dictionary in .txt format contains an entity term on each line. Terms can be formed by several words separated by spaces. The following example shows an excerpt from the the .txt file for the entity "Country." ... United Arab Emirates United Kingdom United States United States of America Uruguay Vanuatu Vatican ... Dictionaries in XML FormatXML format enables a richer definition of an entity than .txt format. In particular, XML dictionaries enable you to define synonyms and regular expressions. The following code shows the XML schema for a dictionary. <?xml version="1.0"?> <instances> <instance> <name> Entity name </name> <term> Term 1 for the entity </term> <term> Term 2 for the entity </term> ... <pattern> Regular expression 1 </pattern> <pattern> Regular expression 2 </pattern> ... <store_term_or_name> term/name </store_term_or_name> <store_regex_or_name>name/regex</store_regex_or_name> </instance> <instance> ... </instance> </instances> Each instance can contain the tags described in the following table. Each instance must contain at least one
The following example shows the dictionary for the entity "Water." <?xml version="1.0"?> <instances> <instance> <name> Water </name> <term> water </term> <term> H20 </term> </instance> </instances> Both The following example shows an internal project defined as a regular expression. <?xml version="1.0"?> <instances> <instance> <name> Internal project </name> <pattern> P[A-Z][0-9]*/{0,1}[0-9]* </pattern> </instance> </instances> Because Composite Entities that Run on Annotated TermsOptionally, you can define a composite entity, which enables you to define each entity as a sequence of terms. That is, the composite entities take a sequence of words rather than the words themselves as input. All the words of the sequence have to have been annotated with an entity. Also, all the words must appear within the same line of a document in the order defined in the composite entities file. Then, if the sequence of words matches the composite entity, the full sequence is identified as an instance of the entity represented by the composite entity. For example, suppose that you want to define a composite entity that detects full names, that is, combinations of titles, names, middlenames, and surnames. First, you need to define four dictionary-based entities, Title, Name, Middlename, and Surname, and provide a dictionary for each one. Then you define the composite entity, FullName, which detects full names. For information about adding a new a composite entity to the search appliance, see "Adding New Composite Entities." All new composite entities are appended to the Composite Entities Definition file. In addition to adding a new composite entity, you can also perform the following tasks:
LL1 GrammarsA composite entity is written as an LL1 grammar. Entity recognition accepts LL1 grammars that take the following format: {Production Name} ::= {Production Name 2} [Terminal Name 1]... The restrictions listed in the following table apply to an LL1 grammar.
FullName Composite Entity ExampleThe following example shows the FullName grammar. {FullName}::= {Title set}{Name set}{Middlenames}{Surname set} {Title set}::=[Title] {Title set} {Title set ::= [epsilon] {Name set} ::= [Name] {Name set2} {Name set2} ::= [Name] {Name set2} {Name set2} ::= [epsilon] {Middlenames} ::= [Middlename] {Middlenames} ::= [epsilon] {Surname set} ::= [Surname] {Surname set2} {Surname set2} ::= [Surname] {Surname set2} {Surname set2} ::= [epsilon] This grammar accepts sequences of annotated words that contain, in the following order:
Entity Blacklist FileThe entity blacklist is a text file that contains terms not to store in the index. If the search appliance discovers an entity that is present in the blacklist when indexing new documents, it discards the entity. By default, the entity blacklist file is empty. To add terms to the entity blacklist, create a .txt file in which each line contains a term that represents a blacklisted entity. A term can include any number of words, but cannot contain regular expressions. After you create the entity blacklist file, you can upload it to the search appliance. You can also download the entity blacklist file. The entity blacklist is applied to documents that are indexed after you upload the file. Documents already in the index are not affected. To apply the entity blacklist to documents already in the index, force the search appliance to recrawl URL patterns by using the Status and Reports > Crawl Diagnostics page. Enabling and Disabling Entity RecognitionEntity recognition is disabled by default. To enable it, click Enable. To disable it, click Disable. Adding New Entities or Updating Existing EntitiesEntities defined by dictionaries have the following two flags:
To add a new entity defined by a dictionary:
Deleting EntitiesTo delete an entity:
Downloading DictionariesTo download the dictionary file for an entity:
Editing DictionariesIf you want to make changes to a dictionary for an entity, follow these steps:
Adding New Composite EntitiesWhen you add a new composite entity, it gets appended to the Composite Entities Definition file. To add a new composite entity:
Updating Composite EntitiesWhen you upload a composite entity file, the input file must be in the following format:
The following example shows the valid format for a composite entity file: grammars { To update a composite entity:
You can also update composite entities by following this procedure:
Deleting Composite EntitiesTo delete a composite entity:
Downloading the Composite Entities Definition FileThe Composite Entities Definition file contains all the user-defined composite entities. To download the Composite Entities Definition file:
Uploading a Composite Entities Definition FileUploading a new or updated Composite Entities Definitions file overwrites the current one. To upload a new Composite Entities Definition file:
Downloading the Entity Blacklist FileTo download the entity blacklist file:
Uploading the Entity Blacklist FileUploading a new or updated entity blacklist file overwrites the current one. The entity blacklist file must be a .txt file. To upload an entity blacklist file:
Adding Entities to Dynamic NavigationOnce the entities are indexed, you can add the entities to dynamic navigation. To add entities to dynamic navigation, use the Serving > Dynamic Navigation page to perform the following steps:
For information about applying the dynamic navigation configuration to a front end and showing dynamic navigation options in a front end, click Help Center > Serving > Dynamic Navigation. For More InformationFor more information about entity recognition, see "Creating the Search Experience: Best Practices," which is linked to the Google Search Appliance help center.
|
||||||||||||||||||||||||||
© Google Inc. |