US 7580827 B1 Abstract A semantic locator determines whether input sequences form semantically meaningful units. The semantic locator includes a coherence component that calculates a coherence of the terms in the sequence and a variation component that calculates the variation in terms that surround the sequence. A heuristics component may additionally refine results of the coherence component and the variation component. A decision component may make the determination of whether the sequence is a semantic unit based on the results of the coherence component, variation component, and heuristics component.
Claims(41) 1. A method performed by a server device, of identifying whether a sequence of terms is a semantic unit, the method comprising:
receiving, by a communication interface or an input device of the server device, the sequence of terms in a memory;
calculating, by a processor of the server device, a first value representing a coherence of terms in the sequence;
calculating, by the processor, a second value representing variation of context in which the sequence occurs;
comparing, by the processor, the first value to a first threshold and the second value to a second threshold;
identifying, by the processor, that the sequence is a semantic unit based at least in part on the first value satisfying the first threshold and the second value satisfying the second threshold; and
outputting, by the communication interface or an output device of the server device, an indication that the sequence is a semantic unit based on identifying that the sequence is a semantic unit.
2. The method of
3. The method of
4. The method of
where f(A) defines a number of occurrences of term A in the collection of documents, f(˜A) defines a number of occurrences of a term other than term A in the collection of documents, f(B) defines a number of occurrences of term B in the collection of documents, N defines a total number of events in the collection of documents, f(AB) defines a number of times term A is followed by term B in the collection of documents, and f(˜AB) is a number of times a term other than A is followed by term B in the collection of documents, where
where n and k are integers.
5. The method of
where f(A) defines a number of occurrences of term A in the collection of documents, f(B) defines a number of occurrences of term B in the collection of documents, N defines a total number of events in the collection of documents, and f(AB) defines a number of times term A is followed by term B in the collection of documents.
6. The method of
7. The method of
8. The method of
H(S)=MIN(HL(S),HR(S)),and
where MIN defines a minimum operation, S represents the sequence, HL(S) represents an entropy to the left of the sequence S, HR(S) represents an entropy to the right of the sequence S, f(wS) defines a number of times a particular term, w, appears in the collection of documents followed by the sequence S, f(Sw) refers to a number of times the sequence S is followed by w in the collection of documents, and f(S) refers to a number of times the sequence S is present in the collection of documents.
9. The method of
HM(S)=MIN(HLM(S),HRM(S)),where MIN defines a minimum operation, S represents the sequence, HLM(S) is defined as a minimum of
for each term w in the collection of documents, HRM(S) is defined as a minimum of
for each term w in the collection of documents, f(wS) defines a number of times a particular term, w, appears in the collection of documents followed by the sequence, f(Sw) refers to a number of times the sequence is followed by w in the collection of documents, and f(S) refers to a number of times the sequence S is present in the collection of documents.
10. The method of
HC(S)=MIN(HLC(S),HRC(S)),where MIN defines a minimum operation, S represents the sequence, HLC(S) is defined as
and HRC(S) is defined as
where δ(X) is defined as one if a sequence X occurs in the collection of documents and zero otherwise, where wS refers to word w followed by the sequence S, and where Sw refers to the sequence S followed by the word w.
11. The method of
HP(S)=MIN(HLP(S),HRP(S))where MIN defines a minimum operation, S represents the sequence, HLP(S) is defined as the number of continuations to the left of the sequence that cover a predetermined percentage of all cases in the collection of documents and HRP(S) is defined as the number of continuations to the right of the sequence that cover the predetermined percentage of all cases in the collection of documents.
12. The method of
13. The method of
applying one or more rules to the sequence, and
where identifying that the sequence is a semantic unit is further based at least in part on the application of the one or more rules.
14. A device comprising:
a memory to store instructions; and
a processor to execute the instructions to implement:
a receiving component to receive a sequence of terms;
a coherence component to calculate a coherence of multiple terms in the sequence of terms;
a variation component to calculate a variation of context terms in a collection of documents in which the sequence occurs, where the variation of context terms is calculated as a measure of entropy of the context of the sequence; and
a decision component to determine whether the sequence constitutes a semantic unit based at least in part on results of the coherence component and the variation component, and output an indication of whether the sequence constitutes a semantic unit for use in a processor.
15. The device of
16. The device of
17. The device of
18. The device of
H(S)=MIN(HL(S),HR(S)),and
where MIN defines a minimum operation, S represents the sequence, HL(S) represents an entropy to the left of the sequence S, HR(S) represents an entropy to the right of the sequence S, f(wS) defines a number of times a particular term, w, appears in the collection of documents followed by the sequence, f(Sw) refers to a number of times the sequence S is followed by w in the collection of documents, and f(S) refers to a number of times the sequence S is present in the collection of documents.
19. The device of
HM(S)=MIN(HLM(S),HRM(S)),where MIN defines a minimum operation, S represents the sequence, HLM(S) is defined as a minimum of
for each term w in the collection of documents, HRM(S) is defined as a minimum of
for each term w in the collection of documents, f(wS) defines a number of times a particular term, w, appears in the collection of documents followed by the sequence S, f(Sw) refers to a number of times the sequence S is followed by w in the collection of documents, and f(S) refers to a number of times the sequence S is present in the collection of documents.
20. The device of
HC(S)=MIN(HLC(S),HRC(S)),where MIN defines a minimum operation, S represents the sequence, HLC(S) is defined as
and HRC(S) is defined as
where δ(X) is defined as one if sequence X occurs in the document collection and zero otherwise, where wS refers to word w followed by the sequence S, and where Sw refers to the sequence S followed by the word w.
21. The device of
HP(S)=MIN(HLP(S),HRP(S))where MIN defines a minimum operation, S represents the sequence, HLP(S) is defined as the number of continuations to the left of the sequence that cover a predetermined percentage of all cases in the collection of documents and HRP(S) is defined as the number of continuations to the right of the sequence that cover the predetermined percentage of all cases in the collection of documents.
22. The device of
23. The device of
a heuristics component to apply one or more predefined rules to the sequence, where the decision component is further to determine whether the sequence constitutes a semantic unit based at least in part on application of the one or more rules.
24. The device of
25. A device comprising:
a memory to store instructions; and
a processor to execute the instructions to implement:
means for receiving a sequence of terms;
means for calculating a first value representing a coherence of terms in the sequence of terms;
means for calculating a second value representing variation of context in which the sequence occurs;
means for comparing the first value to a first threshold and the second value to a second threshold;
means for identifying that the sequence is a semantic unit based at least in part on the first value satisfying the first threshold and second value satisfying the second threshold; and
means for outputting an indication that the sequence is a semantic unit based on identifying that the sequence is a semantic unit.
26. The system of
27. The system of
28. A computer-readable memory device that includes programming instructions to control at least one processor, the computer-readable memory device comprising:
instructions for calculating a first value representing a coherence of terms in a sequence of terms;
instructions for calculating a second value representing variation of context in which the sequence occurs, where the variation of context in which the sequence occurs is calculated as a measure of entropy of the context of the sequence;
instructions for identifying that the sequence is a semantic unit based on the first and second values; and
instructions for outputting an indication that the sequence is a semantic unit.
29. The computer-readable memory device of
30. The computer-readable memory device of
31. The computer-readable memory device of
where f(A) defines a number of occurrences of term A in the collection of documents, f(˜A) defines a number of occurrences of a term other than term A in the collection of documents, f(B) defines a number of occurrences of term B in the collection of documents, N defines a total number of events in the collection of documents, f(AB) defines a number of times term A is followed by term B in the collection of documents, and f(˜AB) is a number of times a term other than A is followed by term B in the collection of documents, where
where n and k are integers.
32. The computer-readable memory device of
33. The computer-readable memory device of
where f(A) defines a number of occurrences of term A in the collection of documents, f(B) defines a number of occurrences of term B in the collection of documents, N defines a total number of events in the collection of documents, and f(AB) defines a number of times term A is followed by term B in the collection of documents.
34. The computer-readable memory device of
35. The computer-readable memory device of
H(S)=MIN(HL(S),HR(S)),and
where MIN defines a minimum operation, S represents the sequence, HL(S) represents an entropy to the left of the sequence S, HR(S) represents an entropy to the right of the sequence S, f(wS) defines a number of times a particular term, w, appears in the collection of documents followed by the sequence, f(Sw) refers to a number of times the sequence is followed by w in the collection of documents, and f(S) refers to a number of times the sequence S is present in the collection of documents.
36. The computer-readable memory device of
HM(S)=MIN(HLM(S),HRM(S)),where MIN defines a minimum operation, S represents the sequence, HLM(S) is defined as a minimum of
for each term w in the collection of documents, HRM(S) is defined as a minimum of
for each term w in the collection of documents, f(wS) defines a number of times a particular term, w, appears in the collection of documents followed by the sequence, f(Sw) refers to a number of times the sequence is followed by w in the collection of documents, and f(S) refers to a number of times the sequence is present in the collection of documents.
37. The computer-readable memory device of
HC(S)=MIN(HLC(S),HRC(S)),where MIN defines a minimum operation, S represents the sequence, HLC(S) is defined as
and HRC(S) is defined as
where δ(X) is defined as one if sequence X occurs in the collection of documents and zero otherwise, where wS refers to word w followed by the sequence S, and where Sw refers to the sequence S followed by the word w.
38. The computer-readable memory device of
HP(S)=MIN(HLP(S),HRP(S))where MIN defines a minimum operation, S represents the sequence, HLP(S) is defined as the number of continuations to the left of the sequence that cover a predetermined percentage of all cases in the collection of documents and HRP(S) is defined as the number of continuations to the right of the sequence that cover the predetermined percentage of all cases in the collection of documents.
39. The computer-readable memory device of
40. The computer-readable memory device of
41. The computer-readable memory device of
instructions for applying one or more rules to the sequence, and
where the instructions for identifying that the sequence is a semantic unit are further based at least in part on the application of the one or more rules.
Description A. Field of the Invention The present invention relates generally to information processing and, more particularly, to identifying multi-word text sequences that are semantically meaningful. B. Description of Related Art In some text processing applications, it can be advantageous to process multiple words in a sequence as a single semantically meaningful unit. For example, the author of the phrase “Labrador retriever” intends to refer to a specific type of dog. If this phrase was present in a search query, such as a search query input to an Internet search engine, it may be desirable to process the phrase as a single semantic unit rather than as the two separate words “Labrador” and “retriever.” Applications other than search engines may benefit from knowledge of semantic units. Named entity learning, segmentation in languages that do not separate words with spaces (e.g., Japanese and Chinese), and article summarization, for example, are some applications that may use semantic units. Thus, there is a need in the art to be able to automatically recognize semantic units from within one or more textual documents. Consistent with aspects of the invention, multi-word text sequences are classified as semantic units based on the coherence of terms in the sequence and based on variation in the context of the sequence. One aspect of the invention is directed to a method of identifying whether a sequence is a semantic unit. The method includes calculating a first value representing a coherence of terms in the sequence, calculating a second value representing variation of context in which the sequence occurs, and determining whether the sequence is a semantic unit based at least in part on the first and second values. Another aspect of the invention is directed to a device that includes a coherence component, a variation component, and a decision component. The coherence component calculates a coherence of multiple terms in a sequence of terms. The variation component calculates a variation of context terms in a collection of documents in which the sequence occurs. The decision component determines whether the sequence constitutes a semantic unit based at least in part on results of the coherence component and the variation component. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, explain the invention. In the drawings, The following detailed description of the invention refers to the accompanying drawings. The detailed description does not limit the invention. A semantic locator is described herein that identifies word sequences that form semantically meaningful units. The operation of the semantic locator is based on one or more factors calculated by comparing the terms in a candidate sequence to a collection of documents. In particular, the factors may include the coherence of the words in the sequence and the variation of the context surrounding the sequence. Clients In an implementation consistent with the principles of the invention, server A document, as the term is used herein, is to be broadly interpreted to include any machine-readable and machine-storable work product. A document may be an e-mail, a news article, a file, a combination of files, one or more files with embedded links to other files, a news group posting, etc. In the context of the Internet, a common document is a web page. Web pages often include textual information and may include embedded information (such as meta information, images, hyperlinks, etc.) and/or embedded instructions (such as Javascript, etc.). Processor Input device(s) As will be described in detail below, server The software instructions defining semantic locator The software instructions contained in memory 125 Semantic locator Coherence component
From equations (1) and (2), coherence component Coherence component There may be some sequences in which the LR value is relatively high, but nevertheless have a poor coherence. To filter out these sequences, coherence component Variation component As an example of the application of equations (3)-(5) by variation component In one alternate implementation, variation component Other techniques, in addition to the two discussed above, may be used to measure variation. In another possible implementation, variation component Variation component Heuristics component Decision component Based on the result from decision component As described above, a semantic locator performs operations consistent with aspects of the invention to determine whether a sequence of terms forms a semantically meaningful unit. The determination may be based on one or more of the coherence of the terms in the sequence and the variation of context surrounding the sequence. Additionally, in some implementations, heuristics may be applied to further refine the determination of a semantic unit. It will be apparent to one of ordinary skill in the art that aspects of the invention, as described above, may be implemented in many different forms of software, firmware, and hardware in the implementations illustrated in the figures. The actual software code or specialized control hardware used to implement aspects consistent with the present invention is not limiting of the present invention. Thus, the operation and behavior of the aspects were described without reference to the specific software code—it being understood that a person of ordinary skill in the art would be able to design software and control hardware to implement the aspects based on the description herein. The foregoing description of preferred embodiments of the present invention provides illustration and description, but is not intended to be exhaustive or to limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. For example, although many of the operations described above were described in a particular order, many of the operations are amenable to being performed simultaneously or in different orders to still achieve the same or equivalent results. No element, act, or instruction used in the description of the present application should be construed as critical or essential to the invention unless explicitly described as such. Also, as used herein, the article “a” is intended to include one or more items. Where only one item is intended, the term “one” or similar language is used. Patent Citations
Non-Patent Citations
Referenced by
Classifications
Legal Events
Rotate |