US 20010013035 A1 Abstract A system and method are provided for answering queries concerning information stored in a set of collections. Each collection includes a structured entity, and each structured entity includes a field. A query is received that specifies a subset of the set of collections and a logical constraint between fields that includes a requirement that a first field match a second field. The probability that the first field matches the second field is determined automatically based upon the contents of the fields. A collection of lists is generated in response to the query, where each list includes members of the subset of collections specified in the query, and where each list has an estimate of the probability that the members of the list satisfies the logical constraint specified in the query.
Claims(25) 1. A method for answering queries concerning information stored in a set of collections, where each collection includes a structured entity, and where each structured entity includes a field, comprising the steps of:
a. receiving a query that specifies
i. a subset of the set of collections;
ii. a logical constraint between fields that includes a requirement that a first field match a second field;
b. automatically determining the probability that the first field matches the second field based upon the contents of the fields; and c. generating a collection of lists in response to the query, where each list includes members of the subset of collections specified in the query, and where each list has an estimate of the probability that the members of the list satisfies the logical constraint specified in the query. 2. The method of claim 1 3. The method of claim 1 4. The method of claim 1 5. The method of claim 4 6. The method of claim 4 7. The method of claim 4 8. The method of claim 7 9. The method of claim 8 10. The method of claim 9 11. The method of claim 7 12. The method of claim 1 13. The method of claim 1 14. The method of claim 13 i. choosing a partial list with an extreme heuristic value; ii. determining if the partial list is complete; iii. if the partial list is complete, then presenting the partial list to the user as the answer to the query; iv. if the partial list is not complete, then extending the partial list by adding a member of the set of collections specified in the query to the partial list; v. assessing the heuristic value of the extended partial list; and vi. repeating steps i. through iii. until at least K lists have been presented to the user, where K is a parameter supplied by the user. 15. The method of claim 14 16. The method of claim 14 17. The method of claim 14 18. The method of claim 14 i. selecting a logical constraint from the query that a first field match a second field, where a member of the set of collections specified in the query corresponding to the first field is included in the partial list; ii. selecting a term that is included in the member of the partial list that corresponds to the first field; iii. finding a potential member that includes the selected term; and iv. adding the potential member that includes the selected term to the existing partial list. 19. An apparatus for answering queries concerning information stored in a set of collections, where each collection includes a structured entity, and where each structured entity includes a field, comprising:
a. a processor; b. a memory that stores search instructions adapted to be executed by said processor to receive a query that specifies a subset of the set of collections and a logical constraint between fields that includes a requirement that a first field match a second field, automatically determine the probability that the first field matches the second field based upon the contents of the fields, and generate a collection of lists in response to the query, where each list includes members of the subset of collections specified in the query, and where each list has an estimate of the probability that the members of the list satisfies the logical constraint specified in the query, said memory coupled to said processor. 20. The apparatus of claim 19 21. The apparatus of claim 19 22. A medium that stores instructions adapted to be executed by a processor to:
a. receive a query that specifies
i. a subset of the set of collections;
ii. a logical constraint between fields that includes a requirement that a first field match a second field;
b. automatically determine the probability that the first field matches the second field based upon the contents, of the fields; and c. generate a collection of lists in response to the query, where each list includes members of the subset of collections specified in the query, and where each list has an estimate of the probability that the members of the list satisfies the logical constraint specified in the query. 23. A medium that stores instructions adapted to be executed by a processor to:
i. choose a partial list with an extreme heuristic value; ii. determine if the partial list is complete; iii. if the partial list is complete, then present the partial list to the user as the answer to the query; iv. if the partial list is not complete, then extend the partial list by adding a member of the set of collections specified in the query to the partial list; v. assess the heuristic value of the extended partial list; and vi. repeat steps i. through iii. until at least K lists have been presented to the user, where K is a parameter supplied by the user. 24. A system for answering queries concerning information stored in a set of collections, where each collection includes a structured entity, and where each structured entity includes a field, comprising:
a. means for receiving a query that specifies
i. a subset of the set of collections;
ii. a logical constraint between fields that includes a requirement that a first field match a second field;
b. means for automatically determining the probability that the first field matches the second field based upon the contents of the fields; and c. means for generating a collection of lists in response to the query, where each list includes members of the subset of collections specified in the query, and where each list has an estimate of the probability that the members of the list satisfies the logical constraint specified in the query. 25. A system for searching through a space of partial lists, comprising:
i. means for choosing a partial list with an extreme heuristic value; ii. means for determining if the partial list is complete; iii. means for if the partial list is complete, then presenting the partial list to the user as the answer to the query; iv. means for determining if the partial list is complete; v. means for extending the partial list by adding a member of the set of collections YES specified in the query to the partial list; v. means for assessing the heuristic value of the extended partial list; and vi. means for determining if at least K lists have been presented to the user, where K is a parameter supplied by the user. Description [0001] This application claims the benefit of U.S. Provisional Application No. 60/039,576 filed Feb. 25, 1997. [0002] This invention relates to accessing databases, and particularly to accessing heterogeneous relational databases. [0003] Databases are the principal way in which information is stored. The most commonly used type of database is a relational database, in which information is stored in tables called relations. Relational databases are described in [0004] Each entry in a relation is typically a character string or a number. Generally relations are thought of as sets of tuples, a tuple corresponding to a single row in the table. The columns of a relation are called fields. [0005] Commonly supported operations on relations include selection and join. Selection is the extraction of tuples that meet certain conditions. Two relations are joined on fields F [0006] Joining relations is the principal means, of aggregating information that is spread across several relations. For example, FIG. 1 shows two sample relations Q [0007] In most databases, each tuple corresponds to an assertion about the world. For instance, the tuple<12:30, 11, “Queen of Outer Space (ZsaZsa Gabor)”, [0008] Known systems can represent information that is uncertain in a database. One known method associates every tuple in the database with a real number indicating the probability that the corresponding assertion about the world is true. For instance, the tuple described above might be associated with the probability 0.9 if the preceding program was a major sporting event, such as the World Series. The uncertainty represented in this probability includes the possibility, for example, that the World Series program may extend beyond its designated time slot. Extensions to the database operations of join and selection useful for relations with uncertain information are also known. One method for representing uncertain information in a database is described in [0009] Another way of storing information is with a text database. Here information is stored as a collection of documents, also known as a corpus. Each document is simply a textual document, typically in English or some other human language. One standard method for representing text in such a database so that it can be easily accessed by a computer is to represent each document as a so-called document vector. A document vector representation of a document is a vector with one component for each term appearing in the corpus. A term is typically a single word, a prefix of a word, or a phrase containing a small number of words or prefixes. The value of the component corresponding to a term is zero if that term does not appear in the document, and non-zero otherwise. [0010] Generally the non-zero values are chosen so that words that are likely to be important have larger weights. For instance, word that occur many times is a document, or words that are rare in the corpus, have large weights. A similarity function can then be defined for document vectors, such that documents with the similar term weights have high similarities, and documents with different term weights have low similarity. Such a similarity function is called a term-based similarity metric. [0011] An operation commonly supported by such text databases is called ranked retrieval. The user enters a query, which is a textual description of the documents he or she desires to be retrieved. This query is then converted into a document vector. The database system then presents to the user a list of documents in the database, ordered (for example) by decreasing similarity to the document vector that corresponds to the query. [0012] As an example, the Review column (the column indicated by [0013] In general, the user will only be interested in seeing a small number of the documents that are highly similar. Techniques are known for efficiently generating a reduced list of documents, say of size K, that contains all or most of the K documents that are most similar to the query vector, without generating as an intermediate result a list of all documents that have non-zero similarity to the query. Such techniques are described in Chapters 8 and 9 of [0014] In some relational database management systems (RDBMS) relations are stored in a distributed fashion, i.e., different relations are stored on different computers. One issue which arises in distributed databases pertains to joining relations stored at different sites. In order for this join to be performed, it is necessary for the two relations to use comparable keys. For instance, consider two relations M and E, where each tuple in M encodes a single person's medical history, and each tuple in E encodes data pertaining to a single employee of some large company. Joining these relations is feasible if M and E both use social security numbers as keys. However, if E uses some entirely different identifier (say an employee number), then the join cannot be carried out, and there is no known way of aligning the tuples in E with those in M. To take another example, the relations Q [0015] In practice, the presence of incomparable key fields is often a problem in merging relations that are maintained by different organizations. A collection of relations that are maintained separately are called heterogeneous,. The problem of providing access to a collection of heterogeneous relations is called data integration. The process of finding pairs of keys that are likely to be equivalent key matching is called key matching. [0016] Techniques are known for coping with some sorts of key mismatches that arise in accessing heterogeneous databases. One technique is to normalize the keys. For instance, in the relations Q [0017] A data integration system based on normalization of keys is described in [0018] Another known technique for handling key mismatches is to use an equality predicate, a function which, when called with arguments Key [0019] It is often the case that the keys to be matched are strings that name certain real-world entities. (In our example, for instance, they are the names of movies.) Techniques are known for examining pairs of names and assessing the probability that they refer to the same entity. Once this has been done, then a human can make a decision about what pairs of names should be considered equal for all subsequent queries that require key matching. Such techniques are described in [0020] Many of these techniques require information about the types of objects that are being named. For instance, Soundex is often used to match surnames. An exception to this is the use of the Smith-Waterman edit distance, which provides a general similarity metric for any pairs of strings. The use of the Smith-Waterman edit distance metric key matching is described in an [0021] It is also known how to use term-based similarity functions, closely related to IR similarity metrics, for key matching. Use of term-based similarity metrics for key matching, as an alternative to Smith-Waterman, is described in [0022] In summary, known methods require that data from heterogeneous sources be preprocessed in some manner. In particular, the data fields that will be used as keys must be normalized, using a domain-specific procedure, or a domain-specific equality test must be written, or a determination as to which keys are in fact matches must be made by a user, perhaps guided by some previously computed assessment of the probability that each pair of keys matches. [0023] All of these known procedures are require human intervention, potentially for each pair of data sources. Furthermore, all of these procedures are prone to error. Errors in the process of determining which keys match will lead to incorrect answers to queries to the resulting database. [0024] What is needed is a way of accessing data from many heterogeneous sources without any preprocessing steps that must be guided by a human. Furthermore, when pairs of keys from different sources are assumed to match, the end user should be alerted to these assumptions, and provided with some estimate of the likelihood that the assumptions are correct, or other information with which the end user can assess the quality of the result. [0025] An embodiment of the present invention accesses information stored in heterogeneous databases by using probabilistic database analysis techniques to answer database queries. The embodiment uses uncertain information about possible key matches obtained by using general-purpose similarity metrics to assess the probability that pairs of keys from different databases match. This advantageously allows a user to access heterogeneous sources of information without requiring any preprocessing steps that must be guided by a human. Furthermore, when pairs of keys from different sources are assumed to match, the user is apprised of these assumptions, and provided with some estimate of the likelihood that the assumptions are correct. This likelihood information can help the user to assess the quality of the answer to the user's query. [0026] Data from heterogeneous databases is collected and stored in relations. In one embodiment, the data items in these relations that will be used as keys are represented as text. A query is received by a database system. This query can pertain to any subset of the relations collected from the heterogeneous databases mentioned above. The query may also specify data items from these relations that must or should refer to the same entity. [0027] A set of answer tuples is computed by the database system. These tuples are those that are determined in accordance with the present invention to most likely to satisfy the user's query. A tuple is viewed as likely to satisfy the query if those data items that should refer to the same entity (according to the query) are judged to have a high probability of referring to the same entity. The probability that two data items refer to the same entity is determined using problem-independent similarity metrics that advantageously do not require active human intervention to formulate for any particular problem. [0028] In computing the join of two relations, each of size N, N [0029] In some cases, many pairs of keys will be weakly similar, and hence will have some small probability of referring to the same entity. Thus, the answer to a query could consist of a small number of tuples with a high probability of being correct answers, and a huge number of tuples with a small but non-zero probability of being correct answers. Known probabilistic database methods would disadvantageously generate all answer tuples with non-zero probability, which often would be an impractically large set. The present invention advantageously solves this problem by computing and returning to the user only a relatively small set of tuples that are most likely to be correct answers, rather than all tuples that could possibly be correct answers. [0030] In one embodiment of the present invention, the answer tuples are returned to the user in the order of their computed likelihood of being correct answers, i.e., the tuples judged to be most likely to be correct are presented first, and the tuples judged less likely to be correct are presented later. [0031] In accordance with one embodiment of the present invention, queries concerning information stored in a set of collections are answered. Each collection includes a structured entity. Each structured entity in turn includes a field. [0032] In accordance with an embodiment of the present invention, a query is received that specifies a subset of the set of collections and a logical constraint between fields that includes a requirement that a first field match a second field. The probability that the first field matches the second field based upon the contents of the fields is automatically determined. A collection of lists is generated in response to the query, where each list includes members of the subset of collections specified in the query. Each list also has an estimate of the probability that the members of the list satisfies the logical constraint specified in the query. [0033] The present invention advantageously combines probabilistic database techniques with probabilistic assessments of similarity to provide a means for automatically and efficiently accessing heterogeneous data sources without the need for human intervention in identifying similar keys. [0034]FIG. 1 shows an prior art example of two relations Q and R and a join of relations Q and R. [0035]FIG. 2 shows an embodiment of a system and apparatus in accordance with the present invention. [0036]FIG. 3 shows a-table of relations upon which experiments were performed to determine properties of the present invention. [0037] An embodiment of an apparatus and system in accordance with the present invention is shown in FIG. 2. A search server [0038] As shown in FIG. 2, search server [0039] One embodiment of the present invention is a medium that stores search instructions. As used herein, the phrase “adapted to be executed” is meant to encompass instructions stored in a compressed and/or encrypted format, as well as instructions that have to be compiled or installed by an installer before being executed by processor [0040] In one embodiment, the search server further comprises a port [0041] In one embodiment, network [0042] In one embodiment, the user is a personal computer. In one embodiment, database servers A [0043] As discussed above, many databases contain many fields in which the individual constants correspond to entities in the real world. Examples of such name domains include course numbers, personal names, company names, movie names, and place names. In general, the mapping from name constants to real entities can differ in subtle ways from database to database, making it difficult to determine if two constants are co-referent ({i.e.}, refer to the same entity). [0044] For instance, in two Web databases listing educational software companies, one finds the name constants “Microsoft” and “Microsoft Kids.” Do these denote the same company, or not? In another pair of Web sources, the names “Kestrel” and “American Kestrel” appear. Likewise, it is unclear as to whether these denote the same type of bird. Other examples of this problem include “MIT” and “MIT Media Labs”; and “A&T Bell Labs,” “AT&T Labs”, “AT&T Labs—Research,” “AT&T Research,” “Bell Labs,” and “Bell Telephone Labs.” [0045] As can be seen from the above examples, determining if two name constants are co-referent is far from trivial in many real-world data sources. Frequently it requires detailed knowledge of the world, the purpose of the user's query, or both. These generally necessitate human intervention in preprocessing or otherwise handling a user query. [0046] Unfortunately, answering most database queries require understanding which names in a database are coreferent. Two phrases are coreferent if each refers to the same or approximately the same external entity. An external entity is an entity in the real world to which a phrase refers. For example, Microsoft and Microsoft, Inc. are two phrases that are coreferent in the sense that they refer to the same company. As used herein, the term “phrase” means any fragment of text down to a single character, e.g., a word, a collection of words, a letter, several letters, a number, a punctuation mark or set of punctuation marks, etc. [0047] This requirement of understanding which names in a database are coreferent poses certain problems. For example, to join two databases on Company_name fields, where the values of the company names are Microsoft and Microsoft Kids, one must know in advance if these two names are meant to refer to the same company. This suggests extending database systems to represent the names explicitly so as to compute the probability that two names are coreferent. This in turn requires that the database includes an appropriate way of representing text (phrases). [0048] One widely used method for representing text briefly described above is the vector space model. Assume a vocabulary T of terms, each which will be treated as atomic, i.e., unbreakable. Terms can include words, phrases, or word stems, which are morphologically derived word prefixes. A fragment of text is represented as DocumentVector, which is a vector of real numbers v εR [0049] A number of schemes have been proposed for assigning weights to terms, as discussed above. An embodiment of the present invention uses the TF-IDF weighting scheme with unit length normalization. Assuming that the document represented by v is a member of a document collection C, define {circumflex over (ν)} [0050] where C [0051] The “similarity” of two document vectors v and w is given by the formula: sim (v, w)=
[0052] which is usually interpreted as the cosine of the angle between v and w. Since every document vector v has unit length, sim (v, w) is always between zero and one. [0053] Although these vectors are conceptually very long, they are also very sparse: if a document contains only k terms, then all but k components of its vector representation will have zero weight. Methods for efficiently manipulating these sparse vectors are known. The vector space representation for documents is described in [0054] The general idea behind this scheme is that the magnitude of the component v [0055] The present invention operates on data is stored in relations, where the primitive elements of each relation are document vectors, rather than atoms. This data model is called SUR, which stands for Simple Texts In Relations. The term “simple” indicates that no additional structure is assumed for the texts. [0056] More precisely, an extensional database (EDB) consists of a term vocabulary T and set of relations {p [0057] An embodiment of a language for accessing these relations in accordance with the present invention is called WHIRL, which stands for Word-based Heterogeneous Information Retrieval Logic. A conjunctive WHIRL query is written B _{k}, where each B_{i }is a literal. There are two types of literals. An EDB literal is written p(X_{1}, . . . , X_{k}) where p is the name of an EDB relation, and the X_{i}'s are variables. A similarity literal is written X˜Y, where X and Y are variables. Intuitively, this can be interpreted as a requirement that documents X and Y be similar. If X appears in a similarity literal in a query Q, then X also appears in some EDB literal in Q.
[0058] To take another example, consider two relations R and S, where tuples of R contain a company name and a brief description of the industry associated with that company, and tuples of S contain a company name and the location of the World Wide Web homepage for that company. The join of the relations R and S might be approximated by the query: Q 2,WebSite)Company1˜Company2
[0059] This is different from an equijoin of R and S, which could be written: r(Company,Industry) s(Company,WebSite).[0060] To find Web sites for companies in the telecommunications industry one might use the query: Q 2,WebSite)Company1˜Company2 const1(IO)Industry˜IO
[0061] where the relation {const [0062] The semantics of WHIRL are defined in part by extending the notion of score to single literals, and then to conjunctions. The semantics of WHIRL are best described in terms of substitutions. A substitution θ is a mapping from variables to document vectors. A substitution is denoted as θ={X [0063] Suppose B is a literal, and θ is a substitution such that Bθ is ground. If B is an EDB literal p(X [0064] If Q=B _{k }is a query and Qθ is ground, then define score (Qθ)=II_{i=1} ^{n }score(B,θ). In other words, conjunctive queries are scored by combining the scores of literals as if they were independent probabilities.
[0065] Recall that the answer to a conventional conjunctive query is the set of ground substitutions that make the query “true,” i.e., provable against the EDB. In WHIRL, the notion of provability has been replaced with the “soft” notion of score: substitutions with a high score are intended to be better answers than those with a low score. It seems reasonable to assume that users will be most interested in seeing the high-scoring substitutions, and will be less interested in the low-scoring substitutions. This is formalized as follows: Given an EDB, the “full answer set” S [0066] for all θ _{Q}; score (Q θ_{i})≧score(Qσ); and
[0067] for all θ [0068] In other words, R [0069] It is assumed that the output of a query -answering algorithm given the query Q will not be a full answer set, but rather an r-answer for Q, where r is a parameter fixed by the user. To understand the notion of an r-answer, observe that in typical situations the full answer set for WHIRL queries will be very large. For example, the full answer set for the query Q [0070] of company names contain the term “Inc.”, and that R and S each contain a random selection of n company names, then one would expect the size of the full answer set to contain
[0071] substitutions simply due to the matches on the term “Inc.” Further, the full answer set for the join of m relations of this sort would be of size at least
[0072] To further illustrate this point, I computed the pairwise similarities of two lists R and S of company names with R containing 1163 names, S containing 976 names. These lists are the relations Hoovers Web [0073] The scoring scheme given above for conjunctive queries can be fairly easily extended to certain more expressive languages in accordance with the present invention. Below, I consider such an extension, which corresponds to projections of unions of conjunctive queries. [0074] A “basic WHIRL clause” is written p(X [0075] Now, consider a ground instance a=p(x [0076] support (a)={(A←Q,θ,3): (A←Q)ευand Aθ=a and score (Qθ)=s and s>0} The score of (x [0077] To understand this formula, note that it is some sense a dual of multiplication: if e _{2}) is p_{1}·p_{2}, and the probability of (e_{1} e_{2}) is 1−(1−p_{1})(1-p_{2}). The “materialization of the view υ” is defined to be a relation with name p which contains all tuples (x_{1}, . . . ,x_{k}) such that score((x_{1}, . . . ,x_{k})εp)>0).
[0078] Unfortunately, while this definition is natural, there is a difficulty with using it in practice. In a conventional setting, it is easy to materialize a view of this sort, given a mechanism for solving a conjunctive query. In WHIRL, one would prefer to assume only a mechanism for computing r-answers to conjunctive queries. However, since Equation (1) involves a support set of unbounded size, it appears that r-answers are not enough to even score a single ground instance a. [0079] Fortunately, however, low-scoring substitutions have only a minimal impact on the score of a. Specifically, if (C,θ,s) is such that s is close to zero, then the corresponding factor of (1−s) in the score for a is close to one. One can thus approximate the score of Equation (1) using a smaller set of high-scoring substitutions, such as those found in an r-answer for moderately large r. [0080] In particular, let υ contain the clauses A {(A←Q,θ,s): (A←Q,θ,s) εsupport(a) and θεR} [0081] Also define the r-score for a from R by replacing support (a) in Equation (1) with the r-support set for a. Finally, define the “r-materialization of υ from R” to contain all tuples with non-zero r-score, with the score of x [0082] Clearly, the r-materialization of a view can be constructed using only an r-answer for each clause body involved in the view. As r is increased, the r-answers will include more and more high-scoring substitutions, and the r-materialization will become a better and better approximation to the full materialized view. Thus, given an efficient mechanism for computing r-answers for conjunctive views, one can efficiently approximate the answers to more complex queries. [0083] One embodiment of WHIRL implements the operations of finding the r-answer to a query and the r-materialization of a view. As noted above, r-materialization of a view can be implemented easily given a routine for constructing r-answers. First, however, I will give a short overview of the main ideas used in the process. [0084] In an embodiment of WHIRL, finding an r-answer is viewed as an optimization problem. In particular, the query processing algorithm uses a general method called A* search to find the highest-scoring r substitutions for a query. The A* search method is described in [0085] To understand the use of search, consider finding an r-answer to the WHIRL query insiderTip(X) publicly Traded(Y)X˜Y, where the relation publicly Traded is very large, but the relation insiderTip is very small. In processing the corresponding equijoin insiderTip(X)publicly Traded(Y)X=Y with a known database system, one would first construct a query plan.[0086] For example, one might first find all bindings for X, and then use an index to find all values Y in the first column of publicly Traded that are equivalent to some X. It is tempting to extend such a query plan to WHIRL, by simply changing the second step to find all values Y that are similar to some X. However, this natural extension can be quite inefficient. Imagine that insiderTip contains the vector xi, corresponding to the document “Armadillos, Inc.” Due to the frequent occurrence of the term “Inc.”, there will be many documents Y that have non-zero similarity to x [0087] To find the Y's most similar to the document “The American Software Company” (in which every term is somewhat frequent), a very different type of subplan might be required. The observations suggest that query processing should proceed in small steps, and that these steps should be scheduled dynamically, in a manner that depends on the specific document vectors being processed. [0088] The query processing method described below searches through a space of partial substitutions. Each substitution is a list of values that could be assigned to some, but not necessarily all, of the values appearing in the query. For example, one state in the search space for the query given above would correspond to the substitution that maps X to x [0089] A* search is a graph search method which attempts to find the highest scoring path between a given start state so and a goal state. A pseudo-code embodiment of A* search as used in an embodiment of the present invention is as, follows: [0090] procedure A* (r s [0091] Begin [0092] OPEN={s [0093] while (OPEN≠Ø) do [0094] s:=argmax, [0095] OPEN:=OPEN−{s} [0096] If goalState(s) then [0097] output <s, h (s)> [0098] Exit if r answers printed [0099] else [0100] OPEN:=OPEN U children(s) [0101] endif [0102] endwhile [0103] end [0104] Initial state s [0105] goalState (<Ø, E>): true iff Q Ø is ground [0106] children (<Ø, E>): [0107] if constrain (<Ø, E>)≠Ø then return constrain (<Ø, E>) [0108] else return explode (<Ø, E>) [0109] constrain (<Ø, E>): [0110] 1. pick X, Y, t where [0111] Xθ=x, [0112] Y is unbound in θ with generator p and generation index l (see text) [0113] x [0114] 2. If no such X, Y, t exists then return Ø [0115] 3. return {<Ø, E′>): U {Ø [0116] where E′=E U {t, Y>}, and [0117] each θ; is θU {Y [0118] θ [0119] explode (<θ, E>): [0120] pick p (Y [0121] return the set of all (θ U {Y [0122] such that (v [0123] h<<θ, E>): Π( [0124] h′(B [0125] h′((X˜Y) θ)= [0126] Σ [0127] where Xθ=x, Y is unbound index l (see text) [0128] generator p and generation index l (see text) [0129] As can be seen in the above pseudo-code, goal states are defined by a goalState predicate. The graph being searched is defined by a function children(s), which returns the set of states directly reachable from state s. To conduct the search, the A* algorithm maintains a set OPEN of states that might lie on a path to some goal state. Initially OPEN contains only the start state s [0130] At each subsequent step of the algorithm, a single state is removed from the OPEN set; in particular, the state s that is “best” according to a heuristic function, h(s), is removed from OPEN. If s is a goal state, then this state is output; otherwise, all children of s are added to the OPEN set. The search continues until r goal states have been output, or the search space is exhausted. [0131] I will now explain how this general search method has been instantiated in WHIRL in accordance with an embodiment of the present invention. I will assume that in the query Q, each variable in Q appears exactly once in a EDB literal. In other words, the variables in EDB literals are distinct from each other, and also distinct from variables appearing in other EDB literals, and both variables appearing in a similarity literal also appear in some EDB literal. (This restriction is made innocuous by an additional predicate eq(X,Y) which is true when X and Y are bound to the same document vector. The implementation of the eq predicate is straightforward and known in the art, and will be ignored in the discussion below.) In processing queries, the following data structures will be used. An inverted index will map terms tεT to the tuples that contain them: specifically, I assume a function index (t,p,i) which returns the set of tuples (v [0132] The states of the graph searched will be pairs (θ,E), where θ is a substitution, and E is a set of exclusions. Goal states will be those for which θ is ground for Q, and the initial state s [0133] I will adopt the following terminology. Given a substitution θ and query Q, a similarity literal X˜Y is constraining if and only if exactly one of Xθ and Yθ are ground. Without loss of generality, I assume that Xθ is ground and Yθ is not. For any variable Y, the EDB literal of Q that contains Y is the generator for Y, the position l of Y within this literal is Y's generation index. For well-formed queries, there will be only one generator for a variable Y. [0134] Children are generated in two ways: by exploding a state, or by constraining a state. Exploding a state corresponds to picking all possible bindings of some unbound EDB literal. To explode a state s=<θ,E>, pick some EDB literal p(Y [0135] The second operation of constraining a state implements a sort of sideways information passing. To constrain a state s=<θ,E>, pick some constraining literal X˜Y and some term t with non-zero weight in the document Xθ such that <t,Y>E. Let p(Y [0136] It is easy to see that if s [0137] Given the operations above, there will typically be many ways to “constrain” or “explode” a state. In the current implementation of WHIRL, a state is always constrained using the pair <t,Y>, such that x [0138] It remains to define the heuristic function, which, when evaluated, produces a heuristic value. Recall that the heuristic function h(θ,E) must be admissible, and must coincide with the scoring function (Qθ) on ground substitutions. This implies that h(θ,E) must be an upper bound on score(q) for any ground instance q of Qθ. I thus define h(θ,E) to be II [0139] where p and l are the generator and generation index for Y. Note that this is an upper bound on the score of B [0140] In the current implementation of WHIRL, the terms of a document are stems produced by the Porter stemming algorithm. The Porter stemming algorithm is described in “An Algorithm for Suffix Stripping”, by M. F. Porter, Program, 14(3):130-137, 1980. In general, the term weights for a document v [0141] To set these weights, every query is checked before invoking the query algorithm to see if it contains any EDB literals p(X [0142] The current implementation of WHIRL keeps all indices and document vectors in main memory. [0143] In the following examples of the procedure in accordance with the present invention, it is assumed that terms are words. [0144] Consider the query “const 1 contains the single document “telecommunications services and/or equipment”. With θ=0, there are no constraining literals, so the first step in answering this query will be to explode the smallest relation, in this case const1. This will produce one child, s_{1}, containing the appropriate binding for IO, which will be placed on the OPEN list.
[0145] Next s [0146] Next, a state will again be removed from the OPEN list. It may be that h(s′ [0147] This process will continue until documents are generated. Note that it is quite likely that low weight terms such as “or” will not be used at all. [0148] In another example of the present invention, consider the query p(Company [0149] In solving this query, the first step will be to explode the smaller of these relations. Assume that this is p, and that p contains 1000 tuples. This will add 1000 states s [0150] However, the h(*) values for the states s [0151] The result is that the next step of the algorithm will be to choose a promising state from the OPEN list, a state that could result in an good final score. A term from the Company [0152] In short, the operation of WHIRL is somewhat similar to time-sharing 1000 simpler queries on a machine for which the basic unit of computation is to access a single inverted index. However, WHIRL's use of the h(*) function will schedule the computation of these queries in an intelligent way: queries unlikely to produce good answers can be discarded, and low-weight terms are unlikely to be used. [0153] In yet another example, consider the query p(Company [0154] where the relation const [0155] At this point there will be two types of states on the OPEN list. There will be one state s′ in which only IO is bound, and (telecommunications,Industry) is excluded. There will also be several states s [0156] However, if some s [0157] This example illustrates how bindings can be propagated through similarity literals. The binding for IO is first used to generate bindings for Company [0158] Embodiments of the invention have been evaluated on data collected from a number of sites on the World Wide Web. I have evaluated the run-time performance with CPU time measurements on a specific class of queries, which I will henceforth call similarity joins. A similarity join is a query of the form p(X [0159] An answer to this query will consist of the r tuples from p and q such that X [0160] 1) The naive method for similarity joins takes each document in the i-th column of relation p in turn, and submits it as a IR ranked retrieval query to a corpus corresponding to the j-column of relation q. The top r results from each of these IR queries are then merged to find the best r pairs overall. This might be more appropriately be called a “semi-naive” method; on each IR query, I use inverted indices, but I employ no special query optimizations. [0161] 2) WHIRL is closely related to the maxscore optimization, which is described in [0162] I computed the top 10 answers for the similarity join of subsets of the IMDB [0163] To evaluate the accuracy of the answers produced by WHIRL, I adopted the following methodology. Again focusing on similarity joins, I selected pairs of relations which contained two or more plausible “key” fields. One of these fields, the “primary key”, was used in the similarity literal in the join. The second key field was then used to check the correctness of proposed pairings; specifically, a pairing was marked as “correct” if the secondary keys matched (using an appropriate matching procedure) and “incorrect” otherwise. [0164] I then treated “correct” pairings in the same way that “relevant” documents are typically treated in evaluation of a ranking proposed by a standard IR system. In particular, I measured the quality of a ranking using non-interpolated average precision. To motivate this measurement, assume the end user will scan down the list of-answers and stop at some particular target answer that he or she finds to be of interest. The answers listed below this “target” are not relevant, since they are not examined by the user. Above the target, one would like to have a high density of correct pairings; specifically, one would like the set S of answers above the target to have high precision, where the precision of S is the ratio of the number of correct answers in S to the number of total answers in S. Average precision is the average precision for all “plausible” target answers, where an answer is considered a plausible target only if it is correct. To summarize, letting a [0165] I used three pairs of relations from three different domains. In the business domain, I joined Iontech [0166] On these domains, similarity joins are extremely accurate. In the movie domain, the performance is actually identical to the hand-coded normalization procedure, and thus has an average precision of 100%. In the animal domain, the average precision is 92.1%, and in the business domain, average precision is 84.6%. These results contrast with the typical performance of statistical IR systems on retrieval problems, where the average precision of a state-of-the art IR system is usually closer to 50% than 90%. In other words, the tested embodiment of the present invention was able to achieve results in an efficient, automatic fashion that were just as good as the results obtained using a substantially more expensive technique involving hand-coding, i.e., human intervention. [0167] The foregoing has disclosed to those skilled in the arts of information retrieval and database how to integrate information from many heterogeneous sources using the method of the invention. While the techniques disclosed herein are the best presently known to the inventor, other techniques could be employed without departing from the spirit and scope of the invention. For example, representations other than relational representations are used to store data; some of these representations are described in [0168] In the process of finding answers with high score, the invention employs A* search. Many variants of this search algorithm are known and many of these could be used. The current invention also outputs answer tuples in an order that is strictly dictated by score; some variants of A* search are known that require less compute time, but output answers in an order that is largely, but not completely, consistent with this ordering. [0169] Methods are also known for finding pairs of similar keys by using Monte Carlo sampling methods; these methods are described in [0170] Many different term-based similarity functions have been proposed by researchers in information retrieval. Many of these variants could be employed instead of the function employed in the invention. [0171] Finally, while the problem that motivated the development of this invention is integration of data from heterogeneous databases, there are potentially other problems to which the present invention can be advantageously applied. That being the case, the description of the present invention set forth herein is to be understood as being in all respects illustrative and exemplary, but not restrictive. Referenced by
Classifications
Legal Events
Rotate |