« PreviousContinue »
United States Patent [w]
Voorhees et al.
US005864845A [ii] Patent Number: 5,864,845  Date of Patent: Jan. 26, 1999
 FACILITATING WORLD WIDE WEB
SEARCHES UTILIZING A MULTIPLE
SEARCH ENGINE QUERY CLUSTERING
 Inventors: Ellen M. Voorhees, North Potomac,
Md.; Narendra K. Gupta, Dayton, N.J.
 Assignee: Siemens Corporate Research, Inc., Princeton, N.J.
 Appl. No.: 674,644
 Filed: Jun. 28, 1996
 Int. CI. G06F 17/30
 U.S. CI 707/5; 707/1; 707/2; 707/3;
 Field of Search 707/3, 4, 2, 5,
 References Cited
U.S. PATENT DOCUMENTS
5,325,298 6/1994 Gallant 704/9
5,442,778 8/1995 Pedersen et al 707/5
5,706,497 1/1998 Takahashi et al 707/5
Bartell et al., "Automatic Combination ol Multiple Ranked Retrieval Systems", Proceedings ol SIGIR '94, Jul. 1994, pp. 173-181.
Belkin et al., "The Effect ol Multiple Query Representations on Information System Performance", Proceedings ol SIGIR '93, Jun. 1993, pp. 339-346.
Fox et al., "Combination ol Multiple Searches", Proceedings ol TREC-3, Apr. 1995, pp. 105-108.
A method implemented on a computer for lacilitating World Wide Web Searches and like database searches by combining search result documents, as provided by separate search engines in response to a query, into one single integrated list so as to produce a single document with a ranked list ol pages, includes the steps ol: (a) training the computer for each search engine by clustering training queries and building cluster centroids; (b) Assign weights to each cluster reflecting the number ol relevant pages expected to be obtained by this search engine for queries similar to those in that cluster; (c) processing an incoming query by selecting, for each search engine, that cluster centroid that is most similar to the incoming query and returning the weight associated with the selected cluster as the weight ol the current search engine; and (d) apportioning the N slots in the retrieved set according to the weights returned by each search engine.
15 Claims, 2 Drawing Sheets
TRAIN THE COMPUTER FOR EACH
SEARCH ENGINE BY CLUSTERING
TRAINING QUERIES AND BUILDING
ASSIGN WEIGHTS TO EACH CLUSTER
REFLECTING THE NUMBER OF RELEVANT PAGES
EXPECTED TO BE OBTAINED BY THIS SEARCH
ENGINE FOR QUERIES SIMILAR TO THOSE IN
PROCESS AN INCOMING QUERY BY SELECTING, FOR EACH SEARCH ENGINE. THAT CLUSTER CENTROID THAT IS MOST SIMILAR TO SAID INCOMING QUERY AND RETURNING THE WEIGHT ASSOCIATED WITH THE SELECTED CLUSTER AS THE WEIGHT OF THE CURRENT SEARCH ENGINE
APPORTION THE N SLOTS IN THE RETRIEVED SET
ACCORDING TO THE WEIGHTS
RETURNED BY EACH SEARCH ENGINE
Jan. 26, 1999 Sheet 2 of 2
TRAINING FOR EACH SEARCH ENGINE
IN ACCORDANCE WITH
THE FOLLOWING STEPS:
DERIVING A PLURALITY OF OUTPUTS
FROM RESPECTIVE SEARCH ENGINES
DERIVING A SIMILARITY MEASURE
FROM A NUMBER OF DOCUMENTS RETRIEVED
IN COMMON BETWEEN TWO QUERIES;
CREATING A QUERY VECTOR
FOR A CURRENT QUERY
DETERMINING THE CENTROID OF A
QUERY CLUSTER BY AVERAGING VECTORS
OF QUERIES CONTAINED WITHIN SAID CLUSTER
ASSIGNING TO A CLUSTER A WEIGHT
THAT REFLECTS HOW EFFECTIVE QUERIES
IN THE CLUSTER ARE FOR THE CORRESPONDING
SEARCH ENGINE, WHEREBY THE LARGER
THE WEIGHT, THE MORE EFFECTIVE THE
QUERIES ARE EXPECTED TO BE
SELECTING THAT CLUSTER WHOSE CENTROID
VECTOR IS MOST SIMILAR TO THE
QUERY VECTOR FOR THE QUERY
RETURNING THE WEIGHT ASSOCIATED WITH
THE SELECTED CLUSTER AS THE WEIGHT
OF THE CURRENT SEARCH ENGINE
APPORTIONING THE N SLOTS IN THE
RETRIEVED SET ACCORDING TO THE
WEIGHTS RETURNED BY EACH SEARCH ENGINE
FACILITATING WORLD WIDE WEB
SEARCHES UTILIZING A MULTIPLE
SEARCH ENGINE QUERY CLUSTERING
The present invention relates to an automatic method for facilitating World Wide Web Searches and, more specifically, to an automatic method for facilitating World Wide Web Searches by exploiting the differences in the search results of multiple search engines to produce a single 10 list that is more accurate than any of the individual lists from which it is built.
Text retrieval systems accept a statement of information need in the form of a query, assign retrieval status values to documents in the collection based on how well the docu- 15 ments match the query, and return a ranked list of the documents ordered by retrieval status value. Data fusion methods that combine the search results of different queries representing a single information need to produce a final ranking that is more effective than the component rankings 20 are well-known. See Bartell, B. T., Cottrell, G. W., and Belew, R. K.: Automatic combination of multiple ranked retrieval systems; Proceedings of SIGIR-94; July, 1994. Belkin, N. J. et al.: The effect of multiple query representations on information system performance; Proceedings of 25 SIGIR-93; June, 1993. Fox, E. A. and Shaw, J. A. Combination of multiple searches. Proceedings of TREC-2; March 1994.
However, these fusion methods determine the rank of a document in the final list by computing a function of the 30 retrieval status values of that document in each of the component searches. The methods are therefore not applicable when the component searches return only the ordered list of documents and not the individual status values.
The World Wide Web is a collection of information- 35 bearing units called "pages" interconnected by a set of links. To help users find pages on topics that are of interest to them, several groups provide search engines that accept a statement of user need (in either English or a more formal query language) and return a list of pages that match the query. A 40 list is usually ordered by a similarity measure computed between the query and the pages. While each of the search engines in principle searches over the same set of pages (the entire Web), the size of the Web and the imprecise nature of the search algorithms frequently causes different search 45 engines to return different lists of pages for the same query.
Search engines such as Excite and Alta Vista provide a query interface to the information in these pages, and, like traditional text retrieval systems, return a ranked list of pages ordered by the similarity of the page to the query. See 50 Steinberg, Steve G.: Seek and Ye Shall Find (Maybe); Wired; May, 1996. Because the search engines process queries in different ways, and because their coverage of the Web differs, the same query statement given to different engines often produces different results. Submitting the 55 same query to multiple search engines, for example such as Quarterdeck's WebCompass product does, can improve overall search effectiveness. See QuarterDeck. URL: http:/ /arachnid.qdeck.com/qdeck/products/webcompass.
In accordance with an aspect of the invention, a method 60 provides for combining the results of the separate search engines into a single integrated ranked list of pages in response to a query. Unlike WebCompass, the method does not keep the search results separated by the search engine that produced the result, but forms a single ranked list. 65 Unlike the traditional fusion methods, the method in accordance with the invention can produce a single ranking
despite the fact that most search engines do not return the similarities that are computed for individual pages.
FIGS. 1 and 2 show flow charts helpful to a fuller understanding of the invention.
The method in accordance with the invention utilizes a particular application of algorithms developed to combine the results of searches on potentially disjoint databases. See Towell, G., et al.: Learning Collection Fusion Strategies for Information Retrieval; Proceedings of the 12th Annual Machine Learning Conference; July, 1995. Voorhees, E. M., Gupta, N. K., and Johnson-Laird, B.: The Collection Fusion Problem; Proceedings of TREC-3, NIST Special Publication
500-225; April, 1995; pp. 95 ><104. Voorhees, E. M., Gupta, N. K., and Johnson-Laird, B.: Learning Collection Fusion Strategies; Proceedings of SIGIR-95; July, 1995; pp.
An object of the present invention is to approximate the effectiveness of a single text retrieval system despite the collection being physically separated. Another object of the present invention is to combine the results of multiple searches of essentially the same database so as to improve the performance over any single search.
In accordance with another aspect of the invention, a method implemented on a computer for facilitating World Wide Web Searches by combining search result documents, as provided by separate search engines in response to a query, into one single integrated list so as to produce a single document with a ranked list of pages, includes the steps of: (a) training the computer for each search engine by clustering training queries and building cluster centroids; (b) Assign weights to each cluster reflecting the number of relevant pages expected to be obtained by this search engine for queries similar to those in that cluster; (c) processing an incoming query by selecting, for each search engine, that cluster centroid that is most similar to the incoming query and returning the weight associated with the selected cluster as the weight of the current search engine; and (d) apportioning the N slots in the retrieved set according to the weights returned by each search engine.
In accordance with another aspect of the invention, the present method for facilitating World Wide Web searches utilizing a query clustering fusion strategy uses relevance data—judgments by the user as to whether a page is appropriate for the query which retrieved it—from past queries to compute the number of pages to select from each search engine for the current query. In the present description, the set of queries for which relevance data is known is called the training queries. The terms "page" and "document" are used interchangeably.
The function Fsq (N), called a relevant document distribution, returns the number of relevant pages retrieved by search engine s for query q in the ranked list of size N.
A fusion method, Modeling Relevant Document Distributions (MRDD) is disclosed in a copending patent application by the present Inventor, entitled Method for facilitating World Wide Web Searches Utilizing a Document Distribution Fusion Strategy and filed on even date herewith and whereof the disclosure is herein incorporated by reference to the extent it is not incompatible with the present invention. As therein disclosed, the fusion method builds an explicit model of the relevant document distribution of the joint search. The model is created by computing the average relevant document distribution of the k nearest neighbors of the current query, q. The nearest neighbors of q are the training queries that have the highest similarity with q.
As disclosed in the above-referenced application, the method utilizes a vector representation of the queries to