« PreviousContinue »
METHOD AND APPARATUS FOR SEARCHING A
DATABASE AND PROVIDING RELEVANCE
CROSS REFERENCE TO RELATED
 This application claims priority from Provisional Application Serial No. 60/194/669, filed Apr. 4, 2000, the full disclosure of which is hereby incorporated by reference.
FIELD OF THE INVENTION
 The present invention relates generally to a database and a method and apparatus for searching such a database. More particularly, the invention relates to a method and search apparatus for searching a database comprised of both Internet and premium content information (or any other set of labeled information records).
BACKGROUND OF THE INVENTION  1. The Prior Art
 According to a study by the NEC Research Institute, conducted at the beginning of 1999, the internet at the time consisted of a total amount of 800 million publicly accessible pages containing 180 million images. The same study estimates that the total amount of publicly accessible pages in 2003 will be at least 2 billion. To find their way through this enormous collection of information, users often use one of several search-services available on the Internet.
 However, search-services suffer from a host of problems that limit their usability and effectiveness in assisting people to find what they are looking for. These problems stem from the method employed by search-engines to build their document databases, and from the way in which people perform a search.
 There are two basic methods used by searchservices to gather information and build their database, each with their own problems. The first method is to classify documents automatically using a classification algorithm. Such an algorithm tries to determine the subject of a document by processing the document's content. The second method is to let humans (usually a staff of editors) determine the subject of documents and add them to a database.
 Although the first method can result in a very large database, the database is usually of marginal quality. This is due to the fact that automatic algorithms are notoriously incapable of accurately determining a document's subject.
 The second method yields a high-quality database, but the staffs of search-services are unable to keep up with the growth and size of the Internet. Even the most successful and largest venture in this category (The Open Directory Project) contains no more than a fraction of the total amount of information available on the Internet.
 Apart from the difficulties in creating a database that is both complete and of a high quality, existing searchservices have dated methods of performing searches. The scientific community that researches the field of Information Retrieval has long since improved and replaced these methods. Generally speaking, Information Retrieval ("IR") concerns itself with finding specific information in a collection
of data/documents. This includes for example systems to search through library catalogues, scientific databases or, indeed, the Internet.
 One of the most prominent developments in IR is the use of Relevance Feedback ("RF"). RF is a general term used to indicate any process (with or without interaction with the user) that uses the results of a query to construct a new, more refined, query.
 There are several ways to generate RF for an IR-system. A completely automatic system can perform a query and from the results of that query extract the most relevant words/terms, the top 10 or 20 of which can then be added to the query. An interactive model can for example require a user to select one or more documents in the query-result that are relevant to the user's information need, and use these documents to determine the most relevant terms.
 Many systems that employ Relevance Feedback have been developed and tested, mainly for research purposes. Relevance Feedback was introduced in the early 1970s to optimise the performance of Information Retrieval systems. Despite the success of RF in academic and research settings, there are few public or commercial systems that offer the use of RF. Some researchers point out problems with the implementation of such large-scale systems, such as complexity and unexpected user-behaviour.
 At present, an Internet search-engine employing a system that could be classified as a true RF system is provided by Northern Light. The Northern Light system groups documents that are relevant to a query into candidate categories. The most relevant candidate categories are then presented to the user for selection. Selecting categories is an efficient form of RF, because with a single mouse-click, a user can mark an entire group of documents as relevant to his information need. In many systems, a user must select multiple separate documents, or parts of documents, to provide RF to an IR-system. The system then determines relevant terms in these documents and uses those terms to expand the query.
 2. Comparison to Present Invention
 The present invention is a variation of Relevance Feedback, which features certain extensions. While traditional RF is only concerned with the actual content of documents, the present invention utilizes "Meta-data." This is data about a document, and can describe the content of a document, but also the author, length, size, publisher, date of publication and any other piece of information about the document. This allows the expansion from text-only to any type of content. No IR system in existence today produces meaningful RF when dealing with a picture, a movie or a song. The present invention deals with meta-data, which can be applied on any type of information, text-based or not. The user produces Relevance Feedback by marking one or more meta-data elements as relevant to his information need. This extension of classic RF is inspired by the realization that the content of a document does not necessarily determine a document's relevance to a user's information need. This is especially true for Internet documents, which tend to contain less and less text, but more images and other non-text content instead.
 A limited form of relevance feedback known as "related searches" is provided by Internet search-engines
like Hotbot and Altavista. In these implementations, if a user searches for "food", he is presented with a list containing often occurring combinations with the word food, like "Italian food" etc., etc. It will not, however, offer query extensions that may be relevant to "food", but do not occur in combination with that word, such as, "cooking", "restaurants", "cutlery" and the like. The present invention has no such limitations, and in the preceding example a search would also produce terms like "wine", "dining" and "desserts."
 Also, the level to which these systems produce meaningful results is disappointing. Again using the preceding example, "food" can be extended only twice. After that, no more "related searches" are available. The present invention dynamically generates possibly relevant query expansions, and offers up to fifteen expansions or more, depending on the maximum number of keywords allowed for a record.
SUMMARY OF THE INVENTION
 The present invention features a database and a method and apparatus for searching the database, which can include Internet and premium content records or any other set of labeled information records (like machine parts in a factory or project information in a consultancy firm). The invention provides users with access to information on the Internet or to premium content information on local networks, and the like.
 The invention is especially useful in environments with large numbers of different documents or entries. The invention uses sophisticated relevance rating algorithms and methods to provide meaningful relevance feedback information about the current query in the form of a set of relevant meta-data elements (usually keywords). This relevance feedback information is presented to the user as a small list that includes only the most relevant N meta-data elements. N stands for the number of elements shown and has a value between 0 and for instance 50. The invention also generates a relevance-ranked list of records that match the query.
 The invention consists of both a database and a mechanism/method to select and sort information from this database. The database is based on data structures that are specifically designed and constructed to meet the specifications and conditions set by the mechanism/method that selects and sorts the information from the database.
 Except for the records, the database includes metadata attributes. It contains meta-data about every individual record, about the individual elements a record consists of and about individual sets of records.
 In response to a query/user request, the apparatus selects and sorts a set of records and a set of items that provide the user with feedback on what is relevant to his query/user request. These items can consist of meta-information like: author, keywords, subject, type, source, language characteristics, etc., etc. The apparatus can also easily use other types of meta-information, such as the length of a song, the resolution of an image, the price of an item, the expiration date of an item or document, etc. Usually the user is provided with the keywords most relevant to his/her query/user request. The set of records is ranked according to relevance to the users query/user input.
 The mechanism/method uses the weights of the meta-data attributes associated with the records to determine the relevance records and meta-data elements have to a query/user request.
 The internal hierarchy/order in the sets generated by the apparatus, represents a hierarchy/order of relevance of this information to the query/user request.
BRIEF DESCRIPTION OF THE DRAWINGS
 FIGS. 1 through 4 of the drawings depict a version of the current embodiment of the present invention for the purpose of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principals of the invention described herein.
 FIG. 1 is a block diagram illustrating the functional elements of a search apparatus and database incorporating the principles of the invention.
 FIG. 2 is a flow chart illustrating the sequence of steps used by the apparatus in performing the described behavior.
 FIG. 3 is a flow chart illustrating the flow chart of FIG. 2 in greater detail.
 FIG. 4 shows the user display of the present invention.
DETAILED DESCRIPTION OF THE
 In one aspect, the invention features a method for searching a database of records. The database can include Internet and premium content records (or any other set of labeled information records). In response to a search instruction from a user, the database is searched and a set of relevant meta-data elements (keywords for instance) is dynamically generated to provide the user with feedback on his query. These meta-data elements are presented as a relevance-ranked list (usually 20-50 long). The elements of this relevance feedback-list can be added to a new query, for instance by using hyperlinks. By default, terms can be added using the AND operator to achieve "intersection." The mechanism can also perform queries containing NOT or OR operators to achieve "difference" or "union." The interface can feature easy icons/buttons to add an element to the query with the AND, NOT or OR operator.
 The method also generates a search result-list of relevant records from the database. The elements of this list can be hyperlinks that function as an input medium for the apparatus. The mechanism responds to selection of a record by adjusting database values and importance factors which are related to that record. This means that a record that is selected often can eventually rank higher in a result-list. Another response of the mechanism is the redirection to or fetching of the requested site/document or information. The length of the result list can be for example 200 (if available), but can be adjusted easily to other lengths. The interface can present the user a part of this result-list and provide links that lead to the presentation of other parts of the result-list. This can be done for example with an interval length of 10 results.
 To present an accurate feedback list of meta-data elements, the collection of records used to generate the feedback list needs to be a valid sample of the total 'population' of records. A sample is 'valid' if the distribution of its records matches the distribution of the entire population of records. This means that if 10% of the records in the entire population contain the keyword "science", a sample is valid if 10% of its records contain the keyword "science". The mechanism determines how many records need to be processed in order to obtain a valid list of relevant meta-data elements and thus a valid representation of the subject context that is related to all records that match the user's query. During calculation of the feedback list, the rate of change in the ranking of all list-elements is continuously monitored. If this rate of change falls below a certain threshold value, the feedback list is considered to be of sufficient quality. Every processed record that matched the query contributes to a pre-result-list that is used in the next step of the process to generate the search result-list.
 The feedback-list consists of the most relevant meta-data attributes within the collection of matching documents. To be able to calculate the feedback list, every record within the database has one or more associated meta-data attributes. These attributes are predicates that consist of either of a single term or of multiple terms in a Boolean expression. These terms can be parameters like author, keywords, subject, type, source, language characteristics, etc. An example of a predicate consisting of multiple words is 'Kids'. This predicate can be constructed using the terms: 'Toys', 'Hobbies', 'School', 'Adult' from the keyword type and could lead to the predicate: ((Toys AND Hobbies AND School) NOT Adult). This means that selecting a single meta-data attribute (like "Kids") from the feedback list can result in a complex query with several constraints on matching documents. This can include constraints on keywords, but also attributes like date, size, type, etc.
 Every record has multiple scores, which are used to rank the records in a result list. One of these scores represents a record's popularity among other records. This is called link popularity and is measured by how often a record is referred to by other records in any relevant context. For example, if many documents dealing with basketball refer to the same document when talking about the rules of the game, this document will have a high popularity score in that context. Another score, called selection popularity, represents how often prior users have selected the record from a result-list in the past. For instance, if many people select the same document after viewing the results for a query on "basketball", then this document will have a high selection popularity score.
 The records in the result-list are ranked according to their final score. The invention features several techniques to influence this final score. The mechanism can apply an arbitrary combination of these techniques to obtain a final score.
 One technique to influence a record's final score is to use the ratio between:
 1. The summed weight of the various matching predicates (meta-data attributes) of a record, and
 2. The summed weight of the predicates the query consists of.
 This is a measure of how well the subject-context of a record matches the subject-context supplied by the user (query).
 Another technique to influence a record's final score is to use its 'context-score'. This score is a measure of how well the subject-context of a record matches the relevance feedback list, which represents the 'average' subjectcontext of all matching records. This means that records that are best matching with the relevance feedback list elements, will also rank highest in the search result-list. Another technique to influence a record's final score is to use its popularity scores described above. There are several other factors that can be used to influence a record's final score, such as the size of a document or the amount paid by the author to be ranked higher.
 The invention features a thesaurus-like collection of items. Each item in this collection represents a predicate and consists of a number of data-elements that are used by the invention. All predicates in the database have one or more scores that can be used to influence the ranking of a predicate in the relevance feedback list.
 One of these scores is a global weight. Every record consists of multiple predicates that contribute to the ranking process of both the relevance feedback-list and the search result-list. The global weight of a predicate can be used to influence the contribution that similar instances of a predicate have on these ranking processes. Another score can be used to influence how much weight a list of related predicates has on this predicate. Yet another score represents how often users have selected the predicate from a relevance feedback-list.
 The predicates in the relevance feedback list are ranked according to a final score. The mechanism can use the different scores a predicate consists of and apply several different calculations to obtain the final score. For example, the invention limits the influence of occurrence when generating the relevance-feedback list. Some words occur very often while having a relatively low weight. This results in an "undeserved" high ranking in the feedback-list. To prevent this from happening, the weight of words that occur exceptionally often is re-calculated to reflect this. This is a process called "branching."
 Another data-element a thesaurus item consists of is a pre-compiled list of records that are associated with that item. The methodology first identifies the precompiled list of records that is associated with one of the query predicates and that is best suitable to use while generating the first result-list. By using this list and the complementary part of the query predicates (the rest), it can dynamically compile the first result-list, which matches the whole query and is a sub-set of the pre-compiled list.
 The fact that the mechanism determines how many records need to be processed from a pre-compiled list (as long as there are still matching records available in the pre-compiled list) guarantees the availability of a valid relevance feedback-list and a complete search result-list. Both lists will be incomplete when the process only works with a subset of the last search result-list. This occurs in other systems that, contrary to the invention, use a fixed length starting search result-list to obtain sub-sets of this list in the next cycles of a narrowing down process. The
invention first processes enough records to obtain a valid relevance feedback list and then takes the records that matched the query during that step of the process to generate a search result list. This requires the multi-pass processing of this list in case the search result-list is also ranked according to the matching ratio between the relevance feedback list and the records of the search result-list.
 At certain points during execution of the method, a process called stemming is applied to the words of the search-request. Many words in a language have many forms in which they occur. Examples are single and plural forms, but also conjugations. Because in the vast majority of cases these different forms have the same semantic meaning, a mechanism is needed that recognizes these different forms and translates them to a common form. This is called the stem, although it is not necessarily the linguistic stem of a word.
 The stem is rarely a linguistically correct word, so the method features a set of rules which are followed to determine what word the display processor should show when a (bucket) needs to be displayed:
 1. When a user manually enters a word to add to a search-request, the displayed word should be that same word, regardless of the preferred form stored with that word's stem.
 2. If the feedback generator selects a word to be displayed in the relevance feedback list, a predetermined form is used. This form can be, for example, the form used most often in a referencepopulation of documents.
 By way of example only, a certain stemming algorithm reduces both the words "computer" and its plural "computers" to "comput". Because in a population of documents the word "computer" occurs more often than the plural form "computers", the former is stored as the preferred form for the display-processor for the stem "comput". However, when a user enters the word "computers" manually, the display processor should use that form instead. It should be noted that stemming is a language specific operation. An algorithm that performs well for English will in all likelihood fail for any other language. There are many different stemming algorithms for different languages.
 Every predicate in the database can also have an attributed list of other related predicates. These lists can be used to influence the final configuration of the relevance feedback list that is presented to the user. Another possibility is the use of 'sidesteps'. A sidestep is related to a certain predicate and provides the user with another predicate (or link) that is (closely) related to the predicate that is referring to the sidestep. This element is a software module and has been so identified merely to illustrate the functionality of the invention. By way of example only, it can be used to influence or tune a user's query. In a HTML interface like environment for example, a sidestep can be displayed on the user's screen when the user moves the mouse cursor over an item of the feedback list. This way a user is provided with feedback on what is relevant to the item the mouse cursor was on.
 In another aspect, the invention features an information retrieval system or search apparatus for searching a database of records and this database itself. The database
comprises a plurality of records, including Internet records and premium content records (or any other set of labeled information records).
 The apparatus includes a database and an information retrieval system. The database includes different elements that all store a different part of a record when it is stored. These elements can be split into different intervals that can be distributed over different computers or storage media. The elements the database consists of are a meta-data allocation table, a record storage base, and a meta-data storage base. The information retrieval system includes an instruction parser, a token processor, a command processor, a stemming processor, a context processor, a record processor, a feedback generator, a result-list generator, a display processor, and a database manager. In the preferred embodiment, each of these elements is a software module. Alternatively, each element could possibly be a hardware module or a combined hardware/software module.
 The information retrieval system receives search instructions from a user. Responsive to a search instruction, the information retrieval system searches the database to generate a search result list that includes a selected set of the records from the database. The information retrieval system also produces a list with relevant meta-data attributes (e.g. keywords) to provide the user with feedback on what is relevant to the records that matched the users query.
 Turning to the drawings, FIG. 1 is a block diagram illustrating the functional elements of a search apparatus and database incorporating the principles of the invention. System 42 includes a database 4 and an information retrieval sub-system 37. The information retrieval sub-system 37 comprises an instruction parser 6, a token processor 7, a command processor 8, a stemming processor 9, a context processor 10, a record processor 11, a feedback generator 12, a result-list generator 13, a display processor 14, and a database manager 15. The database 4 consists of a meta-data allocation table 1, a record storage base 2, and a meta-data storage base 3. A user, 5 of the system/apparatus is coupled to database 4 and information retrieval system by system/ apparatus I/O bus 16. These elements are software modules and have been so identified merely to illustrate the functionality of the invention.
 System 42 performs a plurality of processes to dynamically create the search result list and the feedback list. These processes are generally described below with respect to FIG. 1. Instruction parser 6, token processor 7 and command processor 8 are used to transform the user request into one or more commands that can be used by the apparatus during the next steps of the search cycle. The instruction parser 6 takes the user request (the query) and parses it in order to obtain the different elements (tokens) of which it is constructed. The token processor 7 then identifies the different variables and instructions the user request comprises by selecting and sorting the tokens obtained from the instruction parser 6. The command processor 8 then determines if the generated command is a valid command. According to the type of command (i.e. search command or fetch command), the process continues. The database manager 15 takes care of updates of weights of predicates and records if necessary.
 If the user request is a fetch request, System 42 fetches the requested information and calls a display pro