« PreviousContinue »
DECISION TREE DATA STRUCTURE FOR
USE IN CASE-BASED REASONING
FIELD OF THE INVENTION 5
The invention is generally related to computers and computer software. In particular, the invention is related to case-based reasoning and decision tree data structures for use therewith.
BACKGROUND OF THE INVENTION
Case-based reasoning is but one of a number of types of computer analysis approaches for drawing conclusions from input data. Case-based reasoning typically uses a decision tree to "prune" a library of past cases, also referred to herein as a "search space." A decision tree is created by inductive reasoning,which draws generalizations from past data and applies those generalizations to new data to draw specific conclusions about the new data. Inductive reasoning is the
complement of deductive reasoning, where responses to input data are developed from known general principles.
Case-based reasoning typically relies upon nearestneighbor matching to attempt to predict a result for an unknown case based upon the results of past cases stored in 25 a search space or library. As an example, case-based reasoning may be used by a bank to predict the likelihood that a particular customer would default on a loan, and thus whether a loan should be approved. Cases within a search space might include information such as the anticipated 30 monthly payment, the length of time that a customer was employed at a certain job, the customer's monthly income, etc. Also, for each case in the library, an indication of whether that customer eventually defaulted on his or her loan would also be provided for each case. Then, whenever 35 a new customer was presented to the bank, information about that customer could be presented as an unknown case, with nearest-neighbor matching used to locate those cases in the library that most closely resembled the data associated with the new case. Then, based upon whether those nearest- 40 neighbor cases resulted in defaults, a determination could be made as to whether a loan should be approved for the new customer.
One difficulty associated with nearest-neighbor matching in case-based reasoning is the fact that nearest-neighbor 45 matching can be extremely computationally intensive, particularly when a large number of cases exist in a library and a large number of characteristics, or attributes, need to be analyzed for each case. For this reason, often a logical construct known as a decision tree is utilized to narrow the 50 search space with which nearest-neighbor matching is performed during case-based analysis of an unknown case. A decision tree is typically stored in a decision tree data structure, and is essentially used to prune a search space into a smaller subset of cases most likely to be relevant to an 55 unknown case.
A conventional decision tree typically includes a collection of decision nodes arranged into a tree data structure, thus defining a plurality of paths that each identify different subsets of the cases from a search space. At each decision 60 node, a test question is provided that queries a particular attribute of an unknown case and selects one of a plurality of test answers based upon the result of the query. Associated with each test answer is either a reference to another "child" decision node, from which another relevant query is 65 performed, or a "leaf" node, which identifies a subset of cases from the search space, and which represents the end,
or termination point, for a particular path in the decision tree. As such, a unique path is defined in the decision tree for each unique combination of test answers to the test questions presented in the decision tree, such that a relevant subset of cases may be identified for each combination of test answers.
By "pruning" the search space in this manner with a decision tree, the most likely subset of cases in the search space are quickly identified, so nearest-neighbor matching can then be performed on a smaller number of cases. As a result, case-based analysis may be performed significantly more quickly and with generally comparable results to those generated without the use of a decision tree.
The accuracy of a case-based reasoning system that incorporates a decision tree, however, can be significantly impacted by the manner in which a decision tree partitions a search space. As a result, a significant amount of effort has been directed to the automated generation of decision trees and the arrangement of decision nodes and test queries therein to maximize the accuracy of a decision tree.
One problem associated with the use of decision trees, in particular, stems from the relatively dynamic nature of case-based reasoning analysis. In particular, a case-based reasoning system is only as good as the data provided to the system, and it is therefore desirable to update a case library relatively frequently to build a comprehensive and current library with which nearest-neighbor matching may be performed. However, given that conventional decision trees store specific identifiers to the cases that match each path in the decision tree, anytime a new case is added to a case library or search space, the decision tree used to access that library will typically need to be regenerated. Generating a decision tree is computationally expensive, however, and as such, whenever a case library is updated, case matching cannot proceed until the decision tree is modified in view of the cases in the updated case library. As a consequence, for frequently updated libraries, system availability may be adversely impacted by the need to frequently regenerate the decision trees associated with such libraries.
Therefore, a significant need exists in the art for a manner of increasing the availability of a case-based analysis system, and in particular, for a manner of reducing the need to update decision trees utilized in such systems.
SUMMARY OF THE INVENTION
The invention addresses these and other problems associated with the prior art by providing an apparatus, computer-readable medium and method for use in association with case-based reasoning and the like that utilize a novel decision tree data structure. The data structure incorporates a search criterion in association with each test answer to a test criterion defined within a decision node, for use in selecting cases from a search space that match the associated test answer to the test criterion. Rather than storing identifiers to the actual cases in a case library, or search space, within a decision tree data structure, search criteria are used to provide the mechanism by which those cases that represent the nearest-neighbors for each path of the decision tree data structure can by dynamically selected.
Among other benefits, associating search criteria with test answers within a decision tree data structure takes advantage of the fact that the partitioning of a search space on a relatively coarse level, as is done with a decision tree data structure, typically does not require complete synchronization and currency with respect to a search space. As such, the utilization of search criteria in lieu of actual case identifiers
eliminates the need to regenerate a decision tree after each modification (e.g., the addition of a new case) to the search space. While it still may be desirable in some embodiments to regenerate a decision tree data structure from time to time, the need to do so is significantly reduced, thereby increasing 5 the availability of a case-based reasoning system for analyzing unknown cases.
Consistent with one aspect of the invention, a method is provided for applying case-based reasoning on an unknown case. The method includes traversing a path among a plu- 1° rality of paths defined in a decision tree data structure to identify a subset of cases from a search space suitable for performing nearest-neighbor matching on the unknown case. Each path includes a plurality of decision nodes, and each decision node includes a test criterion defining a :5 plurality of test answers. Each test answer has associated therewith a search criterion that selects cases in the search space that match the associated test answer. In addition, traversing the path includes, at each decision node in the path, selecting a test answer among the plurality of test 20 answers defined by the test criterion for such decision node based upon an attribute associated with the unknown case, and applying the search criterion associated with the selected test answer to the search space to select cases in the search space that match the selected test answer. The method 25 also includes performing nearest-neighbor matching on the identified subset of cases.
Consistent with another aspect of the invention, a method is provided for accessing a search space that includes a plurality of cases. The method includes analyzing a test 30 criterion resident in a decision tree data structure to select a test answer from a plurality of test answers associated with the test criterion, retrieving a search criterion associated with the selected test answer, and applying the retrieved search criterion to the search space to select cases from the search 35 space that match the selected test answer.
Consistent with an additional aspect of the invention a method is provided for generating a decision tree data structure for use in accessing a plurality of cases in a search 4Q space. The method includes generating a plurality of decision nodes, each decision node including a test criterion that defines a plurality of test answers, and associating a search criterion with each test answer defined by each test criterion, wherein each search criterion is configured to select cases 4J from the search space that match the associated test answer.
Consistent with a further aspect of the invention, a computer-readable medium is provided including a decision tree data structure for use in accessing a plurality of cases in a search space. The decision tree data structure includes a 50 test criterion configured to test an attribute associated with the cases, the test criterion defining a plurality of test answers, and a plurality of search criteria, each associated with a test answer from the plurality of test answers, and each configured to select cases from the search space that 55 match the associated test answer.
Consistent with an additional aspect of the invention, an apparatus is provided, including a memory and a decision tree data structure resident therein for use in accessing a plurality of cases in a search space. The decision tree data 60 structure includes a test criterion configured to test an attribute associated with at least a portion of the plurality of cases, the test criterion defining a plurality of test answers, and a plurality of search criteria, each associated with a test answer from the plurality of test answers, and each config- 65 ured to select cases from the search space that match the associated test answer.
Consistent with yet another aspect of the invention, an apparatus is provided, including a memory and a decision tree data structure resident therein for use in identifying a subset of cases from a search space suitable for performing nearest-neighbor matching on an unknown case. The decision tree data structure includes a plurality of decision nodes defining a plurality of paths in the decision tree data structure, each decision node including a test criterion defining a plurality of test answers, and each test answer having associated therewith a search criterion that selects cases in the search space that match the associated test answer.
These and other advantages and features, which characterize the invention, are set forth in the claims annexed hereto and forming a further part hereof. However, for a better understanding of the invention, and of the advantages and objectives attained through its use, reference should be made to the Drawings, and to the accompanying descriptive matter, in which there is described exemplary embodiments of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of an apparatus implementing a case-based reasoning system consistent with the invention.
FIG. 2 is a block diagram of an exemplary decision tree data structure organization consistent with the invention.
FIG. 3 illustrates the program flow of an exemplary generate decision tree routine executed by the decision tree generator of FIG. 1.
FIG. 4 illustrates the program flow of an exemplary case-based reasoning routine executed by the case-based reasoning engine of FIG. 1.
FIG. 5 is a block diagram of an exemplary decision tree data structure generated by the generate decision tree routine of FIG. 3.
The embodiments discussed hereinafter generally operate by embedding within a decision tree data structure search criteria that permit database queries to be utilized in the dynamic generation of a subset of cases from a search space with which to perform case-based reasoning.
As is well known in the art, a decision tree typically is represented using a plurality of decision nodes, each incorporating a test criterion, and organized into a plurality of paths, or "branches", that are selectively traversed for an unknown case based upon the application of the attributes of the unknown case to the test criteria defined within the tree. In a conventional decision tree, the leaf nodes, representing the termination points of each possible path through the decision tree, includes identifiers (e.g., pointers or record ID's) of the actual cases that best meet the test criteria for a particular unknown case. It is then with these identified cases that nearest-neighbor matching is performed to attempt to predict an outcome for the unknown case based upon the outcomes of the cases in the subset of cases identified by the decision tree.
Consistent with the invention, rather than storing case identifiers within leaf nodes, each answer within a decision tree path is associated with a particular search criterion, e.g., a structured query language (SQL) or other form of database query that will retrieve the case identifiers that satisfy each test and answer combination. Thus, at each decision node, a set of case identifiers that meet the test criterion for that node are dynamically generated. Then, using set intersection, the cases that meet all of the criteria in a path may be dynamically selected.
As an added benefit, in some embodiments, dynamically generating a subset permits a only a portion of a path in a decision tree to be used, e.g., until a candidate case set is small enough to perform efficient nearest-neighbor matching. Put another way, a result set of matching cases may be 5 dynamically "pared down" from the entire search space at each decision node in a path, until a moderate number of cases remain in the result set, whereby an effectively variable-length decision tree paths are defined. As an additional benefit, in many instances, a decision tree need not be modified each time a new case is added to the case library. Such an advantage can be realized based upon the fact that generalizations often do not need to be completely in synchronization with the most current data in a case library to be useful. Thus, in contrast to conventional decision tree data structures, reduced maintenance, and thus increased availability of a case library, is typically provided.
Turning now to the Drawings, wherein like numbers denote like parts throughout the several views, FIG. 1 illustrates an apparatus 10 implementing case-based reason- 20 ing consistent with the invention. For the purposes of the invention, apparatus 10 may represent practically any type of computer, computer system or other programmable electronic device, including a client or other single-user computer (e.g., a desktop computer, a laptop computer, a hand- 25 held computer, etc.), a server or other multi-user computer (e.g., an enterprise server, a midrange computer, a mainframe computer, etc.), an embedded controller, etc. Apparatus 10 may be coupled to other computers via a network, or may be a stand-alone device in the alternative. Apparatus 30 10 will hereinafter also be referred to as a "computer", although it should be appreciated the term "apparatus" may also include other suitable programmable electronic devices as well.
Computer 10 includes one or more central processing 35 units (CPU's), or processors, 12 coupled to a memory 14. Memory 14 typically represents the random access memory (RAM) devices comprising the main storage of computer 10, as well as any supplemental levels of memory, e.g., cache memories, non-volatile or backup memories (e.g., 40 programmable or flash memories), read-only memories, etc. In addition, memory 14 may be considered to include memory storage physically located elsewhere in computer 10, as well as any storage capacity used as a virtual memory, e.g., as stored on a mass storage device or on another 45 computer coupled to computer 10 via a network.
Computer 10 also typically receives a number of inputs and outputs for communicating information externally. For interface with a user or operator, computer 10 typically includes one or more user input devices 16 (e.g., a keyboard, 50 a mouse, a trackball, a joystick, a touchpad, and/or a microphone, among others) and a display 18 (e.g., a CRT monitor, an LCD display panel, and/or a speaker, among others). In the alternative, e.g., for a multi-user computer, computer 10 may includes a workstation or other user 55 terminal interface through which user input and output is exchanged.
Computer 10 may also include an interface with one or more networks 20 (e.g., a LAN, a WAN, a wireless network, and/or the Internet, among others) to permit the communi- 60 cation of information with other computers coupled to the network. Furthermore, for additional storage, computer 10 may also include one or more mass storage devices 22, e.g., a floppy or other removable disk drive, a hard disk drive, a direct access storage device (DASD), an optical drive (e.g., 65 a CD drive, a DVD drive, etc.), and/or a tape drive, among others. It will also be appreciated that computer 10 typically
includes suitable analog and/or digital interfaces between processor 12 and each of components 14,16,18, 20 and 22 as is well known in the art.
Computer 10 operates under the control of an operating system 24, and executes or otherwise relies upon various computer software applications, components, programs, objects, modules, data structures, etc. (e.g., case-based reasoning engine 26 and decision tree generator 28 shown as resident in memory 14, and search space or case library 30 and decision tree data structure 32 shown resident in mass storage device 22). Moreover, various applications, components, programs, objects, modules, etc. may also execute on one or more processors in another computer coupled to computer 10 via a network, e.g., in a distributed or client-server computing environment, whereby the processing required to implement the functions of a computer program may be allocated to multiple computers over a network.
In general, the routines executed to implement the embodiments of the invention, whether implemented as part of an operating system or a specific application, component, program, object, module or sequence of instructions will be referred to herein as "computer programs", or simply "programs". The computer programs typically comprise one or more instructions that are resident at various times in various memory and storage devices in a computer, and that, when read and executed by one or more processors in a computer, cause that computer to perform the steps necessary to execute steps or elements embodying the various aspects of the invention. Moreover, while the invention has and hereinafter will be described in the context of fully functioning computers and computer systems, those skilled in the art will appreciate that the various embodiments of the invention are capable of existing in a computer-readable medium, which may include recordable media such as volatile/non-volatile memory devices, floppy and other removable disks, hard disk drives, magnetic tape, optical disks, etc., and/or transmission media such as digital and analog communication links. Furthermore, embodiments of the invention may also exist in the form of a signal borne on a carrier wave, either within a computer or external therefrom along a communications path.
Those skilled in the art will recognize that the exemplary environment illustrated in FIG. 1 is not intended to limit the invention. Indeed, those skilled in the art will recognize that other alternative hardware and/or software environments may be used without departing from the scope of the invention.
In the illustrated embodiment, case-based reasoning consistent with the invention is implemented principally in a case-based reasoning engine program 26 and a decision tree generator program 28. Each of programs 26, 28 rely upon a case library or search space 30 within which is stored a plurality of cases. In this context, a case may incorporate any suitable data structure representing a set of attributes, features or characteristics that define a particular occurrence or instance to be used in the performance of inductive reasoning.
As an example, for a system that attempts to predict whether a loan would default, each case may represent a customer that has previously applied for a loan, as well as whether that loan was approved or not, and if so, whether that loan eventually went into default. Each case in such a system might incorporate various attributes about the customer such as income level, time at their current job, monthly payment, other debts, etc. As another example, for