« PreviousContinue »
United States Patent [w]
Agrawal et al.
US006138115A [ii] Patent Number:  Date of Patent:
 METHOD AND SYSTEM FOR GENERATING A DECISION-TREE CLASSIFIER IN PARALLEL IN A MULTI-PROCESSOR SYSTEM
 Inventors: Rakesh Agrawal; Manish Mehta, both of San Jose, Calif.; John Christopher Shafer, Amherst, Mass.
 Assignee: International Business Machines Corporation, Armonk, N.Y.
[ * ] Notice: This patent is subject to a terminal disclaimer.
 Appl. No.: 09/245,765  Filed: Feb. 5, 1999
Related U.S. Application Data
 Division of application No. 08/641,404, May 1, 1996, Pat. No. 5,870,735.
 Int. C I. G06F 17/30
L. Breiman (Univ. of CA-Berkeley) et al. Classification and
Regression Trees (Book) Chapter 2. Introduction to Tree
Classification pp. 18-58, Wadsworth International Group,
Belmont, CA 1984.
J. Catlett, Megainduction: Machine Learning on Very Large
Databases, PhD thesis, Univ. of Sydney, Jun./Dec. 1991.
P. K. Chan et al., Experiments on Multistrategy Learning by
Meta-learning. In Proc. Second Intl. Conf. on Info, and
Knowledge Mgmt., pp. 314-323, 1993.
D. J. DeWitt, J. F. Naughton and D. A. Schneider, Parallel
Sorting on Shared-Nothing Architecture Using Probabilistic
Splitting, In Proc. of the 1st Int'l Conf. on Parallel and
Distributed Information Systems, pp. 280-291, Dec. 1991.
U. Fayyad et al., The Attribute Selection Problem in Deci-
sion Tree Generation. In 105h NatT Conf. on Al AAAI-92,
Learning: Inductive 1992.
(List continued on next page.)
Primary Examiner—-John E. Breene
Assistant Examiner—Cheryl Lewis
Attorney, Agent, or Firm—Khanh Q. Tran
A method and system are disclosed for generating a decision-tree classifier in parallel in a multi-processor system, from a training set of records. The method comprises the steps of: partitioning the records among the processors, each processor generating an attribute list for each attribute, and the processors cooperatively generating a decision tree by repeatedly partitioning the records using the attribute lists. For each node, each processor determines its best split test and, along with other processors, selects the best overall split for the records at that node. Preferably, the gini-index and class histograms are used in determining the best splits. Also, each processor builds a hash table using the attribute list of the split attribute and shares it with other processors. The hash tables are used for splitting the remaining attribute lists. The created tree is then pruned based on the MDL principle, which encodes the tree and split tests in an MDL-based code, and determines whether to prune and how to prune each node based on the code length of the node.
36 Claims, 10 Drawing Sheets
M. James, Classification Algorithms (book), Chapters 1-3,
QA278.65, J281 Wiley-Interscience Pub., 1985.
M. Mehta et al., Mdl-based Decision Tree Pruning. Int'l
Conference on Knowledge Discovery in Databases and Data
Mining (KDD-95) Montreal, Canada, pp. 216-221, Aug.
J. R. Quinlan et al., Inferring Decision Trees Using Minimum Description Length Principle, Information and Computation 80, pp. 227-248, 1989. (0890-5401/89 Academic Press, Inc.).
Wallace et al., Coding Decision Trees, Machine Learning, 11, pp. 7-22, 1993. (Kluwer Academic Pub., Boston. Mfg. in the Netherlands.).
S. M. Weiss et al., Computer Systems that Learn, Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems, pp.113-143, 1991. Q325.5, W432, C2, Morgan Kaufmann Pub. Inc., San Mateo, CA.
MPI: A Message-Passing Interface Standard, Message Passing Interface Forum May 5, 1994.
M. Mehta, R. Agrawal & J. Rissanen, SLIQ: Fast Scalable Classifier for Data Mining, In EDBT 96, Avignon, France, Mar. 1996.
R. P. Lippmann, An Introduction to Computing with Neural Nets, IEEE ASSP Magazine, pp. 4-22, 0740-7467/87/0400, Apr. 1987.
D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Chapter 6, Intro, to Genetics Based Machine Learning, pp. 218-257, (Book), 1989.
D. J. DeWitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H. Hsiao & R. Rasmussen, The Gamma Database Machine Project, IEEE Transactions on Knowledge and Data Eng. vol. 2, No. 1, pp. 44-62, Mar. 1990.
No. 08/500,717, filed Jul. 11, 1995, for System and Method for Parallel Mining of Association Rules in Databases, Pat. No. 5,842,200.
No. 08/541,665, filed Oct. 10,1995, for Method and System for Mining Generalized Sequential Patterns in a Large Database, Pat. No. 5,742,811.
No. 08/564,694, filed Nov. 29,1995, for Method and System for Generating a Decision-tree Clarifier for Data Records, Pat. No. 5,787,274.