« PreviousContinue »
(12) United States Patent ao) Patent No.: us 6,564,197 B2
Sahami et al. (45) Date of Patent: *May 13,2003
(54) METHOD AND APPARATUS FOR
SCALABLE PROBABILISTIC CLUSTERING
USING DECISION TREES
(75) Inventors: Mehran Sahami, Mountain View, CA (US); George Harrison John, San
Mateo, CA (US)
(73) Assignee: E.piphany, Inc., San Mateo, CA (US)
( * ) Notice: This patent issued on a continued prosecution application filed under 37 CFR 1.53(d), and is subject to the twenty year patent term provisions ol 35 U.S.C. 154(a)(2).
Subject to any disclaimer, the term ol this patent is extended or adjusted under 35 U.S.C. 154(b) by 0 days.
(21) Appl. No.: 09/304,509
(22) Filed: May 3, 1999
(65) Prior Publication Data
US 2003/0065635 Al Apr. 3, 2003
(51) Int. CI. G06N 5/02
Some embodiments ol the invention include methods lor identilying clusters in a database, data warehouse or data mart. The identified clusters can be meaninglully understood by a list ol the attributes and corresponding values lor each ol the clusters. Some embodiments ol the invention include a method lor scalable probabilistic clustering using a decision tree. Some embodiments ol the invention, perform linearly in the size ol the set ol data and only require a single access to the set ol data. Some embodiments ol the invention produce interpretable clusters that can be described in terms ol a set ol attributes and attribute values for that set ol attributes. In some embodiments, the cluster can be interpreted by reading the attribute values and attributes on the path from the root node ol the decision tree to the node ol the decision tree corresponding to the cluster. In some embodiments, it is not necessary for there to be a domain specific distance lunction lor the attributes. In some embodiments, a cluster is determined by identilying an attribute with the highest influence on the distribution ol the other attributes. Each ol the values assumed by the identified attribute corresponds to a cluster, and a node in the decision tree. In some embodiments, the CUBE operation is used to access the set ol data a single time and the result is used to compute the influence and other calculations.
59 Claims, 13 Drawing Sheets
Chow, C.K. 1968. IEEE Transactions On Information Theory, vol. IT-14, No. 3, Approximating Discrete Probability Distributions With Dependence Trees, pp. 462-467. Fisher, D.H. 1986. Unsupervised Concept Learning and Discovery, Knowledge Acquisition Via Incremental Conceptual Clustering, pp. 267-283.
Chickering, D.M. 1996. Learning Bayesian Networks is NP-Complete. pp. 121-130.
John, G.H., Lent, B.1997. American Assciation For Artifical Intelligence, SIPping From the Data Firehose.^. 199-201. Sahami M. 1999. Using Machine Learning To Improve Information Access, Dissertation, Stanford University Dec. 1998.
McAlpine, G. et al., "Integrated Information Retrieval in a Knowledge Worker Support System", Proc. of the Intl. Conf. on Research and Development In Information Retrieval (SIGIR), Cambridge, MA, Jun. 25-28, 1989, Conf. 12, pp. 48-57.
Tsuda, K. et al., "IconicBrowser: An Iconic Retrieval System for Object-Oriented Databases", Proc. of the IEEE Workshop on Visual Languages, Oct. 4, 1989, pp. 130-137. "Multiple Selection List Presentation Aids Complex Search", IBM Technical Disclosure Bulletin, vol. 36, No. 10, Oct. 1993, pp. 317-318.
Han, J.: "Towards On-Line Analytical Mining in Large Databases" SIGMOD Record, Mar. 1998, ACM, USA, vol. 27, No. 1, pp. 97-107, XP000980233, ISSN: 0163-5808.
* cited by examiner