« PreviousContinue »
(12) United States Patent ao) Patent No.: us 6,651,058 Bi
Sundaresan et al. (45) Date of Patent: Nov. 18,2003
A computer program product is provided as an automatic mining system to discover terms that are relevant to a given target topic from a large databases ol unstructured information such as the World Wide Web. The operation ol the automatic mining system is performed in three stages: The first stage is carried out by a new terms discoverer for discovering the terms in a document, the second stage is carried out by a candidate terms discoverer for discovering potentially relevant terms, and the third stage is carried out by a relevant terms discoverer for refining or testing the discovered relevance to filter lalse relevance. The new terms discoverer includes a system for the automatic mining ol patterns and relations, a system for the automatic mining ol new relationships, and a system for selecting new terms from relations. In one embodiment, the system for the automatic mining ol patterns and relations identifies a set ol related terms on the WWW with a high degree ol confidence, using a duality concept, and includes a terms database and two identifiers: a relation identifier and a pattern identifier. The system for the automatic mining ol new relationships includes a database a knowledge module and a statistics module. The knowledge module includes a stemming unit, a synonym check unit, and a domain knowledge check unit. The candidate terms discoverer includes a metadata extractor, a document vector module, an association module, a filtering module, and a database. The relevant terms discoverer includes a stop word filter and a system for the automatic construction ol generalization—specialization hierarchy ol terms comprised ol a terms database, an augmentation module, a generalization detection module, and a hierarchy database.
22 Claims, 9 Drawing Sheets
6,175,829 Bl * 1/2001 Li et al 382/230
6,182,091 Bl * 1/2001 Pitkow et al 707/102
6,185,550 Bl * 2/2001 Snow et al 707/1
6,377,947 Bl * 4/2002 Evans 707/5
6,389,436 Bl * 5/2002 Chakrabarti et al 707/3
D. Gibson et al., "Inferring Web Communities from Link
Topology," Proceedings of the 9th ACM. Conference on
Hypertext and Hypermedia, Pittsburgh, PA, 1998.
D. Turnbull. "Bibliometries and the World Wide Web,"
Technical Report University of Toronto, 1996.
K. McCain, "Mapping Authors in Intellectual Space: A
technical Overview," Journal of the American Society for
Information Science, 41(6):433-443, 1990.
S. Brin, "Extracting Patterns and Relations from the World
Wide Web," WebDB, Valencia, Spain, 1998.
R. Agrawal et al., "Fast Algorithms for Mining Association
Rules," Proc. of the 20th Int'l Conference on VLDB, San-
tiago, Chile, Sep. 1994.
R. Agrawal et al., Mining Association Rules Between Sets of
Items in Large Databases, Proceedings of ACM SIGMOD
Conference on Management of Data, pp. 207-216, Wash-
ington, D.C., May 1993.
S. Chakrabarti et al. "Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery," Proc. of The 8th International World Wide Web Conference, Toronto, Canada, May 1999.
B. Huberman et al., "Strong Regularities in Word Wide Web
Surfing," Xerox Palo Alto Research Center.
A. Hutchunson, "Metrics on Terms and Clauses," Depart-
ment of Computer Science, King's College London.
J. Kleinberg, "Authoritative Sources in a Hyperlinked Environment," Proc. of 9th ACM-SIAM Symposium on Discrete Algorithms, May 1997.
R. Srikant et al., "Mining Generalized Association Rules," Proceedings of the 21st VLDB Conference, Zurich, Swizerland, 1995.
W. Li et al., Facilitating comlex Web queries through visual user interfaces and query relaxation, published on the Word Wide Web at URL: http://www.7scu.edu.au/programme/ fullpapers/1936/coml936.htm as of Aug. 16, 1999. G. Piatetsky-Shapiro, "Discovery, Analysis, and Presentation of Strong Rules," pp. 229-248.
R. Miller et al., "SPHINX: A Framework for Creating Personal, Site-specific Web Crawlers," published on the Word Wide Web at URL: http://www.7scu.edu.au/programme/fullpapers/1875/coml875.htm as of Aug. 16, 1999. S. Soderland. Learning to Extract Text-based Information from the World Wide Web, American Association for Artificial Intelligence (www.aaai.org), pp. 251-254. G. Plotkin. A Note Inductive Generalization, pp. 153-163. R. Feldman et al., "Mining Associations in Text in the Presence of Background Knowledge," Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Aug. 2-4, 1996, Portland, Oregon. R. Kumar et al., "Trawling the Web for Emerging CyberCommunities," published on the Word Wide Web at URL: http://www8.org/w8-papers/4a-search-mining/trawling/ trawling.html as of Nov. 13, 1999.
"Acronym Finder", published on the Word Wide Web at URL:http://acronymfinder.com/ as of Sep. 4, 1999.
* cited by examiner