ADAPTIVE PROBLEM DETERMINATION AND RECOVERY IN A COMPUTER SYSTEM
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] The present invention is related to the following applications entitled: "Method and Apparatus for Publishing and Monitoring Entities Providing Services in a Distributed
Data Processing System", Ser. No. , attorney docket
no. YOR920020173US1; "Method and Apparatus for Automatic Updating and Testing of Software", Ser. No. ,
attorney docket no. YOR920020174US1; "Composition Service for Autonomic Computing", Ser. No. , attorney docket no. YOR920020176US1; and "Self-Configuring
Computing System", Ser. NO. , attorney docket no.
YOR920020181US1; all filed even date hereof, assigned to the same assignee, and incorporated herein by reference.
BACKGROUND OF THE INVENTION [0002] 1. Technical Field
[0003] The present invention relates generally to an improved data processing system, and in particular, to a method and apparatus for managing hardware and software components. Still more particularly, the present invention provides a method and apparatus for automatically recognizing, tracing, diagnosing, and recovering from problems in hardware and software components to achieve functionality requirements.
[0004] 2. Description of Related Art
[0005] Modern computing technology has resulted in immensely complicated and ever-changing environments. One such environment is the Internet, which is also referred to as an "internetwork." The Internet is a set of computer networks, possibly dissimilar, joined together by means of gateways that handle data transfer and the conversion of messages from a protocol of the sending network to a protocol used by the receiving network. When capitalized, the term "Internet" refers to the collection of networks and gateways that use the TCP/IP suite of protocols. Currently, the most commonly employed method of transferring data over the Internet is to employ the World Wide Web environment, also called simply "the Web". Other Internet resources exist for transferring information, such as File Transfer Protocol (FTP) and Gopher, but have not achieved the popularity of the Web. In the Web environment, servers and clients effect data transaction using the Hypertext Transfer Protocol (HTTP), a known protocol for handling the transfer of various data files (e.g., text, still graphic images, audio, motion video, etc.). The information in various data files is formatted for presentation to a user by a standard page description language, the Hypertext Markup Language (HTML). The Internet also is widely used to transfer applications to users using browsers. Often times, users of may search for and obtain software packages through the Internet. While computer technology has become more powerful, it has also become more complex. As the complexity and heterogeneity of computer systems continues to increase, it is becoming increasingly difficult to diagnose and correct hardware and software problems. As computing systems become more autonomic (i.e., self-regulating), this challenge will become even greater for several reasons. First, autonomic computing systems, being self-configuring, will
tend to work around such problems, making it difficult to recognize that anything is wrong. Second, problems will become harder to trace to their source because of the more ephemeral relationships among elements in the autonomic system. In other words, the set of elements that participated in the failure may no longer be connected to one another by the time the problem is noticed, making reconstruction of the problem very difficult. For instance, a number of publications address the topic of problem identification, but do so in a statically-configured system, such as Tang, D.; Iyer, R. K., "Analysis and modeling of correlated failures in multicomputer systems,"ffi££ Transactions on Computers, Vol. 41 Issue 5, May 1992, pp. 567-577; Lee, I.; Iyer, R. K.; Tang, D., "Error/failure analysis using event logs from fault tolerant systems/TCgesf of Papers., Twenty-First International Symposium on Fault-Tolerant Computing (FTCS-21), 1991, pp. 10-17; and Thottan, M.; Chuanyi Ji, "Proactive anomaly detection using distributed intelligent agents,"/E££ Network, Vol. 12 Issue 5, September-October 1998, pp. 21-27.
[0006] Today, human technical support personnel or system administrators perform most of the tasks associated with recognizing, diagnosing, and repairing hardware or software problems manually, often employing a good deal of trial and error, and relying on their own memory or ability to recognize similar patterns of behavior. This is a laborious process, and as system complexity increases there are progressively fewer system administrators who can do it competently. Thus, a need exists for technology to automate the recognition, tracing, diagnosis, and repair of problems in autonomic systems.
SUMMARY OF THE INVENTION
[0007] The present invention is directed toward a method, computer program product, and data processing system for recognizing, tracing, diagnosing, and repairing problems in an autonomic computing system. Rules and courses of actions to follow in logging data, in diagnosing faults (or threats of faults), and in treating faults (or threats of faults) are formulated using an adaptive inference and action system. The adaptive inference and action system includes techniques for conflict resolution that generate, prioritize, modify, and remove rules based on environment-specific information, accumulated time-sensitive data, actions taken, and the effectiveness of those actions. Thus, the present invention enables a dynamic, autonomic computing system to formulate its own strategy for self-administration, even in the face of changes in the configuration of the system.
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
[0009] FIG. 1 is a diagram of a networked data processing system in which the present invention may be implemented;
[0010] FIG. 2 is a block diagram of a server system within the networked data processing system of FIG. 1;
[0011] FIG. 3 is a block diagram of a client system within the networked data processing system of FIG. 1;
[0012] FIG. 4 is a diagram of an autonomic element in accordance with a preferred embodiment of the present invention;
[0013] FIG. 5 is a diagram a mechanism for establishing service-providing relationships between autonomic elements in accordance with a preferred embodiment of the present invention;
[0014] FIG. 6 is an overall view of a problem detection and correction system in accordance with a preferred embodiment of the present invention; and
[0015] FIG. 7 is a detailed view of a problem detection and correction system in accordance with a preferred embodiment of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
[0016] With reference now to the figures, FIG. 1 depicts a pictorial representation of a network of data processing systems in which the present invention may be implemented. Network data processing system 100 is a network of computers in which the present invention may be implemented. Network data processing system 100 contains a network 102, which is the medium used to provide communications links between various devices and computers connected together within network data processing system 100. Network 102 may include connections, such as wire, wireless communication links, or fiber optic cables.
[0017] In the depicted example, server 104 is connected to network 102 along with storage unit 106. In addition, clients 108, 110, and 112 are connected to network 102. These clients 108, 110, and 112 may be, for example, personal computers or network computers. In the depicted example, server 104 provides data, such as boot files, operating system images, and applications to clients 108-112. Clients 108, 110, and 112 are clients to server 104. Network data processing system 100 may include additional servers, clients, and other devices not shown. In the depicted example, network data processing system 100 is the Internet with network 102 representing a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, government, educational and other computer systems that route data and messages. Of course, network data processing system 100 also may be implemented as a number of different types of networks, such as for example, an intranet, a local area network (LAN), or a wide area network (WAN). FIG. 1 is intended as an example, and not as an architectural limitation for the present invention.
[0018] Referring to FIG. 2, a block diagram of a data processing system that may be implemented as a server, such as server 104 in FIG. 1, is depicted in accordance with a preferred embodiment of the present invention. Data processing system 200 may be a symmetric multiprocessor (SMP) system including a plurality of processors 202 and 204 connected to system bus 206. Alternatively, a single processor system may be employed. Also connected to system bus 206 is memory controller/cache 208, which
provides an interface to local memory 209. I/O bus bridge 210 is connected to system bus 206 and provides an interface to I/O bus 212. Memory controller/cache 208 and I/O bus bridge 210 may be integrated as depicted.
[0019] Peripheral component interconnect (PCI) bus bridge 214 connected to I/O bus 212 provides an interface to PCI local bus 216. A number of modems may be connected to PCI local bus 216. Typical PCI bus implementations will support four PCI expansion slots or add-in connectors. Communications links to clients 108-112 in FIG. 1 may be provided through modem 218 and network adapter 220 connected to PCI local bus 216 through add-in boards.
[0020] Additional PCI bus bridges 222 and 224 provide interfaces for additional PCI local buses 226 and 228, from which additional modems or network adapters may be supported. In this manner, data processing system 200 allows connections to multiple network computers. A memory-mapped graphics adapter 230 and hard disk 232 may also be connected to I/O bus 212 as depicted, either directly or indirectly.
[0021] Those of ordinary skill in the art will appreciate that the hardware depicted in FIG. 2 may vary. For example, other peripheral devices, such as optical disk drives and the like, also may be used in addition to or in place of the hardware depicted. The depicted example is not meant to imply architectural limitations with respect to the present invention.
[0022] The data processing system depicted in FIG. 2 may be, for example, an IBM eServer pSeries system, a product of International Business Machines Corporation in Armonk, N.Y., running the Advanced Interactive Executive (AIX) operating system or LINUX operating system.
[0023] With reference now to FIG. 3, a block diagram illustrating a data processing system is depicted in which the present invention may be implemented. Data processing system 300 is an example of a client computer. Data processing system 300 employs a peripheral component interconnect (PCI) local bus architecture. Although the depicted example employs a PCI bus, other bus architectures such as Accelerated Graphics Port (AGP) and Industry Standard Architecture (ISA) may be used. Processor 302 and main memory 304 are connected to PCI local bus 306 through PCI bridge 308. PCI bridge 308 also may include an integrated memory controller and cache memory for processor 302. Additional connections to PCI local bus 306 may be made through direct component interconnection or through add-in boards. In the depicted example, local area network (LAN) adapter 310, SCSI host bus adapter 312, and expansion bus interface 314 are connected to PCI local bus 306 by direct component connection. In contrast, audio adapter 316, graphics adapter 318, and audio/video adapter 319 are connected to PCI local bus 306 by add-in boards inserted into expansion slots. Expansion bus interface 314 provides a connection for a keyboard and mouse adapter 320, modem 322, and additional memory 324. Small computer system interface (SCSI) host bus adapter 312 provides a connection for hard disk drive 326, tape drive 328, and CD-ROM drive 330. Typical PCI local bus implementations will support three or four PCI expansion slots or add-in connectors.
[0024] An operating system runs on processor 302 and is used to coordinate and provide control of various components within data processing system 300 in FIG. 3. The operating system may be a commercially available operating system, such as Windows XP, which is available from Microsoft Corporation. An object oriented programming system such as Java may run in conjunction with the operating system and provide calls to the operating system from Java programs or applications executing on data processing system 300. "Java" is a trademark of Sun Microsystems, Inc. Instructions for the operating system, the object-oriented operating system, and applications or programs are located on storage devices, such as hard disk drive 326, and may be loaded into main memory 304 for execution by processor 302.
[0025] Those of ordinary skill in the art will appreciate that the hardware in FIG. 3 may vary depending on the implementation. Other internal hardware or peripheral devices, such as flash read-only memory (ROM), equivalent nonvolatile memory, or optical disk drives and the like, may be used in addition to or in place of the hardware depicted in FIG. 3. Also, the processes of the present invention may be applied to a multiprocessor data processing system.
[0026] As another example, data processing system 300 may be a stand-alone system configured to be bootable without relying on some type of network communication interfaces. As a further example, data processing system 300 may be a personal digital assistant (PDA) device, which is configured with ROM and/or flash ROM in order to provide non-volatile memory for storing operating system files and/ or user-generated data.
[0027] The depicted example in FIG. 3 and above-described examples are not meant to imply architectural limitations. For example, data processing system 300 also may be a notebook computer or hand held computer in addition to taking the form of a PDA. Data processing system 300 also may be a kiosk or a Web appliance.
[0028] The present invention is directed to a method and apparatus for problem determination and correction in a self-managing, autonomic computing system. The hardware and software components making up such a computing system (e.g., databases, storage systems, Web servers, file servers, and the like) are self-managing components called "autonomic elements." Autonomic elements couple conventional computing functionality (e.g., a database) with additional self-management capabilities. FIG. 4 is a diagram of an autonomic element in accordance with a preferred embodiment of the present invention. According to the preferred embodiment depicted in FIG. 4, an autonomic element 400 comprised a management unit 402 and a functional unit 404. One of ordinary skill in the art will recognize that an autonomic element need not be clearly divided into separate units as in FIG. 4, as the division between management and functional units is merely conceptual.
[0029] Management unit 402 handles the self-configuration features of autonomic element 400. In particular, management unit 402 is responsible for adjusting and maintaining functional unit 404 pursuant to a set of goals for autonomic element 400, as indicated by monitor/control interface 414. Management unit 402 is also responsible for limiting access to functional unit 404 to those other system components (e.g., other autonomic elements) that have permission to use functional unit 404, as indicated by access
control interfaces 416. Management unit 402 is also responsible for establishing and maintaining relationships with other autonomic elements (e.g., via input channel 406 and output channel 408).
[0030] Functional unit 404 consumes services provided by other system components (e.g., via input channel 410) and provides services to other system components (e.g., via output channel 412), depending on the intended functionality of autonomic element 400. For example, an autonomic database element provides database services and an autonomic storage element provides storage services. It should be noted that an autonomic element, such as autonomic element 400, may be a software component, a hardware component, or some combination of the two. One goal of autonomic computing is to provide computing services at a functional level of abstraction, without making rigid distinctions between the underlying implementations of a given functionality.
[0031] Autonomic elements operate by providing services to other components (which may themselves be autonomic elements) and/or obtaining services from other components. In order for autonomic elements to cooperate in such a fashion, one requires a mechanism by which an autonomic element may locate and enter into relationships with additional components providing needed functionality. FIG. 5 is a diagram depicting such a mechanism constructed in accordance with a preferred embodiment of the present invention.
[0032] A "requesting component"500, an autonomic element, requires services of another component in order to accomplish its function. In a preferred embodiment, such function may be defined in terms of a policy of rules and goals. Policy server component 502 is an autonomic element that establishes policies for other autonomic elements in the computing system. In FIG. 5, policy server component 502 establishes a policy of rules and goals for requesting component 500 to follow and communicates this policy to requesting component 500. In the context of network communications, for example, a required standard of cryptographic protection may be a rule contained in a policy, while a desired quality of service (QoS) may be a goal of a policy.
[0033] In furtherance of requesting component 500's specified policy, requesting component 500 requires a service from an additional component (for example, encryption of data). In order to acquire such a service, requesting component 500 consults directory component 504, another autonomic element. Directory component 504 is preferably a type of database that maps functional requirements into components provided the required functionality.
[0034] In a preferred embodiment, directory component 504 may provide directory services through the use of standardized directory service schemes such as Web Services Description Language (WSDL) and systems such as Universal Description, Discovery, and Integration (UDDI), which allow a program to locate entities that offer particular services and to automatically determine how to communicate and conduct transactions with those services. WSDL is a proposed standard being considered by the World-Wide Web Consortium, authored by representatives of companies, such as International Business Machines Corporation, Ariba, Inc., and Microsoft Corporation. UDDI version 3 is the current specification being used for Web service applications and services. Future development and changes to UDDI will
be handled by the Organization for the Advancement of Structured Information Standards (OASIS).
[0035] Directory component 504 provides requesting component 500 information to allow requesting component 500 to make use of the services of a needed component 506. Such information may include an address (such as a network address) to allow an existing component to be communicated with, downloadable code or the address to downloadable code to allow a software component to be provisioned, or any other suitable information to allow requesting component 500 to make use of the services of needed component 506.
[0036] FIG. 6 is a diagram providing an overall view of a system for problem determination and error recovery in an autonomic computing system in accordance with a preferred embodiment of the present invention. An autonomic computing system is comprised of a number of autonomic elements 600, 602, and 604, which are hardware and software components that are self-contained inasmuch as they are self-managed, but which operate in a cooperative manner by utilizing each others' services. Each of autonomic elements 600, 602, and 604 maintains an event log (606, 608, and 610, respectively) and interacts with the Problem Determination and Error Recovery system 612, which may be contained within the individual autonomic elements 600, 602, and 604 or may be contained within another autonomic element or component that specializes in this function.
[0037] The Logic Module 614 in the Problem Determination and Error Recovery system 612 directs each of autonomic elements 600, 602, and 604 to log events of specified types, under specified conditions, with a specified level of detail. Based on the event logs received from the autonomic elements, the Logic Module 614 attempts to diagnose the problem (or the threat of a problem). Once a problem (or potential problem) is detected, the Logic Module suggests courses of action to the autonomic elements to recover from the problem. The specifics of the types of available elements, events, problems and actions are stored in the Database 616 associated with the Logic Module 614.
[0038] In terms of logging of events, it can thus be said that the Logic Module 614 establishes a policy by which each of autonomic elements 600, 602, and 604 log events. Events may include actions taken by any of autonomic elements 600, 602, and 604, inputs received by autonomic elements 600, 602, and 604, or other events observable by autonomic elements 600, 602, and 604. It is understood that Event logs 606, 608, 610 may also include various system configurations, workload characteristics, and performance measures in 600, 602, 604 respectively. Event logs 606, 608, and 610 may be written in any suitable data format, although in a preferred embodiment a structured, machine-readable format, such as XML, may be used. Alternatively, event logs 606, 608, and 610 may be represented in any database or data storage format, such as a relational database, objectoriented database, object-relational database, deductive database, or any other suitable storage format. Typically, the information stored in event logs 606, 608, and 610 will identify particular occurrences of events, their time of occurrence, and any parameters or other data useful for the interpretation of events. It is understood that Event logs 606, 608, 610 may also include various system configurations, workload characteristics, and performance measures in 600,
602, 604, respectively. One of ordinary skill in the art will recognize that a wide variety of items of information regarding events within the computing system may be stored in event logs 606, 608, and 610 without departing from the scope and spirit of the present invention. Moreover, any number of autonomic elements, comprising hardware components, software components, or both, may be used.
[0039] In a preferred embodiment of the present invention, the Logic Module 614 is divided into three separate logic modules, namely the Logging Logic Module 704, the Problem Determination Logic Module 710 and the Error Recovery Logic Module 715 as shown in FIG. 7. Each of these logic modules includes an inference engine (not shown) that applies sets of rules including those of the form
[0040] IF <Condition(s)> THEN <Action(s)> or
[0041] WHILE <Conditions(s)> DO <Action(s)>
[0042] to observed data to make their decisions. Thus, each of Logging Logic Module 704, Problem Determination Logic Module 710, and Error Recovery Logic Module 715 may be thought of as a kind of expert system having an inference engine that applies a knowledge base of logical inference rules to make decisions regarding the logging of data, diagnosis of problems, and recovery from errors, respectively. In the rules making up each knowledge base, a condition clause (e.g., the "X" in "if X then Y") may include variables and be as simple as "element X is of type Tl", or complex, involving statistical, machine learning, or artificial intelligence techniques. Examples of statistical techniques to define the condition clause may include (but are not limited to) the application of Student's T-test, correlation analysis, or regression analysis. Examples of machine learning and artificial intelligence techniques to define the condition clause may include (but are not limited to) supervised learning methods such as neural networks, Bayesian networks, or Support Vector Machines, and unsupervised learning methods such as k-means clustering, hierarchical clustering, or principal component analysis.
[0043] In a similar fashion, the action clause may include variables and be simple as "Log events of type El in element X". Actions may also involve the delegation of a complex task to an element such as "Reduce workload in element X by 10%," where the details of how to reduce the workload are determined by the autonomic element X. Moreover, actions may include the creation, modification, or removal of rules from the rule sets in the three Logic Modules. This rule learning process may be performed using any appropriate machine learning technique, including those previously enumerated herein. For example, upon the discovery of a new diagnosis, an action clause may state "Add a new rule R in the Problem Determination Logic Module where R: IF events of type E5 and E6 occur simultaneously in an element of type Ti, THEN problem of type P9 has occurred."
[0044] The specifics of the types of available (or known) elements, events, known problems, conditions, and actions are all stored in the Databases 705, 711, 716 associated with each of the three logic modules 704, 710, and 715, respectively. These Databases can be updated (e.g., a new class of element is introduced) both by machines as well as humans.
[0045] It should be noted from the above description that the preferred embodiment of the present invention allows for at least two levels of adaptivity in the rule sets contained in
« PreviousContinue » |