WO2001039461A2

WO2001039461A2 - Network event correlation system using protocol models

Info

Publication number: WO2001039461A2
Application number: PCT/US2000/032214
Authority: WO
Inventors: Deepak K. Kakadia
Original assignee: Sun Microsystems, Inc.
Priority date: 1999-11-29
Filing date: 2000-11-22
Publication date: 2001-05-31
Also published as: EP1234407B1; EP1234407A2; DE60032801D1; AU2047301A; ATE350830T1; WO2001039461A3; US6532554B1

Abstract

An event correlation system for network management has computer code for at least one model of a process to be run on a node of a network, where said process is intended to be run on a different node of the network than the model. The correlation system also has code for an alarm monitor comparing the apparent behavior of the model to actual events generated by the process and generating alarm messages when the actual events of the process do not match expected events from the model. It also has code for an event correlation utility; and means for communicating alarms from the alarm monitor to the event correlation utility.

Description

NETWORK EVENT CORRELATION SYSTEM USI NG FORMALLY SPECIFIED MODELS OF PROTOCOL BEHAVIOR

FIELD OF THE INVENTION

The invention relates to be field of network administration, and in particular to the field of software tools for assisting network management through diagnosis of network problems so that appropriate enhancements and repairs may be made.

BACKGROUND OF THE INVENTION

Many voice and data networks occasionally have problems. These problems may include a noisy or failed link, an overloaded link, a damaged cable, repeater, switch, router, or channel bank causing multiple failed links, a crashed server or an overloaded server; a link being a channel interconnecting two or more nodes of the network. Networks may also have software problems, a packet may have an incorrect format, be excessively delayed, or packets may arrive out of order, be corrupt, or missing. Routing tables may become corrupt, and critical name-servers may crash. Problems may be intermittent or continuous. Intermittent problems tend to be more difficult to diagnose than continuously present problems. Networks are often distributed over wide areas, frequently involving repeaters, wires, or fibers in remote locations, and involve a high degree of concurrency. The wide distribution of network hardware and software makes diagnosis difficult because running diagnostic tests may require extensive travel delay and expense. Many companies operating large networks have installed network operations centers where network managers monitor network activity by observing reported alarms. These managers attempt to relate the alarms to common causes and dispatch appropriate repair personnel to the probable location of the fault inducing the alarms.

Y. Nygate, in Event Correlation using Rule and Object Based Techniques, Proceeding of the fourth international symposium on integrated network management, Chapman & Hall, London, 1995, pp 279 to 289, reports that typical network operations centers receive thousands of alarms each hour._This large number of alarms results because a single fault in a data or telecommunications network frequently induces the reporting of many alarms to network operators. For example, a failed repeater may cause alarms at the channel banks, nodes, or hosts at each end of the affected link through the repeater, as well as alarms from the switches, routers, and servers attempting to route data over the link. Each server may also generate alarms at the hardware level and at various higher levels; for example a server running TCP-IP on Ethernet may generate an Ethernet level error, then at the IP level, then at the TCP level, and again at the application level.

Alarms from higher levels of protocol may be generated by network nodes not directly connected to a problem node. For example, alarms may arise from a TCP or application layer on node A of a network, where node A connects only to node B, which routes packets on to node C, when C's connection to node D fails, where node D was being used by an application on node A.

Alarms may be routed to a network management center through the network itself, if sufficient nodes and links remain operational to transport them, or through a separate control and monitor data network. The telephone network, for example, frequently passes control information relating to telephone calls - such as the number dialed and the originating number - over a control and monitor data network separate from the trunks over which calls are routed. It is known that many networks have potential failure modes such that some alarms will be unable to reach the network management center.

Loss of one link may cause overloading of remaining links or servers in the network, causing additional alarms for slow or denied service. The multitude of alarms can easily obscure the real cause of a fault, or mask individual faults in the profusion of alarms caused by other problems. This phenomenon increases the skill, travel, and time needed to resolve failures.

The number of alarms reported from a single fault tends to increase with the complexity of the network, which increases at a rate greater than linear with the increase in the number of nodes. For example the complexity of the network of networks known as the Internet, and the potential for reported faults, increases exponentially with the number of servers and computers connected to it, or to the networks connected to the Internet.

Event correlation is the process of efficiently determining the occurrence of and the source of problems in a complex system or network based on observable events.

Yemini, in United States Patent No. 5661668, used a rule-based expert system for event correlation. The approach of Yemini requires construction of a causality matrix, which relates observable symptoms to likely problems in the system. A weakness of this approach is that a tremendous amount of expertise is needed to cover all cases and, as the number of network nodes, and therefore the number of permutations and combinations increase, the complexity of the causality matrix increases exponentially.

A first node in a network may run one or several processes. These processes run concurrently with, and interact with, additional processes running on each node with which the first node communicates. One of the small minority of computer languages capable of specifying and modeling multiple concurrent processes and their interactions is C.A.R. Hoare's Communicating Sequential Processes, or "CSP", as described in M. Hinchey and S. Jarvis, Concurrent Systems: Formal Development in CSP, The McGraw-Hill International Series in Software Engineering, London, 1995.

The CSP language allows a programmer to formally describe the response of processes to stimuli and how those processes are interconnected. It is known that timed derivatives of CSP can be used to model network processes, Chapter 7 of Hinchey, et al., proposes using a CSP model to verify a reliable network protocol.

Timed CSP processes may be used to model network node behavior implemented as logic gates as well as processes running on a processor in a network node. For purposes of this application, a process running on a network node may be implemented as logic gates, as a process running in firmware on a microprocessor, microcontroller, or CPU chip, or as a hybrid of the two.

Other languages that provide for concurrency may be used to model network node behavior. For example, Verilog and VHDL both provide for concurrent execution of multiple modules and for communication between modules. Further, Occam incorporates an implementation of Timed CSP.

With the recent and continuing expansion of computing, data communications, and communications networks, the volume of repetitive, duplicate, and superfluous alarm messages reaching network operations centers makes understanding root causes of network problems difficult, and has potential to overwhelm system operators. It is desirable that these alarm messages be automatically correlated to make root causes of network problems more apparent to the operators than with prior event correlation tools. SUMMARY OF THE INVENTION Each node A in a network has a model, based upon a formal specification written in a formal specification language such as a timed derivative of CSP, of the expected behavior of each node B to which it is connected. Each node of the network also has an alarm monitor, again written in a formal specification language, which monitors at least one process running on that node. The model or models of expected behavior of each node B is compared with the actual responses of each node B by an additional alarm monitor. Alarms from the alarm monitors are filtered by a local alarm filter in the node, such that the node reports what it believes to be the root cause of a failure to a network management center. In this way the number of meaningless alarms generated and reported to the network management center is reduced and the accuracy of the alarms is improved.

Alarms reported to the network management center are filtered again by an intelligent event correlation utility to reduce redundant fault information before presentation to network administration staff.

The foregoing and other features, utilities and advantages of the invention will be apparent from the following more particular description of a preferred embodiment of the invention as illustrated in the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 illustrates a network of computers for which an event correlation system may be useful;

Figure 2, a diagram of the hierarchy of network protocol processes running on a processor node of a network, as is known in the prior art; Figure 3, a diagram illustrating a hierarchy of protocol processes running on a processor node and a router node of a network, having a formal model of behavior of additional nodes, alarm monitors for protocol layers, and an alarm filter as per the current invention; and Figure 4, a diagram illustrating a hierarchy of protocol processes running on a processor node, having a multiplicity of processes at certain levels, with multiple instantiations of formal models, alarm monitors for protocol layers, and an alarm filter as per the current invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

Typically, large data networks have at least one network management node 100 (Figure 1 ), connected into the network at one or more points. The network may have one or more routers 101 , 101A, and 1 10 capable of passing packets between separate sections of network 102, one or more servers 103 and 1 1 1 and one or more hubs 104. Failure of a router 1 10, a link 1 12 or of a hub 104, can cause isolation of particular servers 1 1 1 or user workstations (not shown) from the network, causing loss or degradation of service to users. It is often desired that failures be reported through alarm messages to the network management node 100, so that corrective action may be taken and service restored quickly. Failed hardware is not always capable of sending alarm messages, failures need often be inferred through alarm messages generated by operating nodes affected by the failure. For example, should link 1 12 fail, hub 104 may report a loss of communication with server 1 1 1 . Additionally, other nodes 103 that were communicating with the now-isolated server 1 1 1 may report loss of communication or that server 1 1 1 , who may be a critical name server for Internet communications, is no longer reachable. To reduce the likelihood that a functioning server will be unable to forward error messages to the network management station 100, some networks are organized with a separate management subnetwork 120 for diagnostic, management, and control messages, which sub-network often operates at a lower speed than many of the data links of the network. For example, should node 103 detect a failure of a network adapter that connects it to data network segment 102, it may be possible for an error message from node 103 to reach the network management station 100 by way of the management sub-network 120. Many networks interconnect with the Internet through one or more firewall or other dedicated routing and filtering nodes 121 .

An application 201 (Figure 2) running on a first network node 203 typically communicates with the network through several layers of network protocol. When application 201 requests communication with a second application 202 on another network node 210, it passes requests to a top layer of network protocol handler, which may be a TCP protocol handler 205. This layer modifies, and encapsulates, the request into another level of request passed to a lower level protocol handler, which may be an IP handler 206; which in turn modifies and encapsulates the request into a packet passed on to a physical layer protocol handler, which may be an Ethernet physical layer handler 207. Other physical layer handlers may also be used for any available computer interconnection scheme, including but not limited to Cl , Fiber channel, 100-BaseT, 10-BaseT, a T-1 line, a DSL line, ATM, Token Ring, or through a modem onto an analog phone line. Some network technologies are known to utilize seven or more layers of protocol handlers between an application process and the physical link. Other protocols than TCP-IP may be used, including IPX-SPX and NetBEUI , it is common for routers to translate protocols so that systems from different vendors may communicate with each other. The physical layer link may connect directly or through a hub to the node 210, on which the target application 202 runs, or may frequently be passed through a router 21 1 or firewall. A firewall is a special case of a router in that packets are filtered in addition to being passed between network segments. In router 21 1 , the packet is received through a physical layer handler, here Ethernet handler 215, which passes the packet to an IP protocol handler 216 that decodes the IP address of the packet. The IP address is then mapped by a routing module 217 to determine where the packet should next be sent, the packet is then modified as needed and returned through an IP protocol handler 216 to a physical layer handler 218.

When the packet reaches the node 210 on which the target application 202 runs, the packet is similarly processed through a physical layer handler 220, then through higher levels of protocol handler including IP and TCP handlers 221 and 222 before being passed to the target application 202.

In a preferred embodiment, the protocol handlers 205, 206, and 207 of the first network node 203 have been instrumented to provide state information to drive a formally specified model 301 (Figure 3) of the node's 302 own TCP protocol layer 303. Similarly, the protocol handlers have been instrumented to drive a formally specified model 305 of the node's own IP layer 306. An event comparator and alarm monitor 307 compares signals passed between the TCP 303, IP 306, and Ethernet 308 handlers with those of their corresponding formally specified models 301 and 305, generating alarm messages when significant discrepancies arise. These alarm messages are passed to an event correlator module 310 within the node 302..

Similarly, a model 315 of the Ethernet protocol layer and IP protocol layer 316 may be invoked for each other machine with which the node 302 is in current communication. An event comparator and alarm monitor 317 watches for significant deviations, such as repeated failures of a target to acknowledge a packet despite retries, between the expected behavior of the target and its actual behavior.

It is advantageous if at least some of the network nodes, such as router 320, are modified in a similar manner. For example, a router may have a formally specified model 321 of its own IP layer 322, with an event comparator and alarm monitor 323 observing deviations between expected and actual behavior. A router may have instances of a model 326 of each node's physical layer with which it communicates; similarly it may have models 327 of IP and TCP (not shown) layers, with an event comparator and alarm monitor 328 to track differences between reality and model behavior.

Alarm messages from the router's event comparators and alarm monitors 323 and 328 are transmitted to an event correlator 330, where they are filtered. Filtered alarms, with precise fault identification, are then addressed to the network management station 100 and passed through the router's TCP 331 , IP 322, and an appropriate physical layer 332 for transmission. These alarms typically report symptoms, for example, a particular host may be unreachable, or all transmissions attempted over a link may fail to be acknowledged at the Ethernet level.

In the preferred embodiment, the local event correlator 310 or 330 is a rules-based correlation engine, that generally accepts the earliest and lowest level alarm caused by a fault as the root cause of the alarms. Since each layer supports the layer above it, the lowest layer reporting an alarm in a given communications context is the most likely source of the fault. The root cause as determined by these correlators is transmitted to the network management node 100 for further analysis and dispatch of repair personnel as needed. Once formally specified and verified in a language, like Timed CSP or Verilog that supports concurrency, individual models, such as the model of the TCP layer 301 may be manually or automatically translated into an implementation language like C. In the preferred embodiment, the models 301 , 305, 315, 316, 321 , and 326 have been translated into C.

In a first alternative embodiment of a server according to the present invention, a model 400 (Figure 4) of the interactions of a critical application 401 with the TCP-IP layers has also been constructed. This application model 400 is used in conjunction with an event monitor and comparator 402 to generate alarm messages should the critical application 401 fail.

In the alternative embodiment of Figure 4, there is a plurality of applications running, application 405 is a separate application from critical application 401 . In this case, each application may also have its own instances of the TCP protocol layer 410 and 41 1 , TCP event monitor and comparator 412 and 413, and TCP models 414 and 415. This alternative embodiment is implemented on the Solaris ^Tm (a trademark or registered trademark of Sun Microsystems, Inc. in the United States and other countries) multithreaded operating system, which also permits multiple instantiations of the IP layer as well as the TCP layer, so that a single server or router may appear at several I P addresses in the network. In this embodiment, a first IP layer 420 is monitored by an event monitor and comparator 421 and an IP model 422 for errors. Errors generated by all the internal event monitors and comparators 402, 413, 412, 421 , and the target event comparators and alarm monitors 425, 426, and 427 are filtered by an Event Correlator 430, a separate instance of the TCP layer 431 , and a separate IP layer 432, such that the server alarm monitor and control appears at a separate IP address from the applications 405 and 401 . While it is anticipated that servers and routers will often have multiple network adapters, and hence multiple physical layer drivers, this particular embodiment has a single physical layer driver, here Ethernet layer 440.

The alternative embodiment of Figure 4 also embodies a plurality of target physical layer models, here Ethernet layer models, 450, 451 , and 452, and target IP layer models, 455, 456, and 457, organized as paired Ethernet and IP layer models. Each pair models the expected reactions of a machine or process on a machine with which the server is in communication; deviations from expected behavior also generate alarm messages that are fed to the local event correlator 430.

It is expected that the models, as with TCP model 414 and IP model 422, may be implemented as separate code and as a separate thread from the corresponding TCP and IP protocol layers 430 and 420. This is a nonintrusive implementation of the invention, in that the code for the TCP and protocol layers need not be modified for use with the invention.

In a second alternative embodiment, an intrusive implementation of the invention, code for TCP model 414 and IP model 422 are buried into the code of the corresponding TCP 430 and IP 420 protocol layers. This implementation is advantageous for speed, but requires that the code of the TCP and protocol layers, which are often components of the operating system, be modified.

In the alternative embodiment of Figure 4, the Event Correlator module 412 also has a thread activated whenever TCP and IP protocol layer tasks are created or destroyed. When these layer tasks are created, the Event Correlator creates appropriate model tasks and configures them as required; and when these layer tasks are destroyed, the Event Correlator destroys the corresponding model tasks. While the invention has been particularly shown and described with reference to a preferred embodiment thereof, it will be understood by those skilled in the art that various other changes in the form and details may be made without departing from the spirit and scope of the invention.

Claims

CLAIMSWhat is claimed is:

1. An event correlation system for network management comprising: computer readable code for at least one model of a process to be run on a node of a network, where said process is intended to be run on a different node of the network than the model; computer readable code for an alarm monitor comparing the apparent behavior of the model to actual events generated by the process; computer readable code for an event correlation utility; and means for communicating alarms from the alarm monitor to the event correlation utility.

2. A network, comprising: a plurality of network nodes, where at least two of the plurality of network nodes further comprise: at least one process running on a node of the plurality of network nodes, at least one model of an additional process running on said node, where said additional process may or may not be running on said node, and an alarm monitor comparing the apparent behavior of the model to actual events of the additional process; a plurality of links, each link connecting a process of the at least one process on one of the plurality of network nodes to a process on another node, such that each of the plurality of network nodes is connected to at least one other of the plurality of network nodes; and at least one event correlation utility running on a node of the network for filtering alarms generated by the alarm monitors of the plurality of network nodes.

3. The network of Claim 2 wherein the model of the additional process is written in a formal specification language.

4. The network of Claim 2 wherein the formal specification language is selected from the set consisting of Occam, Timed CSP, Verilog, and VHDL.

5. The network of Claim 2, wherein the at least one model of an additional process running on said node comprises at least a first model and a second model, where the first model is a model of a layer of a communications protocol, and the second model is a model of a higher layer of the communications protocol.

6. The network of Claim 5, wherein the communications protocol is TCP-IP, and wherein the layer of a communications protocol is an IP layer, and wherein the higher layer of the communications protocol is TCP.

7. The network of Claim 5, wherein the layers of the communications protocol having corresponding models are a subset of a set of layers of communications protocol on the network nodes on which the models and their corresponding communications protocol layers run.

8. A network, comprising: a plurality of network nodes, where at least two of the plurality of network nodes further comprise: at least one process running on a node of the plurality of network nodes, at least one alarm monitor for generating alarm messages, and at least one event correlator for filtering the alarm messages, such that fewer alarm messages are passed to a network management node of the network than are generated by the at least one alarm monitor; a plurality of links, each link connecting a process of the at least one process on one of the plurality of network nodes to a process on another node, such that each of the plurality of network nodes is connected to at least one other of the plurality of network nodes; and at least one event correlation utility running on a network management node of the network for further filtering alarm passed to the network management station by the plurality of network nodes.

9. The network of Claim 8 wherein the at least two of the plurality of network nodes further comprise a model of at least one of the at least one process, wherein the model of the additional process is originally written in a formal specification language.

10. The network of Claim 8 wherein the formal specification language is selected from the set consisting of Occam, Timed CSP, Verilog, and VHDL.

1 1. The network of Claim 8 wherein the at least two of the plurality of network nodes further comprise a model of at least one of the at least one process, wherein the model of the process is coupled to at least one of the at least one alarm monitors, the alarm monitor being coupled to generate an alarm message when the process does not behave according to the model.

12. The network of Claim 8 wherein at least one of the at least two of the plurality of network nodes further comprises a second model, the second model being of a second at least one process running on a node other than the node on which the model is run, wherein the second model is coupled to a second alarm monitor, the second alarm monitor being coupled to generate an alarm message when the second process does not behave according to the second model, the alarm message being filtered by the at least one event correlator prior to transmission to the a network management node of the network.

13. The network of Claim 12, further comprising a separate subnetwork for transmission of alarm messages to the network management node of the network, and for transmission of network control messages from the network management node of the network to at least two of the plurality of network nodes.

14. A network, comprising: a plurality of network nodes, where at least two of the plurality of network nodes further comprise: at least one process running on a node of the plurality of network nodes, means for modeling an additional process running on said node, where said additional process may or may not be running on said node, and means for generating alarms when the means for modeling an additional process mismatches actual events of the additional process; a plurality of links, each link comprising means for connecting a process of the at least one process on one of the plurality of network nodes to a process on another node, such that each of the plurality of network nodes is connected to at least one other of the plurality of network nodes; means for transmission of alarms from the plurality of network nodes to a network management network node; and means for correlating the alarms running on the network management network node to filter alarms generated by the alarm monitors of the plurality of network nodes.

15. The network of Claim 14, wherein the means for generating alarms further comprises a local means for correlating events such that the alarms transmitted to the network management network node are filtered to indicate a root cause of a problem.