WO2001039461A2 - Network event correlation system using protocol models - Google Patents

Network event correlation system using protocol models Download PDF

Info

Publication number
WO2001039461A2
WO2001039461A2 PCT/US2000/032214 US0032214W WO0139461A2 WO 2001039461 A2 WO2001039461 A2 WO 2001039461A2 US 0032214 W US0032214 W US 0032214W WO 0139461 A2 WO0139461 A2 WO 0139461A2
Authority
WO
WIPO (PCT)
Prior art keywords
network
node
model
network nodes
alarm
Prior art date
Application number
PCT/US2000/032214
Other languages
French (fr)
Other versions
WO2001039461A3 (en
Inventor
Deepak K. Kakadia
Original Assignee
Sun Microsystems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Microsystems, Inc. filed Critical Sun Microsystems, Inc.
Priority to EP00983761A priority Critical patent/EP1234407B1/en
Priority to AU20473/01A priority patent/AU2047301A/en
Priority to DE60032801T priority patent/DE60032801D1/en
Publication of WO2001039461A2 publication Critical patent/WO2001039461A2/en
Publication of WO2001039461A3 publication Critical patent/WO2001039461A3/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/34Signalling channels for network management communication

Definitions

  • the invention relates to be field of network administration, and in particular to the field of software tools for assisting network management through diagnosis of network problems so that appropriate enhancements and repairs may be made.
  • Many voice and data networks occasionally have problems. These problems may include a noisy or failed link, an overloaded link, a damaged cable, repeater, switch, router, or channel bank causing multiple failed links, a crashed server or an overloaded server; a link being a channel interconnecting two or more nodes of the network.
  • Networks may also have software problems, a packet may have an incorrect format, be excessively delayed, or packets may arrive out of order, be corrupt, or missing. Routing tables may become corrupt, and critical name-servers may crash. Problems may be intermittent or continuous. Intermittent problems tend to be more difficult to diagnose than continuously present problems. Networks are often distributed over wide areas, frequently involving repeaters, wires, or fibers in remote locations, and involve a high degree of concurrency.
  • Y. Nygate in Event Correlation using Rule and Object Based Techniques, Proceeding of the fourth international symposium on integrated network management, Chapman & Hall, London, 1995, pp 279 to 289, reports that typical network operations centers receive thousands of alarms each hour._This large number of alarms results because a single fault in a data or telecommunications network frequently induces the reporting of many alarms to network operators. For example, a failed repeater may cause alarms at the channel banks, nodes, or hosts at each end of the affected link through the repeater, as well as alarms from the switches, routers, and servers attempting to route data over the link. Each server may also generate alarms at the hardware level and at various higher levels; for example a server running TCP-IP on Ethernet may generate an Ethernet level error, then at the IP level, then at the TCP level, and again at the application level.
  • Alarms from higher levels of protocol may be generated by network nodes not directly connected to a problem node.
  • alarms may arise from a TCP or application layer on node A of a network, where node A connects only to node B, which routes packets on to node C, when C's connection to node D fails, where node D was being used by an application on node A.
  • Alarms may be routed to a network management center through the network itself, if sufficient nodes and links remain operational to transport them, or through a separate control and monitor data network.
  • the telephone network for example, frequently passes control information relating to telephone calls - such as the number dialed and the originating number - over a control and monitor data network separate from the trunks over which calls are routed. It is known that many networks have potential failure modes such that some alarms will be unable to reach the network management center.
  • Loss of one link may cause overloading of remaining links or servers in the network, causing additional alarms for slow or denied service.
  • the multitude of alarms can easily obscure the real cause of a fault, or mask individual faults in the profusion of alarms caused by other problems. This phenomenon increases the skill, travel, and time needed to resolve failures.
  • the number of alarms reported from a single fault tends to increase with the complexity of the network, which increases at a rate greater than linear with the increase in the number of nodes.
  • the complexity of the network of networks known as the Internet and the potential for reported faults, increases exponentially with the number of servers and computers connected to it, or to the networks connected to the Internet.
  • Event correlation is the process of efficiently determining the occurrence of and the source of problems in a complex system or network based on observable events.
  • Yemini in United States Patent No. 5661668, used a rule-based expert system for event correlation.
  • the approach of Yemini requires construction of a causality matrix, which relates observable symptoms to likely problems in the system.
  • a weakness of this approach is that a tremendous amount of expertise is needed to cover all cases and, as the number of network nodes, and therefore the number of permutations and combinations increase, the complexity of the causality matrix increases exponentially.
  • a first node in a network may run one or several processes. These processes run concurrently with, and interact with, additional processes running on each node with which the first node communicates.
  • CSP C.A.R. Hoare's Communicating Sequential Processes
  • the CSP language allows a programmer to formally describe the response of processes to stimuli and how those processes are interconnected. It is known that timed derivatives of CSP can be used to model network processes, Chapter 7 of Hinchey, et al., proposes using a CSP model to verify a reliable network protocol.
  • Timed CSP processes may be used to model network node behavior implemented as logic gates as well as processes running on a processor in a network node.
  • a process running on a network node may be implemented as logic gates, as a process running in firmware on a microprocessor, microcontroller, or CPU chip, or as a hybrid of the two.
  • Verilog and VHDL both provide for concurrent execution of multiple modules and for communication between modules.
  • Occam incorporates an implementation of Timed CSP.
  • Each node A in a network has a model, based upon a formal specification written in a formal specification language such as a timed derivative of CSP, of the expected behavior of each node B to which it is connected.
  • Each node of the network also has an alarm monitor, again written in a formal specification language, which monitors at least one process running on that node.
  • the model or models of expected behavior of each node B is compared with the actual responses of each node B by an additional alarm monitor.
  • Alarms from the alarm monitors are filtered by a local alarm filter in the node, such that the node reports what it believes to be the root cause of a failure to a network management center. In this way the number of meaningless alarms generated and reported to the network management center is reduced and the accuracy of the alarms is improved.
  • Alarms reported to the network management center are filtered again by an intelligent event correlation utility to reduce redundant fault information before presentation to network administration staff.
  • Figure 1 illustrates a network of computers for which an event correlation system may be useful
  • Figure 2 a diagram of the hierarchy of network protocol processes running on a processor node of a network, as is known in the prior art
  • Figure 3 a diagram illustrating a hierarchy of protocol processes running on a processor node and a router node of a network, having a formal model of behavior of additional nodes, alarm monitors for protocol layers, and an alarm filter as per the current invention
  • Figure 4 a diagram illustrating a hierarchy of protocol processes running on a processor node, having a multiplicity of processes at certain levels, with multiple instantiations of formal models, alarm monitors for protocol layers, and an alarm filter as per the current invention.
  • large data networks have at least one network management node 100 ( Figure 1 ), connected into the network at one or more points.
  • the network may have one or more routers 101 , 101A, and 1 10 capable of passing packets between separate sections of network 102, one or more servers 103 and 1 1 1 and one or more hubs 104. Failure of a router 1 10, a link 1 12 or of a hub 104, can cause isolation of particular servers 1 1 1 or user workstations (not shown) from the network, causing loss or degradation of service to users. It is often desired that failures be reported through alarm messages to the network management node 100, so that corrective action may be taken and service restored quickly.
  • Failed hardware is not always capable of sending alarm messages, failures need often be inferred through alarm messages generated by operating nodes affected by the failure. For example, should link 1 12 fail, hub 104 may report a loss of communication with server 1 1 1 . Additionally, other nodes 103 that were communicating with the now-isolated server 1 1 1 may report loss of communication or that server 1 1 1 , who may be a critical name server for Internet communications, is no longer reachable. To reduce the likelihood that a functioning server will be unable to forward error messages to the network management station 100, some networks are organized with a separate management subnetwork 120 for diagnostic, management, and control messages, which sub-network often operates at a lower speed than many of the data links of the network.
  • node 103 should node 103 detect a failure of a network adapter that connects it to data network segment 102, it may be possible for an error message from node 103 to reach the network management station 100 by way of the management sub-network 120.
  • Many networks interconnect with the Internet through one or more firewall or other dedicated routing and filtering nodes 121 .
  • An application 201 ( Figure 2) running on a first network node 203 typically communicates with the network through several layers of network protocol.
  • application 201 requests communication with a second application 202 on another network node 210, it passes requests to a top layer of network protocol handler, which may be a TCP protocol handler 205.
  • This layer modifies, and encapsulates, the request into another level of request passed to a lower level protocol handler, which may be an IP handler 206; which in turn modifies and encapsulates the request into a packet passed on to a physical layer protocol handler, which may be an Ethernet physical layer handler 207.
  • Other physical layer handlers may also be used for any available computer interconnection scheme, including but not limited to Cl , Fiber channel, 100-BaseT, 10-BaseT, a T-1 line, a DSL line, ATM, Token Ring, or through a modem onto an analog phone line.
  • Some network technologies are known to utilize seven or more layers of protocol handlers between an application process and the physical link.
  • Other protocols than TCP-IP may be used, including IPX-SPX and NetBEUI , it is common for routers to translate protocols so that systems from different vendors may communicate with each other.
  • the physical layer link may connect directly or through a hub to the node 210, on which the target application 202 runs, or may frequently be passed through a router 21 1 or firewall.
  • a firewall is a special case of a router in that packets are filtered in addition to being passed between network segments.
  • the packet is received through a physical layer handler, here Ethernet handler 215, which passes the packet to an IP protocol handler 216 that decodes the IP address of the packet.
  • the IP address is then mapped by a routing module 217 to determine where the packet should next be sent, the packet is then modified as needed and returned through an IP protocol handler 216 to a physical layer handler 218.
  • the packet When the packet reaches the node 210 on which the target application 202 runs, the packet is similarly processed through a physical layer handler 220, then through higher levels of protocol handler including IP and TCP handlers 221 and 222 before being passed to the target application 202.
  • the protocol handlers 205, 206, and 207 of the first network node 203 have been instrumented to provide state information to drive a formally specified model 301 ( Figure 3) of the node's 302 own TCP protocol layer 303.
  • the protocol handlers have been instrumented to drive a formally specified model 305 of the node's own IP layer 306.
  • An event comparator and alarm monitor 307 compares signals passed between the TCP 303, IP 306, and Ethernet 308 handlers with those of their corresponding formally specified models 301 and 305, generating alarm messages when significant discrepancies arise. These alarm messages are passed to an event correlator module 310 within the node 302..
  • a model 315 of the Ethernet protocol layer and IP protocol layer 316 may be invoked for each other machine with which the node 302 is in current communication.
  • An event comparator and alarm monitor 317 watches for significant deviations, such as repeated failures of a target to acknowledge a packet despite retries, between the expected behavior of the target and its actual behavior.
  • a router may have a formally specified model 321 of its own IP layer 322, with an event comparator and alarm monitor 323 observing deviations between expected and actual behavior.
  • a router may have instances of a model 326 of each node's physical layer with which it communicates; similarly it may have models 327 of IP and TCP (not shown) layers, with an event comparator and alarm monitor 328 to track differences between reality and model behavior.
  • Alarm messages from the router's event comparators and alarm monitors 323 and 328 are transmitted to an event correlator 330, where they are filtered. Filtered alarms, with precise fault identification, are then addressed to the network management station 100 and passed through the router's TCP 331 , IP 322, and an appropriate physical layer 332 for transmission. These alarms typically report symptoms, for example, a particular host may be unreachable, or all transmissions attempted over a link may fail to be acknowledged at the Ethernet level.
  • the local event correlator 310 or 330 is a rules-based correlation engine, that generally accepts the earliest and lowest level alarm caused by a fault as the root cause of the alarms. Since each layer supports the layer above it, the lowest layer reporting an alarm in a given communications context is the most likely source of the fault. The root cause as determined by these correlators is transmitted to the network management node 100 for further analysis and dispatch of repair personnel as needed.
  • individual models such as the model of the TCP layer 301 may be manually or automatically translated into an implementation language like C. In the preferred embodiment, the models 301 , 305, 315, 316, 321 , and 326 have been translated into C.
  • a model 400 ( Figure 4) of the interactions of a critical application 401 with the TCP-IP layers has also been constructed.
  • This application model 400 is used in conjunction with an event monitor and comparator 402 to generate alarm messages should the critical application 401 fail.
  • application 405 is a separate application from critical application 401 .
  • each application may also have its own instances of the TCP protocol layer 410 and 41 1 , TCP event monitor and comparator 412 and 413, and TCP models 414 and 415.
  • This alternative embodiment is implemented on the Solaris Tm (a trademark or registered trademark of Sun Microsystems, Inc. in the United States and other countries) multithreaded operating system, which also permits multiple instantiations of the IP layer as well as the TCP layer, so that a single server or router may appear at several I P addresses in the network.
  • a first IP layer 420 is monitored by an event monitor and comparator 421 and an IP model 422 for errors.
  • Errors generated by all the internal event monitors and comparators 402, 413, 412, 421 , and the target event comparators and alarm monitors 425, 426, and 427 are filtered by an Event Correlator 430, a separate instance of the TCP layer 431 , and a separate IP layer 432, such that the server alarm monitor and control appears at a separate IP address from the applications 405 and 401 . While it is anticipated that servers and routers will often have multiple network adapters, and hence multiple physical layer drivers, this particular embodiment has a single physical layer driver, here Ethernet layer 440.
  • the alternative embodiment of Figure 4 also embodies a plurality of target physical layer models, here Ethernet layer models, 450, 451 , and 452, and target IP layer models, 455, 456, and 457, organized as paired Ethernet and IP layer models.
  • Each pair models the expected reactions of a machine or process on a machine with which the server is in communication; deviations from expected behavior also generate alarm messages that are fed to the local event correlator 430.
  • TCP model 414 and IP model 422 may be implemented as separate code and as a separate thread from the corresponding TCP and IP protocol layers 430 and 420.
  • code for TCP model 414 and IP model 422 are buried into the code of the corresponding TCP 430 and IP 420 protocol layers.
  • This implementation is advantageous for speed, but requires that the code of the TCP and protocol layers, which are often components of the operating system, be modified.
  • the Event Correlator module 412 also has a thread activated whenever TCP and IP protocol layer tasks are created or destroyed. When these layer tasks are created, the Event Correlator creates appropriate model tasks and configures them as required; and when these layer tasks are destroyed, the Event Correlator destroys the corresponding model tasks. While the invention has been particularly shown and described with reference to a preferred embodiment thereof, it will be understood by those skilled in the art that various other changes in the form and details may be made without departing from the spirit and scope of the invention.

Abstract

An event correlation system for network management has computer code for at least one model of a process to be run on a node of a network, where said process is intended to be run on a different node of the network than the model. The correlation system also has code for an alarm monitor comparing the apparent behavior of the model to actual events generated by the process and generating alarm messages when the actual events of the process do not match expected events from the model. It also has code for an event correlation utility; and means for communicating alarms from the alarm monitor to the event correlation utility.

Description

NETWORK EVENT CORRELATION SYSTEM USI NG FORMALLY SPECIFIED MODELS OF PROTOCOL BEHAVIOR
FIELD OF THE INVENTION
The invention relates to be field of network administration, and in particular to the field of software tools for assisting network management through diagnosis of network problems so that appropriate enhancements and repairs may be made.
BACKGROUND OF THE INVENTION
Many voice and data networks occasionally have problems. These problems may include a noisy or failed link, an overloaded link, a damaged cable, repeater, switch, router, or channel bank causing multiple failed links, a crashed server or an overloaded server; a link being a channel interconnecting two or more nodes of the network. Networks may also have software problems, a packet may have an incorrect format, be excessively delayed, or packets may arrive out of order, be corrupt, or missing. Routing tables may become corrupt, and critical name-servers may crash. Problems may be intermittent or continuous. Intermittent problems tend to be more difficult to diagnose than continuously present problems. Networks are often distributed over wide areas, frequently involving repeaters, wires, or fibers in remote locations, and involve a high degree of concurrency. The wide distribution of network hardware and software makes diagnosis difficult because running diagnostic tests may require extensive travel delay and expense. Many companies operating large networks have installed network operations centers where network managers monitor network activity by observing reported alarms. These managers attempt to relate the alarms to common causes and dispatch appropriate repair personnel to the probable location of the fault inducing the alarms.
Y. Nygate, in Event Correlation using Rule and Object Based Techniques, Proceeding of the fourth international symposium on integrated network management, Chapman & Hall, London, 1995, pp 279 to 289, reports that typical network operations centers receive thousands of alarms each hour._This large number of alarms results because a single fault in a data or telecommunications network frequently induces the reporting of many alarms to network operators. For example, a failed repeater may cause alarms at the channel banks, nodes, or hosts at each end of the affected link through the repeater, as well as alarms from the switches, routers, and servers attempting to route data over the link. Each server may also generate alarms at the hardware level and at various higher levels; for example a server running TCP-IP on Ethernet may generate an Ethernet level error, then at the IP level, then at the TCP level, and again at the application level.
Alarms from higher levels of protocol may be generated by network nodes not directly connected to a problem node. For example, alarms may arise from a TCP or application layer on node A of a network, where node A connects only to node B, which routes packets on to node C, when C's connection to node D fails, where node D was being used by an application on node A.
Alarms may be routed to a network management center through the network itself, if sufficient nodes and links remain operational to transport them, or through a separate control and monitor data network. The telephone network, for example, frequently passes control information relating to telephone calls - such as the number dialed and the originating number - over a control and monitor data network separate from the trunks over which calls are routed. It is known that many networks have potential failure modes such that some alarms will be unable to reach the network management center.
Loss of one link may cause overloading of remaining links or servers in the network, causing additional alarms for slow or denied service. The multitude of alarms can easily obscure the real cause of a fault, or mask individual faults in the profusion of alarms caused by other problems. This phenomenon increases the skill, travel, and time needed to resolve failures.
The number of alarms reported from a single fault tends to increase with the complexity of the network, which increases at a rate greater than linear with the increase in the number of nodes. For example the complexity of the network of networks known as the Internet, and the potential for reported faults, increases exponentially with the number of servers and computers connected to it, or to the networks connected to the Internet.
Event correlation is the process of efficiently determining the occurrence of and the source of problems in a complex system or network based on observable events.
Yemini, in United States Patent No. 5661668, used a rule-based expert system for event correlation. The approach of Yemini requires construction of a causality matrix, which relates observable symptoms to likely problems in the system. A weakness of this approach is that a tremendous amount of expertise is needed to cover all cases and, as the number of network nodes, and therefore the number of permutations and combinations increase, the complexity of the causality matrix increases exponentially.
A first node in a network may run one or several processes. These processes run concurrently with, and interact with, additional processes running on each node with which the first node communicates. One of the small minority of computer languages capable of specifying and modeling multiple concurrent processes and their interactions is C.A.R. Hoare's Communicating Sequential Processes, or "CSP", as described in M. Hinchey and S. Jarvis, Concurrent Systems: Formal Development in CSP, The McGraw-Hill International Series in Software Engineering, London, 1995.
The CSP language allows a programmer to formally describe the response of processes to stimuli and how those processes are interconnected. It is known that timed derivatives of CSP can be used to model network processes, Chapter 7 of Hinchey, et al., proposes using a CSP model to verify a reliable network protocol.
Timed CSP processes may be used to model network node behavior implemented as logic gates as well as processes running on a processor in a network node. For purposes of this application, a process running on a network node may be implemented as logic gates, as a process running in firmware on a microprocessor, microcontroller, or CPU chip, or as a hybrid of the two.
Other languages that provide for concurrency may be used to model network node behavior. For example, Verilog and VHDL both provide for concurrent execution of multiple modules and for communication between modules. Further, Occam incorporates an implementation of Timed CSP.
With the recent and continuing expansion of computing, data communications, and communications networks, the volume of repetitive, duplicate, and superfluous alarm messages reaching network operations centers makes understanding root causes of network problems difficult, and has potential to overwhelm system operators. It is desirable that these alarm messages be automatically correlated to make root causes of network problems more apparent to the operators than with prior event correlation tools. SUMMARY OF THE INVENTION Each node A in a network has a model, based upon a formal specification written in a formal specification language such as a timed derivative of CSP, of the expected behavior of each node B to which it is connected. Each node of the network also has an alarm monitor, again written in a formal specification language, which monitors at least one process running on that node. The model or models of expected behavior of each node B is compared with the actual responses of each node B by an additional alarm monitor. Alarms from the alarm monitors are filtered by a local alarm filter in the node, such that the node reports what it believes to be the root cause of a failure to a network management center. In this way the number of meaningless alarms generated and reported to the network management center is reduced and the accuracy of the alarms is improved.
Alarms reported to the network management center are filtered again by an intelligent event correlation utility to reduce redundant fault information before presentation to network administration staff.
The foregoing and other features, utilities and advantages of the invention will be apparent from the following more particular description of a preferred embodiment of the invention as illustrated in the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 illustrates a network of computers for which an event correlation system may be useful;
Figure 2, a diagram of the hierarchy of network protocol processes running on a processor node of a network, as is known in the prior art; Figure 3, a diagram illustrating a hierarchy of protocol processes running on a processor node and a router node of a network, having a formal model of behavior of additional nodes, alarm monitors for protocol layers, and an alarm filter as per the current invention; and Figure 4, a diagram illustrating a hierarchy of protocol processes running on a processor node, having a multiplicity of processes at certain levels, with multiple instantiations of formal models, alarm monitors for protocol layers, and an alarm filter as per the current invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
Typically, large data networks have at least one network management node 100 (Figure 1 ), connected into the network at one or more points. The network may have one or more routers 101 , 101A, and 1 10 capable of passing packets between separate sections of network 102, one or more servers 103 and 1 1 1 and one or more hubs 104. Failure of a router 1 10, a link 1 12 or of a hub 104, can cause isolation of particular servers 1 1 1 or user workstations (not shown) from the network, causing loss or degradation of service to users. It is often desired that failures be reported through alarm messages to the network management node 100, so that corrective action may be taken and service restored quickly. Failed hardware is not always capable of sending alarm messages, failures need often be inferred through alarm messages generated by operating nodes affected by the failure. For example, should link 1 12 fail, hub 104 may report a loss of communication with server 1 1 1 . Additionally, other nodes 103 that were communicating with the now-isolated server 1 1 1 may report loss of communication or that server 1 1 1 , who may be a critical name server for Internet communications, is no longer reachable. To reduce the likelihood that a functioning server will be unable to forward error messages to the network management station 100, some networks are organized with a separate management subnetwork 120 for diagnostic, management, and control messages, which sub-network often operates at a lower speed than many of the data links of the network. For example, should node 103 detect a failure of a network adapter that connects it to data network segment 102, it may be possible for an error message from node 103 to reach the network management station 100 by way of the management sub-network 120. Many networks interconnect with the Internet through one or more firewall or other dedicated routing and filtering nodes 121 .
An application 201 (Figure 2) running on a first network node 203 typically communicates with the network through several layers of network protocol. When application 201 requests communication with a second application 202 on another network node 210, it passes requests to a top layer of network protocol handler, which may be a TCP protocol handler 205. This layer modifies, and encapsulates, the request into another level of request passed to a lower level protocol handler, which may be an IP handler 206; which in turn modifies and encapsulates the request into a packet passed on to a physical layer protocol handler, which may be an Ethernet physical layer handler 207. Other physical layer handlers may also be used for any available computer interconnection scheme, including but not limited to Cl , Fiber channel, 100-BaseT, 10-BaseT, a T-1 line, a DSL line, ATM, Token Ring, or through a modem onto an analog phone line. Some network technologies are known to utilize seven or more layers of protocol handlers between an application process and the physical link. Other protocols than TCP-IP may be used, including IPX-SPX and NetBEUI , it is common for routers to translate protocols so that systems from different vendors may communicate with each other. The physical layer link may connect directly or through a hub to the node 210, on which the target application 202 runs, or may frequently be passed through a router 21 1 or firewall. A firewall is a special case of a router in that packets are filtered in addition to being passed between network segments. In router 21 1 , the packet is received through a physical layer handler, here Ethernet handler 215, which passes the packet to an IP protocol handler 216 that decodes the IP address of the packet. The IP address is then mapped by a routing module 217 to determine where the packet should next be sent, the packet is then modified as needed and returned through an IP protocol handler 216 to a physical layer handler 218.
When the packet reaches the node 210 on which the target application 202 runs, the packet is similarly processed through a physical layer handler 220, then through higher levels of protocol handler including IP and TCP handlers 221 and 222 before being passed to the target application 202.
In a preferred embodiment, the protocol handlers 205, 206, and 207 of the first network node 203 have been instrumented to provide state information to drive a formally specified model 301 (Figure 3) of the node's 302 own TCP protocol layer 303. Similarly, the protocol handlers have been instrumented to drive a formally specified model 305 of the node's own IP layer 306. An event comparator and alarm monitor 307 compares signals passed between the TCP 303, IP 306, and Ethernet 308 handlers with those of their corresponding formally specified models 301 and 305, generating alarm messages when significant discrepancies arise. These alarm messages are passed to an event correlator module 310 within the node 302..
Similarly, a model 315 of the Ethernet protocol layer and IP protocol layer 316 may be invoked for each other machine with which the node 302 is in current communication. An event comparator and alarm monitor 317 watches for significant deviations, such as repeated failures of a target to acknowledge a packet despite retries, between the expected behavior of the target and its actual behavior.
It is advantageous if at least some of the network nodes, such as router 320, are modified in a similar manner. For example, a router may have a formally specified model 321 of its own IP layer 322, with an event comparator and alarm monitor 323 observing deviations between expected and actual behavior. A router may have instances of a model 326 of each node's physical layer with which it communicates; similarly it may have models 327 of IP and TCP (not shown) layers, with an event comparator and alarm monitor 328 to track differences between reality and model behavior.
Alarm messages from the router's event comparators and alarm monitors 323 and 328 are transmitted to an event correlator 330, where they are filtered. Filtered alarms, with precise fault identification, are then addressed to the network management station 100 and passed through the router's TCP 331 , IP 322, and an appropriate physical layer 332 for transmission. These alarms typically report symptoms, for example, a particular host may be unreachable, or all transmissions attempted over a link may fail to be acknowledged at the Ethernet level.
In the preferred embodiment, the local event correlator 310 or 330 is a rules-based correlation engine, that generally accepts the earliest and lowest level alarm caused by a fault as the root cause of the alarms. Since each layer supports the layer above it, the lowest layer reporting an alarm in a given communications context is the most likely source of the fault. The root cause as determined by these correlators is transmitted to the network management node 100 for further analysis and dispatch of repair personnel as needed. Once formally specified and verified in a language, like Timed CSP or Verilog that supports concurrency, individual models, such as the model of the TCP layer 301 may be manually or automatically translated into an implementation language like C. In the preferred embodiment, the models 301 , 305, 315, 316, 321 , and 326 have been translated into C.
In a first alternative embodiment of a server according to the present invention, a model 400 (Figure 4) of the interactions of a critical application 401 with the TCP-IP layers has also been constructed. This application model 400 is used in conjunction with an event monitor and comparator 402 to generate alarm messages should the critical application 401 fail.
In the alternative embodiment of Figure 4, there is a plurality of applications running, application 405 is a separate application from critical application 401 . In this case, each application may also have its own instances of the TCP protocol layer 410 and 41 1 , TCP event monitor and comparator 412 and 413, and TCP models 414 and 415. This alternative embodiment is implemented on the Solaris Tm (a trademark or registered trademark of Sun Microsystems, Inc. in the United States and other countries) multithreaded operating system, which also permits multiple instantiations of the IP layer as well as the TCP layer, so that a single server or router may appear at several I P addresses in the network. In this embodiment, a first IP layer 420 is monitored by an event monitor and comparator 421 and an IP model 422 for errors. Errors generated by all the internal event monitors and comparators 402, 413, 412, 421 , and the target event comparators and alarm monitors 425, 426, and 427 are filtered by an Event Correlator 430, a separate instance of the TCP layer 431 , and a separate IP layer 432, such that the server alarm monitor and control appears at a separate IP address from the applications 405 and 401 . While it is anticipated that servers and routers will often have multiple network adapters, and hence multiple physical layer drivers, this particular embodiment has a single physical layer driver, here Ethernet layer 440.
The alternative embodiment of Figure 4 also embodies a plurality of target physical layer models, here Ethernet layer models, 450, 451 , and 452, and target IP layer models, 455, 456, and 457, organized as paired Ethernet and IP layer models. Each pair models the expected reactions of a machine or process on a machine with which the server is in communication; deviations from expected behavior also generate alarm messages that are fed to the local event correlator 430.
It is expected that the models, as with TCP model 414 and IP model 422, may be implemented as separate code and as a separate thread from the corresponding TCP and IP protocol layers 430 and 420. This is a nonintrusive implementation of the invention, in that the code for the TCP and protocol layers need not be modified for use with the invention.
In a second alternative embodiment, an intrusive implementation of the invention, code for TCP model 414 and IP model 422 are buried into the code of the corresponding TCP 430 and IP 420 protocol layers. This implementation is advantageous for speed, but requires that the code of the TCP and protocol layers, which are often components of the operating system, be modified.
In the alternative embodiment of Figure 4, the Event Correlator module 412 also has a thread activated whenever TCP and IP protocol layer tasks are created or destroyed. When these layer tasks are created, the Event Correlator creates appropriate model tasks and configures them as required; and when these layer tasks are destroyed, the Event Correlator destroys the corresponding model tasks. While the invention has been particularly shown and described with reference to a preferred embodiment thereof, it will be understood by those skilled in the art that various other changes in the form and details may be made without departing from the spirit and scope of the invention.

Claims

CLAIMSWhat is claimed is:
1. An event correlation system for network management comprising: computer readable code for at least one model of a process to be run on a node of a network, where said process is intended to be run on a different node of the network than the model; computer readable code for an alarm monitor comparing the apparent behavior of the model to actual events generated by the process; computer readable code for an event correlation utility; and means for communicating alarms from the alarm monitor to the event correlation utility.
2. A network, comprising: a plurality of network nodes, where at least two of the plurality of network nodes further comprise: at least one process running on a node of the plurality of network nodes, at least one model of an additional process running on said node, where said additional process may or may not be running on said node, and an alarm monitor comparing the apparent behavior of the model to actual events of the additional process; a plurality of links, each link connecting a process of the at least one process on one of the plurality of network nodes to a process on another node, such that each of the plurality of network nodes is connected to at least one other of the plurality of network nodes; and at least one event correlation utility running on a node of the network for filtering alarms generated by the alarm monitors of the plurality of network nodes.
3. The network of Claim 2 wherein the model of the additional process is written in a formal specification language.
4. The network of Claim 2 wherein the formal specification language is selected from the set consisting of Occam, Timed CSP, Verilog, and VHDL.
5. The network of Claim 2, wherein the at least one model of an additional process running on said node comprises at least a first model and a second model, where the first model is a model of a layer of a communications protocol, and the second model is a model of a higher layer of the communications protocol.
6. The network of Claim 5, wherein the communications protocol is TCP-IP, and wherein the layer of a communications protocol is an IP layer, and wherein the higher layer of the communications protocol is TCP.
7. The network of Claim 5, wherein the layers of the communications protocol having corresponding models are a subset of a set of layers of communications protocol on the network nodes on which the models and their corresponding communications protocol layers run.
8. A network, comprising: a plurality of network nodes, where at least two of the plurality of network nodes further comprise: at least one process running on a node of the plurality of network nodes, at least one alarm monitor for generating alarm messages, and at least one event correlator for filtering the alarm messages, such that fewer alarm messages are passed to a network management node of the network than are generated by the at least one alarm monitor; a plurality of links, each link connecting a process of the at least one process on one of the plurality of network nodes to a process on another node, such that each of the plurality of network nodes is connected to at least one other of the plurality of network nodes; and at least one event correlation utility running on a network management node of the network for further filtering alarm passed to the network management station by the plurality of network nodes.
9. The network of Claim 8 wherein the at least two of the plurality of network nodes further comprise a model of at least one of the at least one process, wherein the model of the additional process is originally written in a formal specification language.
10. The network of Claim 8 wherein the formal specification language is selected from the set consisting of Occam, Timed CSP, Verilog, and VHDL.
1 1. The network of Claim 8 wherein the at least two of the plurality of network nodes further comprise a model of at least one of the at least one process, wherein the model of the process is coupled to at least one of the at least one alarm monitors, the alarm monitor being coupled to generate an alarm message when the process does not behave according to the model.
12. The network of Claim 8 wherein at least one of the at least two of the plurality of network nodes further comprises a second model, the second model being of a second at least one process running on a node other than the node on which the model is run, wherein the second model is coupled to a second alarm monitor, the second alarm monitor being coupled to generate an alarm message when the second process does not behave according to the second model, the alarm message being filtered by the at least one event correlator prior to transmission to the a network management node of the network.
13. The network of Claim 12, further comprising a separate subnetwork for transmission of alarm messages to the network management node of the network, and for transmission of network control messages from the network management node of the network to at least two of the plurality of network nodes.
14. A network, comprising: a plurality of network nodes, where at least two of the plurality of network nodes further comprise: at least one process running on a node of the plurality of network nodes, means for modeling an additional process running on said node, where said additional process may or may not be running on said node, and means for generating alarms when the means for modeling an additional process mismatches actual events of the additional process; a plurality of links, each link comprising means for connecting a process of the at least one process on one of the plurality of network nodes to a process on another node, such that each of the plurality of network nodes is connected to at least one other of the plurality of network nodes; means for transmission of alarms from the plurality of network nodes to a network management network node; and means for correlating the alarms running on the network management network node to filter alarms generated by the alarm monitors of the plurality of network nodes.
15. The network of Claim 14, wherein the means for generating alarms further comprises a local means for correlating events such that the alarms transmitted to the network management network node are filtered to indicate a root cause of a problem.
PCT/US2000/032214 1999-11-29 2000-11-22 Network event correlation system using protocol models WO2001039461A2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP00983761A EP1234407B1 (en) 1999-11-29 2000-11-22 Network event correlation system using protocol models
AU20473/01A AU2047301A (en) 1999-11-29 2000-11-22 Network event correlation system using formally specified models of protocol behavior
DE60032801T DE60032801D1 (en) 1999-11-29 2000-11-22 DEPENDENT NETWORK EXAMINATION DEVICE SYSTEM USING PROTOCOL MODELS

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/450,296 US6532554B1 (en) 1999-11-29 1999-11-29 Network event correlation system using formally specified models of protocol behavior
US09/450,296 1999-11-29

Publications (2)

Publication Number Publication Date
WO2001039461A2 true WO2001039461A2 (en) 2001-05-31
WO2001039461A3 WO2001039461A3 (en) 2001-12-13

Family

ID=23787529

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/032214 WO2001039461A2 (en) 1999-11-29 2000-11-22 Network event correlation system using protocol models

Country Status (6)

Country Link
US (1) US6532554B1 (en)
EP (1) EP1234407B1 (en)
AT (1) ATE350830T1 (en)
AU (1) AU2047301A (en)
DE (1) DE60032801D1 (en)
WO (1) WO2001039461A2 (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7161945B1 (en) * 1999-08-30 2007-01-09 Broadcom Corporation Cable modem termination system
US6789257B1 (en) * 2000-04-13 2004-09-07 International Business Machines Corporation System and method for dynamic generation and clean-up of event correlation circuit
US7111205B1 (en) * 2000-04-18 2006-09-19 Siemens Communications, Inc. Method and apparatus for automatically reporting of faults in a communication network
US6701459B2 (en) * 2000-12-27 2004-03-02 Egurkha Pte Ltd Root-cause approach to problem diagnosis in data networks
US20040017779A1 (en) * 2002-07-25 2004-01-29 Moxa Technologies Co., Ltd. Remote equipment monitoring system with active warning function
JP3983138B2 (en) * 2002-08-29 2007-09-26 富士通株式会社 Failure information collection program and failure information collection device
US20040049714A1 (en) * 2002-09-05 2004-03-11 Marples David J. Detecting errant conditions affecting home networks
US7548897B2 (en) * 2002-10-02 2009-06-16 The Johns Hopkins University Mission-centric network defense systems (MCNDS)
EP1460801B1 (en) * 2003-03-17 2006-06-28 Tyco Telecommunications (US) Inc. System and method for fault diagnosis using distributed alarm correlation
US7389345B1 (en) 2003-03-26 2008-06-17 Sprint Communications Company L.P. Filtering approach for network system alarms
US7711811B1 (en) * 2003-03-26 2010-05-04 Sprint Communications Company L.P. Filtering approach for network system alarms based on lifecycle state
US7421493B1 (en) 2003-04-28 2008-09-02 Sprint Communications Company L.P. Orphaned network resource recovery through targeted audit and reconciliation
CN100416493C (en) * 2003-06-05 2008-09-03 中兴通讯股份有限公司 Apparatus and method for realizing inserting multiple alarming process
US8694475B2 (en) * 2004-04-03 2014-04-08 Altusys Corp. Method and apparatus for situation-based management
US20050222895A1 (en) * 2004-04-03 2005-10-06 Altusys Corp Method and Apparatus for Creating and Using Situation Transition Graphs in Situation-Based Management
US7788109B2 (en) * 2004-04-03 2010-08-31 Altusys Corp. Method and apparatus for context-sensitive event correlation with external control in situation-based management
US20050222810A1 (en) * 2004-04-03 2005-10-06 Altusys Corp Method and Apparatus for Coordination of a Situation Manager and Event Correlation in Situation-Based Management
GB0410047D0 (en) 2004-05-05 2004-06-09 Silverdata Ltd An analytical software design system
EP2677691A1 (en) 2004-05-25 2013-12-25 Rockstar Consortium US LP Connectivity Fault Notification
US7719965B2 (en) * 2004-08-25 2010-05-18 Agilent Technologies, Inc. Methods and systems for coordinated monitoring of network transmission events
US7408441B2 (en) * 2004-10-25 2008-08-05 Electronic Data Systems Corporation System and method for analyzing user-generated event information and message information from network devices
US20060168170A1 (en) * 2004-10-25 2006-07-27 Korzeniowski Richard W System and method for analyzing information relating to network devices
US7408440B2 (en) * 2004-10-25 2008-08-05 Electronics Data Systems Corporation System and method for analyzing message information from diverse network devices
US7792049B2 (en) * 2005-11-30 2010-09-07 Novell, Inc. Techniques for modeling and evaluating protocol interactions
US8205215B2 (en) * 2007-05-04 2012-06-19 Microsoft Corporation Automated event correlation
US7958386B2 (en) * 2007-12-12 2011-06-07 At&T Intellectual Property I, L.P. Method and apparatus for providing a reliable fault management for a network
US8533533B2 (en) * 2009-02-27 2013-09-10 Red Hat, Inc. Monitoring processes via autocorrelation
US9723501B2 (en) 2014-09-08 2017-08-01 Verizon Patent And Licensing Inc. Fault analytics framework for QoS based services

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE69126666T2 (en) * 1990-09-17 1998-02-12 Cabletron Systems Inc NETWORK MANAGEMENT SYSTEM WITH MODEL-BASED INTELLIGENCE
US5473596A (en) 1993-12-09 1995-12-05 At&T Corp. Method and system for monitoring telecommunication network element alarms
US5528516A (en) 1994-05-25 1996-06-18 System Management Arts, Inc. Apparatus and method for event correlation and problem reporting
US5539877A (en) 1994-06-27 1996-07-23 International Business Machine Corporation Problem determination method for local area network systems
US5777549A (en) * 1995-03-29 1998-07-07 Cabletron Systems, Inc. Method and apparatus for policy-based alarm notification in a distributed network management environment
US6017143A (en) * 1996-03-28 2000-01-25 Rosemount Inc. Device in a process system for detecting events
US6208955B1 (en) * 1998-06-12 2001-03-27 Rockwell Science Center, Llc Distributed maintenance system based on causal networks
US6360333B1 (en) * 1998-11-19 2002-03-19 Compaq Computer Corporation Method and apparatus for determining a processor failure in a multiprocessor computer

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GARDNER R D ET AL: "METHODS AND SYSTEMS FOR ALARM CORRELATION" COMMUNICATIONS: THE KEY TO GLOBAL PROSPERITY. GLOBECOM 1996. LONDON, NOV. 18 - 22, 1996, GLOBAL TELECOMMUNICATIONS CONFERENCE (GLOBECOM), NEW YORK, IEEE, US, vol. 1, 18 November 1996 (1996-11-18), pages 136-140, XP000742141 ISBN: 0-7803-3337-3 *
LEHMANN A ET AL: "KNOWLEDGE-BASED ALARM SURVEILLANCE FOR TMN" PROCEEDINGS OF THE 1996 IEEE FIFTEENTH ANNUAL INTERNATIONAL PHOENIX CONFERENCE ON COMPUTERS AND COMMUNICATIONS. SCOTTSDALE, MAR. 27 - 29, 1996, PROCEEDINGS OF THE IEEE ANNUAL INTERNATIONAL PHOENIX CONFERENCE ON COMPUTERS AND COMMUNICATIONS, NEW YORK,, vol. CONF. 15, 27 March 1996 (1996-03-27), pages 494-500, XP000594817 ISBN: 0-7803-3256-3 *

Also Published As

Publication number Publication date
EP1234407B1 (en) 2007-01-03
EP1234407A2 (en) 2002-08-28
DE60032801D1 (en) 2007-02-15
AU2047301A (en) 2001-06-04
ATE350830T1 (en) 2007-01-15
WO2001039461A3 (en) 2001-12-13
US6532554B1 (en) 2003-03-11

Similar Documents

Publication Publication Date Title
US6532554B1 (en) Network event correlation system using formally specified models of protocol behavior
US5727157A (en) Apparatus and method for determining a computer network topology
US6813634B1 (en) Network fault alerting system and method
AU675362B2 (en) Determination of network topology
KR100617344B1 (en) Reliable fault resolution in a cluster
US8812649B2 (en) Method and system for processing fault alarms and trouble tickets in a managed network services system
US8738760B2 (en) Method and system for providing automated data retrieval in support of fault isolation in a managed services network
EP0898822B1 (en) Method and apparatus for integrated network management and systems management in communications networks
US8732516B2 (en) Method and system for providing customer controlled notifications in a managed network services system
US7281170B2 (en) Help desk systems and methods for use with communications networks
US7974219B2 (en) Network troubleshooting using path topology
US8676945B2 (en) Method and system for processing fault alarms and maintenance events in a managed network services system
US8924533B2 (en) Method and system for providing automated fault isolation in a managed services network
AU700018B2 (en) Method and apparatus for testing the responsiveness of a network device
EP0403414A2 (en) Method and system for automatic non-disruptive problem determination and recovery of communication link problems
AU2001241700B2 (en) Multiple network fault tolerance via redundant network control
CN115102865A (en) Network security device topology management method and system
Cisco Troubleshooting DECnet Connectivity
Cisco Troubleshooting Overview
Cho et al. A study on the classified model and the agent collaboration model for network configuration fault management
Katchabaw et al. Policy-driven fault management in distributed systems
Agre A message-based fault diagnosis procedure
GECKIL Apparatus and method for determining a computer network topology
Jones et al. Towards decentralized network management and reliability
Watanabe et al. A reliability design method for private networks

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
AK Designated states

Kind code of ref document: A3

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

WWE Wipo information: entry into national phase

Ref document number: 2000983761

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2000983761

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWG Wipo information: grant in national office

Ref document number: 2000983761

Country of ref document: EP