WO2002103524A1 - System and method for isolating faults in network - Google Patents

System and method for isolating faults in network Download PDF

Info

Publication number
WO2002103524A1
WO2002103524A1 PCT/US2002/019118 US0219118W WO02103524A1 WO 2002103524 A1 WO2002103524 A1 WO 2002103524A1 US 0219118 W US0219118 W US 0219118W WO 02103524 A1 WO02103524 A1 WO 02103524A1
Authority
WO
WIPO (PCT)
Prior art keywords
error
failure region
detected
network
write
Prior art date
Application number
PCT/US2002/019118
Other languages
French (fr)
Inventor
Stephen A. Wiley
John C. Schell
Christian Cadieux
Original Assignee
Sun Microsystems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Microsystems, Inc. filed Critical Sun Microsystems, Inc.
Publication of WO2002103524A1 publication Critical patent/WO2002103524A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0659Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Definitions

  • the present invention relates to a system and methods for isolating faults in a network, particularly in a fibre channel arbitrated loop (FCAL) or other multidevice computer system in which one or more components may fail.
  • Locating faults in a network such as an FCAL is a challenging and time- consuming undertaking, in particular where the loop includes many devices, each of which may undergo intermittent failures, hi systems currently in use, logs are kept of failed commands, data transfers, responses, etc., so that diagnostics maybe performed in attempting to locate the sources of failures.
  • diagnostics typically involve attempts to replicate a given fault condition, often in a trial-and-error manner, removing and/or replacing components until a faulty component is identified.
  • fault isolation techniques can result in wasted time, efforts and equipment, hi addition, the difficulties inherent in fault isolation using current techniques can lead to extended periods of down time for a system or subsystem, and a local error can thus have a broad effect, affecting the productivity of all users on the loop or network.
  • a system is needed that can isolate errors in a network in a manner that is at best deterministic and at worst at least reduces trial-and-error attempts to locate failing components relative to current methods.
  • a system and method for isolating fault regions in a processor- based network or loop, such as a fibre channel arbitrated loop, include logic and program modules that are configured to analyze error logs relating to read and write errors that occur in the network. Rapid isolation of a failure region can be accomplished by determining those devices for which data read errors occur, and by the observation that read data errors are likely to occur for failure regions downstream of a given device, while write data errors are likely to occur for failure regions upstream of a given device. Information concerning command and response errors is not needed to determine the location of the failure region.
  • FIG. 1 is a block diagram of a fibre channel arbitrated loop (FCAL) appropriate for use in connection with the present invention.
  • FCAL fibre channel arbitrated loop
  • Figure 2 shows data structures or packets usable in an fibre channel arbitrated loop as shown in Figure 1.
  • Figure 3 is a logic table indicating likely locations of component failures based upon types of errors.
  • Figure 4 is a flow chart reflecting implementation of an embodiment of the method of the invention.
  • Figures 5-7 are histograms showing examples of error logs under different circumstances for an FCAL.
  • a fibre channel arbitrated loop (FCAL) or other network 10 in connection with which the present invention may be used includes a processor- based host 20, which may be a conventional computer, server or the like, having at least one processor 30 executing instructions or program modules stored in memory 40.
  • processor- based host 20 which may be a conventional computer, server or the like, having at least one processor 30 executing instructions or program modules stored in memory 40.
  • processor 30 executing instructions or program modules stored in memory 40.
  • other components conventionally used in a processor- based system will be used, though not shown in Figure 1, such as input-output (I/O) components and logic, disk or other storage, networking components and logic, user interface components and logic, etc.
  • I/O input-output
  • disk or other storage networking components and logic
  • user interface components and logic etc.
  • logic may refer to hardware, software, firmware or any combination of these, as appropriate for a given function.
  • program module or a program may be a software application or portion of an application, applets, data or logic (as defined above), again as appropriate for the desired functions.
  • Program modules are typically stored on hard disk or other storage medium and loaded into memory for execution.
  • the host 20 includes or communicates with a host bus adapter (ELBA) 50, which controls packets (commands, data, responses, idle packets, etc.) transmitted over the FCAL 10.
  • ELBA host bus adapter
  • the HBA communicates over a first loop segment 60 via a receive module 70 with a first device 80.
  • the devices on the FCAL 10 may be any devices suitable for communication with a network, including wired or wireless devices that may be in a given loop. Typical devices at nodes in a network would be storage devices in a storage area network (SAN), servers, and so on.
  • SAN storage area network
  • Device 80 passes data, commands, responses and other network packets via a transmit module 90 over segment 100 to a receive module 110 of device 120, which likewise transmits packets via transmit module 130 over segment 140 to receive module 150 of device 160. Finally, device 160 sends packets via transmit module 170 over segment 180 back to the HBA 50.
  • the receive and transmit modules may be components separate from the network devices, or may be in at least some cases integral to some of the devices.
  • Examplel Failure region upstream of a target device.
  • a command such as a read or write command addressed to a specific target is transmitted from the HBA 50. If, for instance, a read command is sent to device 120, then it is sent to device 80 along segment 60, and device 80 forwards it to device 120 without taking any other substantive action, since the command packet is not addressed to device 80.
  • a command packet 200 includes a command frame 210 and CRC data 220; a data packet 230 includes a data frame 240 and CRC data 250; and a status frame 270 includes status frame and CRC data 280.
  • failure region 190 in Figure 1 If there is a failure in the FCAL, such as somewhere along segment 100 as indicated by the failure region 190 in Figure 1, then packets passing through that failure region may not be transmitted correctly. Since failures are often intermittent, it may be that some packets are transmitted successfully while others are not. Thus, it may be that a read command targeting device 120 arrives without corruption, and the device 120 can detect this by checking the CRC information. If this occurs, then the device 120 can send a response packet indicating successful receipt of the read command, and can additionally send the read data packet(s) themselves back to the HBA.
  • the failure region 190 is earlier or "upstream" in the loop from the device 120, once the read command reaches device 120 there will be no problem in sending the response and data packets, if there are no other failure regions at this time in the FCAL 10.
  • the host 20 sends a write command to the device 120, this will be followed by the write data itself, and even if the write command arrives at device 120 uncorrupted, it maybe that some portion of the write data is sent over segment 100 while a failure is occurring. In this case, the write data will likely be corrupted, so that device 120 either does not receive the data at all (e.g. because the target address was corrupted) or determines from the CRC data that one or more packets were damaged.
  • a timeout error is likely, i.e. the host 20 determines that it has not received a suitable response from device 120 within some predetermined time period. If the device receives the data, but the CRC information indicates an error, then a response indicating this is sent to the host 20. In either case, a log is kept of each error, the time of occurrence, and the associated type of action, packet and device, segment, node or other type of component in connection with which the failure occurred. Subsets and variations on the exact error data logged are possible, as will be appreciated below.
  • the error data may be stored at the host 20 or in some other storage device or medium connected to the network.
  • any device including any storage medium for storing data, such as disks, tapes, RAM or other volatile or nonvolatile memory may be used.
  • logic and/or program configured to implement the invention may be stored on any storage device and/or computer-readable medium, and may be executed on the host or any other suitable processor-based system.
  • Example Failure region downstream of a target device.
  • the target device for read and write requests is device 80, or some other device upstream of a failure region such as failure region 190, then different error data will be exhibited. In this case, a read request that successfully reaches device 80 may not successfully be fulfilled, since the transmitted data may be corrupted in traversing the failure region 120. If the failure region is not failing at the time of transmission, however, the read data may be transmitted successfully.
  • an error log relating to commands sent to a device 80 upstream of a failure region 190 could include entries indicating:
  • a target device e.g. device 80
  • write requests are successful (Example 2, cases 4 and 5)
  • read requests maybe unsuccessful (Example 2, case 1) or, even if they are successful, the read response may fail (Example 2, case 3).
  • a target device e.g. device 120
  • read requests that arrive intact are successful (Example 1, case 1)
  • write requests that arrive intact may still be unsuccessful because of corruption of the ensuing write data (Example 1, case 4).
  • Example 1, cases 1, 4 and 5 For each of Examples 1 and 2, two cases involve successful operations while three cases involve some kind of failure.
  • the failures (Example 1, cases 1, 4 and 5; and Example 2, cases 1, 2 and 5) are logged, and are summarized in the logic table of Figure 3.
  • it is sufficient to isolate the fault region if Example 2, case 2 and Example 1, case 4 are examined, and thus the other failure cases (involving failures of the commands or of the responses) need not be inspected.
  • the location of a given target device is relevant to the analysis of failure data relative to that device in isolating a fault region.
  • a simplex (unidirectional) fibre channel network or loop is discussed.
  • the concepts of the invention may also be applied to full duplex or half-duplex (i.e. bidirectional) network connections, in which case the directionality data relating to each error should be stored and used in the fault analysis. For example, in an Ethernet network packets are sent in both directions along the Ethernet segments, which could make fault isolation difficult, unless transmission and target node information is used to determine the directionality of the operations that result in errors. If this is done, then the principles of the present invention can be applied to isolate failing component.
  • FIG 4 is a flow chart reflecting an appropriate method according to the invention.
  • the method may be implemented as a program or other logic executed within the host 20 or any other processor-based system, based upon failure data generated and stored due to operation of the FCAL 10 or another network.
  • the method may be implemented as a At step (box) 300, if a failure has occurred, the first device in the loop downstream of the transmitting device (such as an HBA) is considered (see box 310).
  • the method inspects failure information for read and/or write requests where the commands themselves were successfully transmitted, but the read or write data did not arrive intact at the requesting device (here, the HBA 50).
  • the method determines whether at least one write data packet for the current device (i.e. a given device in the loop, such as, in this first iteration, the first device 80 downstream of the HBA 50) had an associated error. If so, then as seen in Figure 3, that may be an indication that a failure region is located upstream of the target device. In the example of Figure 1, failure region 190 is downstream of device 80, so the outcome of the determination of box 330 is for that example negative, and the method proceeds to box 340. If no read packets contained errors, then the method proceeds to box 320, and the next device is considered.
  • a given device in the loop such as, in this first iteration, the first device 80 downstream of the HBA 50
  • a system according to the invention preferably stores the information that a failure has occurred downstream of the device currently under consideration; see line connecting boxes 340 and 400 in Figure 4, as well as Example 2, case 2 discussed above and depicted in Figure 3.
  • the method proceeds to box 320 to inspect the failure data for that next device. If there are no more devices (i.e. components of any kind) on the loop (box 420), then the failure region has been located: since it is downstream of the device under consideration and there are no additional components in the system, the failure must have occurred between the current device (which is the "last" device in the loop) and the HBA. Conventional loop and/or component/device diagnostics may then be executed to specifically identify the failure, and the failed component can be repaired or replaced. In the present example, there are additional devices (120, 160) on the loop, so the method proceeds to box 320, and the failure data relating to device 120 is considered.
  • Example 1 it will be determined that Example 1, case 4 is met, i.e. failures on data write packets will be identified.
  • the method proceeds to box 350, where it is determined that there were no data read packet errors.
  • the method proceeds to box 360, where it is determined that the failure region is upstream of the current device and downstream of the previous device. If the current device is the first device on the loop, then the "previous device" is simply the HBA or other transmitting device. At this point, diagnostics are executed at box 380.
  • a fault region could lie between a first location and a second location, where a read error occurred downstream of the first location for a packet being sent in a first direction, but another read error occurred downstream of the second location relative to the opposite of the first direction (downstream for the second direction being upstream relative to the first direction).
  • a read error occurred downstream of the first location for a packet being sent in a first direction
  • another read error occurred downstream of the second location relative to the opposite of the first direction (downstream for the second direction being upstream relative to the first direction).
  • a system and method according to this invention are particularly useful in identifying intermittent failures in a network before complete failure of a device. It is while the failures are transient that the important data can be logged, and then used in the above-described method. This can lead to rapid and efficient failure isolation. Note that if the loop itself is completely down, due e.g. to catastrophic failure of a component, nothing at all gets through ⁇ no idle frames, commands, etc. It is more typical, however, that a failing loop will undergo transient failures of a few microseconds, resulting in intermittent errors of the types discussed above. Even such a short down time will likely result in a CRC error.
  • $last substr ($p , 0, $ix) if ($ix > 0) ; return $last;
  • Orphan-Devices are devices that have errors in the message files but cannot be ⁇ n" .
  • $PT $opts ⁇ P ⁇ ; opendir ⁇ 0, $DIR) ;
  • my @a readdir (0) ; closedir(O);
  • my $today Util->today ("YMDH") ;
  • $cnt l;
  • $p0 $PTl ⁇ $xx[4] ⁇ ->[0] ; if ($path0) ⁇ if ($p0 ne $path0) ⁇ print "Warning: 2 paths in $f ($path0 and $p0) ⁇ n";
  • $ASCD ⁇ SC ⁇ "$asc-$ascq" ⁇ $desc; last; ⁇ ⁇ if ($wwn) ⁇ if ($asc eq "47") ⁇ $write_found++ ;
  • $text . "Problem at $dev, ⁇ "; $tran cnt++;
  • $text "All errors were Write Errors, problem outside enclosure (s) between HBA and first device. ⁇ ";
  • $text "All errors were Read Errors, problem outside enclosure (s) between last device and HBA. ⁇ "; if ($tran_cnt > 1) ⁇
  • $info "multiple transition often means more than one problem! Fix first transition in loop first ! ⁇ n";
  • $warn ' . ' if ( ! $warn) ; print sprintfC %-15.15s %-4s %-16s %-l ⁇ s %-10s %-l0s ⁇ n", $dev, $ssd, $wwn, $xl, $x2, $warn) ;
  • $today sprintf ("%2.2d-%2.2d %2.2d:%2.2d:%2.2d” , $now[4], $now[3], $now[2], $now[l], $now[ ⁇ ]);

Abstract

A fault isolation system in a network is disclosed, particularly suited for use in a unidirectional fibre channel arbitrated loop (100).

Description

Background of the Invention The present invention relates to a system and methods for isolating faults in a network, particularly in a fibre channel arbitrated loop (FCAL) or other multidevice computer system in which one or more components may fail. Locating faults in a network such as an FCAL is a challenging and time- consuming undertaking, in particular where the loop includes many devices, each of which may undergo intermittent failures, hi systems currently in use, logs are kept of failed commands, data transfers, responses, etc., so that diagnostics maybe performed in attempting to locate the sources of failures. Such diagnostics typically involve attempts to replicate a given fault condition, often in a trial-and-error manner, removing and/or replacing components until a faulty component is identified. This is a nondeterrninistic approach, since intermittent faults by definition do not occur every time a given state of the network occurs (e.g. a given FCAL configuration with a given I/O command). Thus, an engineer may spend a considerable amount of time and resources fruitlessly attempting to isolate an FCAL error, and additionally may replace more components than necessary, i.e. may replace nonfailing components along with a failed component, due to insufficient knowledge about the location of a failure.
Thus, fault isolation techniques can result in wasted time, efforts and equipment, hi addition, the difficulties inherent in fault isolation using current techniques can lead to extended periods of down time for a system or subsystem, and a local error can thus have a broad effect, affecting the productivity of all users on the loop or network.
Accordingly, a system is needed that can isolate errors in a network in a manner that is at best deterministic and at worst at least reduces trial-and-error attempts to locate failing components relative to current methods.
Summary of the Invention
A system and method are presented for isolating fault regions in a processor- based network or loop, such as a fibre channel arbitrated loop, include logic and program modules that are configured to analyze error logs relating to read and write errors that occur in the network. Rapid isolation of a failure region can be accomplished by determining those devices for which data read errors occur, and by the observation that read data errors are likely to occur for failure regions downstream of a given device, while write data errors are likely to occur for failure regions upstream of a given device. Information concerning command and response errors is not needed to determine the location of the failure region.
Brief Description of the Drawings Figure 1 is a block diagram of a fibre channel arbitrated loop (FCAL) appropriate for use in connection with the present invention.
Figure 2 shows data structures or packets usable in an fibre channel arbitrated loop as shown in Figure 1.
Figure 3 is a logic table indicating likely locations of component failures based upon types of errors.
Figure 4 is a flow chart reflecting implementation of an embodiment of the method of the invention.
Figures 5-7 are histograms showing examples of error logs under different circumstances for an FCAL.
Description of the Preferred Embodiments As shown in Figure 1, a fibre channel arbitrated loop (FCAL) or other network 10 in connection with which the present invention may be used includes a processor- based host 20, which may be a conventional computer, server or the like, having at least one processor 30 executing instructions or program modules stored in memory 40. It will be understood that other components conventionally used in a processor- based system will be used, though not shown in Figure 1, such as input-output (I/O) components and logic, disk or other storage, networking components and logic, user interface components and logic, etc. For the purposes of the present application, the term "logic" may refer to hardware, software, firmware or any combination of these, as appropriate for a given function. Similarly, a "program module" or a program may be a software application or portion of an application, applets, data or logic (as defined above), again as appropriate for the desired functions. Program modules are typically stored on hard disk or other storage medium and loaded into memory for execution.
The host 20 includes or communicates with a host bus adapter (ELBA) 50, which controls packets (commands, data, responses, idle packets, etc.) transmitted over the FCAL 10.
In this exemplary configuration, the HBA communicates over a first loop segment 60 via a receive module 70 with a first device 80. The devices on the FCAL 10 may be any devices suitable for communication with a network, including wired or wireless devices that may be in a given loop. Typical devices at nodes in a network would be storage devices in a storage area network (SAN), servers, and so on.
Device 80 passes data, commands, responses and other network packets via a transmit module 90 over segment 100 to a receive module 110 of device 120, which likewise transmits packets via transmit module 130 over segment 140 to receive module 150 of device 160. Finally, device 160 sends packets via transmit module 170 over segment 180 back to the HBA 50.
As many or as few nodes or devices such a devices 80, 90 and 100 may be used in the system. The receive and transmit modules may be components separate from the network devices, or may be in at least some cases integral to some of the devices.
Examplel: Failure region upstream of a target device.
In a typical FCAL system, a command such as a read or write command addressed to a specific target is transmitted from the HBA 50. If, for instance, a read command is sent to device 120, then it is sent to device 80 along segment 60, and device 80 forwards it to device 120 without taking any other substantive action, since the command packet is not addressed to device 80.
When the read command packet reaches device 120, the device executes the command (e.g. by retrieving the requested information from a file), sends response and data packets to the host 20 (via segments 140 and 180). This may be done in a variety of suitable fashions in different types of networks, and as shown in Figure 2 at a minimum each data packet will include a frame with associated cyclic redundancy check (CRC) information, used to detect errors in a given packet. Thus, a command packet 200 includes a command frame 210 and CRC data 220; a data packet 230 includes a data frame 240 and CRC data 250; and a status frame 270 includes status frame and CRC data 280. If there is a failure in the FCAL, such as somewhere along segment 100 as indicated by the failure region 190 in Figure 1, then packets passing through that failure region may not be transmitted correctly. Since failures are often intermittent, it may be that some packets are transmitted successfully while others are not. Thus, it may be that a read command targeting device 120 arrives without corruption, and the device 120 can detect this by checking the CRC information. If this occurs, then the device 120 can send a response packet indicating successful receipt of the read command, and can additionally send the read data packet(s) themselves back to the HBA.
In this example, since the failure region 190 is earlier or "upstream" in the loop from the device 120, once the read command reaches device 120 there will be no problem in sending the response and data packets, if there are no other failure regions at this time in the FCAL 10. However, if the host 20 sends a write command to the device 120, this will be followed by the write data itself, and even if the write command arrives at device 120 uncorrupted, it maybe that some portion of the write data is sent over segment 100 while a failure is occurring. In this case, the write data will likely be corrupted, so that device 120 either does not receive the data at all (e.g. because the target address was corrupted) or determines from the CRC data that one or more packets were damaged.
If the target address was corrupted, a timeout error is likely, i.e. the host 20 determines that it has not received a suitable response from device 120 within some predetermined time period. If the device receives the data, but the CRC information indicates an error, then a response indicating this is sent to the host 20. In either case, a log is kept of each error, the time of occurrence, and the associated type of action, packet and device, segment, node or other type of component in connection with which the failure occurred. Subsets and variations on the exact error data logged are possible, as will be appreciated below. The error data may be stored at the host 20 or in some other storage device or medium connected to the network. In general, any device (including any storage medium) for storing data, such as disks, tapes, RAM or other volatile or nonvolatile memory may be used. Similarly, logic and/or program configured to implement the invention may be stored on any storage device and/or computer-readable medium, and may be executed on the host or any other suitable processor-based system.
In this example, a read request was successfully completed if it reached device 120 without error, but a write request was unsuccessful, even though the request itself arrived without corruption. This is behavior that is likely to occur when the failure region is upstream of the target device. Of course, it is also possible that in the case of a write request, the write data would successfully traverse the failure region 190, reaching device 120 without corruption. Thus, the error log for this example could show several types of entries:
(1) read requests that were sent by the host 20 but failed to reach device 120 intact;
(2) read requests that arrived at device 120 intact and were successfully responded to;
(3) write requests that arrived at device 120 and were followed by successfully transmitted write data and a successful write response; (4) write requests that arrived at device 120 and were followed by write data that was corrupted in transmission or otherwise did not arrive successfully at the HBA 50; and (5) write requests that fail to reach device 120 intact.
Example: Failure region downstream of a target device.
If instead the target device for read and write requests is device 80, or some other device upstream of a failure region such as failure region 190, then different error data will be exhibited. In this case, a read request that successfully reaches device 80 may not successfully be fulfilled, since the transmitted data may be corrupted in traversing the failure region 120. If the failure region is not failing at the time of transmission, however, the read data may be transmitted successfully.
However, if a write command is sent to the device 80, the write data itself will be transmitted successfully, since there is no failure region between the HBA 50 and the device 80. In this case, device 80 will send a response indicating successful receipt of the data. That response packet may or may not reach the HBA intact, because of the failure region 190 -but for any write response that does arrive intact for device 80, a successful write operation will be reflected, absent a failure region upstream of device 80. Thus, an error log relating to commands sent to a device 80 upstream of a failure region 190 could include entries indicating:
(1) read requests that were sent by the host 20 to the device 80, which then sent read data that was corrupted in transmission or otherwise did not arrive successfully at the HBA 50; (2) read requests that arrived at device 120 and were successfully responded to;
(3) read requests that arrived at device 120 intact and were followed by successfully transmitted read data, but where the read response failed to reach the HBA 20 intact;
(4) write requests that arrived at device 120 and were followed by successfully transmitted write data, where the write response reached the HBA 20 intact; and
(5) write requests that arrived at device 120 and were followed by successfully transmitted write data, but where the write response did not reach the HBA 20 intact. In both Example 1 and Example 2, it is possible to store many other types of error data in the log. However, the foregoing is sufficient to illustrate the operation of embodiments of the present invention.
It will be noted that when a target device (e.g. device 80) is upstream of a failure region, write requests are successful (Example 2, cases 4 and 5), while read requests maybe unsuccessful (Example 2, case 1) or, even if they are successful, the read response may fail (Example 2, case 3). When a target device (e.g. device 120) is downstream of a failure region, read requests that arrive intact are successful (Example 1, case 1), while write requests that arrive intact may still be unsuccessful because of corruption of the ensuing write data (Example 1, case 4).
For each of Examples 1 and 2, two cases involve successful operations while three cases involve some kind of failure. The failures (Example 1, cases 1, 4 and 5; and Example 2, cases 1, 2 and 5) are logged, and are summarized in the logic table of Figure 3. As will be seen in the following discussion, it is sufficient to isolate the fault region if Example 2, case 2 and Example 1, case 4 are examined, and thus the other failure cases (involving failures of the commands or of the responses) need not be inspected.
The location of a given target device is relevant to the analysis of failure data relative to that device in isolating a fault region. In the present example, a simplex (unidirectional) fibre channel network or loop is discussed. However, the concepts of the invention may also be applied to full duplex or half-duplex (i.e. bidirectional) network connections, in which case the directionality data relating to each error should be stored and used in the fault analysis. For example, in an Ethernet network packets are sent in both directions along the Ethernet segments, which could make fault isolation difficult, unless transmission and target node information is used to determine the directionality of the operations that result in errors. If this is done, then the principles of the present invention can be applied to isolate failing component.
Figure 4 is a flow chart reflecting an appropriate method according to the invention. The method may be implemented as a program or other logic executed within the host 20 or any other processor-based system, based upon failure data generated and stored due to operation of the FCAL 10 or another network. The method may be implemented as a At step (box) 300, if a failure has occurred, the first device in the loop downstream of the transmitting device (such as an HBA) is considered (see box 310). At box 320, the method inspects failure information for read and/or write requests where the commands themselves were successfully transmitted, but the read or write data did not arrive intact at the requesting device (here, the HBA 50).
The cases where there was a failure due to a command failing to reach the device, or a response failing to arrive intact back at the HBA, are not used in this implementation, though they may provide additional diagnostics in alternative embodiments. Thus, the errors illustrated in Figure 3 relating to write response, read response, write command and read command are marked "Ignore", while the failures on data read and data write are used in the embodiment of the invention relating to the flow chart of Figure 4.
At box 330, the method determines whether at least one write data packet for the current device (i.e. a given device in the loop, such as, in this first iteration, the first device 80 downstream of the HBA 50) had an associated error. If so, then as seen in Figure 3, that may be an indication that a failure region is located upstream of the target device. In the example of Figure 1, failure region 190 is downstream of device 80, so the outcome of the determination of box 330 is for that example negative, and the method proceeds to box 340. If no read packets contained errors, then the method proceeds to box 320, and the next device is considered.
However, if enough error data has been logged, it is likely that there will be read errors recorded for device 80 due to failure region 190. In this case, the method determines that at least one read packet contained an error, and proceeds to step 400. A system according to the invention preferably stores the information that a failure has occurred downstream of the device currently under consideration; see line connecting boxes 340 and 400 in Figure 4, as well as Example 2, case 2 discussed above and depicted in Figure 3.
If there is at least one additional device on the look (box 410), the method proceeds to box 320 to inspect the failure data for that next device. If there are no more devices (i.e. components of any kind) on the loop (box 420), then the failure region has been located: since it is downstream of the device under consideration and there are no additional components in the system, the failure must have occurred between the current device (which is the "last" device in the loop) and the HBA. Conventional loop and/or component/device diagnostics may then be executed to specifically identify the failure, and the failed component can be repaired or replaced. In the present example, there are additional devices (120, 160) on the loop, so the method proceeds to box 320, and the failure data relating to device 120 is considered. In this case, it will be determined that Example 1, case 4 is met, i.e. failures on data write packets will be identified. Thus, at box 330, the method proceeds to box 350, where it is determined that there were no data read packet errors. The method proceeds to box 360, where it is determined that the failure region is upstream of the current device and downstream of the previous device. If the current device is the first device on the loop, then the "previous device" is simply the HBA or other transmitting device. At this point, diagnostics are executed at box 380.
There is no need at this point to further inspect the errors logged (e.g. data write errors) with respect to device 160, because the failure region has already been located as being upstream of device 120.
It is possible, if insufficient data is gathered, that no data read errors would have been logged with respect to device 80, although the failure region 190 is downstream of that device. In this case, the data write failures for device 120 would still locate the failure region as being upstream of device 120, but device 80 would not contribute to narrowing down the possible failure location. Thus, logging more error data will lead to a greater likelihood of quickly locating a failure region.
Note that if both data read and data write errors occur in connection with a given component, then the failure region must be within that component. This case is indicated at box 370. This method thus deterministically locates failure regions as being between two devices on a loop, where error data has been logged sufficient to identify read data and write data errors with respect to both devices. Examples of such data are shown in Figures 5-7, in which errors relating a system with twelve components (dev 1 - dev 12) are depicted. Thus, an error log might indicate that there were 8 read data errors for device 1, 10 read data errors for device 2, etc. The errors for devices 1-7 are all read data errors, while the errors for devices 8-12 are all write data errors. According to the above method of the invention, the failure region must therefore be between devices 7 and 8. hi Figure 6, the errors for all twelve devices are read errors (except for device
8, which in this example exhibited no errors), and thus the failure region must be downstream of the last device, i.e. between the last device and the HBA. In this example, the method of Figure 4 would iterate through boxes 320, 330, 340, 400 and
410 until the last device on the loop was reached, at which point it would proceed to box 420, identifying the failure region as noted.
In Figure 7, the errors for all twelve devices are write errors. Thus, the failure region must be upstream of all twelve devices, and is accordingly between the first device and the HBA. In this example, the method of Figure 4 would follow boxes
320, 330, 350 and 360, and in the first iteration would locate the failure region.
In the case of a bidirectional network where directionality is stored for the individual errors, it is possible to apply the concepts of the invention in a straightforward extension by considering either two read or two write errors that occurred were sent in different directions on the network, but whose intersection locates a fault. For instance, a fault region could lie between a first location and a second location, where a read error occurred downstream of the first location for a packet being sent in a first direction, but another read error occurred downstream of the second location relative to the opposite of the first direction (downstream for the second direction being upstream relative to the first direction). This is analogous to the concept of locating a fault region downstream of a read error but upstream of a write error for a unidirectional loop, as in Figure 4.
As a further extension of the bidirectional embodiment, other combinations of read and write errors (or other types of errors) can be considered, as long as the fault location is identified as being at the intersection of the errors, taking directionality into account, as taught generally by the method of Figure 4.
A system and method according to this invention are particularly useful in identifying intermittent failures in a network before complete failure of a device. It is while the failures are transient that the important data can be logged, and then used in the above-described method. This can lead to rapid and efficient failure isolation. Note that if the loop itself is completely down, due e.g. to catastrophic failure of a component, nothing at all gets through ~ no idle frames, commands, etc. It is more typical, however, that a failing loop will undergo transient failures of a few microseconds, resulting in intermittent errors of the types discussed above. Even such a short down time will likely result in a CRC error.
Perl code appropriate for an implementation of the present invention is attached hereto as Appendix A. Other specific implementations of program modules or logic may be equally appropriate for various embodiments of the invention.
APPENDIX A
# ! /usr/bin/perl use Getop : :Std; use Data: :Dumper;
# /net/rune . central/Tools/FCloop/TarFiles/ sub processAS { my($com, $DB, $ENC, $WWN) = @_; my($enc, $in, $num, $x, $ n, $state, $status) ; foreach my $1 (@$com) { if ($1 =~ /Node WWN:(.+) Enclosure Name: (.*)/) { $enc = Util->rtrim($2) ; $enc = $1 if (!$enc); last;
} } my ($slot) ; foreach my $1 (@$com) { if ($1 =~ /λSL0T/) { $in = "slot"; } elsif ($in eg "slot") { if (substr ($1,0,1) =- /\d/) { $num = substr ($1, 0,2) + 0; $x = substr($l,7,13) ; $x =- /([ \ ]+) (.*)/; $wwn = lc (substr ($1,25, 16) ) ;
$slot = sprintf ("f%2. d", $num) ; $DB->{substr ($wwn, 2) } = [$enc, $slot] ; $WN->{"$enc.$slot"} = $wwn;
$ΞNC->{$enc} {total} = $num if ($num > $ENC->{$enc} {total}'
$x = substr ($1,45, 13) ; $x =~ /(\w+) (.*)/; $wwn = lc (substr ($1,63, 16) ) ; $slot = sprintf ("r%2.2d", $num) ; $ WN->{"$enc.$slot"} = $w n;
$DB->{substr ($wwn,2) } = [$enc, $slot] ;
} } } $ok = 1;
} sub short { my($p) = @_; my $last = $p; my $ix = rindex($p, "/");
$last = substr ($p , 0, $ix) if ($ix > 0) ; return $last;
} sub help { print "fcloop -d <message-files base dir> \n" .
" -1 <luxadm display prefix> [optional] \n" . " The prefix for all ' luxadm_display file \n" .
" -p <luxadm dump_map prefix> [optional] \n" . " The prefix to all 'luxadm -e dump_map files \n" .
" -P <path_to_list location> \n" .
" -f <message file to use> \n" .
" -A [use all message files] \n" . " -v [0/1/2 ] 0=quiet , l=show loop , 2=show debug info . \n" .
" -h [help] \n" .
" \n" . " Using fcloop without the -p and -1 will make live luxadm calls to\n" .
" get the loop order of devices and their www names. \n" .
" This should be done locally on the host needing analysis. \n" .
" \n" .
" Example: fcloop -d /tmp/savel -1 luxadm -p sbus \n" .
" Example for current directory: fcloop -d . -1 display -p sbus \n"
" Orphan-Devices are devices that have errors in the message files but cannot be\n" .
" found in the dump_map output. \n" . " Orphan-Paths are paths that are in the message files but are not in the \n" .
" dump_map and path_to_inst files . " .
"\n" ; }
#
################### START ########################## # f :V:hd:l:p:", \%opts)) {
|| "/var/adm";
Figure imgf000015_0001
$ WN = {}, $DB = {}, $ENC = {}. $PT = $opts{P}; opendir{0, $DIR) ; my @a = readdir (0) ; closedir(O); my $today = Util->today ("YMDH") ; my($created, $size, $first, $year, $month, $day, $firstD) ; $cnt=l;
$list = ""; foreach my $f (@a) { if (substr ($f,0, 8) eq "messages") { $list .= " $cnt = $f \n"; $FILE[$cnt] = $f; $cnt++; } } if ($opts{f}) {
$ans = $opts{f};
} elsif ($cnt == 2) { print "\nReading messages from ' $FI E [1] ' \n" if ($VERB) ; $ans =1;
} else { if (!$list) { print "Warning: No message file to select in dir '$DIR' , aborting \n»; exit (1) ;
} print "\nSelect a message file to parse : \n$list" ; print " A = All Files\n" if ($cnt > 2) ; print " Select: "; if ($A L) { $ans = "A "; } else {
$ans = <STDIN>;
} chop($ans); # chop removes last char of string
<return>? exit if (lc($ans) eq "q") ;
} print "\n" if ($VERB) ; my $ count = 0; if ($PT) { if (open(0, "$PT") ) { while ($1 = <0>) { chop($l) ;
@a = split(/\s+/, $1) ; if ($a[2] eq "\"ssd\"" ) { $1 = substr ($a[0] ,1) ;
$1 =~ /(.*)\/ssd\@w([0-9a-f]+)/; $path = $1; $wwn = $2;
$PTl{$wwn} = [$path, $a [1] ] ; }
} } else { print "Cannot open $PT: $!\n"; $PT = undef; }
} if ($LUX) { my $ix = rindex($ UX, "/"); my ($dirl, $pat) ; if ($ix < 0) { $dirl = " . " ; $pat = $LUX; } else {
$dirl = substr($LUX,0,$ix) ; $pat = substr ($LUX, $ix+l) ;
} opendir(0, $dirl) ; my ©files = readdir(O); closedir(O); my @lall; my $count = 0; foreach my $f (©files) { next if (substr ($f,0,1) eq ".»); next if ($f !~ /Λ$pat/); open(0, "$dirl/$f") ; $empty = 1; $count++; while ($1 = <0>) {
$empty = 0; chop($l) ; push(@lall, $1) ;
} close (0) ; if ($empty) { print "Warning: luxadm_display for $dirl/$f is empty\n" ; } } if (!$couήt) { print "Warning: No luxadm_display output found in dir $dirl, aborting \n" ; exit (1) ; } for ( $x =0 ; $x <= $#lall ; $x++) { $1 = $lall [$x] ; if ( $1 =~ / s+SENA/ ) { if ( $count > 0 ) {
&processA5K (\@com, $DB, $ENC, $WWN) ,- ©corn = ( ) ;
}
$count++; } else { push (©corn, $1) ; } }
&processA5K(\@com, $DB, $ENC, $WWN) ; ©co = () ,-
} else { my($err, $out) = &run_command("/usr/sbin/luxadm probe") ; if ($#$out <= 0) { print "luxadm probe returned nothing, aborting! \n" ; exit ;
} foreach $1 (@$out) { if ($1 =~ /Name:(.*) Node WWN: (.+)/) { $wwn = $2; print "luxadm display on $wwn.. -<br>\n" if ($VERB) ; my($err, $com) = &run_command("/usr/sbin/luxadm display $wwn") ;
&processA5K($com, $DB, $ENC, $WWN) ; }
my (%DUMP) ; if ($DMAP) { my $ix = rindex($DMAP, "/") ; my($dirl, $pat) ; if ($ix < 0) { $dirl = " . " ;
$pat = $DMAP; } else {
$dirl = substr ($DMAP, 0, $iχ) ; $pat = subst ($DMAP, $ix+l) ;
} opendir(0, $dirl) ; my ©files = readdir(O); closedir(O); my $count = 0; foreach my $f (©files) { next if (substr ($f ,0,1) eq "."); next if ($f !- /Λ$pat/) ; my $in = 0; my $path = $f;
$count++; if ($PT) { # get the path open(0, "$dirl/$f") ; my $path0 ;
$empty = 1; while ($1 = <0>) { chop ($1) ; $empty=0; @xx = split (/\s+/, $1) ; if ($xx[7] =~ /Disk/) { if ($PTl{$xx[4] }) {
$p0 = $PTl{$xx[4] }->[0] ; if ($path0) { if ($p0 ne $path0) { print "Warning: 2 paths in $f ($path0 and $p0) \n";
} } else { $path0 = $path = $p0;
}
} else { print "Warning: Cannot find path for $xx[4], ignoring -P option\n" ;
$PT = undef;
} last; }
} close (O) ; if ($empty) { print "Warning: dump_map for $dirl/$f is empty\n"; }
} print "path of $f is $path: » if ($VERB > 1) ; if ($DUMP{$pathj) { print " already done \n" if ($VERB > 1) ; next;
} else { print "\n" if ($VERB > 1) ; } open(0, "$dirl/$f") ; while ($1 = <0>) { chop($l) ; if ($1 =- /Pos A_PA/) { $in = 1; } elsif ($in) { my(@a) = split (/\s+/, $1) ; $DUMP{$path} .= "$a[4],»; #my $wwn = substr ($a [4] , 2) ; #my $e = $DB->{$wwn};
#if ($e) {
# $DUMP { $path} . = $e- > [0] . " : $e- > [1] , " ;
#} }
} close (O) ;
} if ( ! $count) { print "Warning: No dump_file found in dir $dirl \n" ;
} } else { my($err, $paths) = &run_command("/usr/sbin/luxadm probe -p") ; my ($x , $list) ; for ($x=0; $x <= $#$paths; $x++) { if ($paths-> [$x] =~ /Physical Path:/) { $x++ if (index ($paths->[$x] , "/") < 0) ; my $fl = index ($paths->[$x] , "/"); my $f2 = rindex($paths->[$x] , "/") ; my $pa = substr ($paths-> [$x] , $fl, $f2-$fl) ; next if (index($list, $pa) >= 0) ; $list .= "$pa,"; print "luxadm dump_map on $pa ...<br>\n" if ($VERB) ; my($err, $out) = &run_command("/usr/sbin/luxadm -e dump_map
$pa") ; my ($in, $e) ; foreach $1 (@$out) { if ($1 =- /Pos AL_PA/) { $in = 1;
} elsif ($in) { my(@a) = split (/\s+/, $1) ; my $wwn = substr ($a [4] , 2) ; $e = $DB->{$wwn}; if ($e) {
$DUMP{$pa} .= $e->[0] . ":$e->[l] , "; }
} }
} my % = () ; my %REASON = ( ) ; my %NF; my %ASCDESC; my $no_read2 = 0; my $pop; my ($last_wwn, $last__path, $asc,$ascq, $asc_desc) ; my($no_read, $no_write, $cnt) ; for ($X=1; $x <= $#FI E ; $x++) { $file = $FILE[$x] ; if ($ans eq $x | | $ans eq "A" | | $ans eq "a" | | $ans eq $file) { # added lower case for A is o.k. (SW-8/20/01) $ok = 1; } else { next ;
} print "Reading $file ..\n" if ($VERB) ; open(0, "$DIR/$file") ; $cnt++; my ($10) ; while (1) { last if ( ! ($10 = <0>) ) ; my $date0 = substr($10, 0, 15); my ©A; if ($pop) { push(@A, $pop) ;
$pop = undef;
} pus (@A, $10) ; my $done = 1; while ($1 = <0>) { if (substr ($1,0, 15) eq $date0) { push(@A, $1) ; } else {
$pop = $1; $done = 0; last;
} } last if ($done) ; # 0 = CRC -read
# 1 = write_command (47)
# 2 = read__command
# 3 = WARNING my($x, $xl, $reason, $vl) ; for ($X=0; $X <= $#A; $x++) { if ($A[$x] =~ /WARNING: { . *) \@w ( [0-9a-f] +) . *\ (ssd (\d+) \) /)
{
$wwn = $2; $last_path = &short ($1) ;
$ssd = $3;
#$vl = $DB->{$wwn};
$SSD{$wwn} = $ssd;
$M{$last_jpath} {$wwn} [3] ++; # warning $warn_found++;
} elsif ( $A[$x] =~ /Transport error. *Channel CRC/) {
$last_jpath = $wwn = $reason = " " ; for ($xl=$x; $xl <= $#A; $xl++) { if ($A[$xl] =~ /WARNING: ( . *) \©w ( [0-9a- f]+) -*\(ssd/) { if (!$wwn) {
$last_path = &short($l); $wwn = $2 ;
} } elsif ($A[$xl] =~ /SCSI transport failed: reason
Figure imgf000020_0001
$reason = $1 if ( ! $reason) ; }
} if ($wwn) {
$read_found++;
$M{$last_path}{$wwn} [0]++; # read $REASON{$last_path} {$wwn} {$reason}++; # save reason
#last; } elsif ($A[$x] =- /Error for Command: write\ (/) {
$wwn = $lastj>ath = $asc = $desc = $sense = $ascq =
II II / .
( [ 0 -9a-
) { ") {
Figure imgf000021_0001
} elsif ($A[$xl] =~ /ASC: 0x(\d+) \((.*)\), ASCQ: 0x( [0-9a-fx]+)/) {
$asc = sprintf ("%2.2d", $1) ; $ascq = sprintf ("%2.2d", $3); $desc = $2; $desc =~ s/<//g;
$ASCDΞSC{"$asc-$ascq"} = $desc; last; } } if ($wwn) { if ($asc eq "47") { $write_found++ ;
$M{$last_path} {$wwn} [1] ++,- # write
} else { $WRITE_REASON{"$sense-${asc}-$ascq-$desc"}++;
#$M{$last_path} {$wwn} [1] ++; # asc++
$REASON{$last_path}{$wwn}{"$asc-$ascq"}++; # } else { print "wwn not found for $A[$x]<br>" if ($VERB) ;
} #last;
} elsif (0) { if ($A[$x] =~ /Error for Command: read\ (/) {
$wwn = $last_path = $ascq = $asc = $asc_desc = ""; for ($xl=0; $xl <= $#A; $xl++) { if ($A[$xl] =~ /WARNING: ( . *) \@w ( [0-9a- f]+) .*\(ssd/) { $last_path = &short ($1) ;
$wwn = $2; } elsif ($A[$Xl] =- /ASC: 0x(\d+) \((.*)\), ASCQ: 0x( [0-9a-fx]+)/) {
$asc = sprintf ("%2.2d", $1) ; $ascq = sprintf ("%2.2d" , $3) ; my $desc = $2; $desc =~ s/<//g; $ASCDESC{"$asc-$ascq"} = $desc;
} } if ($wwn) {
$rcfound++;
$M{$last_j>ath} {$wwn} [2] ++; # asc++
$REASON{ $last_path} { $wwn} { "$asc-$ascq" } ++; # } last;
} } }
} } $Data: :Dumper :: Indent = 1; # $DATA = {DB => $DB, wwn => $WWN, enc => $ENC, data => \%M, reason => \%REASON,
# PTI => \%PTI,
# dump => \%DUMP, ascdesc => \%ASCDESC, # missing => [$no_read, $no_write, $no_read2] , notFound =>
\%NF};
$warn_found+=0 ; $read_found+=0 ; $write_found+=0; print "\n Ssd-warnings : $warn_found, CRC-reads: $read_found, writes : $write_found \n"; foreach $x (keys %WRITE_REASON) { print " SENSΞ-ASC-ASCQ-DESC=$x: $WRITE_REASON{ $x} \n" ; print "\n";
&graph($DB, $WWN, $ΞNC, \%M, \%REASON, \%PTI, \%DUMP) ;
sub get_date { my($first) = @_; my(@b) = split (/\s+/, $first) ; my $today = Util->today ("YMDH") ; my($month) = $Util : :MTH{$b[0] } ; my($day) = $b [1] ; my($year) = ($month <= substr ($today, 5, 2) ) ? 2001:2000; return ($year, $month, $day) ; }
sub graph { my($DB, $WWN, $ENC, $data, $REASON, $PTI, $dump) = @_; my($ENC, $REASON, $title, @a) ;
$HEAD = " Encl-Disk SSD WWN ReadErrors
WriteErrs SSD-Warn \n" .
==\n";
$path_cnt = 0 ; foreach $path (keys %$dump) { $head_jρrinted = 0; print "\n" if ($VERB) ; print " Path: $path \n"; $path_cnt++;
$pos = "L"; $text = ""; $lastL = ""; $read_cnt=θ; $wri e_cnt=0; $info = ""; $tran_cnt=0;
$order = $dump->{$path} ;
©wwns = split (/\,/, $order) ; foreach $wwn (@wwns) { my $v = $DB->{ substr ($wwn, 2) } ; next if ( ! $v) ; my $dev = $v? "$v-> [0] :$v-> [1] " $wwn; my $ssd = $PTI->{$wwn} -> [1] ; my $x = $data->{$path} {$wwn} ;
$Xl = $X->[0] $X2 = $X->[1]
$warn = $x->[3!
$data->{$path} {$wwn} = [-1,-1,-1, -1] ; $last = $dev if ($x->[0]); $read__cnt++ if ($x->[0]); $write_cnt++ if ($x->[l]),- if ($pos eq "L" && $x->[l]) { if ($last && $last ne $dev) {
$text .= "Problem between $last and $dev, <<<<<<<<<",- } else {
$text .= "Problem at $dev, <<<<<<<<<";
}
$pos = "R"; $tran_cnt++; } elsif ($pos eq "R" && $x->[0]) { $pos = " ";
$text .= "Problem at $dev, <<<<<<<<<"; $tran cnt++;
} next if (!$VERB && ($x->[0] + $x-> [1] == 0) ) ; if ( ! $head_jprinted) { print $HEAD;
$head_jprinted = 1;
} print sprintf (" %-15.15s %-4s %-16s %-10s %-10s %-10s\n", $dev, $ssd, $wwn, $xl, $x2, $warn) ;
}
$p = $data->{$path} ;
$orphans = 0;
@ = ( ) ; foreach $wwn (keys %$p) { my $v = $p->{$wwn}; next if ($v->[0] == -1) ; my $v = $DB->{substr ($wwn, 2) } ; next if ( ! $v) ; my $dev = $v? "$v-> [0] :$v-> [1] " : $wwn; push(@L, "$dev\t$wwn") ;
} foreach $x (sort @ ) { my($dev, $wwn) = split(/\t/, $x) ; my $ssd = $PTI->{$wwn}-> [1] ; my $x = $data->{$path} {$wwn} ; $xl = $x-> [0] $x2 = $x-> [l] $warn = $χ->[3] $data->{$path} {$wwn} [-1,-1,-1,-1] ; if (!$orphans) { print " Orphan-device (s) \n" ;
$ orphans = 1 ;
} print sprintfC %-15.15s %-4s %-16s %-lθs %-lθs %-10s\n", $dev, $ssd, $wwn, $xl, $x2, $warn) ;
} next if ($orphans) ; if ($read_cnt + $write_cnt == 0) { $text = "No read or write errors found, look at the
Warnings ... \n" ;
} if ($write_cnt && !$read_cnt) {
$text = "All errors were Write Errors, problem outside enclosure (s) between HBA and first device. <<<<<<<<<";
} elsif ($tran_cnt == 0 && $read_cnt) {
$text = "All errors were Read Errors, problem outside enclosure (s) between last device and HBA. <<<<<<<<<"; if ($tran_cnt > 1) {
$info = "multiple transition often means more than one problem! Fix first transition in loop first !\n";
} if ($info I I $text) { print "\n" if ($VERB) ; print " -> TransitionCount: $tran_cnt \n$info"; print " -> $text \n";
} }
Figure imgf000024_0001
push (@X, "$dev\t$wwn\t$x- > [0] \t$x- > [l] \t$x- > [2] \t$x-> [3] " ) ; }
}
$new = 1; foreach $d (sort ©X) { my($dev, $wwn, $xl, $x2, $x3, $warn) = split(/\t/, $d) ; if ($new) { print "\n Orphan-Path: $p\n"; print $HEAD; $new = 0;
} $xl = ' . > if (!$xl) ;
$X2 = ' . ' if (!$x2) ;
$warn = ' . ' if ( ! $warn) ; print sprintfC %-15.15s %-4s %-16s %-lθs %-10s %-l0s\n", $dev, $ssd, $wwn, $xl, $x2, $warn) ;
» ' print " --\n" if ($VΞRB) ;
} sub run_command { my($com) = @_; my (©out) ; print "Running $com ,.\n" if ($VERB) ; open(0, "$com| ") ; while ($1 = <0>) { chop($l) ; push (©out, $1) ;
} return ("", \@out) ;
}
############################################################### # # package Util; sub rtrim { my($this, $s) =
$s =- /(\s*)$/; if ($D { substr ($s, 0, 0 - lengt ($1) ) ; } else { $s; } } sub today { my($class, $format) = @_; my($this, ©now, $today) ;
©now = localtime; $now[4]++; $now[5] += 1900; if ($format eq "YMDH") { $today = sprintf ("%4.4d-%2.2d-%2.2d %2.2d:%2.2d:%2.2d" ,
$now[5], $now[4], $now[3] , $now[2], $now[l] , $now[θ] ) ; } else {
$today = sprintf ("%2.2d-%2.2d %2.2d:%2.2d:%2.2d" , $now[4], $now[3], $now[2], $now[l], $now[θ]);
}
$today;
}
i;

Claims

What is claimed is:
1. An error isolation system adapted for use in a computer network in communication with a storage device configured to store error information relating to at least one detected error that occurs in the network, the system having a plurality of program modules configured to execute on at least one processor, the program modules including: an error identification module configured to determine whether the detected error is a write error or a read error; and a failure region detection module configured to identify a failure region on the network, the failure region being between a first location on the network at which a write error is detected and a second location on the network at which a read error is detected.
2. The system of claim 1, wherein: the network includes at least one unidirectional loop; and the failure region detection module is configured to isolate errors occurring within the unidirectional loop.
3. The system of claim 2, wherein the unidirectional loop comprises a fibre channel arbitrated loop.
4. The system of claim 2, wherein the failure region detection module is further configured to identify the failure region as being within an intersection of at least two detected errors.
5. The system of claim 4, wherein the two detected errors include a read error and a write error.
6. The system of claim 2, wherein the failure region detection module is further configured to identify the failure region as being downstream of a first read error and upstream of a first write error.
7. The system of claim 1, wherein: the network includes at least one bidirectional loop; the failure region detection module is configured to isolate errors occurring within the bidirectional loop; and the error identification module is configured to identify, for each detected error, a directionality on the bidirectional loop for that detected error.
8. The system of claim 7, wherein the failure region detection module is further configured to identify the failure region as being within an intersection of at least two detected errors.
9. The system of claim 8, wherein: the failure region detection module is further configured to identify the failure region as being downstream of a first read error and upstream of a first write error; wherein the directionality of the first read error is the same as the directionality of the first write error.
10. The system of claim 8, wherein: the failure region detection module is further configured to identify the failure region as being downstream, relative to a first directionality, of a first read error and downstream, relative to a second directionality, of a second read error.
11. The system of claim 8, wherein: the failure region detection module is further configured to identify the failure region as being upstream, relative to a first directionality, of a first write error and upstream, relative to a second directionality, of a second write error.
12. A method for locating a fault region in a network in communication with a storage device configured to store error information relating to detected errors that occurs in the network, the method including the steps of: determining whether each of a plurality of detected errors is a read error or a write error; and identifying a failure region on the network between a first location on the network where a first read error is detected and a second location on the network where a first write error is detected.
13. The method of claim 12, wherein the identifying step includes identifying the failure region as being within a unidirectional portion of the network.
14. The method of claim 12, wherein the identifying step includes identifying the failure region as begin within a fibre channel arbitrated loop.
15. The method of claim 12, wherein the identifying step includes identifying the failure region as being downstream of a detected read error and upstream of a detected write error.
16. The method of claim 12, wherein the identifying step includes identifying the failure region as being within a bidirectional portion of the network.
17. The method of claim 16, wherein the identifying step includes determining an associated directionality for each of the plurality of detected errors.
18. A computer program product stored on a computer-usable medium, comprising a computer-readable program configured to cause a computer to control execution of an application to determine a fault region associated with a plurality of detected errors in a network, the computer-readable program including: an error identification module configured to determine whether each of the plurality of detected errors is a write error or a read error; and a failure region detection module configured to identify a failure region on the network, the failure region being between a first location on the network at which a write error is detected and a second location on the network at which a read error is detected.
19. The computer program product of claim 18, wherein the failure region detection module is configured to identify the failure region as being upstream of a first detected write error and downstream of a first detected read error.
20. A computer network, including: a host including a processor and a host bus adapter; a loop coupled to the host bus adapter and configured to carry packets from the host bus adapter to at least one device on the loop and from the device to the host bus adapter; first logic coupled to the host and configured to detect a plurality of errors in the packets on the loop, the errors including at least one detected read error and one detected write error; and second logic coupled to the host and configured to locate a failure region on the loop downstream of the detected read error and upstream of the detected write error.
PCT/US2002/019118 2001-06-15 2002-06-14 System and method for isolating faults in network WO2002103524A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US60/298,259 2001-06-15
US10/172,302 US6990609B2 (en) 2001-06-12 2002-06-13 System and method for isolating faults in a network
US10/172,302 2002-06-13

Publications (1)

Publication Number Publication Date
WO2002103524A1 true WO2002103524A1 (en) 2002-12-27

Family

ID=22627122

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/019118 WO2002103524A1 (en) 2001-06-15 2002-06-14 System and method for isolating faults in network

Country Status (2)

Country Link
US (1) US6990609B2 (en)
WO (1) WO2002103524A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7243264B2 (en) * 2002-11-01 2007-07-10 Sonics, Inc. Method and apparatus for error handling in networks
US7496657B2 (en) * 2003-06-13 2009-02-24 Hewlett-Packard Development Company, L.P. Method and system for determining a source of a virtual circuit fault
US7619979B2 (en) * 2004-06-15 2009-11-17 International Business Machines Corporation Fault isolation in a network
US7661017B2 (en) * 2007-01-30 2010-02-09 International Business Machines Corporaion Diagnostic operations within a switched fibre channel arbitrated loop system
DE102014219512A1 (en) * 2014-09-26 2016-03-31 Dr. Johannes Heidenhain Gmbh Method and device for serial data transmission via a bidirectional data transmission channel
US10516625B2 (en) * 2016-10-24 2019-12-24 Hewlett Packard Enterprise Development Lp Network entities on ring networks
US10613919B1 (en) 2019-10-28 2020-04-07 Capital One Services, Llc System and method for data error notification in interconnected data production systems

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001156778A (en) * 1999-11-30 2001-06-08 Ntt Comware Corp Health check system for communication network

Family Cites Families (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04154242A (en) 1990-10-17 1992-05-27 Nec Corp Network failure recovery system
US5333301A (en) * 1990-12-14 1994-07-26 International Business Machines Corporation Data transfer bus system and method serving multiple parallel asynchronous units
DE69221338T2 (en) 1991-01-18 1998-03-19 Nat Semiconductor Corp Repeater interface control device
US5379411A (en) 1991-11-15 1995-01-03 Fujitsu Limited Fault indication in a storage device array
US5303302A (en) * 1992-06-18 1994-04-12 Digital Equipment Corporation Network packet receiver with buffer logic for reassembling interleaved data packets
US5448722A (en) 1993-03-10 1995-09-05 International Business Machines Corporation Method and system for data processing system error diagnosis utilizing hierarchical blackboard diagnostic sessions
US5519830A (en) 1993-06-10 1996-05-21 Adc Telecommunications, Inc. Point-to-multipoint performance monitoring and failure isolation system
US5390188A (en) * 1993-08-02 1995-02-14 Synoptics Method and apparatus for measuring and monitoring the performance within a ring communication network
US6006016A (en) 1994-11-10 1999-12-21 Bay Networks, Inc. Network fault correlation
US5664093A (en) 1994-12-27 1997-09-02 General Electric Company System and method for managing faults in a distributed system
US5636203A (en) 1995-06-07 1997-06-03 Mci Corporation Method and system for identifying fault locations in a communications network
US5768300A (en) * 1996-02-22 1998-06-16 Fujitsu Limited Interconnect fault detection and localization method and apparatus
US6151688A (en) * 1997-02-21 2000-11-21 Novell, Inc. Resource management in a clustered computer system
US6516435B1 (en) * 1997-06-04 2003-02-04 Kabushiki Kaisha Toshiba Code transmission scheme for communication system using error correcting codes
US6105068A (en) * 1998-02-10 2000-08-15 3Com Corporation Method and apparatus for determining a protocol type on a network connection using error detection values stored within internetworking devices
US6442694B1 (en) 1998-02-27 2002-08-27 Massachusetts Institute Of Technology Fault isolation for communication networks for isolating the source of faults comprising attacks, failures, and other network propagating errors
US6556540B1 (en) * 1998-05-29 2003-04-29 Paradyne Corporation System and method for non-intrusive measurement of service quality in a communications network
US6460070B1 (en) 1998-06-03 2002-10-01 International Business Machines Corporation Mobile agents for fault diagnosis and correction in a distributed computer environment
US6016510A (en) 1998-06-24 2000-01-18 Siemens Pyramid Information Systems, Inc. TORUS routing element error handling and self-clearing with programmable watermarking
US6249887B1 (en) 1998-09-21 2001-06-19 William G. Gray Apparatus and method for predicting failure of a disk drive
JP3266126B2 (en) 1999-01-14 2002-03-18 日本電気株式会社 Network fault information management system and storage medium
US6609167B1 (en) * 1999-03-17 2003-08-19 Adaptec, Inc. Host and device serial communication protocols and communication packet formats
US6353902B1 (en) 1999-06-08 2002-03-05 Nortel Networks Limited Network fault prediction and proactive maintenance system
US6324659B1 (en) 1999-10-28 2001-11-27 General Electric Company Method and system for identifying critical faults in machines
US6587960B1 (en) 2000-01-11 2003-07-01 Agilent Technologies, Inc. System model determination for failure detection and isolation, in particular in computer systems
US6684250B2 (en) 2000-04-03 2004-01-27 Quova, Inc. Method and apparatus for estimating a geographic location of a networked entity
US6414595B1 (en) 2000-06-16 2002-07-02 Ciena Corporation Method and system for processing alarm objects in a communications network
US6829729B2 (en) 2001-03-29 2004-12-07 International Business Machines Corporation Method and system for fault isolation methodology for I/O unrecoverable, uncorrectable error
US20020191602A1 (en) * 2001-06-13 2002-12-19 Woodring Sherrie L. Address mapping and identification
US7058844B2 (en) 2001-06-15 2006-06-06 Sun Microsystems, Inc. System and method for rapid fault isolation in a storage area network

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001156778A (en) * 1999-11-30 2001-06-08 Ntt Comware Corp Health check system for communication network

Also Published As

Publication number Publication date
US6990609B2 (en) 2006-01-24
US20030233598A1 (en) 2003-12-18

Similar Documents

Publication Publication Date Title
US10649838B2 (en) Automatic correlation of dynamic system events within computing devices
JP4759574B2 (en) Method and apparatus for network packet capture distributed storage system
US7058844B2 (en) System and method for rapid fault isolation in a storage area network
US6691209B1 (en) Topological data categorization and formatting for a mass storage system
US6678788B1 (en) Data type and topological data categorization and ordering for a mass storage system
US10289694B1 (en) Method and system for restoring encrypted files from a virtual machine image
US7685459B1 (en) Parallel backup
US8769342B2 (en) Redirecting data generated by network devices
EP1315074A2 (en) Storage system and control method
US20080077752A1 (en) Storage system and audit log management method
US7467235B2 (en) Data transfer method and system
JP2005301497A (en) Storage management system, restoration method and its program
JP2004038928A (en) System and method for determining change between two snapshots and transmitting the change to destination snapshot
US10261696B2 (en) Performance during playback of logged data storage operations
US10756952B2 (en) Determining a storage network path utilizing log data
CN110535692A (en) Fault handling method, device, computer equipment, storage medium and storage system
TWI709865B (en) Operation and maintenance data reading device and reading method thereof
US20090037655A1 (en) System and Method for Data Storage and Backup
WO2002103524A1 (en) System and method for isolating faults in network
CN111435286A (en) Data storage method, device and system
CN110244904A (en) A kind of data-storage system, method and device
US7873963B1 (en) Method and system for detecting languishing messages
US8850117B2 (en) Storage apparatus and method maintaining at least an order of writing data
CN110134572B (en) Validating data in a storage system
JPH047650A (en) Fault information log method and data processor

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP