|Publication number||US7299385 B2|
|Application number||US 10/900,948|
|Publication date||Nov 20, 2007|
|Filing date||Jul 28, 2004|
|Priority date||Jul 28, 2004|
|Also published as||US20060026451|
|Publication number||10900948, 900948, US 7299385 B2, US 7299385B2, US-B2-7299385, US7299385 B2, US7299385B2|
|Inventors||Douglas L. Voigt|
|Original Assignee||Hewlett-Packard Development Company, L.P.|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (13), Referenced by (19), Classifications (6), Legal Events (4)|
|External Links: USPTO, USPTO Assignment, Espacenet|
The described subject matter relates to electronic computing, and more particularly to managing a fault tolerant system.
A disk array is a type of turnkey, high-availability system. A disk array is designed to be inherently fault tolerant with little or no configuration effort. It responds automatically to faults, repair actions, and configuration in a manner that preserves system availability. These characteristics disk arrays are achieved by encoding fault recovery and configuration change responses into embedded software, i.e., firmware that executes on the array controller. This encoding is often specific to the physical packaging of the array.
Since the software embedded in disk arrays is complex and expensive to develop, it is desirable to foster as much reuse as possible across an array product portfolio. Different scales of systems targeted at various market segments have distinct ways of integrating of the components that make up the system.
For example, some array controllers and disks are distributed in a single package with shared power supplies, while other array controllers are packaged separately from disks, and each controller has its own power supply. In the future, turnkey, fault tolerant systems may include loosely-integrated storage networking elements. The patterns of redundancy and common mode failure differ across these integration styles. Unfortunately these differences directly affect the logic that governs fault and configuration change responses.
Therefore, there remains a need for systems and methods for managing a fault tolerant system.
In one exemplary implementation a system for modeling and managing a fault tolerant system, comprises a configuration manager that receives configuration events from the fault tolerant system; a fault normalizer that receives fault events from the fault tolerant system; and a fault tolerance logic engine that constructs a model of the fault tolerant system based on inputs from the configuration manager and generates reporting events in response to inputs from the fault normalizer.
Described herein are exemplary architectures and techniques for managing a fault-tolerant system. The methods described herein may be embodied as logic instructions on a computer-readable medium, firmware, or as dedicated circuitry. When executed on a processor, the logic instructions (or firmware) cause a processor to be programmed as a special-purpose machine that implements the described methods. The processor, when configured by the logic instructions (or firmware) to execute the methods recited herein, constitutes structure for performing the described methods.
In an exemplary implementation data storage system 100 may implement RAID (Redundant Array of Independent Disks) data storage techniques. RAID storage systems are disk array systems in which part of the physical storage capacity is used to store redundant data. RAID systems are typically characterized as one of six architectures, enumerated under the acronym RAID. A RAID 0 architecture is a disk array system that is configured without any redundancy. Since this architecture is really not a redundant architecture, RAID 0 is often omitted from a discussion of RAID systems.
A RAID 1 architecture involves storage disks configured according to mirror redundancy. Original data is stored on one set of disks and a duplicate copy of the data is kept on separate disks. The RAID 2 through RAID 5 architectures involve parity-type redundant storage. Of particular interest, a RAID 5 system distributes data and parity information across a plurality of the disks 130 a-130 c. Typically, the disks are divided into equally sized address areas referred to as “blocks”. A set of blocks from each disk that have the same unit address ranges are referred to as “stripes”. In RAID 5, each stripe has N blocks of data and one parity block which contains redundant information for the data in the N blocks.
In RAID 5, the parity block is cycled across different disks from stripe-to-stripe. For example, in a RAID 5 system having five disks, the parity block for the first stripe might be on the fifth disk; the parity block for the second stripe might be on the fourth disk; the parity block for the third stripe might be on the third disk; and so on. The parity block for succeeding stripes typically rotates around the disk drives in a helical pattern (although other patterns are possible). RAID 2 through RAID 4 architectures differ from RAID 5 in how they compute and place the parity block on the disks. The particular RAID class implemented is not important.
In a RAID implementation, the storage management system 110 optionally may be implemented as a RAID management software module that runs on a processing unit of the data storage device, or on the processor unit of a computer 130.
The disk array controller module 120 coordinates data transfer to and from the multiple storage disks 130 a-130 f. In an exemplary implementation, the disk array module 120 has two identical controllers or controller boards: a first disk array controller 122 a and a second disk array controller 122 b. Parallel controllers enhance reliability by providing continuous backup and redundancy in the event that one controller becomes inoperable. Parallel controllers 122 a and 122 b have respective mirrored memories 124 a and 124 b. The mirrored memories 124 a and 124 b may be implemented as battery-backed, non-volatile RAMs (e.g., NVRAMs). Although only dual controllers 122 a and 122 b are shown and discussed generally herein, aspects of this invention can be extended to other multi-controller configurations where more than two controllers are employed.
The mirrored memories 124 a and 124 b store several types of information. The mirrored memories 124 a and 124 b maintain duplicate copies of a cohesive memory map of the storage space in multiple storage disks 130 a-130 f This memory map tracks where data and redundancy information are stored on the disks, and where available free space is located. The view of the mirrored memories is consistent across the hot-plug interface, appearing the same to external processes seeking to read or write data.
The mirrored memories 124 a and 124 b also maintain a read cache that holds data being read from the multiple storage disks 130 a-130 f. Every read request is shared between the controllers. The mirrored memories 124 a and 124 b further maintain two duplicate copies of a write cache. Each write cache temporarily stores data before it is written out to the multiple storage disks 130 a-130 f.
The controller's mirrored memories 122 a and 122 b are physically coupled via a hot-plug interface 126. Generally, the controllers 122 a and 122 b monitor data transfers between them to ensure that data is accurately transferred and that transaction ordering is preserved (e.g., read/write ordering).
Each controller 210 a, 210 b has a converter 230 a, 230 b connected to receive signals from the host via respective I/O modules 240 a, 240 b. Each converter 230 a and 230 b converts the signals from one bus format (e.g., Fibre Channel) to another bus format (e.g., peripheral component interconnect (PCI)). A first PCI bus 228 a, 228 b carries the signals to an array controller memory transaction manager 226 a, 226 b, which handles all mirrored memory transaction traffic to and from the NVRAM 222 a, 222 b in the mirrored controller. The array controller memory transaction manager maintains the memory map, computes parity, and facilitates cross-communication with the other controller. The array controller memory transaction manager 226 a, 226 b is preferably implemented as an integrated circuit (IC), such as an application-specific integrated circuit (ASIC).
The array controller memory transaction manager 226 a, 226 b is coupled to the NVRAM 222 a, 222 b via a high-speed bus 222 a, 222 b and to other processing and memory components via a second PCI bus 220 a, 220 b. Controllers 210 a, 210 b may include several types of memory connected to the PCI bus 220 a and 220 b. The memory includes a dynamic RAM (DRAM) 214 a, 214 b, flash memory 218 a, 218 b, and cache 216 a, 216 b.
The array controller memory transaction managers 226 a and 226 b are coupled to one another via a communication interface 250. The communication interface 250 supports bi-directional parallel communication between the two array controller memory transaction managers 226 a and 226 b at a data transfer rate commensurate with the NVRAM buses 224 a and 224 b.
The array controller memory transaction managers 226 a and 226 b employ a high-level packet protocol to exchange transactions in packets over hot-plug interface 250. The array controller memory transaction managers 226 a and 226 b perform error correction on the packets to ensure that the data is correctly transferred between the controllers.
The array controller memory transaction managers 226 a and 226 b provide a memory image that is coherent across the hot plug interface 250. The managers 226 a and 226 b also provide an ordering mechanism to support an ordered interface that ensures proper sequencing of memory transactions.
In an exemplary implementation each controller 210 a, 210 b includes multiple central processing units (CPUs) 212 a, 213 a, 212 b, 213 b, also referred to as processors. The processors on each controller may be assigned specific functionality to manage. For example, a first set of processing units 212 a, 212 b may manage storage operations for the plurality of disks 130 a-130 f, while a second set of processing units 213 a, 213 b may manage networking operations with host computers or software modules that request storage services from data storage system 100.
When fault tolerance system 300 is implemented in a data storage system such as the system depicted in
The fault symptom catalog table 320 is a data table that provides a mapping between failure information for specific components and fault events in the context of a larger data storage system. In an exemplary implementation the fault symptom catalog table 320 may be constructed by the manufacturer or administrator of the data storage system.
The event generation registry table 330 is a data table that maps reporting events for particular fault events or configuration change events to particular modules/devices in the data storage system. By way of example, a disk array in a storage system that utilizes the services of a power supply may register with the event generation registry table to receive notification of a failure event for the power supply.
As the data storage system is constructed (or modified) the configuration manager 315 and the fault tolerance logic engine 340 cooperate to build a system relationship and state table 325 that describes relationships between components in the data storage system. In operation, fault events and configuration change events are delivered to the fault tolerance logic engine 340, which uses the system relationship and state table 325 to generate fault reporting and recovery action events. The fault recovery logic engine 340 uses the mapping information in the event generation registry table 330 to propagate the reporting and recovery action events generated by the fault recovery logic engine 340 to other modules/devices in the storage system.
Operation of the system 300 will be described with reference to the flowcharts of
In operation, fault tolerance system 300 constructs an abstract model representative of a real, physical fault tolerant system such as, e.g., a storage system. The abstract model is configured with relationships and properties representative of the real fault tolerant system. Then the fault tolerance system 300 monitors operation of real fault tolerant system for changes in the status of the real fault tolerant system. The abstract model is updated to reflect changes in the status of one or more components of the real fault tolerant system.
More particularly, the configuration manager 315 receives configuration events (e.g., when the real system is powered on or when a new component is added to the system) and translates information about the construction of the real fault tolerant system and its components into a description usable by the fault tolerance logic engine 340. In an exemplary implementation configuration events may be formatted in a manner specific to the component that generates the configuration event. Accordingly, the configuration manager 315 associates a hardware-specific event with a component class in the component classes table 310 to translate the configuration event from a hardware-specific code into a code that is compatible with the model developed by system 300. The fault tolerance logic engine receives configuration information form the configuration manager 315 and constructs a model of the real physical system. In an exemplary implementation the real physical system may be implemented as a storage system and the model may depict the system as a graph.
By way of overview, in one exemplary implementation, the following configuration modification process is implemented using the configuration manager 315. First, an object is created for each primitive physical component in the real physical system by informing the fault tolerance logic engine 340 of the existence of the component, its name, its state and other parametric information that may be interpretable either by the fault tolerance logic engine 340 or by recipients of outbound events. Each of these objects is likely to represent a field replaceable unit such as a power supply, fan, PC board, or storage device.
Second, a dependency group is created for each set of primitive devices that depend on each other for continued operation. A dependency group is a logical association between a group (or groups) of components in which all of the components must be functioning correctly in order for the group to perform its collective function. For example, an array controller with its own non-redundant power supply and fan would represent a dependency group.
Third, a redundancy group is created for each set of components or groups that exhibits some degree of fault tolerant behavior. Redundancy parameters of the group may be set to represent the group's behavior within the bounds of a common fault tolerance relationship.
Fourth, additional layers of dependency and redundancy groups are created to represent the topology, construction and fault tolerance behavior of the entire system. This may be done by creating groups for whole rack mountable components that comprise a system, followed by additional layers indicating how those components are integrated into the particular system.
Fifth track component additions, deletions and other system modifications by destroying groups and creating others. Each time a group is created or destroyed, the state of all other components whose membership has changed as a result of the new configuration is recomputed. This is implemented by querying the states of all of the constituent components in the group after the change and combining them according to the type (dependency vs. redundancy) and parameters of the group. Any internal state changes that result are propagated throughout the system model within the apparatus.
These operations are described in greater detail in connection with the following text and the accompanying flowcharts.
The operations of
By contrast, if at operation 512 the event is not an add component event (i.e., if the event is a remove component event), then control passes to operation 522 and the remove node process is executed to remove a component node from the model constructed by the fault tolerance logic engine 340. The remove node process is described in detail below with reference to
The operations of
By contrast, if at operation 542 the event is not an add group event (i.e., if the event is a remove group event), then control passes to operation 552 and the remove node process is executed to remove the group from the model constructed by the fault tolerance logic engine 340. The remove node process is described in detail below with reference to
The operations of
The operations of
Control then passes back to operation 612 and operations 612-616 are repeated until there are no more dependent nodes, whereupon control passes to operation 618. If, at operation 618, the specific node is a component, then control passes to operation 622 and the component node data structure is deleted. By contrast, if the specific node is not a component node (i.e., it is a group node) then control passes to operation 620 and the logical links between the group and its constituent members are dissolved. Control then passes to operation 622, and the specific group node data structure is deleted. At operation 624 control returns to the calling routine.
If, at operation 712, the specific node is a component node, then control passes to operation 714 and the fault tolerance logic engine 340 generates an event for the event generation registry 330. This process is described in detail below, with reference to
By contrast, if at operation 712 the specific node is not a component node (i.e., it is a redundancy node or a dependency node) then control passes to operation 722 and a node state accumulator is initialized. In an exemplary implementation a node state accumulator may be embodied as a numerical indicator of the number of bad node states in a redundancy node or a dependency node. In an alternate implementation a node state accumulator may be implemented as a Boolean state indicator.
If, at operation 724, the specific node is a redundancy node, then at operation 726 control passes to the operations of
When all the members of the redundancy group have been processed, then control passes from operation 740 to operation 746 and the number of bad node states is compared to a redundancy level threshold. If the number of bad node states exceeds the redundancy level threshold, then the node state is changed to “failed” at operation 750, and control returns to the calling routine at operation 754. By contrast, if the number of bad node states remains below the redundancy level threshold, then control passes to operation 748. If, at operation 748 the number of bad node states is zero, then control returns to the calling routine at operation 754. By contrast, if the number of bad node states is not zero, then control passes to operation 752 and the node state is changed to reflect a partial failure.
In an exemplary implementation the redundancy level may be implemented as a threshold that represents the maximum number of bad node states allowable in a redundancy node before the state of the redundancy node changes from active to inactive (or failed). The threshold may be set, e.g., by a manufacturer of a device or by a system administrator as part of configuring a system. By way of example, a redundancy group that represents a disk array having eight disks that is configured to implement RAID 5 storage may be further configured to change its state from active to failed if three or more disks become inactive.
Referring back to
If, at operation 816 the new node state matches criteria specified in the event generation registry table, then control passes to operation 818 and a new event is generated. At operation 820 the new node state is added to the new event, and at operation 822 the new event is reported to the device(s) that are registered to receive the event. In an exemplary implementation the event may be reported by transmitting an event message using conventional electronic message transmitting techniques.
At operation 910 a fault event is received in the fault normalizer 335. At operation 912 the fault normalizer 335 determines the system fault description, e.g., by looking up the system fault description in the fault symptom catalog table 320 based on the received hardware-specific event. At operation 914 the system fault description (e.g., a node handle for use in the fault tolerance logic engine 340 and a new node state) is associated with the hardware specific event. At operation 916 the event is logged, e.g., in a memory location associated with the system. And at operation 918 the event component state is delivered to the fault tolerance logic engine 340, which processes the event.
Although the described arrangements and procedures have been described in language specific to structural features and/or methodological operations, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or operations described. Rather, the specific features and operations are disclosed as preferred forms of implementing the claimed present subject matter.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US5408218 *||Mar 19, 1993||Apr 18, 1995||Telefonaktiebolaget L M Ericsson||Model based alarm coordination|
|US6006016 *||Oct 18, 1996||Dec 21, 1999||Bay Networks, Inc.||Network fault correlation|
|US6874099 *||May 31, 2001||Mar 29, 2005||Sprint Communications Company L.P.||Method and software for testing and performance monitoring|
|US7035953 *||Jun 25, 2002||Apr 25, 2006||Hewlett-Packard Development Company, L.P.||Computer system architecture with hot pluggable main memory boards|
|US20030061322 *||Feb 2, 1998||Mar 27, 2003||Toshiaki Igarashi||Network data base control device and method thereof|
|US20030097588 *||Oct 24, 2002||May 22, 2003||Fischman Reuben S.||Method and system for modeling, analysis and display of network security events|
|US20040236547 *||Nov 18, 2003||Nov 25, 2004||Rappaport Theodore S.||System and method for automated placement or configuration of equipment for obtaining desired network performance objectives and for security, RF tags, and bandwidth provisioning|
|US20050137832 *||Jan 12, 2005||Jun 23, 2005||System Management Arts, Inc.||Apparatus and method for event correlation and problem reporting|
|US20050185597 *||Feb 20, 2004||Aug 25, 2005||Le Cuong M.||Method, system, and program for checking and repairing a network configuration|
|US20050198583 *||Apr 7, 2005||Sep 8, 2005||Martinez Jesus A.||State/activity indication using icons on an LCD|
|US20050210330 *||Mar 22, 2004||Sep 22, 2005||Xerox Corporation||Dynamic control system diagnostics for modular architectures|
|US20050232256 *||Apr 29, 2005||Oct 20, 2005||Jason White||Applying object oriented concepts to switch system configurations|
|US20060106585 *||Dec 29, 2005||May 18, 2006||Microsoft Corporation||Architecture for distributed computing system and automated design, deployment, and management of distributed applications|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US7747900||Apr 2, 2007||Jun 29, 2010||International Business Machines Corporation||Thresholding system power loss notifications in a data processing system based on vital product data|
|US7827351 *||Jan 9, 2008||Nov 2, 2010||Hitachi, Ltd.||Storage system having RAID level changing function|
|US7934201 *||Nov 13, 2006||Apr 26, 2011||Artoftest, Inc.||System, method, and computer readable medium for universal software testing|
|US7937602||Apr 2, 2007||May 3, 2011||International Business Machines Corporation||System and method for thresholding system power loss notifications in a data processing system based on current distribution network configuration|
|US7983171||Sep 30, 2008||Jul 19, 2011||International Business Machines Corporation||Method to manage path failure thresholds|
|US8027263||Sep 30, 2008||Sep 27, 2011||International Business Machines Corporation||Method to manage path failure threshold consensus|
|US8190925||Oct 2, 2008||May 29, 2012||International Business Machines Corporation||Single shared power domain dynamic load based power loss detection and notification|
|US8301920||Apr 13, 2012||Oct 30, 2012||International Business Machines Corporation||Shared power domain dynamic load based power loss detection and notification|
|US8560895 *||May 26, 2011||Oct 15, 2013||Tibco Software Inc.||Distillation and reconstruction of provisioning components|
|US9348736||Aug 29, 2014||May 24, 2016||Telerik Inc.||System, method, and computer readable medium for universal software testing|
|US9524194 *||Nov 15, 2012||Dec 20, 2016||Empire Technology Development Llc||Performing services on behalf of physical devices|
|US20080092119 *||Nov 13, 2006||Apr 17, 2008||Artoftest, Inc.||System, method, and computer readable medium for universal software testing|
|US20080244283 *||Apr 2, 2007||Oct 2, 2008||John Charles Elliott||System and Method for Thresholding System Power Loss Notifications in a Data Processing System|
|US20080244311 *||Apr 2, 2007||Oct 2, 2008||John Charles Elliott||System and Method for Thresholding System Power Loss Notifications in a Data Processing System Based on Vital Product Data|
|US20090037656 *||Jan 9, 2008||Feb 5, 2009||Suetsugu Michio||Storage system having raid level changing function|
|US20100080117 *||Sep 30, 2008||Apr 1, 2010||Coronado Juan A||Method to Manage Path Failure Threshold Consensus|
|US20100083061 *||Sep 30, 2008||Apr 1, 2010||Coronado Juan A||Method to Manage Path Failure Thresholds|
|US20100088533 *||Oct 2, 2008||Apr 8, 2010||International Business Machines Corporation||Single Shared Power Domain Dynamic Load Based Power Loss Detection and Notification|
|US20110296254 *||May 26, 2011||Dec 1, 2011||Tibco Software Inc.||Distillation and reconstruction of provisioning components|
|Cooperative Classification||G06F11/0727, G06F11/0769|
|European Classification||G06F11/07P4A, G06F11/07P1F|
|Jul 28, 2004||AS||Assignment|
Owner name: HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P., TEXAS
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:VOIGT, DOUGLAS L.;REEL/FRAME:015637/0168
Effective date: 20040727
|May 20, 2011||FPAY||Fee payment|
Year of fee payment: 4
|Apr 28, 2015||FPAY||Fee payment|
Year of fee payment: 8
|Nov 9, 2015||AS||Assignment|
Owner name: HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP, TEXAS
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P.;REEL/FRAME:037079/0001
Effective date: 20151027