|Publication number||US7069465 B2|
|Application number||US 10/205,769|
|Publication date||Jun 27, 2006|
|Filing date||Jul 26, 2002|
|Priority date||Jul 26, 2002|
|Also published as||CN1234071C, CN1480843A, US20040019821|
|Publication number||10205769, 205769, US 7069465 B2, US 7069465B2, US-B2-7069465, US7069465 B2, US7069465B2|
|Inventors||Davis Qi-Yu Chu, Allen King|
|Original Assignee||International Business Machines Corporation|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (22), Non-Patent Citations (8), Referenced by (10), Classifications (15), Legal Events (6)|
|External Links: USPTO, USPTO Assignment, Espacenet|
1. The Field of the Invention
The invention relates to redundant arrays of independent disks (RAID) in client/server computing environments, and more specifically to systems and methods for reliable failover capabilities involving write operations of a failed node within a cluster.
2. The Relevant Art
In contemporary client/server computing environments, a cluster is a set of coupled, independent computer systems called nodes, which behave as a single system. A client interacts with a cluster as though it were a single server. The combined computing power and storage of a cluster make the cluster a valuable tool in many applications ranging from on-line business to scientific modeling. In many instances, the reliability of these systems is critical to the overall success of a business endeavor or scientific experiment.
The most vulnerable component of a computer system, including cluster systems, are the hard disk drives which contain essentially the only mechanical, moving parts in the otherwise electronic assembly. Data written to a single drive is only as reliable as that drive, and many drives eventually do fail. The data stored on these hard disk drives in many cases represent critical client information, investment information, academic information, or the like. In an age when information storage and access is becoming increasingly important to all enterprises, more reliable methods of data storage are needed.
One existing storage method is a redundant array of independent disks (RAID). RAID systems store and access multiple individual hard disk drives as if the array were a single, larger disk. Distributing data over these multiple disks reduces the risk of losing the data if one drive fails, and it also improves access time. RAID was developed for use in transaction or applications servers and large file servers. Currently, RAID is also utilized in desktop or workstation systems where high transfer rates are needed.
In a cluster environment, such as the one described above, RAID and similar shared disk arrays are implemented to provide a client with access to the computing power of the combined nodes together with the large storage capacity of the disk array.
Depicted within each node 110 is a RAID controller 112, which will be discussed in greater detail below with respect to
The cluster system 102 connects to a Local Area Network (LAN) 120 or a private network cable or interconnect 118. Under the depicted embodiment, the cluster system 102, cluster administrator 106, and the plurality of clients 108 are connected by the network hub 104. The cluster administrator 106 preferably monitors and manages cluster operations. Occasionally, a RAID controller 112 in one of the nodes 110 fails, generally due to a component or power failure. When this occurs, non-cached write operations may be underway and incomplete. As a consequence, critical data may be lost.
Referring now to
Occasionally, a RAID controller 112 may fail. In such a case, other functioning RAID controllers have no access to MRT 214 of the failing controller 112. The remaining RAID controllers 112 cannot identify or make consistent incomplete write operations of the failed controller 112. However, the remaining RAID controllers 112 can identify the logical drives of the RAID 114 pertained to the failed controller 112. A remaining controller 112 will initiate a background consistency check (BGCC) on those logical drives, each from the beginning to the end, and if necessary, a consistency restoration where data inconsistency due to an incomplete write is found.
With the logical drive sizes currently in use, a BGCC of said logical drives of the RAID array 114 may take several hours. During this period of time, read and write operations are allowed to occur in the foreground on those logical drives not completely checked yet involved in the BGCC of the RAID array 114. A data corruption problem may occur in the event that a physical drive of one of said logical drives of the RAID disk array 114 fails a read request that happens to be located in a yet-to-be made consistent cache line group. This data corruption failure is a result of a RAID controller 112 regenerating data from other physical drives of this logical drives without realizing the data was inconsistent to begin with. This problem is commonly known as a “write hole.”
Thus, it can be seen from the above discussion that a need exists in the art for an improved reliable failover method and apparatus for resolving incomplete RAID disk writes after a disk failure.
The apparatus and method of the present invention have been developed in response to the present state of the art, and in particular, in response to the problems and needs in the art that have not yet been fully solved by currently available apparatus and methods. Accordingly, it is an overall object of the present invention to provide an apparatus and method that overcomes many or all of the above-discussed shortcomings in the art.
To achieve the foregoing object, and in accordance with the invention as embodied and broadly described herein in the preferred embodiments, an improved reliable failover apparatus and method is provided.
These and other objects, features, and advantages of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth hereinafter. In certain embodiments, the invention comprises a plurality of shared disks and a plurality of RAID controllers each configured to access the plurality of disks. In order to distribute and maintain the location of the beginning logical block addresses (LBA) of the data on the disks undergoing write operations, a mirror race table (MRT) is implemented along with a common MRT storage location accessible by each of the RAID controllers. The common MRT storage location also duplicates all MRTs located on each RAID controller.
Under a preferred embodiment of the present invention, the common MRT storage location comprises a non-volatile random access memory (NVRAM) module. This NVRAM module may be implemented within a shared disk enclosure. Alternatively, the NVRAM module may be located on each of the RAID controllers. In one embodiment, the shared disk enclosure comprises a SCSI accessed fault-tolerant enclosure (SAF-TE).
Preferably, on each RAID controller, the apparatus also comprises an MRT search module configured to search the MRT for the first free entry and an MRT entry module configured to create an entry in the MRT by entering the logical block address (LBA) of the first cache line group to be written to the plurality of shared disks. Also provided may be an MRT pointer module configured to save an MRT pointer for the entry, an MRT retrieve module configured to find and retrieve the MRT pointer for the entry, an MRT read module configured to locate the cache line group of data on disks for consistency restoration when necessary (after a node bootup or cluster failover), an MRT clear module configured to find and clear the entry from the MRT, and an MRT transfer module configured to transfer the MRT from the shared disk enclosure to at least one RAID controller.
A method of the present invention is provided for establishing a common MRT storage location accessible by each of the RAID controllers. In one embodiment, the method comprises accessing the MRT from the common storage location, updating the MRT, detecting a failure of at least one RAID controller, and reliably distributing the work load of the failed RAID controller.
In one embodiment, the method also comprises searching the MRT for the first free entry from the top of the table to the bottom; creating an entry in the MRT by entering the logical block address (LBA) of the first cache line group to be written to the plurality of shared disks; saving an MRT pointer for the entry; finding and retrieving the MRT pointer for the entry; and finding and clearing the entry from the MRT. In order to provide reliable failover capabilities, the method may also comprise transferring the MRT from the shared disk enclosure to at least one RAID controller.
In order that the manner in which the advantages and objects of the invention are obtained will be readily understood, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments of the invention and are not therefore to be considered to be limiting of its scope, the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
The present invention is claimed and described herein in terms of “modules.” As used herein, this term is used to refer to lines of software code instructions or to electronic hardware configured to achieve the given purpose of the module. As such, a module is a structural element. As will be readily understood to one skilled in the art of software development, more than one instruction may exist within a module. The instructions may not necessarily be located contiguously, and could be spread out among various different portions of one or more software programs, including within different objects, routines, functions, and the like. Similarly, the hardware components of a subsystem or module, such as integrated circuits, logic gates, discrete devices, and the like, need not be organized into a single circuit, but could be distributed among one or more circuits. Unless stated otherwise, hardware or software implementations may be used interchangeably to achieve the structure and function of the disclosed modules.
Referring now to
Each entry in the MRT 400 comprises a valid flag bit 402, a LBA 404 that represents the beginning address of a cache line group, and a reserved field 406 that may be used for an optional check sum to validate entry information. In one embodiment, the MRT 400 requires only six bytes of storage space for each clustered controller's entry.
The SAF-TE 500 generally comprises a SCSI target interface 502, a plurality of SAF-TE registers 504, a plurality of status registers 506, a CPU 508, an erasable programmable read-only memory (EPROM) 510 module, and a dynamic random access memory (DRAM) module 512. In accordance with the present invention, a non-volatile random access memory (NVRAM) module 514 is also shown within the SAF-TE 500. A detailed description will not be made of each component of the SAF-TE 500, as one skilled in the art will readily recognize the function and purpose for the separate components. The configuration of the SAF-TE 500 is given herein by way of example and is not to be considered limiting, as one skilled in the art can readily modify the configuration while maintaining the intention of the enclosure.
In a multi-node cluster, when a node fails, an automatic failover occurs. The cluster software operating in functioning node(s) in response to a failure, disperses the work from the failed system to remaining systems in the cluster. However, prior to the invention, RAID controllers 112 at remaining nodes 110 have no access to the MRT 214 of the failed system. In order to overcome limitations in the art, an MRT 516 is provided within the NVRAM 514. The configuration of the MRT 516 will be described in greater detail below with reference to
Under a preferred embodiment of the present invention, the MRT 516 resides within the NVRAM module 512 and is configured to be accessed by functioning RAID controllers 112 of
Referring now to
Under a preferred embodiment of the present invention, the logical drive is made consistent 904 by reading the data of the cache line groups, as found in the MRT of the node, and calculating the parity that is required to make the cache line group consistent. This calculated parity is compared against the recorded parity. If the calculated parity does not match the recorded parity then the newly calculated parity is recorded to the logical drive. Following the consistency check, the MRT entries in both the RAID controller 112 and the enclosure are cleared 906 to ensure that data in a potentially malfunctioning member disk can be correctly regenerated.
The completion pending status of consistency restorations is then tracked 908. The pending consistency restorations are stopped 910 if the logical drive ownership changed during node failure. The method 800 then ends 912. Referring again to
In the absence of a node bootup 804, when a node failover 814 is detected, remaining nodes determine if logical drives must be made consistent 816. If a consistency restoration is necessary, the logical drives are identified 822. Under a preferred embodiment of the present invention, logical drive identification is determined by the drive ownership table 310. The ownership of the failed node is then changed 824, and the entries pertaining to the failed node are retrieved 902 from the MRT of the shared disk enclosure. The logical drive is then made consistent 904, as described above.
Following the consistency restoration, the MRT entries in both the RAID controller 112 and the enclosure are cleared 906 to ensure that data in a potentially malfunctioning member disk can be correctly regenerated. The method 800 continues, and the completion pending status of consistency restorations is tracked 908. The pending consistency restorations are stopped 910 if the logical drive ownership changed during node failure. The method 800 then ends 912. Alternatively, if a consistency restore of logical drives is determined 816 to not be necessary, the method 708 ends 912.
Returning to determination 814, if a node failover is not detected and new logical drives are assigned 818, the method determines whether there is any pending consistency restore activity for the assigned logical drives. If pending consistency restore operations exist 820, then the segments pertaining to the logical drives are retrieved 902 from the MRT of the shared disk enclosure. The logical drives are then made consistent 904, as described above. Following the consistency restoration, the entries in both the shared disk enclosure MRT and in the RAID controller MRT are cleared 906 to ensure that the data in a potentially malfunctioning member disk can be correctly regenerated.
The method 708 continues, and the completion pending status of consistency restorations is tracked 908. The pending consistency restorations are stopped 910 if the logical drive ownerships have changed. If there is no pending 820 consistency restoration, the method 708 ends 912. If no new logical drives are detected 818, the method 800 ends 912.
In one embodiment, the method 800 described above with reference to
Referring now to
Under a preferred embodiment of the present invention, if the buffer capacity is determined 1012 to be sufficient, then the feature support status and buffer capacity is recorded 1014. Alternatively, if the buffer capacity is determined 1012 to not be sufficient, or the NVRAM buffer feature is determined 1008 to not be installed, the node then operates in a manner as described with reference to the prior art
When it is determined 1114 that no other cache line groups are present, typical cluster operations are continued 1116. Alternatively, when a cache line group writing is completed 1118, respective entries in the node controller are cleared 1120. The shared disk enclosure MRT is cleared 1122 by issuing WRITE BUFFER commands with the appropriate NVRAM addresses. Under an alternative embodiment, the node controller is interrupt driven rather than configured to wait for activity completion. In one embodiment, the WRITE BUFFER command to clear entries in the MRT comprises issuing a WRITE BUFFER command with write data containing all zeroes.
Referring now to
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US5269019 *||Apr 8, 1991||Dec 7, 1993||Storage Technology Corporation||Non-volatile memory storage and bilevel index structure for fast retrieval of modified records of a disk track|
|US5301297||Jul 3, 1991||Apr 5, 1994||Ibm Corp. (International Business Machines Corp.)||Method and means for managing RAID 5 DASD arrays having RAID DASD arrays as logical devices thereof|
|US5432922||Aug 23, 1993||Jul 11, 1995||International Business Machines Corporation||Digital storage system and method having alternating deferred updating of mirrored storage disks|
|US5488731||Dec 8, 1994||Jan 30, 1996||International Business Machines Corporation||Synchronization method for loosely coupled arrays of redundant disk drives|
|US5551003||Feb 14, 1994||Aug 27, 1996||International Business Machines Corporation||System for managing log structured array (LSA) of DASDS by managing segment space availability and reclaiming regions of segments using garbage collection procedure|
|US5680570||Sep 30, 1994||Oct 21, 1997||Quantum Corporation||Memory system with dynamically allocatable non-volatile storage capability|
|US5757642||Jun 27, 1997||May 26, 1998||Dell Usa L.P.||Multi-function server input/output subsystem and method|
|US5778411||Sep 18, 1997||Jul 7, 1998||Symbios, Inc.||Method for virtual to physical mapping in a mapped compressed virtual storage subsystem|
|US5928367||Apr 29, 1996||Jul 27, 1999||Hewlett-Packard Company||Mirrored memory dual controller disk storage system|
|US6163856||May 29, 1998||Dec 19, 2000||Sun Microsystems, Inc.||Method and apparatus for file system disaster recovery|
|US6219752||Aug 4, 1998||Apr 17, 2001||Kabushiki Kaisha Toshiba||Disk storage data updating method and disk storage controller|
|US6230240||Jun 23, 1998||May 8, 2001||Hewlett-Packard Company||Storage management system and auto-RAID transaction manager for coherent memory map across hot plug interface|
|US6381674 *||Sep 30, 1997||Apr 30, 2002||Lsi Logic Corporation||Method and apparatus for providing centralized intelligent cache between multiple data controlling elements|
|US6505273 *||Nov 24, 1998||Jan 7, 2003||Fujitsu Limited||Disk control device and method processing variable-block and fixed-block accesses from host devices|
|US6519677 *||Apr 20, 1999||Feb 11, 2003||International Business Machines Corporation||Managing access to shared data in data processing networks|
|US6557140 *||May 23, 2001||Apr 29, 2003||Hitachi, Ltd.||Disk array system and its control method|
|US6678787 *||Dec 21, 2000||Jan 13, 2004||International Business Machines Corporation||DASD-free non-volatile updates|
|US6721870 *||Jun 12, 2001||Apr 13, 2004||Emc Corporation||Prefetch algorithm for short sequences|
|US6766430 *||Apr 12, 2001||Jul 20, 2004||Hitachi, Ltd.||Data reallocation among storage systems|
|US6772303 *||Feb 15, 2001||Aug 3, 2004||International Business Machines Corporation||System and method for dynamically resynchronizing backup data|
|US6973549 *||Aug 13, 2002||Dec 6, 2005||Incipient, Inc.||Locking technique for control and synchronization|
|US20030191916 *||Apr 4, 2002||Oct 9, 2003||International Business Machines Corporation||Apparatus and method of cascading backup logical volume mirrors|
|1||*||Castets, Gustavo A.; Leplaideur, Daniel; Bras, Daniel Alcino; Galang, Jason, IBM Enterprise Storage Server, Sep. 2001, International Business Machines Corporation, Second Edition, pp. 29, 32, and 49.|
|2||*||Hanly, Jeri R.; Koffman, Elliot B.; Horvath, Joan C., C Program Design for Engineers, 1995, Addison-Wesley Publishing Company, Inc., pp. 356-357.|
|3||IBM Corp., Hybrid Reducdancy Direct-Access Storage Device Array with Design Options, IBM Technical Disclosure Bulletin vol. 37 No. 02B p. 141-148.|
|4||IBM Corp., Method and means of ensuring data integrity with removable NVRAM Cache, Research Disclosure Mar. 2000.|
|5||IBM Corp., Method for Background Parity Update in a Redundant Array of Inexpensive Disks (Raid), IBM Technical Disclosure Bulletin vol. 35 No. 5 p. 139-141.|
|6||IBM Corp., Non-volatile position verification for nulti-node networks, Research Disclosure Nov. 2000.|
|7||*||Mellish, Barry; Sedeora, Surjit; Smythe, Tom; Voci, Gea, ESS Solutions for Open Systems Storage: Compaq AlphaServer, HP, and SUN, Mar. 2001, International Business Machines Corporation, First Edition, pp. 3 and 114.|
|8||*||SCSI Accessed Fault-Tolerant Enclosures Interface Specification, Apr. 1997, nStor Corporation and Intel Corporation, p. 2.|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US7549079 *||Nov 30, 2005||Jun 16, 2009||Oracle International Corporation||System and method of configuring a database system with replicated data and automatic failover and recovery|
|US7599967 *||Mar 20, 2007||Oct 6, 2009||Oracle International Corporation||No data loss system with reduced commit latency|
|US8364905||Aug 16, 2010||Jan 29, 2013||Hewlett-Packard Development Company, L.P.||Storage system with middle-way logical volume|
|US8688798||Apr 3, 2009||Apr 1, 2014||Netapp, Inc.||System and method for a shared write address protocol over a remote direct memory access connection|
|US9009427 *||Feb 2, 2009||Apr 14, 2015||Cisco Technology, Inc.||Mirroring mechanisms for storage area networks and network based virtualization|
|US20070022250 *||Jul 19, 2005||Jan 25, 2007||International Business Machines Corporation||System and method of responding to a cache read error with a temporary cache directory column delete|
|US20070168704 *||Nov 30, 2005||Jul 19, 2007||Oracle International Corporation||System and method of configuring a database system with replicated data and automatic failover and recovery|
|US20080235294 *||Mar 20, 2007||Sep 25, 2008||Oracle International Corporation||No data loss system with reduced commit latency|
|US20090228651 *||Feb 2, 2009||Sep 10, 2009||Cisco Technology, Inc.||Mirroring Mechanisms For Storage Area Networks and Network Based Virtualization|
|US20150143158 *||Nov 19, 2013||May 21, 2015||International Business Machines Corporation||Failover In A Data Center That Includes A Multi-Density Server|
|U.S. Classification||714/6.22, 714/E11.092|
|International Classification||G06F3/06, G06F9/312, G06F11/20, G06F12/00, G06F11/00, G06F11/10, H04L1/22, G06F11/07|
|Cooperative Classification||G06F2201/82, G06F3/0601, G06F2003/0697, G06F11/2092|
|Jul 26, 2002||AS||Assignment|
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHU, DAVID Q.;KING, ALLEN;REEL/FRAME:013149/0724
Effective date: 20020724
|Oct 21, 2002||AS||Assignment|
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y
Free format text: CORRECTED RECORDATION FORM COVER SHEET TO CORRECT INVENTOR S NAME, PREVIOUSLY RECORDED AT REEL/FRAME 013149/0724 (ASSIGNMENT OF ASSIGNOR S INTEREST);ASSIGNORS:CHU, DAVIS Q.;KING, ALLEN;REEL/FRAME:013411/0925
Effective date: 20020724
|Oct 21, 2009||FPAY||Fee payment|
Year of fee payment: 4
|Feb 7, 2014||REMI||Maintenance fee reminder mailed|
|Apr 11, 2014||FPAY||Fee payment|
Year of fee payment: 8
|Apr 11, 2014||SULP||Surcharge for late payment|
Year of fee payment: 7