« PreviousContinue »
METHOD AND APPARATUS FOR RECOVERING FROM FAILURE OF A MIRRORED BOOT DEVICE
BACKGROUND  This invention relates to the field of computer systems. More particularly, a system and methods are provided for recovering from the failure of a disk in a set of mirrored boot disks.
 Many computing environments require high availability of storage resources. This is often accomplished by mirroring data across multiple physical storage devices (e.g., disk drives). A computer system may read from and write to a set of mirrored disks as a single logical disk drive. However, the contents that are written to the logical drive are actually written to each of the physical devices, and a read may be made from any of the devices. As a result, if one disk in the set fails, another can be used in its place, and provide the same data, without halting the system.
 Traditionally, at least three devices have been included in a mirror set in order to allow the host computer system to continue operating after one device fails, or to reboot before a failed device is replaced. Systems usually require a minimum of three devices because they operate on a quorum basis, with each device having one "vote." A majority of devices (e.g., two) must be available so that the system can identify which device(s) contain valid data or distinguish stale data (or a device holding stale data) from fresh data. If only one disk is available (e.g., two have failed), the system cannot determine whether it contains fresh or stale data, and the system may be configured to cease or prevent operation.
 Further, existing disk-mirroring schemes suffer from the possibility that the host computer system may boot from a device having stale data. In a computer system that can boot from multiple devices (e.g., any one of a set of mirrored boot disks), the order in which the system should attempt to boot from each device is usually specified. With existing schemes, if the first device in the order fails during operation of the computer system and is not repaired or replaced before the system reboots, the system may attempt to boot from the device. And, if the device exhibits only intermittent failures, the system may be able to boot from it. However, because the device was considered failed before rebooting, it may not have received all data updates or configuration changes, in which case the system will boot and operate with stale data.
 Also, some computing environments that require high availability of storage or boot devices may have space limitations that make it difficult to install or accommodate more than two devices. For example, many computer systems employed in such environments are configured to contain two internal disk drives and, if more are needed, they must be attached or housed in a separate enclosure.
 The procedures or instructions for mirroring a set of disks are often encoded in hardware (e.g., firmware), such as within a controller that controls the disks or within a subsystem that includes the disks. However, this arrangement limits the flexibility of the mirroring operations. If, for example, the instructions include the automatic performance of one or more procedures or commands, a system operator typically cannot override them in order to perform them manually (e.g., with different parameters or in a different order).
 Thus, in one embodiment of the invention a system and methods are provided for facilitating the mirroring of a limited number of storage devices (e.g., two), such that the system can continue operation if one device fails and, if the system is rebooted prior to repair or replacement of the failed device, the system will not attempt to boot from the failed device, thus preventing it from operating with stale data. Further, the methods may be implemented in software, thus enabling flexibility in the operation of the mirroring and recovery from a failed device. For example, a recovery procedure may be performed manually or automatically, or selected portions of the procedure may be accomplished manually, while other portions are performed automatically.
 In one embodiment of the invention, a method of recovering from the failure of a mirrored boot device includes a set of compensating actions that are performed after the failure is detected. Then, after the device is repaired or replaced, a set of reintegrating actions are performed.
 Compensating actions may include removing the failed device from a set of devices from which the system may boot, and attempt to remove the failed device from the mirroring scheme. Removing a device from the mirroring scheme may require the deletion of mirror configuration or status data from the device and the updating of such data on the remaining device(s) in the scheme.
 Reintegrating actions may include retrieving mirror status data from another mirrored device and recreating the necessary configuration on the repaired or replacement device. After the device joins the mirroring scheme, it may then be added to the set of devices from which the system may boot.
 In various embodiments of the invention, different phases of a recovery procedure (e.g., detection of failure, compensating actions, reintegrating actions) may be performed manually or automatically. A system administrator or operator may, therefore, select an appropriate policy specifying the manner (e.g., automatic or manual) in which each phase should be performed.
DESCRIPTION OF THE FIGURES
 FIGS. 1A-B are block diagrams depicting illustrative computer systems for mirroring boot devices, wherein the systems may recover from the failure of a mirrored device, in accordance with an embodiment of the present invention.
 FIG. 2 is a state diagram illustrating recovery from the failure of a mirrored boot device in accordance with an embodiment of the invention.
 FIG. 3 is a flowchart demonstrating one method of recovering from the failure of a mirrored boot device in accordance with an embodiment of the present invention.
 FIG. 4 demonstrates the decreased vulnerability of attempting to boot from a failed device in accordance with an embodiment of the invention.
 The following description is presented to enable any person skilled in the art to make and use the invention,
and is provided in the context of particular applications of the invention and their requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
 The program environment in which a present embodiment of the invention is executed illustratively incorporates a general-purpose computer or a special purpose device such as a clustered system configured for high availability, small footprint, low maintenance cost, etc. Details of such devices (e.g., processor, memory, display) may be omitted for the sake of clarity.
 It should also be understood that the techniques of the present invention might be implemented using a variety of technologies. For example, the methods described herein may be implemented in software executing on a computer system, or implemented in hardware utilizing either a combination of microprocessors or other specially designed application specific integrated circuits, programmable logic devices, or various combinations thereof. In particular, the methods described herein may be implemented by a series of computer-executable instructions residing on a storage medium such as a carrier wave, disk drive, or computerreadable medium. Exemplary forms of carrier waves may take the form of electrical, electromagnetic or optical signals conveying digital data streams along a local network or a publicly accessible network such as the Internet.
 In one embodiment of the invention, a system and methods are provided for mirroring multiple computer system boot devices (e.g., disk drives) and recovering from the failure of one of the mirrored devices without allowing the system to boot from the failed device and thereby operate with stale data. This embodiment is particularly configured for a mirror set consisting of just two devices, although other embodiments may be configured for mirror sets containing more than two devices. Regardless of the number of devices that may be included in a mirror set, however, the system can continue to operate and can be rebooted while only one device is available. Further, the methods of this embodiment may be implemented in software, while the methods of other embodiments of the invention may be implemented partially in software and/or hardware.
 In one implementation of this embodiment, the mirrored boot devices participate in a set of one or more mirrors, with each mirror representing one logical device partition. Each mirror comprises multiple sub-mirrors (e.g., one on each device), corresponding to physical partitions on the individual devices. Mirrors and sub-mirrors may be referred to as metadevices. Further, each device stores one or more replicas or copies of a state database that stores information regarding the mirror set. The data in the state database may identify the participating boot devices and defined metadevices, the status or configuration of each device and metadevice, etc. Each mirrored boot device, and its metadevice components (e.g., sub-mirrors) may be identified to the system by unique label or serial numbers. Device identifiers may be obtained by the system by polling
or reading from each device, while metadevice identifiers may be assigned by users or a software module performing the mirroring.
 The partitions of each boot device may contain various types of system files and/or data. Because the host computer system boots from the mirrored devices, it may be very important to ensure that the system only boots from a device having up-to-date (fresh) information. Otherwise, the system may operate with stale data that may affect its performance, security, accounting, etc.
 Illustratively, a boot device may be considered as having failed if one or more sub-mirrors or state database replicas on the device have failed.
 In one embodiment, when a boot device in a set of mirrored devices fails, that device is removed from a list of devices from which the host computer system will attempt to boot. The device is not returned to the list until it is repaired or replaced and its partitions reintegrated into the mirroring scheme. As a result, if the system is rebooted or crashes after a device fails, but before it is repaired or replaced, the system will only boot from a device that has not failed. The system thus cannot boot or even attempt to boot from a device that has failed. Further, when a boot device fails, it may be dropped from the mirroring scheme. As a result, despite the inaccessibility of the failed device, a single remaining device can constitute a quorum. This embodiment may be well suited for a computer system or environment that demands high availability and yet requires the system to always operate with fresh data.
 FIGS. 1A-1B depict illustrative computer systems in which embodiments of the invention may be implemented. Computer system 100 comprises multiple storage devices (e.g., disks) 110fl-110« from which the system may boot. Controller 102 controls operation of the multiple devices. Mirror module 104 of system 100 performs or monitors the mirroring of the boot devices.
 Computer system 150 of FIG. IB also comprises multiple boot devices 160a-160«, but each boot device is coupled to its own controller (e.g., controller 152a, controller 152m). As in system 100, the mirroring of boot devices in system 150 is performed or monitored by mirror module 154. The mirror modules may comprise sets of software instructions or, in alternative embodiments, may be encoded in firmware or hardware.
 In a computer system such as system 100 of FIG. 1A or system 150 of FIG. IB, the boot devices are mirrored, meaning that data stored on one device is also stored on the other. As a result, the contents of the devices will be substantially identical, thereby allowing the system to boot and operate from either device.
 One embodiment of the invention is implemented within a computer system that includes just two mirrored boot devices (e.g., disk drives). Each boot device comprises one or more similarly configured partitions. Each such partition defines a sub-mirror and each pair of corresponding sub-mirrors constitutes a mirror. Thus, a given mirror may be considered a logical partition of a boot device, with each sub-mirror in that mirror corresponding to one physical representation of that partition. Multiple mirrors may be defined (i.e., one mirror for each partition that is common to both devices) and, further, both boot devices may be partitioned identically. In alternative embodiments of the invention, a sub-mirror may include multiple partitions on a single device and the devices need not be partitioned identically.
 As described above, configuration and/or status information is stored in the form of a state database that is replicated on the mirrored devices. The state database replicas may include information identifying the partition configuration of a boot device, the metadevices (e.g., mirrors, sub-mirrors) defined for a mirroring scheme, the number and/or locations of state database replicas, etc. Illustratively, multiple copies of the state database may be stored on each device. In one embodiment, at least three replicas are stored on each device, possibly in different partitions or submirrors. Each replica is updated when the status or configuration of a mirror changes, and the updates may be performed sequentially (e.g., rather than in parallel) to avoid having multiple replicas corrupted in the event of a system failure.
 In one embodiment of the invention, the method of recovery from a boot device failure cooperates with, but is separate from, the actual mirroring operations involving the boot devices. In particular, a present embodiment is configured for operation in a SunClusterTM environment running the SolarisTM operating system and using SolsticeTM DiskSuite for device mirroring. Illustratively, the various mirroring metadevices are created by or with DiskSuite.
 In this environment, a plurality of state database replicas (e.g., three) is required on each boot device, and at least half of the replicas stored on the devices should be available in order to determine which device, if any, possesses valid data. Without at least half of them available, the system may not be able to reboot, because it would may not be able to determine which device has current data. Therefore, as described below, when a mirrored boot device fails in this environment, its sub-mirrors and state database replicas are removed, and the remaining replicas (on the good device) are updated to reflect the change in configuration of the mirroring scheme and the number and location of state database replicas. Information concerning the number of state database replicas and/or their location may be saved (i.e., on the unfailed device), and may be used to restore the same configuration after the failed device is repaired or replaced. Further, a failed device is removed from a list or collection of devices the system may boot from. Thus, even if the state database replicas could not be removed from the device (e.g., it is completely inaccessible), the system cannot attempt to boot from it.
 In an environment such as that just described, the mirroring metadevices and state database replicas may be configured by the software or hardware module performing the mirroring (e.g., Solstice DiskSuite), in which case the recovery method is designed to work within the configuration parameters (e.g., by invoking the appropriate procedure of the mirroring module). For example, the recovery procedure may invoke the mirroring module to recreate submirrors and state database replicas and resynchronize mirrors after a failed device is replaced, to resume mirroring, etc. In an alternative embodiment, however, the recovery method may perform any of these actions.
 In one embodiment of the invention in which there is an even number of boot devices being mirrored (e.g., two), and each device stores an equal number of state database
replicas (e.g., three), an additional measure may be taken to keep the system operational in the event that half of the devices (e.g., one) fail and their state database replicas cannot be removed. In particular, if one of two mirrored boot devices becomes totally inaccessible, the system may not be able to remove its replicas. Although that device is then removed from the boot device list, the mirroring module (e.g., Solstice DiskSuite) may not be able to easily determine which device contains good data, because each device has three state database replicas that agree with each other, but which differ from the other device's replicas. Thus, in this embodiment, when this situation occurs the mirroring module will examine the boot device list and assume that the device that is in the list is the valid device.
 FIG. 2 is a state diagram demonstrating a method of recovering from the failure of a mirrored boot device, according to one embodiment of the invention. In this embodiment, in state 210 the status of a set of multiple (e.g., two) boot devices (e.g., disk drives) is monitored. Monitoring the mirror set may include activity such as testing or waiting for a device failure (e.g., a read or write error), comparing the status or contents of state database replicas stored on the devices, etc. State 210 may be preceded by a Start state in which configuration and/or status information is stored in the state database replicas. The state database may include information regarding mirrors and/or submirrors, (e.g., the device partitions to be mirrored, devices to participate in mirroring), the number and/or locations of state database replicas, etc.
 In one embodiment of the invention, at least two boot devices are mirrored, each boot device comprises one or more partitions, and one mirror may comprise any number of partitions that are mirrored across the participating devices. Illustratively, however, each device participating in a mirror will possess the same partition arrangement or structure. Thus, each defined mirror will comprise at least two sub-mirrors—one for each of the participating devices.
 In FIG. 2, when the failure of a mirrored boot device is detected (e.g., a read or write failure is detected), the system transitions from monitoring state 210 to compensation activity state 220. The compensation activity state is described in further detail with reference to FIG. 3, and includes activity intended to remove the failed device from its mirror set(s) and allow the system to reboot without the device. State 220 may also include the testing of the device to ensure that it has failed. If the compensation activities are unsuccessful, the system may transition to exit state 250. Otherwise, if the activities are successful, the system transitions to repair state 230, in which the failed device is repaired or replaced. After a successful repair state, the system transitions to reintegration activity state 240.
 In reintegration activity state 240, the new or repaired boot device is configured to rejoin a mirror set. The reintegration activity state is described further in conjunction with FIG. 3, but may include creating and attaching sub-mirrors, resynchronizing with other mirrored devices, etc. If reintegration activity state 240 is successful, the system transitions back to monitor state 210. If unsuccessful, the system transitions to exit state 250.
 In exit state 250, the system may continue to operate, but without a sufficient number of mirrored boot devices to continue operation in the event of another failure.
For example, in a system consisting of two mirrored boot devices, if one device fails and cannot be repaired or replaced, or if reintegration of a repaired or replaced device fails, the system may continue to operate with the single remaining boot device.
 As shown by the dashed lines in FIG. 2, a system may transition from repair state 230 back to compensation activity state 220. This may occur if a second mirrored boot device fails while a first failed device is being repaired or replaced, or while waiting for the first failed device to be repaired or replaced. Further, the system may transition back from reintegration activity state 240 to either compensation activity state 220 or repair state 230 in similar circumstances—if a second device fails while reintegrating a first device, or a repaired/replacement device exhibits signs of failure while being reintegrated.
 FIG. 3 is a flowchart demonstrating a method of recovering from the failure of one device in a set of mirrored boot devices (e.g., disk drives), according to one embodiment of the invention. The method of the invention illustrated in FIG. 3 is described here as it may be applied within a computer system consisting of two boot devices. This method may, however, be applied for virtually any number of mirrored boot devices greater than or equal to two.
 In the illustrated embodiment of the invention, each boot device is partitioned identically, into one or more partitions, with each physical partition corresponding to one sub-mirror. Each pair of sub-mirrors (i.e., one from each device) defines one mirror. Multiple copies or replicas of a state database (e.g., three) are stored on each device (e.g., in different partitions). The state database includes configuration information regarding the mirror(s) and sub-mirrors and their status. Also, the state database and/or the system module that performs the mirroring (e.g., Solstice DiskSuite) may also track the number, location and status of state database replicas. When the status or configuration of a mirror changes, each replica is updated. The system's mirroring module may handle the creation and maintenance of the state database replicas as well as the mirroring of data. In particular, the system mirroring module may be configured to update state database replicas when there is a configuration change (e.g., removal of a sub-mirror) or a status change (e.g., failure of a sub-mirror).
 In one suitable environment for implementing this method, the computer system is one node of a SunClusterTM operating the Solaris operating system and the Solstice DiskSuite mirroring utility. In this environment, each of the two mirrored boot devices will normally store three copies of the state database.
 In state 300 of the illustrated method, a failure is detected in one of the boot devices (e.g., a read or write failure). Illustratively, the determination of failure may be made by software or hardware that is performing the mirror operations or that is configured to execute this method of recovery, or by a separate software or hardware module that is configured to monitor the boot devices and/or the mirroring operations. For example, the computer system implementing this method may operate a program or other set of computer executable instructions for monitoring or examining the status of the system. Yet further, a human operator may detect the failure and then initiate one or more scripts or programs to perform the illustrated recovery procedure.
Thus, in state 300, the software or hardware module(s) performing this method of recovery may be notified of a device failure by an internal or external process or procedure.
 In state 302, the failed device is removed from a list or other data structure that specifies the order in which the computer system will attempt to boot from the mirrored boot devices. The list of boot devices may, for example, be stored on an NVRAM (Non-Volatile Random Access Memory). Thus, unless and until the failed device is returned to the list, the system will not attempt to boot from the failed device, and can only boot from the other device.
 In state 304, the state database replicas stored on the failed device are removed. This action, and any other part of the recovery procedure, may be logged to facilitate the computer system's recovery after the failure is corrected, or to return the system to an appropriate state in the event the system crashes, power is cut off, or there is some other failure.
 In state 306, the device's sub-mirror(s) is/are deleted from their mirror sets. As a result, in the stated configuration for this method (i.e., only two boot devices) only one sub-mirror will remain for each mirror. State database replicas on the other device may be updated to indicate this change in configuration.
 In state 308, information concerning the removed state database replicas (e.g., the number of replicas, their locations) and the change in sub-mirror configuration is stored. Illustratively, the information may be stored on the other, unfailed, device, and may be used to restore the original configuration during reintegration.
 In state 310, the system ensures that the device's state database replicas sub-mirror(s) have been deleted. In an alternative embodiment of the invention, each activity may be separately confirmed. Confirmation of these activities helps ensure that the system will not attempt to write to or read from the failed device. If the confirmation is successful, the illustrated recovery procedure continues at state 310. Otherwise, the procedure exits with a failure. In this method, the compensation activity described in conjunction with FIG. 2 may comprise states 302-310.
 In state 312 the failed device is repaired or replaced. As part of the repair/replacement procedure, the repaired/replacement device may be partitioned and/or formatted in the same manner as the failed device. The appropriate partition configuration may, for example, be copied from the unfailed boot device. In one embodiment of the invention, the failed device is repaired or replaced without powering down or rebooting the computer system.
 Because the contents of the unfailed device are valid and up-to-date, the system can continue to operate and can even be rebooted without the failed device. Illustratively, if the mirroring operations require a quorum, such a quorum is available in the form of the multiple state database replicas stored on the unfailed device.
 In state 314, configuration information is retrieved (e.g., from the unfailed device) and written to the repaired/ replacement device. In particular, the retrieved information identifies the number of state database replicas that were stored on the failed device and their location (e.g., partition),