|Publication number||US7076645 B2|
|Application number||US 10/606,645|
|Publication date||Jul 11, 2006|
|Filing date||Jun 25, 2003|
|Priority date||Jun 25, 2003|
|Also published as||CN1864134A, CN100481004C, EP1644828A2, EP1644828A4, US20040268112, WO2004114570A2, WO2004114570A3|
|Publication number||10606645, 606645, US 7076645 B2, US 7076645B2, US-B2-7076645, US7076645 B2, US7076645B2|
|Inventors||Ajay Mittal, Laura Xu, Srikanth Koneru|
|Original Assignee||Nokia Inc.|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (9), Referenced by (11), Classifications (16), Legal Events (8)|
|External Links: USPTO, USPTO Assignment, Espacenet|
Equipment that provides a high degree of reliability is a prime consideration of organizations that supply Internet and Intranet services. To help meet this need, technology has become available to combine several devices into a cluster that is configured to act as a single device. Using the cluster arrangement, it is intended that the failure of one device does not significantly affect the remaining components within the cluster.
The term for starting software on a device is ‘booting’ (short for ‘bootstrapping’); when this is performed on a device that is active, the term is ‘rebooting’. A reboot is normally performed for a variety of reasons, including: to activate new versions of the software; and to restore functionality of the device after a fatal error in the software that prevents the device's operation.
In a cluster environment, the reboot of devices requires special consideration, since maintenance of the cluster functionality is of utmost importance. Rebooting the cluster, however, may interfere with its operation. What is needed is a way to reboot members of a cluster such that the cluster operation is maintained.
The present invention is directed at rebooting a cluster while maintaining cluster operation.
According to one aspect of the invention, cluster operation is automatically maintained during the reboot. During the cluster reboot process at least one member of the cluster remains active during the rebooting of the other members.
According to another aspect of the invention, a user, such as an administrator triggers the cluster reboot process. The administrator does not have to manually reboot each member of the cluster. Instead, the cluster reboot process handles the reboots of the members.
According to another aspect, an algorithm is executed which reboots members of the cluster at different times. Rebooting all cluster members at the same time would cause the operation of the cluster to be lost until at least one member is restored to operation.
In the following detailed description of exemplary embodiments of the invention, reference is made to the accompanied drawings, which form a part hereof, and which is shown by way of illustration, specific exemplary embodiments of which the invention may be practiced. Each embodiment is described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims.
Throughout the specification and claims, the following terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise.
The term “IP” means any type of Internet Protocol. The term “node” means a device that implements IP. The term “router” means a node that forwards IP packets not explicitly addressed to itself. The term “routable address” means an identifier for an interface such that a packet is sent to the interface identified by that address. The term “link” means a communication facility or medium over which nodes can communicate. The term “cluster” refers to a group of nodes configured to act as a single node.
The following abbreviations are used throughout the specification and claims: RMB Remote Management Broker; CS=Configuration Subsystem; CLI=Command Line Interface; CM=Cluster Management; GUI=Graphical User Interface; MAC=Message Authentication Code; and NM=Network Management.
Referring to the drawings, like numbers indicate like parts throughout the views. Additionally, a reference to the singular includes a reference to the plural unless otherwise stated or is inconsistent with the disclosure herein.
The present invention is directed at rebooting a cluster while maintaining cluster operation. At least one member of the cluster stays active during the reboot process. An administrator triggers the reboot process and then does not have to perform any other steps during the reboot process. An algorithm is executed which reboots members of the cluster at different times while always maintaining operation of at least one member of the cluster.
As illustrated, inside network 145 is an IP packet based backbone network that includes routers, such as routers 125 to connect the support nodes in the network. Routers are intermediary devices on a communications network that expedite message delivery. On a single network linking many computers through a mesh of possible connections, a router receives transmitted messages and forwards them to their correct destinations over available routes. On an interconnected set of LANs, including those based on differing architectures and protocols, a router acts as a link between LANs, enabling messages to be sent from one to another. Communication links within LANs typically include twisted wire pair, fiber optics, or coaxial cable, while communication links between networks may utilize analog telephone lines, full or fractional dedicated digital lines including T1, T2, T3, and T4, Integrated Services Digital Networks (ISDNs), Digital Subscriber Lines (DSLs), wireless links, or other communications links.
Management computer 105 is coupled to management network 120 through communication mediums. Management computer 108 is coupled to inside network 145 through communication mediums. Management computers 105 and 108 may be used to manage a cluster, such as cluster 130, as well as to trigger a cluster reboot.
Furthermore, computers, and other related electronic devices may be connected to network 110, network 120, and network 145. The public Internet itself may be formed from a vast number of such interconnected networks, computers, and routers. IP network 100 may include many more components than those shown in
The media used to transmit information in the communication links as described above illustrates one type of computer-readable media, namely communication media. Generally, computer-readable media includes any media that can be accessed by a computing device. Communication media typically embodies computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, communication media includes wired media such as twisted pair, coaxial cable, fiber optics, wave guides, and other wired media and wireless media such as acoustic, RF, infrared, and other wireless media.
Depending on the exact configuration and type of computing device, system memory 204 may include volatile memory, non-volatile memory, data storage devices, or the like. These examples of system memory 804 are all considered computer storage media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by node 200. Any such computer storage media may be part of node 200.
Node 200 may include input component 212 for receiving input. Input component 212 may include a keyboard, a touch screen, a mouse, or other input devices. Output component 214 may include a display, speakers, printer, and the like.
Node 200 may also includes network component 216 for communicating with other devices in an IP network. In particular, network component 216 enables node 200 to communicate with mobile nodes and corresponding nodes. Node 200 may be configured to use network component 216 to receive and send packets to and from the corresponding nodes and the mobile nodes. The communication may be wired or wireless.
Signals sent and received by network component 216 are one example of communication media. The term computer readable media as used herein includes both storage media and communication media.
Software components of node 200 are typically stored in system memory 204. System memory 204 typically includes an operating system 205, one or more applications 206, and data 207. As shown in the figure, system memory 204 may also include cluster rebooting program 208. Program 208 is a component for performing operations relating to rebooting a cluster as described herein. Program 208 includes computer-executable instructions for performing processes relating to cluster rebooting.
The GUI and CLI may be configured to present a view of a node(s) within the cluster. RMB 350 distributes information between the nodes within the cluster.
According to one embodiment, GUI 320 is configured to execute on a workstation (not shown) and interact with Configuration Subsystem 325 of device 305. GUI 320 provides a graphical interface to perform operations relating to device 305. One of these operations is performing a reboot of a cluster. CLI 325 provides a command line interface that allows the user to perform operations on device 305 by an application executing on device 305. The GUI and CLI associated with device 305 may also be used to trigger a cluster reboot.
RMB 350 is configured to communicate with device 305 and other devices (device 310 and device 315) within the cluster. RMB 350 may be included within device 305 or it may be separate from device 305. Generally, RMB 350 is used to communicate information between the members of the cluster.
According to one embodiment, the system acquires exclusive authority of the cluster during the reboot process. This helps to prevent more than one user or system from affecting the devices during the reboot.
According to one embodiment, GUI 320 is implemented as a set of Web pages in a browser and a Web Server. The server may operate on a device within the cluster or a device separate from the cluster. The server may operate on all or some of the cluster members.
CLI 325 is a management CLI that presents the cluster information relating to the device and the cluster textually to a user.
When the reboot process is initiated, RMB 350 interacts with the configuration subsystems of the devices being rebooted. According to one embodiment, when an error occurs during a reboot with one of the cluster members, the reboot process is stopped. According to one embodiment, RMB 350 may be configured to restore the configurations to the devices before the reboot process began. This helps to ensure that all the members of the cluster maintain the same attributes. When a problem occurs RMB 350 may indicate that there was a failure to the GUI and CCLI, or send the error to some other location. When the rebooting is complete, the administrator may perform other operations.
The reboot action is triggered by a control in an application using the Graphical User Interface (GUI) or a command in a Command Line Interface (CLI) shell.
The control or command causes a script to be run that performs the cluster rebooting process. The script initiates a reboot by contacting each cluster member, providing an attribute that causes each member to temporarily be removed from the cluster, and then providing an attribute that causes the reboot operation to begin. The script then detects the loss of contact with the device and attempts to re-establish contact. When the script has established contact, it internally indicates that that device is now rebooted and informs the administrator which device has been rebooted. According to one embodiment, the device from which the rebooting process is initiated is not rebooted until all of the other devices have been rebooted.
The reboot for all of the devices, except for the one on which the reboot is initiated, can either be performed sequentially (one device at a time) or in parallel. The parallel method reduces the overall time needed to restore the cluster to full operation.
If the reboot fails on any of the devices, as indicated by failure to re-establish contact with the device, the reboot process halts, thereby preserving the state of the devices not rebooted. The administrator is informed that the cluster reboot has been stopped prematurely along with the identity of the device or devices that have failed.
Remote Management System 400 acts as the backbone for the nodes within the cluster. RMB 400 provides base mechanisms including: discovering the members within the cluster; delivering queries and operations relating to NM attributes to the devices in the cluster; ensuring message integrity; an interface for management applications; and an interface to each device's local configuration subsystem. RMB 400 also includes a secure mechanism for transporting the information in the messages sent between the nodes within the cluster.
RMB 400 is also configured to automatically query the nodes it is coupled with in order to determine the cluster members. These queries are performed periodically to help ensure that all cluster members are available at any given time.
According to one embodiment, RMB 400 ensures consistency of the configuration by using database transactions. For example to begin a transaction whenever an attribute is to be changed and applying a ‘commit’ database operation if the change is successful on all devices and a ‘rollback’ operation when the change fails on any device. The RMB may implement these transactions either internally or by using the transaction capabilities of the Configuration Subsystem. According to one embodiment, the Configuration Subsystem's transactions are used since these may be complicated operations.
RMB Client 420 uses Cluster API 425 to discover the cluster's member devices.
RMB 400 uses messages to perform system and NM operations. The system operations include acquiring and releasing the configuration lock. When a message is to be sent, the RMB fills in header and delivers the message. When a message is received, the RMB checks the header and accepts the message only if values in the fields of the header are valid. The RMB discards any message whose header has invalid values in the fields.
RMB Client 420 composes the body of a RMB message and uses Cluster API 425 to deliver the message to the cluster members; receive the responses from the members; and extract the result of the operation from the message. Remote API 430 delivers the message to a particular cluster member and checks that a response message is received for every request message sent. Secure Transport 435 is the transport mechanism that actually sends and receives the messages.
The RMB Client can be implemented as a collection of shared-object libraries with well-defined Application Programming Interfaces (APIs). CGUI and CCLI can use these APIs to interact with the RMB to perform NM operations.
The RMB Server can be implemented as a daemon that is launched during system start-up.
RMB's Secure Transport can be implemented as a Secure Sockets Layer (SSL) socket. This provides and extra layer of security by providing the ability to encrypt the RMB messages.
The above specification, examples and data provide a complete description of the invention. Since many embodiments of the invention can be made without departing from the spirit and scope of the invention, the invention resides in the claims hereinafter appended.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US6044461 *||Sep 16, 1997||Mar 28, 2000||International Business Machines Corporation||Computer system and method of selectively rebooting the same in response to a system program code update|
|US6202097 *||Jan 17, 1995||Mar 13, 2001||International Business Machines Corporation||Methods for performing diagnostic functions in a multiprocessor data processing system having a serial diagnostic bus|
|US6324692 *||Jul 28, 1999||Nov 27, 2001||Data General Corporation||Upgrade of a program|
|US6691244 *||Mar 14, 2000||Feb 10, 2004||Sun Microsystems, Inc.||System and method for comprehensive availability management in a high-availability computer system|
|US6757836 *||Jan 10, 2000||Jun 29, 2004||Sun Microsystems, Inc.||Method and apparatus for resolving partial connectivity in a clustered computing system|
|US6779176 *||Dec 13, 1999||Aug 17, 2004||General Electric Company||Methods and apparatus for updating electronic system programs and program blocks during substantially continued system execution|
|US20030149735 *||Jun 22, 2001||Aug 7, 2003||Sun Microsystems, Inc.||Network and method for coordinating high availability system services|
|US20040153704 *||Jan 23, 2002||Aug 5, 2004||Jurgen Bragulla||Automatic startup of a cluster system after occurrence of a recoverable error|
|US20040158575 *||Jun 24, 2003||Aug 12, 2004||Christian Jacquemot||Distributed computer platform with flexible configuration|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US7469279||Aug 5, 2003||Dec 23, 2008||Cisco Technology, Inc.||Automatic re-provisioning of network elements to adapt to failures|
|US7661025 *||Jan 19, 2006||Feb 9, 2010||Cisco Technoloy, Inc.||Method of ensuring consistent configuration between processors running different versions of software|
|US8145789 *||Sep 15, 2003||Mar 27, 2012||Cisco Technology, Inc.||Method providing a single console control point for a network device cluster|
|US8209403||Aug 18, 2009||Jun 26, 2012||F5 Networks, Inc.||Upgrading network traffic management devices while maintaining availability|
|US8438253||May 25, 2012||May 7, 2013||F5 Networks, Inc.||Upgrading network traffic management devices while maintaining availability|
|US8812635||Dec 14, 2004||Aug 19, 2014||Cisco Technology, Inc.||Apparatus and method providing unified network management|
|US20040141461 *||Jan 22, 2003||Jul 22, 2004||Zimmer Vincent J.||Remote reset using a one-time pad|
|US20050152288 *||Dec 14, 2004||Jul 14, 2005||Krishnam Datla||Apparatus and method providing unified network management|
|US20060075001 *||Sep 30, 2004||Apr 6, 2006||Canning Jeffrey C||System, method and program to distribute program updates|
|US20060104151 *||Dec 30, 2005||May 18, 2006||Rambus Inc.||Single-clock, strobeless signaling system|
|US20070174685 *||Jan 19, 2006||Jul 26, 2007||Banks Donald E||Method of ensuring consistent configuration between processors running different versions of software|
|U.S. Classification||713/1, 714/12, 713/2, 714/11, 714/13, 714/4.4|
|International Classification||G06F11/00, G06F15/177, H04L, G06F9/445|
|Cooperative Classification||G06F9/4405, G06F11/1441, G06F15/177|
|European Classification||G06F9/44A2, G06F15/177, G06F11/14A8P|
|Jun 25, 2003||AS||Assignment|
Owner name: NOKIA INC., TEXAS
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MITTAL, AJAY;XU, LAURA;KONERU, SRIKANTH;REEL/FRAME:014243/0781;SIGNING DATES FROM 20030624 TO 20030625
|Dec 26, 2006||CC||Certificate of correction|
|Feb 21, 2008||AS||Assignment|
Owner name: NOKIA CORPORATION, FINLAND
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NOKIA INC;REEL/FRAME:020540/0061
Effective date: 20070326
|Feb 25, 2008||AS||Assignment|
Owner name: NOKIA SIEMENS NETWORKS OY, FINLAND
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NOKIA CORPORATION;REEL/FRAME:020550/0521
Effective date: 20070907
|Jan 8, 2010||FPAY||Fee payment|
Year of fee payment: 4
|Jan 3, 2014||FPAY||Fee payment|
Year of fee payment: 8
|Nov 19, 2014||AS||Assignment|
Owner name: NOKIA SOLUTIONS AND NETWORKS OY, FINLAND
Free format text: CHANGE OF NAME;ASSIGNOR:NOKIA SIEMENS NETWORKS OY;REEL/FRAME:034294/0603
Effective date: 20130819
|Jul 27, 2015||AS||Assignment|
Owner name: RPX CORPORATION, CALIFORNIA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NOKIA SOLUTIONS AND NETWORKS OY;REEL/FRAME:036187/0312
Effective date: 20150630