|Publication number||US20110167158 A1|
|Application number||US 12/683,747|
|Publication date||Jul 7, 2011|
|Priority date||Jan 7, 2010|
|Also published as||US7962615|
|Publication number||12683747, 683747, US 2011/0167158 A1, US 2011/167158 A1, US 20110167158 A1, US 20110167158A1, US 2011167158 A1, US 2011167158A1, US-A1-20110167158, US-A1-2011167158, US2011/0167158A1, US2011/167158A1, US20110167158 A1, US20110167158A1, US2011167158 A1, US2011167158A1|
|Inventors||Douglas L. Lehr, Franklin E. McCune, David C. Reed, Max D. Smith|
|Original Assignee||International Business Machines Corporation|
|Export Citation||BiBTeX, EndNote, RefMan|
|Classifications (4), Legal Events (4)|
|External Links: USPTO, USPTO Assignment, Espacenet|
1. Field of the Invention
The present invention relates in general to computers, and more particularly to a method, system, and computer program product for reducing deadlock in multi-system computing environments.
2. Description of the Related Art
Computers and computer systems are found in a variety of settings in today's society. Computing environments and networks may be found at home, at work, at school, in government, and in other settings. Computing environments increasingly store data in one or more storage environments, which in many cases are remote from the local interface presented to a user.
These computing storage environments may use many storage devices such as disk drives, often working in concert, to store, retrieve, and update a large body of data, which may then be provided to a host computer requesting or sending the data. In some cases, a number of data storage subsystems are collectively managed as a single data storage system. These subsystems may be managed by host “sysplex” (system complex) configurations that combine several processing units or clusters of processing units. In this way, multi-system computing environments may be used to organize and process large quantities of data.
In many of today's applications, operable in such multi-system computing environments, there arises a need to obtain system resources that may be held by other threads within the same application, or by other applications across a sysplex. One such application that utilizes this methodology is VSAM (Virtual Storage Access Method) Record Level Sharing (RLS). This mechanism allows for sharing of VSAM datasets down to the record level, and allows such sharing across an entire sysplex.
While mechanisms such as RLS include a deadlock detection scheme, these mechanisms are useful to detect problems with individual records being deadlocked by multiple transactions. Currently there is no mechanism to serialize between a variety of resource types, such as record locks, special locks, enqueues, internal latches, buffers, device reserves, cache entries, and the like. This may lead to deficiencies in deadlock detection and reduction, as a heavily used sysplex utilizing mechanisms such as RLS may have any number of requests for resources pending for various resources on various sysplexes at any one time. In view of the foregoing, a need exists for a mechanism to detect, recover from, and reduce system deadlocks in multiple-system/multiple-sysplex computing environments that more adequately addresses the potential issues described previously.
Accordingly, various embodiments for reducing deadlock in multi-system computing environments are provided. In one such embodiment, by way of example only, a method for reducing deadlock in multi-system computing environments is provided. A set of default, current wait times is initialized for resource requests of each of a plurality of resources. A plurality of resource holders and resource waiters is monitored within an address space of the multi-system computing environment. If one resource holder of the plurality of resource holders of one of the plurality of resources is determined to be one resource waiter on another of the plurality of resources, a current wait time for the one resource holder is incremented and a deadlock indicator for both the one resource holder and the one resource waiter is activated. Subsequent to the expiration of a predefined interval, which of the plurality of resource holders and resource waiters having an active deadlock indicator is aggregated. The plurality of resource holders and resource waiters is parsed through to determine an original resource holder, indicating a system deadlock. Pursuant to detecting the system deadlock, which of the plurality of resource holders associated with the system deadlock having a lowest current wait time is restarted.
In addition to the foregoing exemplary embodiment, various system and computer program embodiments are provided and supply related advantages.
In order that the advantages of the invention will be readily understood, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments that are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments of the invention and are not therefore to be considered to be limiting of its scope, the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings, in which:
The illustrated embodiments below provide mechanisms for reducing deadlock in multiple system computing environments. In one embodiment, these mechanisms monitor requests for all resources held by various address spaces in the computing environment, such as all VSAM RLS address spaces, along with current Program Status Words (PSW) associated with the spaces. These requests may then be transferred to all instances of VSAM RLS jobs, for example, across multiple sysplexes.
Using user-defined time out limits for the various resources, each instance may determine if they are holding resources that are in need by other threads on either the same system or differing systems. If the PSW associated with the offending threads does not change in the allotted time, these threads may then be restarted to allow waiting threads to move forward.
Examples of disk storage comprise one or more disk drives, for example, arranged as a redundant array of independent disks (RAID) or just a bunch of disks (JBOD), or solid state disk (SSD), etc. Herein, a data storage system having both disk storage 50 and an automated tape library 70 is called a “composite library.” An example of a data storage system 20 which may implement the present invention comprises the IBM® TS7700 Virtual Tape Server, which may provide disaster recoverable virtual tape storage functionality to users.
Each host system 18, 19 may be considered a host sysplex as previously described. As can be seen, and as one of ordinary skill in the art will appreciate, each of the systems 18 and 19 are interconnected between storage clusters 30 over network 80. Accordingly, and as previously described, an application operable on one sysplex 18 may hold or wait for a system resource waited for or held by an additional sysplex 19. In addition, one thread of the application may wait on the same resource held by another thread of the same application.
Generally, the server 202 operates under control of an operating system (OS) 208 (e.g. z/OS, OS/2, LINUX, UNIX, WINDOWS, MAC OS) stored in the memory 206, and interfaces with the user to accept inputs and commands and to present results, for example through a graphical user interface (GUI) module 232. In one embodiment of the present invention, the monitoring mechanisms as will be further described are facilitated by the OS 208. Although the GUI module 232 is depicted as a separate module, the instructions performing the GUI functions can be resident or distributed in the operating system 208, the application program 210, or implemented with special purpose memory and processors. Server 202 also optionally comprises an external data communication device 230 such as a modem, satellite link, Ethernet card, wireless link or other device for communicating with other computers, e.g. via the Internet or other network.
The server 202 implements a compiler 212 that allows the application program 210 written in a programming language such as COBOL, PL/1, C, C++, JAVA, ADA, BASIC, VISUAL BASIC or any other programming language to be translated into code that is readable by the processor 204. After completion, the application program 210 accesses and manipulates data stored in the memory 206 of the computer 202 using the relationships and logic that was generated using the compiler 212. A number of threads may be operational on application 210 as previously indicated. Each thread may hold or wait to hold a system resource of server 202 or of another portion of the multi-system computing environment. Since server 202 is intended to indicate only a portion of the multi-system computing environment, one of ordinary skill in the art will appreciate that additional application programs 210 may be operational in other systems, etc.
Operating system 208 includes a monitoring module 240. The monitoring module may operate in conjunction with program(s) 210, and other components 200, to implement deadlock monitoring operations for a multi-system computing environment such as that shown in
The mechanisms of the present invention allow for complex resource allocation applications, such as SMSVSAM (the job name of VSAM RLS), to process at full capacity without the potential for deadlocks. In addition, the mechanisms avoid transactions being held because a subsequent thread cannot, for whatever reason, complete. As a result, the efficiency and overall reliability of the multi-system computing environment is enhanced.
In one embodiment of the present invention, a set of default, current wait times are initialized for various resource requests. These resource requests may include enqueues, buffers, locks, caches, latches, and the like as one of ordinary skill in the art will appreciate. An interval value is defined between deadlock checks. The default, current wait times, and interval values may be adjusted by a user, for example, depending on how long the user wishes the mechanisms of the present invention to wait to perform various actions.
As was previously indicated, various tables (as will be described in detail, following), may be constructed. In one embodiment, these tables may be organized by resource type that a thread updates with the name of a respective resource, the Task Control Block (TCB) address, the system to which it belongs to, and whether it holds the resource or is waiting for the resource. The current wait time may be initially set to zero, although this value may also be varied for a particular situation.
The monitoring module 240, for example, may execute functionality added to jobs such as SMSVSAM. This functionality my include implementing a monitoring thread that monitors all tables within a particular address space. If a particular thread is a new resource holder, then the current PSW may be stored in the table. For all other resource holders, the monitor may check if the TCB is a waiter on a different table for another resource. If it is waiting, then a deadlock bit may be turned on in both the holding and waiting entry to notify the thread that this holder is, in fact, waiting on another resource, and the wait count is incremented. If the resource is not waiting on any other resources, then the PSW of the resource may be compared against the TCB's current PSW. If the TCB's current PSW is the same, and it is determined that a waiting resource exists, then the wait count may be incremented. If this is not the case, then the new PSW may be stored, and the wait count reset.
If the monitoring thread detects a resource waiter, it may parse through entries in the same table for the corresponding holder. If none are detected, this indicates that the holder is on a different system. This instance of the monitoring thread will then send the waiter to all other instances of monitoring threads, so that each system will have a copy of the waiting thread. This allows the other systems the knowledge that their requests might be having an affect on other monitoring threads, and action should be taken to start incrementing the wait counts on the resource holders. Once a system detects that it has the thread holding up the resource, the holder may be sent back to that system. This allows every system to have a complete list of resource holder/waiters that are relevant to processing.
A thread may be restarted in one of two ways. First, if the interval has expired, and no evidence is found that the thread is waiting on any of the other known resources, then the thread may go through recovery to back out any changes made, and then rerun using the input parameters originally given. This allows other threads behind the stalled thread to continue on and have a chance to access the resource. In some cases, this repairs a deadlock not detected by any detection algorithms.
The thread may also be restarted pursuant to a deadlock detection algorithm, which will be further explained, following. In one embodiment, for example, subsequent to the expiration of the time interval, this algorithm aggregates all holders and waiters (from each table) having an active deadlock. Since all waiters are transferred to every system at the time the request was issued, there is no need to poll the other systems to gather their holders and waiters. The algorithm parses down a list of all holders collected, and for each entry the algorithm looks for a corresponding wait entry. If one is found, then it checks that resource's holder and looks for another waiter, and so on, until it either doesn't detect another waiter, or it comes back to the original holder.
If a thread holds a resource and is not waiting on any other resource, then that holder is taken off the list of aggregated holders and waiters. In addition, all of their holders searched through may be taken off as well in order to speed up the process. As a next step, the next holder may be selected and the process repeated. However, if a deadlock is detected, then the algorithm then may restart the thread with the lowest wait count (since it was the newest request), in the same way as it restarts a thread that timed out. The corresponding threads are taken off the holding list and the deadlock algorithm continues with the remaining entries. Finally, if the deadlock algorithm decides that a thread on another system needs to be terminated, then a request may be sent to that system requesting that it be restarted.
Turning now to
Continuing to table 322, a number of latches are collectively gathered. Here again, the designated resource (column 324), along with the resource holder (column 326) and resource waiter (column 328) is shown. For example, a certain cleanup latch 330 is held by Task 7 (332) and waited to be held by Task 3 (334).
In a separate system 304 (System 2) of the multi-system computing environment, a table 336 is maintained of enqueues. Here again, a specific resource is designated in column 338, the holder in column 340 and waiters in column 342. In the first entry example, resource SYSZSCM7 (the same resource delineated in table 306), while it is known in system 302 that Task 1 is the holder (ostensibly because task 1 is operable in system 1), it is unknown by system 304, although Task 4 waits for it (as shown in boxes 344, 346, and 348). Similarly, while resource SYSEOM in table 306 has an unknown holder (i.e., box 320), it is known in table 336 that the resource is held by Task 8.
The example entries in tables 306, 322, and 336 demonstrate that tasks may wait on resources held by other systems, and resources on other systems may be held without the knowledge that additional waiters and/or holders may exist on different systems. Additionally, tasks may hold more than one resource, but may only wait on one, since waiting on a resource effectively stalls the thread until the resource is granted.
Turning now to
Once the various tables 306, 322, and 336 are updated accordingly pursuant to the monitoring mechanisms of the present invention, they may then act to detect similar waiters/holders having the active deadlock indicator. To this regard, the mechanisms parse through the holders/waiters to, for example, take a particular resource waiter and find the holder, then moving to the resource the holder is waiting on, etc., until the mechanisms eventually find the original resource holder and waiter, if any. This indicates a system deadlock. The mechanisms then operate to recover from the deadlock in various means as will be subsequently described, such as restarting a particular stalled thread.
Method 400 begins (step 402) with the initialization of a set of default, current wait times for each of a number of resource requests within the multi-system computing environment (step 404). In step 408, following, an interval value is set between deadlock checks. A set of monitoring tables is constructed for various resource types as previously depicted in
If a new resource holder is found in a particular system (step 414), it's current PSW is stored in the table, and is sent to the remaining tables across the multi-system environment (step 416). If, instead, the TCB indicates that the thread is a resource waiter on a differing table for another resource (step 418), then the deadlock indicator (in this case, a bit) is activated for both the holding and the waiting entry in the table (step 420) and the current wait count is incremented (step 422). Next (or alternatively to step 418), the PSW of the examined thread is compared against the TCB's current PSW (step 424). If the PSWs are the same, and there is a waiting resource (step 426), then the current wait count is incremented (step 428). If not, the new PSW is stored in the table and the current wait count is reset (step 430).
If the monitoring mechanisms detect a resource waiter (step 432), the system table the resource waiter corresponds to is also examined for a corresponding resource holder (step 434). If no holder is detected in the same table (step 436), then the resource waiter information is sent to all other job instances (e.g., all SMSVSAM instances) on additional systems (step 438) as previously described.
The method 400 waits (step 442) for an expiration of the interval value. When this occurs (step 440), the method 400 queries whether any of the resource waiters show no proof of waiting on any of the other known resources (step 444). If this is true, then the resource waiter (the corresponding thread) is restarted using the thread's original input parameters (step 446). The method aggregates all of the holders and waiters having an active deadlock indicator (step 448) as previously described. As a following step, the list of aggregated holders/waiters is parsed through to determine an original resource holder and waiter, if any (step 450). If a holder is determined to be not waiting on another resource (step 452), it is removed from the aggregate list (step 454). If the original resource holder ultimately located (step 456) (again, indicating a system deadlock), the resource holder associated with the system deadlock having a lowest current wait time is restarted (step 458) to resolve the deadlock. Corresponding threads to the identified deadlock are removed (step 460) and the method 400 continues parsing through remaining entries in the aggregated list (step 462, returning to step 450) until no additional entries are found. The method then ends (step 464).
As will be appreciated by one of ordinary skill in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wired, optical fiber cable, RF, etc., or any suitable combination of the foregoing. Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks. The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the above figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
While one or more embodiments of the present invention have been illustrated in detail, one of ordinary skill in the art will appreciate that modifications and adaptations to those embodiments may be made without departing from the scope of the present invention as set forth in the following claims.
|Jan 27, 2010||AS||Assignment|
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LEHR, DOUGLAS L.;MCCUNE, FRANKLIN E.;REED, DAVID C.;AND OTHERS;SIGNING DATES FROM 20091130 TO 20091208;REEL/FRAME:023859/0904
|Jan 15, 2015||SULP||Surcharge for late payment|
|Jan 15, 2015||FPAY||Fee payment|
Year of fee payment: 4
|Mar 17, 2015||AS||Assignment|
Owner name: RAKUTEN, INC., JAPAN
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:035176/0373
Effective date: 20141229