US 20060161737 A1
In some embodiments, a Hat Trick deque requires only a single DCAS for most pushes and pops. The left and right ends do not interfere with each other until there is one or fewer items in the queue, and then a DCAS adjudicates between competing pops. By choosing a granularity greater than a single node, the user can amortize the costs of adding additional storage over multiple push (and pop) operations that employ the added storage. A suitable removal strategy can provide similar amortization advantages. The technique of leaving spare nodes linked in the structure allows an indefinite number of pushes and pops at a given deque end to proceed without the need to invoke memory allocation or reclamation so long as the difference between the number of pushes and the number of pops remains within given bounds. Both garbage collection dependent and explicit reclamation implementations are described.
1. A method of efficiently managing concurrent access to a double-ended data structure, the method comprising:
encoding the double-ended data structure using a subset of a bi-directional referencing chain of nodes instantiated in addressable memory storage;
distinguishing between nodes that encode current state of the data structure and at least one pool of one or more spare nodes thereof;
accessing the current state of the double-ended data structure in the course of plural concurrent computations, wherein two of the plural computations operate at opposing ends of the double-ended data structure; and
coordinating the accessing and any removal of nodes from the spare node pool using a synchronization mechanism that coordinates lock-free access to at least two independently addressable locations in the memory storage.
2. The method of
wherein the synchronization mechanism includes use of transactional memory.
3. The method of
wherein the synchronization mechanism includes use of an operation that linearizably updates N locations in the memory storage, N≧2.
4. The method of
wherein the synchronization mechanism employs a non-blocking software emulation.
5. The method of
wherein the synchronization mechanism employs a DCAS operation implemented in processor hardware.
6. The method of
employed in the implementation of a garbage collector.
7. An implementation of a garbage collector, the implementation comprising:
a double-ended concurrent shared object organized as a dynamically-sized bi-directional referencing chain of nodes, the double-ended concurrent shared object distinguishing spare nodes thereof; and
a lock-free mechanism for coordinating adding and deleting the spare nodes with concurrent opposing-end accesses for states of two or more values.
8. The garbage collector implementation of
embodied as instruction sequences encoded in one or more computer readable media.
9. The garbage collector implementation of
embodied as a programmed computational machine.
10. The garbage collector implementation of
in combination with addressable memory storage in which the bi-directional referencing chain of nodes is instantiated,
wherein the lock-free mechanism coordinates access to at least two independently addressable locations in the memory storage.
11. The garbage collector implementation of
wherein the lock-free mechanism employs transactional memory.
12. The garbage collector implementation of
wherein the lock-free mechanism includes use of an operation that linearizably updates N locations in memory storage, N≧2.
13. The garbage collector implementation of
wherein the lock-free mechanism employs a software emulation.
14. The garbage collector implementation of
wherein the lock-free mechanism employs a DCAS operation implemented in processor hardware.
The present application is a continuation of U.S. patent application Ser. No. 09/837,669, filed Apr. 18, 2001, which is itself a continuation-in-part of U.S. application Ser. No. 09/551,113, filed Apr. 18, 2000. application Ser. No. 09/837,669 is incorporated herein by reference in its entirety.
In addition, this application is related to U.S. patent application Ser. No. 09/837,671, filed Apr. 18, 2001, now U.S. Pat. No. 6,993,770.
1. Field of the Invention
The present invention relates generally to coordination amongst execution sequences in a multiprocessor computer, and more particularly, to structures and techniques for facilitating non-blocking access to concurrent shared objects.
2. Description of the Related Art
An important abstract data structure in computer science is the “double-ended queue” (abbreviated “deque” and pronounced “deck”), which is a linear sequence of items, usually initially empty, that supports the four operations of inserting an item at the left-hand end (“left push”), removing an item from the left-hand end (“left pop”), inserting an item at the right-hand end (“right push”), and removing an item from the right-hand end (“right pop”).
Sometimes an implementation of such a data structure is shared among multiple concurrent processes, thereby allowing communication among the processes. It is desirable that the data structure implementation behave in a linearizable fashion; that is, as if the operations that are requested by various processes are performed atomically in some sequential order.
One way to achieve this property is with a mutual exclusion lock (sometimes called a semaphore). For example, when any process issues a request to perform one of the four deque operations, the first action is to acquire the lock, which has the property that only one process may own it at a time. Once the lock is acquired, the operation is performed on the sequential list; only after the operation has been completed is the lock released. This clearly enforces the property of linearizability.
However, it is generally desirable for operations on the left-hand end of the deque to interfere as little as possible with operations on the right-hand end of the deque. Using a mutual exclusion lock as described above, it is impossible for a request for an operation on the right-hand end of the deque to make any progress while the deque is locked for the purposes of performing an operation on the left-hand end. Ideally, operations on one end of the deque would never impede operations on the other end of the deque unless the deque were nearly empty (containing two items or fewer) or, in some implementations, nearly full.
In some computational systems, processes may proceed at very different rates of execution; in particular, some processes may be suspended indefinitely. In such circumstances, it is highly desirable for the implementation of a deque to be “non-blocking” (also called “lock-free”); that is, if a set of processes are using a deque and an arbitrary subset of those processes are suspended indefinitely, it is always still possible for at least one of the remaining processes to make progress in performing operations on the deque.
Certain computer systems provide primitive instructions or operations that perform compound operations on memory in a linearizable form (as if atomically). The VAX computer, for example, provided instructions to directly support the four deque operations. Most computers or processor architectures provide simpler operations, such as “test-and-set”; (IBM 360), “fetch-and-add” (NYU Ultracomputer), or “compare-and-swap” (SPARC). SPARCŪ architecture based processors are available from Sun Microsystems, Inc., Mountain View, Calif. SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the United States and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems.
The “compare-and-swap” operation (CAS) typically accepts three values or quantities: a memory address A, a comparison value C, and a new value N. The operation fetches and examines the contents V of memory at address A. If those contents V are equal to C, then N is stored into the memory location at address A, replacing V. Whether or not V matches C, V is returned or saved in a register for later inspection. All this is implemented in a linearizable, if not atomic, fashion. Such an operation may be notated as “CAS(A, C, N)”.
Non-blocking algorithms can deliver significant performance benefits to parallel systems. However, there is a growing realization that existing synchronization operations on single memory locations, such as compare-and-swap (CAS), are not expressive enough to support design of efficient non-blocking algorithms. As a result, stronger synchronization operations are often desired. One candidate among such operations is a double-word (“extended”) compare-and-swap (implemented as a CASX instruction in some versions of the SPARC architecture), which is simply a CAS that uses operands of two words in length. It thus operates on two memory addresses, but they are constrained to be adjacent to one another. A more powerful and convenient operation is “double compare-and-swap” (DCAS), which accepts six values: memory addresses A1 and A2, comparison values C1 and C2, and new values N1 and N2. The operation fetches and examines the contents V1 of memory at address A1 and the contents V2 of memory at address A2. If V1 equals C1 and V2 equals C2, then N1 is stored into the memory location at address A1, replacing V1, and N2 is stored into the memory location at address A2, replacing V2. Whether or not V1 matches C1 and whether or not V2 matches C2, V1 and V2 are returned or saved in a registers for later inspection. All this is implemented in a linearizable, if not atomic, fashion. Such an operation may be notated as “DCAS(A1, A2, C1, C2, N1, N2)”.
Massalin and Pu disclose a collection of DCAS-based concurrent algorithms. See e.g., H. Massalin and C. Pu, A Lock-Free Multiprocessor OS Kernel, Technical Report TR CUCS-005-9, Columbia University, New York, N.Y., 1991, pages 1-19. In particular, Massalin and Pu disclose a lock-free operating system kernel based on the DCAS operation offered by the Motorola 68040 processor, implementing structures such as stacks, FIFO-queues, and linked lists. Unfortunately, the disclosed algorithms are centralized in nature. In particular, the DCAS is used to control a memory location common to all operations and therefore limits overall concurrency.
Greenwald discloses a collection of DCAS-based concurrent data structures that improve on those of Massalin and Pu. See e.g., M. Greenwald. Non-Blocking Synchronization and System Design, Ph.D. thesis, Stanford University Technical Report STAN-CS-TR-99-1624, Palo Alto, Calif., 8 1999, 241 pages. In particular, Greenwald discloses implementations of the DCAS operation in software and hardware and discloses two DCAS-based concurrent double-ended queue (deque) algorithms implemented using an array. Unfortunately, Greenwald's algorithms use DCAS in a restrictive way. The first, described in Greenwald, Non-Blocking Synchronization and System Design, at pages 196-197, uses a two-word DCAS as if it were a three-word operation, storing two deque end pointers in the same memory word, and performing the DCAS operation on the two-pointer word and a second word containing a value. Apart from the fact that Greenwald's algorithm limits applicability by cutting the index range to half a memory word, it also prevents concurrent access to the two ends of the deque. Greenwald's second algorithm, described in Greenwald, Non-Blocking Synchronization and System Design, at pages 217-220, assumes an array of unbounded size, and does not deal with classical array-based issues such as detection of when the deque is empty or full.
Arora et al. disclose a CAS-based deque with applications in job-stealing algorithms. See e.g., N. S. Arora, Blumofe, and C. G. Plaxton, Thread Scheduling For Multiprogrammed Multiprocessors, in Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures, 1998. Unfortunately, the disclosed non-blocking implementation restricts one end of the deque to access by only a single processor and restricts the other end to only pop operations.
Accordingly, improved techniques are desired that provide linearizable and non-blocking (or lock-free) behavior for implementations of concurrent shared objects such as a deque, and which do not suffer from the above-described drawbacks of prior approaches.
A set of structures and techniques are described herein whereby an exemplary concurrent shared object, namely a double-ended queue (deque), is implemented. Although non-blocking, linearizable deque implementations exemplify several advantages of realizations in accordance with the present invention, the present invention is not limited thereto. Indeed, based on the description herein and the claims that follow, persons of ordinary skill in the art will appreciate a variety of concurrent shared object implementations. For example, although the described deque implementations exemplify support for concurrent push and pop operations at both ends thereof, other concurrent shared objects implementations in which concurrency requirements are less severe, such as LIFO or stack structures and FIFO or queue structures, may also be implemented using the techniques described herein. Accordingly, subsets of the functional sequences and techniques described herein for exemplary deque realizations may be employed to support any of these simpler structures.
Furthermore, although various non-blocking, linearizable deque implementations described herein employ a particular synchronization primitive, namely a double compare and swap (DCAS) operation, the present invention is not limited to DCAS-based realizations. Indeed, a variety of synchronization primitives may be employed that allow linearizable, if not atomic, update of at least a pair of storage locations. In general, N-way Compare and Swap (NCAS) operations (N≧2) may be employed.
Choice of an appropriate synchronization primitive is typically affected by the set of alternatives available in a given computational system. While direct hardware- or architectural-support for a particular primitive is preferred, software emulations that build upon an available set of primitives may also be suitable for a given implementation. Accordingly, any synchronization primitive that allows the access and spare node maintenance operations described herein to be implemented with substantially equivalent semantics to those described herein is suitable.
Accordingly, a novel linked-list-based concurrent shared object implementation has been developed that provides non-blocking and linearizable access to the concurrent shared object. In an application of the underlying techniques to a deque, non-blocking completion of access operations is achieved without restricting concurrency in accessing the deque's two ends. While providing the a non-blocking and linearizable implementation, embodiments in accordance with the present invention combine some of the most attractive features of array-based and linked-list-based structures. For example, like an array-based implementation, addition of a new element to the deque can often be supported without allocation of additional storage. However, when spare nodes are exhausted, embodiments in accordance with the present invention allow expansion of the linked-list to include additional nodes. The cost of splicing a new node into the linked-list structure may be amortized over the set of subsequent push and pop operations that use that node to store deque elements. Some realizations also provide for removal of excess spare nodes. In addition, an explicit reclamation implementation is described, which facilitates use of the underlying techniques in environments or applications where automatic reclamation of storage is unavailable or impractical.
The present invention may be better understood, and its numerous objects, features, and advantages made apparent to those skilled in the art by referencing the accompanying drawings.
The use of the same reference symbols in different drawings indicates similar or identical items.
The description that follows presents a set of techniques, objects, functional sequences and data structures associated with concurrent shared object implementations employing linearizable synchronization operations in accordance with an exemplary embodiment of the present invention. An exemplary non-blocking, linearizable concurrent double-ended queue (deque) implementation that employs double compare-and-swap (DCAS) operations is illustrative. A deque is a good exemplary concurrent shared object implementation in that it involves all the intricacies of LIFO-stacks and FIFO-queues, with the added complexity of handling operations originating at both of the deque's ends. Accordingly, techniques, objects, functional sequences and data structures presented in the context of a concurrent deque implementation will be understood by persons of ordinary skill in the art to describe a superset of support and functionality suitable for less challenging concurrent shared object implementations, such as LIFO-stacks, FIFO-queues or concurrent shared objects (including deques) with simplified access semantics.
In view of the above, and without limitation, the description that follows focuses on an exemplary linearizable, non-blocking concurrent deque implementation that behaves as if access operations on the deque are executed in a mutually exclusive manner, despite the absence of a mutual exclusion mechanism. Advantageously, and unlike prior approaches, deque implementations in accordance with some embodiments of the present invention are dynamically-sized and allow concurrent operations on the two ends of the deque to proceed independently. Since synchronization operations are relatively slow and/or impose overhead, it is generally desirable to minimize their use. Accordingly, one advantage of some implementations in accordance with the present invention is that in typical execution paths of both access and spare node maintenance operations, only a single synchronization operation is required.
One realization of the present invention is as a deque implementation employing the DCAS operation on a shared memory multiprocessor computer. This realization, as well as others, will be understood in the context of the following computation model, which specifies the concurrent semantics of the deque data structure.
In general, a concurrent system consists of a collection of n processors. Processors communicate through shared data structures called objects. Each object has an associated set of primitive operations that provide the mechanism for manipulating that object. Each processor P can be viewed in an abstract sense as a sequential thread of control that applies a sequence of operations to objects by issuing an invocation and receiving the associated response. A history is a sequence of invocations and responses of some system execution. Each history induces a “real-time” order of operations where an operation A precedes another operation B, if A's response occurs before B's invocation. Two operations are concurrent if they are unrelated by the real-time order. A sequential history is a history in which each invocation is followed immediately by its corresponding response. The sequential specification of an object is the set of legal sequential histories associated with it. The basic correctness requirement for a concurrent implementation is linearizability, which requires that every concurrent history is “equivalent” to some legal sequential history which is consistent with the real-time order induced by the concurrent history. In a linearizable implementation, an operation appears to take effect atomically at some point between its invocation and response. In the model described herein, the collection of shared memory locations of a multiprocessor computer's memory (including location L) is a linearizable implementation of an object that provides each processor Pi with the following set of sequentially specified machine operations:
Implementations described herein are non-blocking (also called lock-free). Let us use the term higher-level operations in referring to operations of the data type being implemented, and lower-level operations in referring to the (machine) operations in terms of which it is implemented. A non-blocking implementation is one in which, even though individual higher-level operations may be delayed, the system as a whole continuously makes progress. More formally, a non-blocking implementation is one in which any infinite history containing a higher-level operation that has an invocation but no response must also contain infinitely many responses. In other words, if some processor performing a higher-level operation continuously takes steps and does not complete, it must be because some operations invoked by other processors are continuously completing their responses. This definition guarantees that the system as a whole makes progress and that individual processors cannot be blocked, only delayed by other processors continuously taking steps. Using locks would violate the above condition, hence the alternate name: lock-free.
Double Compare-and-Swap Operation
Double compare-and-swap (DCAS) operations are well known in the art and have been implemented in hardware, such as in the Motorola 68040 processor, as well as through software emulation. Accordingly, a variety of suitable implementations exist and the descriptive code that follows is meant to facilitate later description of concurrent shared object implementations in accordance with the present invention and not to limit the set of suitable DCAS implementations. For example, order of operations is merely illustrative and any implementation with substantially equivalent semantics is also suitable. Similarly, some formulations (such as described above) may return previous values while others may return success/failure indications. The illustrative formulation that follows is of the latter type. In general, any of a variety of formulations are suitable.
The above sequences of operations implementing the DCAS operation are executed atomically using support suitable to the particular realization. For example, in various realizations, through hardware support (e.g., as implemented by the Motorola 68040 microprocessor or as described in M. Herlihy and J. Moss, Transactional memory: Architectural Support For Lock-Free Data Structures, Technical Report CRL 92/07, Digital Equipment Corporation, Cambridge Research Lab, 1992, 12 pages), through non-blocking software emulation (such as described in G. Barnes, A Method For Implementing Lock-Free Shared Data Structures, in Proceedings of the 5th ACM Symposium on Parallel Algorithms and Architectures, pages 261-270, June 1993 or in N. Shavit and D. Touitou, Software transactional memory, Distributed Computing, 10(2):99-116, February 1997), or via a blocking software emulation (such as described in U.S. patent application Ser. No. 09/207,940, entitled “PLATFORM INDEPENDENT DOUBLE COMPARE AND SWAP OPERATION,” naming Cartwright and Agesen as inventors, and filed Dec. 9, 1998).
Although the above-referenced implementations are presently preferred, other DCAS implementations that substantially preserve the semantics of the descriptive code (above) are also suitable. Furthermore, although much of the description herein is focused on double compare-and-swap (DCAS) operations, it will be understood that N-location compare-and-swap operations (N≧2) or transactional memory may be more generally employed, though often at some increased overhead.
A Double-Ended Queue (Deque)
A deque object S is a concurrent shared object, that in an exemplary realization is created by an operation of a constructor operation, e.g., make_deque( ), and which allows each processor Pi, 0≦i≦n-1, of a concurrent system to perform the following types of operations on S: push_righti (v), push_lefti (v), pop_righti ( ), and pop_lefti ( ). Each push operation has an input, v, where v is selected from a range of values. Each pop operation returns an output from the range of values. Push operations on a full deque object and pop operations on an empty deque object return appropriate indications. In the case of a dynamically-sized deque, “full” refers to the case where the deque is observed to have no available nodes to accommodate a push and the system storage allocator reports that no more storage is available to the process.
A concurrent implementation of a deque object is one that is linearizable to a standard sequential deque. This sequential deque can be specified using a state-machine representation that captures all of its allowable sequential histories. These sequential histories include all sequences of push and pop operations induced by the state machine representation, but do not include the actual states of the machine. In the following description, we abuse notation slightly for the sake of clarity.
The state of a deque is a sequence of items S=<v0, . . . , vk> from the range of values, having cardinality 0≦|S|≦max_length_S. The deque is initially in the empty state (following invocation of make_deque ( )), that is, has cardinality 0, and is said to have reached a full state if its cardinality is max_length_S. In general, for deque implementations described herein, cardinality is unbounded except by limitations (if any) of an underlying storage allocator.
The four possible push and pop operations, executed sequentially, induce the following state transitions of the sequence S=<v0, . . . , vk>, with appropriate returned values:
Many programming languages and execution environments have traditionally placed responsibility for dynamic allocation and deallocation of memory on the programmer. For example, in the C programming language, memory is allocated from the heap by the malloc procedure (or its variants). Given a pointer variable, p, execution of machine instructions corresponding to the statement p=malloc (sizeof (SomeStruct)) causes pointer variable p to point to newly allocated storage for a memory object of size necessary for representing a SomeStruct data structure. After use, the memory object identified by pointer variable p can be deallocated, or freed, by calling free (p). Other languages provide analogous facilities for explicit allocation and deallocation of memory.
Dynamically-allocated storage becomes unreachable when no chain of references (or pointers) can be traced from a “root set” of references (or pointers) to the storage. Memory objects that are no longer reachable, but have not been freed, are called garbage. Similarly, storage associated with a memory object can be deallocated while still referenced. In this case, a dangling reference has been created. In general, dynamic memory can be hard to manage correctly. Because of this difficulty, garbage collection, i.e., automatic reclamation of dynamically-allocated storage, can be an attractive model of memory management. Garbage collection is particularly attractive for languages such as the JAVA™ language (JAVA and all Java-based marks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries), Prolog, Lisp, Smalltalk, Scheme, Eiffel, Dylan, M L, Haskell, Miranda, Oberon, etc. See generally, Jones & Lins, Garbage Collection: Algorithms for Automatic Dynamic Memory Management, pp. 1-41, Wiley (1996) for a discussion of garbage collection and of various algorithms and implementations for performing garbage collection.
In general, the availability of particular memory management facilities are language, implementation and execution environment dependent. Accordingly, for some realizations in accordance with the present invention, it is acceptable to assume that storage is managed by a “garbage collector” that returns (to a “free pool”) that storage for which it can be proven that no process will, in the future, access data structures contained therein. Such a storage management scheme allows operations on a concurrent shared object, such as a deque, to simply eliminate references or pointers to a removed data structure and rely upon operation of the garbage collector for automatic reclamation of the associated storage.
However, for some realizations, a garbage collection facility may be unavailable or impractical. For example, one realization in which automatic reclamation may be unavailable or impractical is a concurrent shared object implementation (e.g., a deque) employed in the implementation of a garbage collector itself. Accordingly, in some realizations in accordance with the present invention, storage is explicitly reclaimed or “freed” when no longer used. For example, in some realizations, removal operations include explicit reclamation of the removed storage.
Deque with Amortized Node Allocation
One embodiment in accordance with the present invention includes a linked-list-based implementation of a lock-free double-ended queue (deque). The implementation includes both structures (e.g., embodied as data structures in memory and/or other storage) and techniques (e.g., embodied as operations, functional sequences, instructions, etc.) that allow costs associated with allocation of additional storage to be amortized over multiple access operations. The exemplary implementation employs double compare and swap (DCAS) operations to provide linearizable behavior. However, as described elsewhere herein, other synchronization primitives may be employed in other realizations. In general, the exemplary implementation exhibits a number of features that tend to improve its performance:
Although all of these features are provided in some realizations, fewer than all may be provided in others.
The organization and structure of a doubly-linked list 102 and deque 101 encoded therein are now described with reference to
Each node encodes two pointers and a value field. The first pointer of a node points to the node to its right in a linked-list of such nodes, and the second pointer points to the node to its left. There are two shared variables LeftHat and RightHat, which always point to nodes within the doubly-linked list. LeftHat always points to a node that is to the left (though not necessarily immediately left) of the node to which RightHat points. The node to which LeftHat points at a given instant of time is sometimes called the left sentinel and the node to which RightHat points at a given instant of time is sometimes called the right sentinel. The primary invariant of this scheme is that the nodes, including both sentinels, always form a consistent doubly-linked list. Each node has a left pointer to its left neighbor and a right pointer to its right neighbor. The doubly-linked chain is terminated at its end nodes by a null in the right pointer of the rightmost node and a null in the left pointer of the leftmost node.
It is assumed that there are three distinguishing null values (called “nullL”, “nullR”, and “nullX”) that can be stored in the value field of a node but which are never requested to be pushed onto the deque. The left sentinel is always to the left of the right sentinel, and the zero or more nodes falling between the two sentinels always have non-null values in their value fields. Both sentinels and all nodes “beyond” the sentinels in the linked structure always have null values in their value cells. Except as described below, left sentinel and the spare nodes (if any) to the logical left thereof have nullL in their value fields, while right sentinel and spare nodes (if any) to the logical right thereof have a corresponding nullR in their value fields. Notwithstanding the above, the most extreme node at each end of the linked-list structure holds the nullX in its value field rather than the usual left or right null value.
Terms such as always, never, all, none, etc. are used herein to describe sets of consistent states presented by a given computational system. Of course, persons of ordinary skill in the art will recognize that certain transitory states may and do exist in physical implementations even if not presented by the computational system. Accordingly, such terms and invariants will be understood in the context of consistent states presented by a given computational system rather than as a requirement for precisely simultaneous effect of multiple state changes. This “hiding” of internal states is commonly referred to by calling the composite operation “atomic”, and by allusion to a prohibition against any process seeing any of the internal states partially performed.
Referring more specifically to
Most operations on a deque are performed by “moving a hat” (i.e., redirecting a sentinel pointer) between a sentinel node and an adjacent node, taking advantage of the presence of spare nodes to avoid the expense of frequent memory allocation calls. One way to understand operation of the deque is to contrast its operation with other algorithms that push a new element onto a linked-list implemented data structure by creating a new node and then splicing the new node onto one end. In contrast, embodiments in accordance with the present invention treat a doubly-linked list structure more as if it were an array. For example, addition of a new element to the deque can often be supported by simple adjustment of a pointer and installation of the new value into a node that is already present in the linked list. However, unlike a typical array-based algorithm, which, on exhaustion of pre-allocated storage, must report a full deque, embodiments in accordance with the present invention allow expansion of the linked-list to include additional nodes. In this way, the cost of splicing a new node into the doubly-linked structure may be amortized over the set of subsequent push and pop operations that use that node to store deque elements. In this manner, embodiments in accordance with the present invention combine some of the most attractive features of array-based and linked-list-based implementations.
In addition to value-encoding nodes (if any), two sentinel nodes are also included in a linked-list representation in accordance with the present invention. The sentinel nodes are simply the nodes of the linked-list identified by LeftHat and RightHat. Otherwise, the sentinel nodes are structurally-indistinguishable from other nodes of the linked-list. When the deque is empty, the sentinel nodes are a pair of nodes linked adjacent to one another.
Besides the sentinels and the nodes that are logically “in the deque,” additional spare nodes may also be linked into the list. These spare nodes, e.g., nodes 201, 202, 203 and 204 (see
An empty deque is created or initialized by stringing together a convenient number of nodes into a doubly-linked list that is terminated at the left end with a null left pointer and at the right end with a null right pointer. A pair of adjacent nodes are designated as the sentinels, with the one pointed to by the left sentinel pointer having its right pointer pointing to the one designated by the right sentinel pointer, and vice versa. Both spare and sentinel nodes have null value fields that distinguish them from nodes of the deque.
The description that follows presents an exemplary non-blocking implementation of a deque based on an underlying doubly-linked-list data structure wherein access operations (illustratively, push_right, pop_right, push_left and pop_l eft) facilitate concurrent access. Exemplary code and illustrative drawings will provide persons of ordinary skill in the art with detailed understanding of one particular realization of the present invention; however, as will be apparent from the description herein and the breadth of the claims that follow, the invention is not limited thereto. Exemplary right-hand-side code is described with the understanding that left-hand-side operations are symmetric. Use herein of directional signals (e.g., left and right) will be understood by persons of ordinary skill in the art to be somewhat arbitrary. Accordingly, many other notational conventions, such as top and bottom, first-end and second-end, etc., and implementations denominated therein are also suitable. With the foregoing in mind, pop_right and push_right access operations and related right-end spare node maintenance operations are now described.
An illustrative push_right access operation in accordance with the present invention follows:
To perform a push_right access operation, a processor uses the DCAS in lines 4-5 to attempt to move the right hat to the right and replace the right null value formerly under it (nullR) with the new value passed to the push operation (newVal). If the DCAS succeeds, the push has succeeded. Otherwise, the DCAS failed either because the hat was moved by some other operation or because there is not an available cell—a condition indicated by a nullX in the value cell of the sentinel node.
An illustrative pop_right access operation in accordance with the present invention follows:
To perform a pop_right access operation, a processor first tests for an empty deque (see lines 3-6). Note that checking for empty does not access the other hat, and therefore does not create contention with operations at the other end of the deque. Because changes are possible between the time we read the RightHat and the time we read the L pointer, we use a DCAS (line 7) to verify that these two pointers, are at the same moment, equal to the values individually read. If the deque is non-empty, execution of the pop_right operation uses a DCAS to insert a nullR value in the value field of a node immediately left of the right sentinel (i.e., into rhL→value, where rhL=rh→L) and to move the right sentinel hat onto that node.
There is one instance in this implementation of the deque where access operations at opposing ends of the deque may conflict, namely, if the deque contains just one element and both a pop_right and pop_left are attempted ‘simultaneously’.
Spare Node Maintenance Operations
The push operations described above work smoothly so long as the linked-list includes sufficient spare nodes. However, pushes that exceed pops at a given end of the deque will eventually require addition of nodes to the linked-list. As with access operations, exemplary code and illustrative drawings will provide persons of ordinary skill in the art with detailed understanding of one particular realization of the present invention; however, as will be apparent from the description herein and the breadth of the claims that follow, the invention is not limited thereto. Exemplary right-hand-side code is described with the understanding that left-hand-side operations are symmetric.
An illustrative add_right_nodes operation in accordance with the present invention follows:
A service routine, allocate_right_nodes, is used to allocate storage and initialize a doubly-linked node chain with null values in each value field (nullX in the rightmost one, nullR in the rest). The chain of nodes 801 is terminated by a null right pointer in the rightmost node (see
In line 8 of add_right_nodes, we see the left pointer of the leftmost node of new chain 801 being set to point to the likely tail node of the deque structure. A DCAS in lines 9-10 then attempts to replace the null right pointer of the likely tail node with a link to the new structure and replaces the nullX in its value cell with a nullR. If the DCAS succeeds, the new storage is spliced onto the node chain as illustrated in
While the above-described implementation of a dynamically sized deque illustrates some aspects of some realizations in accordance with the present invention, a variation now described provides certain additional benefits. For example, spare node maintenance facilities are extended to allow removal of excess spare nodes and a possible behavior that results in creation of a “spur” is handled. Related modifications have been made to push and pop access operations and to the set of distinguishing values stored in the value field of a node but which are not pushed onto the deque.
As before, the deque implementation is based on a doubly-linked list representation. Each node contains left and right pointers and a value field, which can store a value pushed onto the deque or store one of several special distinguishing values that are never pushed onto the deque. In addition to the distinguishing values nullL and nullR (hereafter LN and RN), left and right variants LX and RX of a terminal value (previously nullX) and two additional distinguishing values, LY and RY have been defined. As before, the list contains one node for each value in the deque, plus additional nodes that can be used for values in the future, and which are used to synchronize additions to and removals from the list.
Values currently in the deque are stored in consecutive nodes in the list, and there is at least one additional node on each side of the nodes that encode the deque's state. Each node other than those containing values of the deque state is distinguished by one or the special distinguishing values listed above. The node directly to the right (left) of the rightmost (leftmost) value is called the right (left) sentinel. Except in a special case, which is described later, two shared pointers, hereafter RHat and LHat, point to the right and left sentinels, respectively. For an empty state of the deque, left and right sentinels are adjacent. Thus,
As before, the implementation is completely symmetric. Accordingly, we therefore restrict our presentation to the right side with the understanding that left side representations and operations are symmetric. Referring then to
In the illustration of
We begin by describing the operation of “normal” push and pop operations that do not encounter any boundary cases or concurrent operations. Later, we describe special cases for these operations, interaction with concurrent operations, and operations for growing and shrinking the list. As before, exemplary right-hand-side code is described with the understanding that left-hand-side operations are symmetric and the choice of a naming convention, i.e., “right” (and “left”), is arbitrary.
An illustrative implementation of a pushRight access operation follows:
The pushRight access operation turns the current right sentinel into a value node and changes RHat to point to the node to the right of the current right sentinel, thereby making it the new right sentinel. In the illustrated implementation, this objective is achieved by reading RHat to locate the right sentinel (line 3), by determining the next node to the right of the right sentinel (line 4) and then using a DCAS primitive (line 6) to atomically move RHat to that next node and to change the value in the previous right sentinel from the distinguishing value RN to the new value, v.
For example, starting from the deque and list state shown in
First, the DCAS primitive can fail due to the effect of a concurrent operation, in which case, pushRight simply retries. Such a DCAS failure can occur only if another operation (possibly including another pushRight operation) succeeds in changing the deque state during execution of the pushRight operation. Accordingly, lock-freedom is not compromised by retrying. Otherwise, execution of the pushRight operation may fail because it detects that there is no node available to become the new right sentinel (line 5), or because the distinguishing value in the old sentinel is not RN (in which case the DCAS of line 6 will fail). In such a case, it may be that we have exhausted the right spare nodes as illustrated in
An illustrative implementation of a popRight access operation follows:
The popRight access operation locates the rightmost value node of the deque and turns this node into the right sentinel by atomically changing its value to RN and moving the RHat to point to it. For example, a successful right Pop access operation operating on the list and deque state shown in
The popRight access operation begins by reading the pointer RHat to locate the right sentinel (line 3), and then reads (at line 4) the left pointer of this node to locate the node containing the rightmost value of the deque. The popRight operation reads the value stored in this node (line 6). It can be shown that the value read can be one of the distinguishing values RN, RX, LY, or RY only in the presence of successful execution of a concurrent operation. Accordingly, the popRight operation retries if it detects any of these values (lines 7-8). However, if the popRight operation read either a left null or left terminating value (i.e., either LN or LX), then either the deque is empty (i.e., there are no values between the two sentinels) or the popRight operation read values that did not exist simultaneously due to execution of a concurrent operation.
To disambiguate, the popRight access operation uses a DCAS primitive (at line 10) to check whether the values read from RHat and the value field of the rightmost value node exist simultaneously in the list representation. Note that the last two arguments to the DCAS are the same as the second two, so the DCAS does not change any values. Instead, the DCAS checks that the values are the same as those read previously. If so, the popRight operation returns “empty” at line 11. Otherwise, the popRight operation retries. Finally, if the popRight operation finds a value other than a distinguishing value in the node to the left of the right sentinel, then it uses a DCAS primitive (at line 12) to attempt to atomically change this value to RN (thereby making the node that stored the value to be popped available for subsequent pushRight operations) and to move RHat to point to the popped value node. If the DCAS succeeds in atomically removing the rightmost value and making the node that stored it become the new right sentinel, the value can be returned (line 13). Otherwise, the popRight operation retries.
As before, if the deque state includes two or more values, symmetric left and right variants of the above-described pop operation execute independently. This independence is an advantage over some DCAS-based deque implementations, which do not allow left and right operations to execute concurrently without interfering with one another. However, when there are zero or one values in the deque, concurrently executed popLeft and popRight access operations do interact. For example, if popLeft and popRight access operations operate concurrently on a deque state such as that illustrated in
Spare Node Maintenance Operations
As before, the push operations described above work smoothly so long as the linked-list includes sufficient spare nodes. However, pushes that exceed pops at a given end of the deque will eventually require addition of nodes to the linked-list. In addition, removal of some of the unused nodes that are beyond the sentinels (e.g., to the right of the right sentinel) may be desirable at a deque end that has accumulated an excessive number of spare nodes resulting from pops that exceed pushes.
We next describe an implementation of an add_right operation that can be used to add spare nodes to the right of the linked list for use by subsequently executed pushRight access operations. One suitable implementation is as follows:
The add_right operation can be called directly if desired. However, as illustrated above, the add_right operation is called by the pushRight access operation if execution thereof determines that there are no more right null nodes available (e.g., based on observation of a terminating RX value in the right sentinel at line 9 of the illustrated pushRight implementation). In the illustrated implementation, the add_right operation takes an argument that indicates the number of nodes to be added. In the illustrated implementation, add_right begins by calling alloc_right to construct a doubly-linked chain of the desired length. Any of a variety of implementations are suitable and one such suitable implementation follows:
where we have assumed a constructor that initializes a new node with a value passed thereto. Accordingly, the rightmost node in of a newly allocated chain encodes an RX value, and all others encode an RN value. Implementation of an alloc_right operation is straightforward because no other process can concurrently access the chain as it is constructed from newly-allocated nodes.
Next, the add_right operation attempts to splice the new chain onto the existing list. For example, given the list and deque state illustrated by
In preparation for a splice, the add_right operation first traverses the right referencing chain of the list from the right sentinel, past the nodes encoding a right null distinguishing value RN (lines 5-7), looking for an RX terminating value (line 10). When the executing add_right operation finds the RX terminating value, it attempts to splice the new chain onto the existing list, as described above, by using a DCAS primitive (line 13). In preparation, the add_right operation first sets the left pointer of the leftmost node of its new chain to point back to finds the previously found node with an RX terminating value (line 11) so that, if the DCAS succeeds, the doubly-linked list will be complete, and then reads (line 12) the current right pointer, rrptr, for use in the DCAS.
Because of the possibility of concurrent operations, traversal of the right referencing chain may encounter any value, e.g., a deque value or one of the other distinguishing values, before finding the a node containing the RX terminating value. In most such cases, the add_right operation simply repeats its search again after re-reading the RHat value (at line 5). As usual, a retry does not compromise lock-freedom because a concurrent operation that altered the list state must have succeeded. However, a special case can arise even in the absence of concurrent operations. This case is handled at lines 8-9 and is explained below following discussion of a remove_right operation.
Some realizations may also include an operation to remove excess spare nodes. In the implementation described below, a remove_right operation is used to remove all but a specified number of the spare right nodes. Such a remove_right operation can be invoked with a number that indicates the maximum number of spare nodes that should remain on the right of the list. If such a remove_right operation fails to chop off part of the list due to concurrent operations, it may be that the decision to chop off the additional nodes was premature. Therefore, rather than insisting that a remove_right implementation retry until it is successful in ensuring there are no more spare nodes than specified, we allow it to return false in the case that it encounters concurrent operations. Such an implementation leaves the decision of whether to retry the remove_right operation to the user. In fact, decisions regarding when to invoke the remove operations, and how many nodes are to remain, may also be left to the user.
In general, storage removal strategies are implementation-dependent. For example, in some implementations it may be desirable to link the need to add storage at one end to an attempt to remove some (for possible reuse) from the other end. Determination that excessive spare nodes lie beyond a sentinel can be made with a counter of pushes and pops from each end. In some realizations, a probabilistically accurate (though not necessarily precise) counter may be employed. In other realizations, a synchronization primitive such as a CAS can be used to ensure a precise count. Alternatively excess pops may be counted by noting the relevant sentinel crossing successive pseudo-boundaries in the link chain. A special node or marker can be linked in to indicate such boundaries, but such an approach typically complicates implementation of the other operations.
Whatever the particular removal strategy, the remove_right operation implementation that follows is illustrative.
We begin by discussing a straightforward (and incorrect) approach to removing spare nodes, and then explain how this approach fails and how the implementation above addresses the failure. In such a straightforward approach, execution of a remove_right operation (such as illustrated above) traverses the right referencing chain of the list beginning at the right sentinel, counting the null nodes that will not be removed, as specified by the argument, n (see lines 2-7). If the traversal reaches the end of the list (at line 7) or a node containing the terminating value RX (see line 4) before counting the specified number of right null nodes, then the remove_right operation returns true, indicating that no excess nodes needed to be excised. Otherwise, the traversal reaches a chop point node that contains the distinguishing right null value RN.
A straightforward approach is to simply use a DCAS primitive to change the right pointer of this chop point node to null, thereby making nodes to the right of the chop point available for garbage collection, and to change the value in the chop point node from RN to RX, thereby preserving the invariant that an RX terminator exists. However, careful examination of the resulting algorithm reveals a problem, illustrated by the following scenario. For purposes of illustration, use in the drawings of a special distinguishing value, RY, (which turns out to be part of a solution) should be ignored.
Consider a pushRight (E) access operation that runs alone from the list and deque state illustrated in
Suppose now that remove_right (0) is invoked (using the straightforward, but incorrect approach) and that it runs alone to completion, resulting in the state shown in
Our approach to dealing with this problem is not to avoid it, but rather to modify our algorithm so that we can detect and correct it. We separate the removal of nodes into two steps. In the first step, in addition to marking the node that will become the new right terminator with the terminating value RX, we also mark its successor with the special distinguishing value RY.
By employing the distinguishing value RY, an implementation prevents further pushes from proceeding down the old chain. In particular, consider the case of the (i) previously described pushRight (E) access operation that runs alone from the list and deque state illustrated in
The implementation of the pushRight access operation further allows processes that are attempting to push values to detect that RHat has gone onto a spur (e.g., as illustrated in
The unspur_right operation verifies that RHat still points to a node labeled with the distinguishing value RY (lines 1-2), follows (line 3) the still-existing pointer from the spur back to the list (e.g., from node 1313 to node 1312, in
The break_cycles_right operation, which is invoked at line 13 of remove_right and described below, is optional (and may be omitted) in implementations for execution environments that provide a facility, such as garbage collection, for automatic reclamation of storage.
Explicit Reclamation of Storage
While the above description has focused on implementations for execution environments that provide intrinsic support for automatic reclamation of storage, or garbage collection, some implementations in accordance with the present invention support explicit reclamation. This is important for several reasons. First, many common programming environments do not support garbage collection. Second, almost all of those that do provide garbage collection introduce excessive levels of synchronization overhead, such as locking and/or stop-the-world collection mechanisms. Accordingly, the scaling of such implementations is questionable. Finally, designs and implementations that depend on existence of a garbage collector cannot be used in the implementation of the garbage collector itself.
It has been discovered that a variation on the above-described techniques may be employed to provide explicit reclamation of nodes as they are severed from the deque as a result of remove operations. The variation builds on a lock-free reference counting technique described in greater detail in U.S. patent application Ser. No. 09/837,671, entitled “LOCK FREE REFERENCE COUNTING,” naming David L. Detlefs, Paul A. Martin, Mark S. Moir, and Guy L. Steele Jr. as inventors, and filed on Apr. 18, 2001, which is incorporated herein in its entirety by reference.
By applying the lock-free reference counting technique, a deque implementation has been developed that provides explicit reclamation of storage. As before, the deque is represented as a doubly-linked list of nodes, but includes an additional facility that ensures that nodes removed from the list (e.g., by operation of a remove_right_nodes operation) are free of cyclic referencing chains.
As described so far, our deque implementation allows cycles in garbage, because the chains cut off the list by a remove_right_nodes operation are doubly linked. Therefore, in preparation for applying the LFRC methodology, we modified our implementation so that cycles are broken in chains that are cut off from the list. This is achieved (on the right side) by the break_cycles_right operation, which is invoked after successfully performing the DCAS at line 11 of the remove_right operation. The following implementation of a break_cycles_right operation is illustrative.
The approach employed is straightforward. We simply walk down the referencing chain, setting the forward pointers (e.g., right pointers in the case of break_cycles_right) to null. However, there are some subtleties involved with breaking these cycles. In particular, we need to deal with the possibility of concurrent accesses to the broken off chain while we are breaking it, because some processes may have been accessing it before it was cut off. Concurrent pop and push operations pose no problem. Their DCASs will not succeed when trying to push onto or pop from a cut-off node because the relevant hat (RHat or LHat) no longer points to it. Also, these operations check for null pointers and take appropriate action, so there is no risk that setting the forward pointers to null will cause concurrent push and pop operations to de-reference a null pointer. However, dealing with concurrent remove and add operations is more challenging, as it is possible for both types of operation to modify the chain we are attempting to break it up.
First, simply walking down the list, setting forward pointers to null (presumably using the detection of a null forward pointer as a termination condition), does nothing to prevent another process from adding a new chain onto any node in the chain that contains an terminating RX value. If this happens, the cycles in this newly added chain will never be broken, resulting in a memory leak. Second, a simplistic approach can result in multiple processes concurrently breaking links in the same chain in the case that one process executing remove_right succeeds at line 11 in chopping off some nodes within an already chopped-off chain. This results in unnecessary work and more difficulty in reasoning about correctness.
In the illustrated break_cycles_right implementation, we address both of these problems with one technique. Before setting the forward pointer of a node to null (at line 8), we first use a CAS primitive to set the next node's value field to the distinguishing value RY (lines 5-7). The reason for using a CAS instead of an ordinary store is that we can determine the value overwritten when storing the RY value. If the CAS changes a terminating value RX to an RY value, then the loop terminates (see line 4). It is safe to terminate in this case because either the terminating value RX was in the rightmost node of the chain (and changing the RX to an RY prevents a new chain from being added subsequently), or some process executing a remove_right operation set the value of this node to the terminating value RX (see line 11, remove_right), in which case that process has the responsibility to break the cycles in the remainder of the chain.
Since the break_cycles_right implementation ensures that referencing cycles are broken in chopped node chains, the implementation described is amenable to transformation to a GC-independent form using the lock-free reference counting (LFRC) methodology described in detail in the above-incorporated U.S. patent application. However, to summarize, (1) we added a reference count field rc to the node object, (2) we implemented an LFRCDestroy (v) function, (3) we ensured (using the break_cycles_right implementation) that the implementation does not result in referencing cycles in or among garbage objects, (4, 5) we replaced accesses and manipulations of pointer variables with corresponding LFRC pointer operations and (6) we ensured that local pointer variables are initialized to NULL before being used with any of the LFRC operations and are properly destroyed using LFRCDestroy upon return (or when such local pointer variables otherwise go out of scope). LFRC pointer operations employed include LFRCLoad, LFRCStore, LFRCCopy, LFRCPass, LFRCStoreAlloc, LFRCDCAS, LFRCDCAS, LFRCDCAS1 and LFRCDestroy. An illustrative implementation of each is described in detail in the above-incorporated U.S. patent application.
The illustrative object definitions that follow, including constructor and destructor methods provide the reference counts and ensure proper initialization of a deque and reclamation thereof.
wherein the notation LFRCDestroy (p, q) is shorthand for invocation of the LFRCDestroy operation on each of the listed operands.
Corresponding pushRight and popRight access operations follow naturally from the above-described GC-dependent implementations thereof. Initialization of local pointer values, replacement of pointer operations and destruction of local variables as they go out of scope are all straightforward. The following transformed pushRight and popRight access operation implementations are illustrative.
Corresponding spare node maintenance operations also follow naturally from the above-described implementations thereof. As before, initialization of local pointer values, replacement of pointer operations and destruction of local variables as they go out of scope are all straightforward. The following transformed add_right_nodes and allocate_right_nodes operation implementations are illustrative.
wherein the LFRCDCAS1 pointer operation provides LFRC pointer operation support only for a first addressed location. Because the second addressed location is a literal value, the LFRCDCAS1 operation is employed rather than a LFRCDCAS. The LFRCCopyAlloc pointer operation, like the LFRCStoreAlloc pointer operation described in the above-incorporated U.S. patent application, is a variant that forgoes certain reference count manipulations for a newly allocated node.
The transformed remove_right_nodes operation implementation that follows is also illustrative.
wherein the DCAS primitive at line 22 operates on literals, rather than pointers. Accordingly, replacement with an LFRC pointer operation is not implicated.
Finally, transformed versions of the previously described unspur_right and break_cycles_right operations are as follows:
where, as before, the CAS primitive at line 20 operates on a literal, rather than a pointer values. Accordingly, replacement with an LFRC pointer operation is not implicated.
While the invention has been described with reference to various embodiments, it will be understood that these embodiments are illustrative and that the scope of the invention is not limited to them. Terms such as always, never, all, none, etc. are used herein to describe sets of consistent states presented by a given computational system. Of course, persons of ordinary skill in the art will recognize that certain transitory states may and do exist in physical implementations even if not presented by the computational system. Accordingly, such terms and invariants will be understood in the context of consistent states presented by a given computational system rather than as a requirement for precisely simultaneous effect of multiple state changes. This “hiding” of internal states is commonly referred to by calling the composite operation “atomic”, and by allusion to a prohibition against any process seeing any of the internal states partially performed.
Many variations, modifications, additions, and improvements are possible. For example, while various full-function deque realizations have been described in detail, realizations of other shared object data structures, including realizations that forgo some of access operations, e.g., for use as a FIFO, queue, LIFO, stack or hybrid structure, will also be appreciated by persons of ordinary skill in the art. In addition, more complex shared object structures may be defined that exploit the techniques described herein. Other synchronization primitives may be employed and a variety of distinguishing values may be employed. In general, the particular data structures, synchronization primitives and distinguishing values employed are implementation specific and, based on the description herein, persons of ordinary skill in the art will appreciate suitable selections for a given implementation.
Plural instances may be provided for components, operations or structures described herein as a single instance. Finally, boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of claims that follow. Structures and functionality presented as discrete components in the exemplary configurations may be implemented as a combined structure or component. These and other variations, modifications, additions, and improvements may fall within the scope of the invention as defined in the claims that follow.