|Publication number||US20050071841 A1|
|Application number||US 10/676,581|
|Publication date||Mar 31, 2005|
|Filing date||Sep 30, 2003|
|Priority date||Sep 30, 2003|
|Also published as||CN1853166A, CN100578453C, DE602004026750D1, EP1668500A1, EP1668500B1, US7398521, US20050081207, WO2005033936A1|
|Publication number||10676581, 676581, US 2005/0071841 A1, US 2005/071841 A1, US 20050071841 A1, US 20050071841A1, US 2005071841 A1, US 2005071841A1, US-A1-20050071841, US-A1-2005071841, US2005/0071841A1, US2005/071841A1, US20050071841 A1, US20050071841A1, US2005071841 A1, US2005071841A1|
|Inventors||Gerolf Hoflehner, Shih-Wei Liao, Xinmin Tian, Hong Wang, Daniel Lavery, Perry Wang, Dongkeun Kim, Milind Girkar, John Shen|
|Original Assignee||Hoflehner Gerolf F., Shih-Wei Liao, Xinmin Tian, Hong Wang, Lavery Daniel M., Perry Wang, Dongkeun Kim, Milind Girkar, Shen John P.|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (8), Referenced by (53), Classifications (5), Legal Events (1)|
|External Links: USPTO, USPTO Assignment, Espacenet|
Embodiments of the invention relate to information processing system; and more specifically, to thread management for multi-threading.
Memory latency has become the critical bottleneck to achieving high performance on modern processors. Many large applications today are memory intensive, because their memory access patterns are difficult to predict and their working sets are becoming quite large. Despite continued advances in cache design and new developments in prefetching techniques, the memory bottleneck problem still persists. This problem worsens when executing pointer-intensive applications, which tend to defy conventional stride-based prefetching techniques.
One solution is to overlap memory stalls in one program with the execution of useful instructions from another program, thus effectively improving system performance in terms of overall throughput. Improving throughput of multitasking workloads on a single processor has been the primary motivation behind the emerging simultaneous multithreading (SMT) techniques. An SMT processor can issue instructions from multiple hardware contexts, or logical processors (also referred to as hardware threads), to the functional units of a super-scalar processor in the same cycle. SMT achieves higher overall throughput by increasing overall instruction-level parallelism available to the architecture via the exploitation of the natural parallelism between independent threads during each cycle.
SMT can also improve the performance of applications that are multithreaded. However, SMT does not directly improve the performance, in terms of reducing latency, of single-threaded applications. Since the majority of desktop applications in the traditional PC environment are still single-threaded, it is important to investigate if and how SRI resources can be exploited to enhance single-threaded code performance by reducing its latency. In addition, the current compiler typically cannot automatically allocate resources for the threads it created.
The invention may best be understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the invention. In the drawings:
Methods and apparatuses for compiler-creating helper threads for multi-threading systems are described. According to one embodiment, a compiler, also referred to as AutoHelper, that implements thread-based prefetching helper threads on a multi-threading system, such as, for example, the Intel Pentium™ 4 Hyper-Threading systems, available from Intel Corporation. In one embodiment, the compiler automates the generation of helper threads for Hyper-Threading processors. The techniques focus at identifying and generating helper threads of minimal sizes that can be executed to achieve timely and effective data prefetching, while incurring minimal communication overhead. A runtime system is also implemented to efficiently manage the helper threads and the synchronization between threads. Consequently, helper threads are able to issue timely prefetches for the sequential pointer-intensive applications.
In addition, hardware resources such as register contexts may be managed for helper threads within a compiler. Specifically, the register set may be statically or dynamically partitioned between main thread and helper threads, and between multiple helper threads. As a result, the live-in/live-out register copies via memory for threads may be avoided and the threads may be destroyed at compile-time, when the compiler runs out of resources, or at runtime when infrequent cases of certain main thread event occurs.
In the following description, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.
Some portions of the detailed descriptions which follow are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar data processing device, that manipulates and transforms data represented as physical (e.g. electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
Embodiments of the present invention also relate to apparatuses for performing the operations described herein. An apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs) such as Dynamic RAM (DRAM), erasable programmable ROMs (EPROMs), electrically erasable programmable ROMs (EEPROMs), magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each of the above storage components is coupled to a computer system bus.
The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the methods. The structure for a variety of these systems will appear from the description below. In addition, embodiments of the present invention are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the embodiments of the invention as described herein.
A machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a machine-readable medium includes read only memory (“ROM”); random access memory (“RAM”); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.); etc.
Note that while
As shown in
According to one embodiment, processor 103 may include one or more logical hardware contexts, also referred to as logical processors, for handling multiple threads simultaneously, including a main thread, also referred to as a non-speculative thread, and one or more helper threads, also referred to as speculative threads, of an application. Processor 103 may be a Hyper Threading processor, such as a Pentium 4 or a Xeon processor capable of performing multithreading processes from Intel Corporation. During an execution of an application, the main thread and one or more helper threads are executed in parallel. The helper threads are speculatively executed associated with, but somewhat independent to, the main thread to perform some precomputations, such as speculative prefetches of addresses or data, for the main thread to reduce the memory latency incurred by the main thread.
According to one embodiment, the code of the helper threads (e.g., the source code and the binary executable code) are generated by a compiler, such as AutoHelper compiler available from Intel Corporation, loaded and executed in a memory, such as volatile RAM 105, by an operating system (OS) executed by a processor, such as processor 103. The operating system running within the exemplary system 100 may be a Windows operating system from Microsoft Corporation or a Mac OS from Apple Computer. Alternatively, the operating system may be a Linux or Unix operating system. Other operating systems, such as embedded real-time operating systems, may be utilized.
Current Hyper-Threading processors typically provide two hardware contexts, or logical processors. To improve the performance of a single-threaded application, Hyper-Threading technology can utilize its second context to perform prefetching for the main thread. Having a separate context allows the helper threads' execution to be decoupled from the control flow of the main thread, unlike software prefetching. By running far ahead of the main thread to perform long-range prefetches, the helper threads can trigger prefetches early, and eliminate or reduce the cache miss penalties experienced by the main thread.
With AutoHelper, a compiler is able to automatically generate prefetching helper threads for Hyper-Threading machines. The helper threads aim at bringing the latency-hiding benefit of multithreading to sequential workloads. Unlike threads produced by the conventional parallelizing compilers, the helper threads only prefetch for the main thread, which does not reuse the computed results from the helper threads. According to on embodiment, the program correctness is still maintained by the main thread's execution, while the helper threads do not affect program correctness and are used solely for performance improvement. This attribute permits the use of more aggressive forms of optimization in generating helper threads. For example, when the main thread does not need help, certain optimizations may be performed, which are not possible with conventional throughput threading paradigm.
In one embodiment, if it is predicted that a helper is not needed for a certain period of time, the helper may terminate and release all the resources associate with the helper to main thread. According to another embodiment, if it is predicted that a helper may be needed shortly, the helper may be in a pause mode, which still consumes some resources on Hyper-Threading hardware. Exponential back-off (via halting) will be invoked if the helper stays in the pause mode too long (e.g., exceeding a programmable timeout period). According to a further embodiment, if the compiler cannot predict when the helper thread will be needed, the helper may be in a snooze mode and may relinquish the occupied processor resources to the main thread.
Furthermore, according to one embodiment, performance monitoring and on-the-fly adjustments are made possible under helper-threading paradigm, because the helper thread does not contribute to the semantics of the main program. When a main thread needs a helper, it will wake up the main thread. For example, with respect to a run-away helper or a run-behind thread, one of the processes described above may be invoked to adjust the run-away helper thread.
For at least one embodiment, the front end 221 includes a fetch/decode unit 222 that includes logically independent sequencers 220 for each of a plurality of thread contexts. The logically independent sequencer(s) 220 may include marking logic 280 to mark the instruction information for speculative threads as being “speculative.” One skilled in the art will recognize that, for an embodiment implemented in a multiple processor multithreading environment, only one sequencer 220 may be included in the fetch/decode unit 222.
As used herein, the term “instruction information” is meant to refer to instructions that can be understood and executed by the execution core 230. Instruction information may be stored in a cache 225. The cache 225 may be implemented as an execution instruction cache or an execution trace cache. For embodiments that utilize an execution instruction cache, “instruction information” includes instructions that have been fetched from an instruction cache and decoded. For embodiments that utilize a trace cache, the term “instruction information” includes traces of decoded micro-operations. For embodiments that utilize neither an execution instruction cache nor trace cache, “instruction information” also includes raw bytes for instructions that may store in an instruction cache such as I cache 244.
Memory system 302 is intended as a generalized representation of memory and may include a variety of forms of memory, such as a hard drive, CD-ROM, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM) and related circuitry. Memory system 302 may store instructions 310 and/or data 312 represented by data signals that may be executed by processor 304. The instructions 310 and/or data 312 may include code for performing any or all of the techniques discussed herein.
Specifically, compiler 308 may include a delinquent load identifier 320 that, when executed by the processor 304, identifies one or more delinquent load regions of a main thread. The compiler 308 may also include a parallelization analyzer 324 that, when executed by the processor 304, performs one or more parallelization analysis for the helper threads. Also, the compiler 308 may include a slicer 322 that identifies one or more slices to be executed by a helper thread in order to perform speculative precomputation. The compiler 308 may further include a code generator 328 that, when executed by the processor 304, generates the code (e.g., source and executable code) for the helper threads.
Executing helper threads in an SMT machine is a form of asymmetric multithreading, as shown in
At block 504, at least a portion of the code within the region of the main thread is executed using in part the data (e.g., prefetched or precomputed) provided by the one or more helper threads. According to one embodiment, the results computed by a helper thread are not integrated into the main thread. The benefit of a helper thread lies in its side effects of prefetching, not in reusing its computation results. This allows the compiler to aggressively optimize the code generation for helper threads. The main thread handles the correctness issue, while the helper threads target the performance of a program. This also allows the helper thread invoking statement, such as invoke_helper, to drop requests whenever deemed appropriate. Finally, non-faulting instructions, such as the prefetch instructions, may be used to avoid disruptions to the main thread if exceptions are signaled in a helper thread.
At block 505, the one or more helper threads associated with the main thread are terminated (via a function call, such as finish_helper) when the main thread is about to exit the delinquent load region and the resources, such as logical thread contexts, associated with the terminated helper threads are released back to the thread pool. This enables future requests to immediately recycle the logical thread contexts from the thread pool. Other operations apparent to those with ordinary skill in the art may be included.
Hyper-Threading technology is well suited for supporting the execution of one or more helper threads. According to one embodiment, in each processor cycle, instructions from either of the logical processors can be scheduled and executed simultaneously on shared execution resources. This allows helper threads to issue timely prefetches. In addition, the entire on-chip cache hierarchy is shared between the logical processors, which is useful for helper threads to effectively prefetch for the main thread at all levels of the cache hierarchy. Furthermore, although the physical execution resources are shared between the logical processors, the architecture state is duplicated in a Hyper-Threading processor. The execution of helper threads will not alter the architecture state in the logical processor executing the main thread.
However, on Hyper-Threading technology enabled machines, helper threads can still impact the execution of main thread due to the writes to memory. Because helper threads share memory with the main thread, the execution of helper threads should be guaranteed not to write to the data structures of the main thread. In one embodiment, the compiler (e.g., AutoHelper) provides memory protection between the main thread and the helper threads. The compiler removes stores to non-local variables in the helper threads.
Unlike a conventional approach, AutoHelper (e.g., the compiler) eliminates the profile-instrumentation pass to make the tool easier to use. According to one embodiment, the compiler can directly analyze the output from profiling results, such as those generated by Intel's VTune™ Performance Analyzer, which is enabled for Hyper-Threading technology. Because it is a middle-end pass instead of a post-pass tool, the compiler is able to utilize several product-quality analyses, such as array dependence analysis and global scalar optimization, etc. These analyses, invoked after the compiler, perform aggressive optimizations on the helper threads' code.
According to one embodiment, the compiler generates one or more helper threads to precompute and prefetch the address accessed by a load that misses the cache frequently, also referred to as a delinquent load. The compiler also generates one or more triggers in the main thread that spawns one or more helper threads. The compiler implements the trigger as an invoking function, such as the invoke_helper function call. Once the trigger is reached, the load is expected to appear later in the instruction stream of the main thread, hence the speculatively executed helper threads can reduce the number of cache misses in the main thread.
According to one embodiment, the compiler identifies the most delinquent loads in an application source code using one or more run-time profiles. Traditional compilers collect the profiles in two steps: profile-instrumentation and profile-generation. However, because cache miss is not an architecture feature that is exposed to the compilers, profile-instrumentation pass does not permit instrumentation of cache misses for the compiler to identify delinquent loads. The profiles for each cache hierarchy are collected via a utility, such as the VTune™ Analyzer from Intel Corporation. In one embodiment, the application may be executed with debugging information in a separate profiling run prior to the compiler. During the profiling run, cache misses are sampled and the hardware counters are accumulated for each static load in the application.
The compiler identifies the candidates for thread-based prefetching. In a particular embodiment, the VTune™ summarizes the cache behavior on a per-load basis. Because the binary for the profiling run is compiled with the debug information (e.g., debug symbols), it is possible to correlate the profiles back to source line numbers and the statements. Certain loads that contribute more than a predetermined threshold may be identified as delinquent loads. In a particular embodiment, the top loads that contribute to 90% of cache misses are denoted as delinquent loads.
In addition to identifying delinquent load instructions, the compiler generates helper threads that compute the addresses of delinquent loads accurately. In one embodiment, separate code for helper threads is generated. The separation between the main thread and the helper thread's code prevents transformations on a helper thread's code from affecting the main thread. In one embodiment, the compiler uses multi-entry threading, instead of conventional out-lining, in the Intel product compiler to generate separate codes for helper threads.
Furthermore, according to one embodiment, the compiler performs multi-entry threading at the granularity of a compiler-selected code region, denoted as precomputation region. This region encompasses a set of delinquent loads and defines the scope for speculative precomputation. In one embodiment, the implementation usually targets loop regions, because loops are usually the hot spots in program execution, and the delinquent loads are the loads that were executed many times, usually in a loop.
Referring back to
According to one embodiment, slicing in the compiler extracts a minimal sequence of instructions to produce the addresses of delinquent loads by transitively traversing the dependence edges backwards. The leaf nodes on the dependence graph of the resulting slices can be converted to prefetch instructions, because no further instructions are dependent on those leaf nodes. Those prefetch instructions executed by a processor, such as the Pentium™ 4 from Intel Corporation, are both non-blocking and non-faulting. Different prefetch instructions exist for bringing data into different levels of cache in the memory hierarchy.
According to one embodiment, slicing operations may be performed with respect to a given code region. Traversal on the dependence graph in a given region must terminate when it reaches code outside of that region. Thus, slicing must be terminated during traversal instead of after traversal, because the graph traversal may span to the outside of a region and then back to the inside of a region. Simply collecting the slices according to regions after the traversal may lose precision.
In a further embodiment, the compiler slices each delinquent loads instruction one by one. To minimize the duplication of code in helper threads and reduce the overhead of thread invocation and synchronization, the compiler merges slices into one helper thread if they are in the same precomputation region.
Referring back to
Referring back to
Because the typical Hyper-Threading processors issue three micro-ops per processor cycle and use some hard-partitioned resources, the compiler has to be judicious as not to let helper threads slow down the main thread's execution, especially if the main thread issues three micro-ops for execution per cycle already. For the loop nest encompassing delinquent loads, the compiler makes trade-off between re-computation and communication in choosing the loop level for performing speculative precomputation. For each loop level, starting from the innermost one, according to one embodiment, the compiler selects one of the communication-based scheme and computation-based scheme.
According to one embodiment, the communication-based scheme communicates the live-in values from the main thread to the helper thread in each iteration, so the helper thread does not need to re-compute the live-in values. The compiler will select this scheme if there exists an inner loop encompassing most delinquent loads and if slicing for the inner loop significantly decreases the size of a helper thread. However, this scheme will be disabled if the communication cost for the inner loop level is very large. The compiler will give smaller estimate of communication cost, if the live-in values are computed early and the number of live-ins is small.
Communication-based scheme will create multiple communication points between the main thread and its helper thread at runtime. Communication-based scheme is important for Hyper-Threading processors, because relying on only one communication point by re-computing the slice in the helper thread may create too much resource contention between threads. This scheme is similar to constructing a do-across loop in that the main thread initiates the next iteration after it finishes computing the live-in values for that iteration. The scheme trades communication for less computation.
According to one embodiment, the computation-based scheme assumes only one communication point between two threads to pass in the live-in values in the beginning. Afterwards, the helper thread needs to compute everything it needs to generate accurate prefetch addresses. The compiler will select this scheme if there is no inner loop, or if slicing for this loop level does not significantly increases the size of a helper thread. Computation-based scheme gives the helper thread more independence in execution, once the single communication point is reached.
According to one embodiment, to select the loop level for speculative precomputation, the compiler selects the outermost loop that benefits from communication-based scheme. Hence the scheme-selection algorithm described above can terminate once it finds a loop with communication-based scheme. If the compiler does not find any loop with communication-based scheme, the outermost loop will be the targeted region for speculative precomputation. After the compiler selects the precomputation regions and their communication schemes, locating good trigger points in the main thread would ensure timely prefetches, while minimizing the communication between the main thread and the helper threads. Liveness information helps locate triggers, which are the points at which the backward slicing ends. Slicing beyond the precomputation region ends when the number of live-ins increases.
Referring back to
If the synchronization period is too large, the prefetch induced by the helper thread could not only displace temporally important data to be used by the main thread but also potentially displace earlier prefetched data that have not been used by the main thread. On the other hand, if the synchronization period is too small, the prefetch could be too late to be useful. To decide on the value of synchronization period, according to one embodiment, the compiler first computes the difference between the length of the slice and the length of program schedule in the main thread. If the difference is small, the run-ahead distance induced by the helper thread in one iteration is consequently small. Multiple iterations may be needed by the helper thread to maintain enough run-ahead distance. Hence, the compiler increases the synchronization period if the difference is small, and vice versa.
Thereafter, the compiler generates code for the main thread and the helper thread during a code generation stage. During the code generation stage, the compiler builds a thread graph as the interface between the analysis phase and code generation phase. Each graph node denotes a sequence of instructions, or a code region. The invocation edge between the nodes denotes the thread-spawning relationship, which is important for specifying chaining helper threads. Having a thread graph enables code reuse because, according to one embodiment, the compiler also allows the user to insert pragmas in the source program to specify the code for helper threads and the live-ins. Both the pragma-based approach and the automatic approach share the same graph abstraction. As a result, the helper thread code generation module may be shared.
The helper thread code generation leverages multi-entry threading technology in the compiler to generate helper thread code. In contrast to the conventional, well-known outlining, the compiler does not create a separate compilation unit (or routine) for the helper thread. Instead, the compiler generates a threaded entry and a threaded return for in the helper thread code. The compiler keeps all newly generated helper thread codes intact or inlined within the same user-defined routine without splitting them into independent subroutines. This method provides later compiler optimizations with more opportunities for performing optimization on the newly generated helper threads. Fewer instructions in the helper thread means less resource contention on a hyper-threaded processor. This demonstrates that using helper threads for hiding latency incurs fewer instructions and less resource contention than the traditional symmetric multithreading model, which is important especially because the hyper-threaded processor issues three micro-ops per processor cycle and has some hard-partitioned resources.
According to one embodiment, the generated codes for helper threads will be re-ordered and optimized by the later on phases in the compiler such as partial dead-store elimination (PDSE), partial redundancy elimination (PRE), and other scalar optimizations. In that sense, the helper thread code needs to be optimized to minimize the resource contention. due to the helper thread. However, those further optimizations may remove prefetching code as well. Therefore, the leaf delinquent loads may be converted to the volatile-assign statements in the compiler. The leaf node in the dependence graph of a slice implies that no further instructions in the helper thread depend on the loaded value. Hence, the destination of the volatile-assign statement is changed to a register temp in the representation to speed up the resulting code. Using volatile-assign may prevent all later on compiler global optimizations from removing generated prefetches for delinquent loads.
According to one embodiment, the compiler aims at ensuring the helper thread to run neither too far ahead nor behind the main thread using a self-counting mechanism. According to one embodiment, value X is pre-set for run-ahead distance control. The X can be modified through a compiler switch by users, or based on program analysis of the length of slice (or helper code) and the length of main code. In one embodiment, the compiler generates mc (M-counter) with an initial value X for main thread and hc (H-counter) with an initial value 0 for helper thread, and the compiler generates the counter M and H for counting the sync-up periods in main and helper code. The idea is that the all four counters (mc, M, hc, H) perform self-counting. The helper thread has no inference to main thread. If the helper thread runs too far ahead of main thread, it will issue a wait, if the helper thread runs behind main thread, it will perform a catch-up.
In a particular embodiment, for every X loop-iterations, the main thread issues a post to ensure that the helper is not waiting and can go ahead to perform non_faulting_load. At this point, if the helper thread waits for the main thread after issuing a number of non_faulting_loads in chunks of sync-up period, it will wake up to perform non_faulting_loads. In another particular embodiment, for every X loop-iterations, the helper thread examines whether its hc counter is greater main thread's mc counter and the hc counter is greater a sync-up period H*X of the helper thread, if so, the helper will issue a wait and go to sleep. This prevents the helper thread from running too far ahead of the main thread. In a further embodiment, before iterating over another chunk of sync-up period, the helper thread examines whether its hc counter is smaller than the main thread's mc counter. If so, the helper thread has fallen behind, and must “catch-up and jump ahead” by updating its counter hc and H and all capture private and live-in variable from the main thread. FIGS. 9A-9C are diagrams illustrating exemplary pseudo code of an application, a main thread, and a helper thread according to one embodiment. Referring to
After the code for the helper threads have been created, the compiler may further allocate, statically or dynamically, resources for each helper thread and the main thread to ensure that there is no resource conflict between the main thread and the helper threads, and among the helper threads. Hardware resources, such as register contexts, may be managed for helper threads within the compiler. Specifically, the register set may be statically or dynamically partitioned between the main thread and the helper threads, and between multiple helper threads. As a result, the live-in/live-out register copies via memory for threads may be avoided and the threads may be destroyed at compile-time, when the compiler runs out of resources, or at runtime when infrequent cases of certain main thread event occurs.
According to one embodiment, the compiler may “walk through” the helper threads in a bottom-up order and communicates the resource utilization in a data structure, such as a resource table shown in
The threads are created by the compiler during a thread creation phase, such as those operations shown in
It is crucial that a thread can only share incoming registers (or resources in general) with a parent thread. For example, referring to
According to one embodiment, the compiler allocates resources for the helper threads and the main thread in a bottom-up order.
For the purposes of illustration, the resources used the threads are assumed to be the hardware registers. However, similar concepts may be applied to other resources apparent to one with ordinary skill in the art, such as memory or interrupt. Referring to
In addition, according to one embodiment, when the compiler runs out of registers, it can delete one or more helper threads within the chain. This can happen for example, when the main thread runs out of registers, because the helper thread chain is too deep or a single helper thread needs too many registers and the main thread has to spill/fill registers. The compiler can apply heuristics to either allow certain number of spills or delete the entire helper thread chain or some threads in the thread chain. An alternative to deleting helper thread is to explicitly configure the weight of context save/restore, so that upon context switch, the parent's live registers that could be written by the helper thread's execution can be saved automatically by the hardware. Even though this context switch is relatively expensive, potentially such case is infrequent case. Moreover, such fine-grain context switch is still of much low overhead compared to full-context switch as used in most OS-enabled thread switch or a traditional hardware based full-context thread switch.
Furthermore, when there is a conflict for live-in registers, for example, if helper thread 1003 overwrote a live-in register (e.g., mov v5=. . . ) and this register is also used in helper thread 1002 after the spawn of helper thread 1003, there would be a resource conflict for the register assigned to v5 (in this example, register R2). To handle this information, the compiler would use availability analysis and insert compensation code, such as inserting a mov v5′=v5 instruction before spawning helper thread 1003 and replacing v5 by v5′ after the spawn.
The above described techniques have been tested against a variety of benchmark tools based on a system similar to the following configurations:
A Processor with Hyper-Threading Technology Threading 2 logical processors. Trace cache 12k micro-ops. 8-way associative. 6 micro-ops per line. L1 D cache 8k bytes. 4-way associative. 64-byte line size. 2-cycle integer access. 4-cycle FP access. L2 unified 256k bytes. 8-way associative. cache 128-byte line size. 7-cycle access latency. Load buffers 48 Store buffers 24
The variety of benchmark tools include at least one of the following:
Benchmark Description Input Set nbody_walker Traverses nearest bodies 20k bodies from any node in Nbody graph mst Computes Minimal 3k nodes Spanning Tree for data clustering em3d Solves electromagnetic 20k 5- propagation in 3D degree nodes health Hierarchical database 5 levels modeling health care system mcf Integer programming Lite algorithm used for bus scheduling
Thus, methods and apparatuses for thread management for multi-threading have been described. In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the invention as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US6233599 *||Jul 10, 1997||May 15, 2001||International Business Machines Corporation||Apparatus and method for retrofitting multi-threaded operations on a computer by partitioning and overlapping registers|
|US7036124 *||Mar 1, 1999||Apr 25, 2006||Sun Microsystems, Inc.||Computer resource management for competing processes|
|US7313795 *||May 27, 2003||Dec 25, 2007||Sun Microsystems, Inc.||Method and system for managing resource allocation in non-uniform resource access computer systems|
|US7328242 *||Sep 17, 2002||Feb 5, 2008||Mccarthy Software, Inc.||Using multiple simultaneous threads of communication|
|US7415699 *||Jun 27, 2003||Aug 19, 2008||Hewlett-Packard Development Company, L.P.||Method and apparatus for controlling execution of a child process generated by a modified parent process|
|US20030037290 *||Aug 15, 2001||Feb 20, 2003||Daniel Price||Methods and apparatus for managing defunct processes|
|US20050081207 *||Feb 13, 2004||Apr 14, 2005||Hoflehner Gerolf F.||Methods and apparatuses for thread management of multi-threading|
|US20050165671 *||Mar 21, 2005||Jul 28, 2005||Meade Stephen M.||Online trading system and method supporting heirarchically-organized trading members|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US7206795 *||Dec 22, 2003||Apr 17, 2007||Jean-Pierre Bono||Prefetching and multithreading for improved file read performance|
|US7337442 *||Dec 3, 2002||Feb 26, 2008||Microsoft Corporation||Methods and systems for cooperative scheduling of hardware resource elements|
|US7398521 *||Feb 13, 2004||Jul 8, 2008||Intel Corporation||Methods and apparatuses for thread management of multi-threading|
|US7448037 *||Jan 13, 2004||Nov 4, 2008||International Business Machines Corporation||Method and data processing system having dynamic profile-directed feedback at runtime|
|US7472256||Apr 12, 2005||Dec 30, 2008||Sun Microsystems, Inc.||Software value prediction using pendency records of predicted prefetch values|
|US7774779 *||Nov 18, 2005||Aug 10, 2010||At&T Intellectual Property I, L.P.||Generating a timeout in a computer software application|
|US8056083 *||Oct 10, 2006||Nov 8, 2011||Diskeeper Corporation||Dividing a computer job into micro-jobs for execution|
|US8087018||Mar 31, 2006||Dec 27, 2011||Intel Corporation||Managing and supporting multithreaded resources for native code in a heterogeneous managed runtime environment|
|US8176487 *||Apr 28, 2008||May 8, 2012||International Business Machines Corporation||Client partition scheduling and prioritization of service partition work|
|US8214831||May 5, 2009||Jul 3, 2012||International Business Machines Corporation||Runtime dependence-aware scheduling using assist thread|
|US8219988||Apr 28, 2008||Jul 10, 2012||International Business Machines Corporation||Partition adjunct for data processing system|
|US8219989||Apr 28, 2008||Jul 10, 2012||International Business Machines Corporation||Partition adjunct with non-native device driver for facilitating access to a physical input/output device|
|US8230422 *||Jan 13, 2005||Jul 24, 2012||International Business Machines Corporation||Assist thread for injecting cache memory in a microprocessor|
|US8239869||Jun 19, 2006||Aug 7, 2012||Condusiv Technologies Corporation||Method, system and apparatus for scheduling computer micro-jobs to execute at non-disruptive times and modifying a minimum wait time between the utilization windows for monitoring the resources|
|US8312455 *||Dec 19, 2007||Nov 13, 2012||International Business Machines Corporation||Optimizing execution of single-threaded programs on a multiprocessor managed by compilation|
|US8359589 *||Feb 1, 2008||Jan 22, 2013||International Business Machines Corporation||Helper thread for pre-fetching data|
|US8413151||Dec 19, 2007||Apr 2, 2013||Nvidia Corporation||Selective thread spawning within a multi-threaded processing system|
|US8447933||Feb 4, 2008||May 21, 2013||Nec Corporation||Memory access control system, memory access control method, and program thereof|
|US8464271||Apr 10, 2012||Jun 11, 2013||International Business Machines Corporation||Runtime dependence-aware scheduling using assist thread|
|US8468539||Sep 3, 2009||Jun 18, 2013||International Business Machines Corporation||Tracking and detecting thread dependencies using speculative versioning cache|
|US8495632||Apr 6, 2012||Jul 23, 2013||International Business Machines Corporation||Partition adjunct for data processing system|
|US8544006 *||Dec 19, 2007||Sep 24, 2013||International Business Machines Corporation||Resolving conflicts by restarting execution of failed discretely executable subcomponent using register and memory values generated by main component after the occurrence of a conflict|
|US8561046 *||Sep 14, 2009||Oct 15, 2013||Oracle America, Inc.||Pipelined parallelization with localized self-helper threading|
|US8578354 *||May 11, 2009||Nov 5, 2013||Xmos Limited||Link-time resource allocation for a multi-threaded processor architecture|
|US8583700||Jan 2, 2009||Nov 12, 2013||International Business Machines Corporation||Creation of date window for record selection|
|US8601241||Feb 1, 2008||Dec 3, 2013||International Business Machines Corporation||General purpose register cloning|
|US8612730||Jun 8, 2010||Dec 17, 2013||International Business Machines Corporation||Hardware assist thread for dynamic performance profiling|
|US8615765||Nov 2, 2011||Dec 24, 2013||Condusiv Technologies Corporation||Dividing a computer job into micro-jobs|
|US8615770||Aug 29, 2008||Dec 24, 2013||Nvidia Corporation||System and method for dynamically spawning thread blocks within multi-threaded processing systems|
|US8645974||Apr 28, 2008||Feb 4, 2014||International Business Machines Corporation||Multiple partition adjunct instances interfacing multiple logical partitions to a self-virtualizing input/output device|
|US8667260||Mar 5, 2010||Mar 4, 2014||International Business Machines Corporation||Building approximate data dependences with a moving window|
|US8707016||Feb 1, 2008||Apr 22, 2014||International Business Machines Corporation||Thread partitioning in a multi-core environment|
|US8726279 *||May 6, 2006||May 13, 2014||Nvidia Corporation||System for multi threaded multi processor sharing of asynchronous hardware units|
|US8775778||Feb 1, 2008||Jul 8, 2014||International Business Machines Corporation||Use of a helper thread to asynchronously compute incoming data|
|US8826258 *||May 11, 2009||Sep 2, 2014||Xmos Limited||Compiling and linking|
|US8887174 *||Jul 26, 2011||Nov 11, 2014||Intel Corporation||Mechanism for monitoring instruction set based thread execution on a plurality of instruction sequencers|
|US8959497 *||Aug 29, 2008||Feb 17, 2015||Nvidia Corporation||System and method for dynamically spawning thread blocks within multi-threaded processing systems|
|US9003421 *||Nov 28, 2005||Apr 7, 2015||Intel Corporation||Acceleration threads on idle OS-visible thread execution units|
|US20040107421 *||Dec 3, 2002||Jun 3, 2004||Microsoft Corporation||Methods and systems for cooperative scheduling of hardware resource elements|
|US20040243767 *||Jun 2, 2003||Dec 2, 2004||Cierniak Michal J.||Method and apparatus for prefetching based upon type identifier tags|
|US20050081207 *||Feb 13, 2004||Apr 14, 2005||Hoflehner Gerolf F.||Methods and apparatuses for thread management of multi-threading|
|US20050138091 *||Dec 22, 2003||Jun 23, 2005||Jean-Pierre Bono||Prefetching and multithreading for improved file read performance|
|US20050154861 *||Jan 13, 2004||Jul 14, 2005||International Business Machines Corporation||Method and data processing system having dynamic profile-directed feedback at runtime|
|US20070124736 *||Nov 28, 2005||May 31, 2007||Ron Gabor||Acceleration threads on idle OS-visible thread execution units|
|US20070261053 *||May 6, 2006||Nov 8, 2007||Portal Player, Inc.||System for multi threaded multi processor sharing of asynchronous hardware units|
|US20090164759 *||Dec 19, 2007||Jun 25, 2009||International Business Machines Corporation||Execution of Single-Threaded Programs on a Multiprocessor Managed by an Operating System|
|US20090199170 *||Feb 1, 2008||Aug 6, 2009||Arimilli Ravi K||Helper Thread for Pre-Fetching Data|
|US20110067014 *||Mar 17, 2011||Yonghong Song||Pipelined parallelization with localized self-helper threading|
|US20110131558 *||May 11, 2009||Jun 2, 2011||Xmos Limited||Link-time resource allocation for a multi-threaded processor architecture|
|US20110131559 *||May 11, 2009||Jun 2, 2011||Xmos Limited||Compiling and linking|
|US20120017221 *||Jan 19, 2012||Hankins Richard A||Mechanism for Monitoring Instruction Set Based Thread Execution on a Plurality of Instruction Sequencers|
|WO2007115429A1 *||Mar 31, 2006||Oct 18, 2007||Chen Miaobo||Managing and supporting multithreaded resources for native code in a heterogeneous managed runtime environment|
|WO2008040081A1 *||Oct 5, 2007||Apr 10, 2008||Holt John Matthew||Job scheduling amongst multiple computers|
|International Classification||G06F9/46, G06F9/45|
|Sep 30, 2003||AS||Assignment|
Owner name: INTEL CORPORATION, CALIFORNIA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HOFLEHNER, GEROLF F.;LIAO, SHIH-WEI;TIAN, XINMIN;AND OTHERS;REEL/FRAME:014572/0309;SIGNING DATES FROM 20030919 TO 20030924