|Publication number||US6477637 B1|
|Application number||US 09/409,802|
|Publication date||Nov 5, 2002|
|Filing date||Sep 30, 1999|
|Priority date||Sep 30, 1999|
|Publication number||09409802, 409802, US 6477637 B1, US 6477637B1, US-B1-6477637, US6477637 B1, US6477637B1|
|Inventors||Ravi Kumar Arimilli, Robert Alan Cargnoni, Guy Lynn Guthrie|
|Original Assignee||International Business Machines Corporation|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (8), Referenced by (2), Classifications (6), Legal Events (9)|
|External Links: USPTO, USPTO Assignment, Espacenet|
The present patent application is related to copending application U.S. Ser. No. 09/409,801, filed on even date, entitled “METHOD AND APPARATUS FOR DISPATCHING MULTIPLE STORE REQUESTS BETWEEN FUNCTIONAL UNITS WITHIN A PROCESSOR”
1. Technical Field
The present invention relates to a method and apparatus for data processing in general, and in particular to a method and apparatus for transporting access requests within a data processing system. Still more particularly, the present invention relates to a method and apparatus for queuing and transporting store requests between functional units within a processor.
2. Description of the Prior Art
Designers of modern state-of-the-art processors are continually attempting to enhance performance aspects of such processors. One technique for enhancing data processing efficiency is the achievement of shorter cycle times and a lower cycles-per-instruction ratio by issuing multiple instructions concurrently. In conjunction, separate execution units that can operate concurrently may be utilized to execute issued instructions. For example, some superscalar processors employ pipelined branch, fixed-point, and floating-point executions units to execute multiple instructions concurrently. As a result of the concurrent issuance and execution of multiple instructions, instruction execution performance is increased.
In addition, processor designers are faced with the challenge of constructing efficient means for sending pipeline commands, request or instructions, between various functional units within a processor. Because multiple cycles are required to transport a command between two functional units within a large processor, it is important that the transport protocol can maximize the rate at which commands can be sent, even with added transport latency that may exist between the two functional units. This is because even with multiple execution units, the performance of a processor still depends upon the rate at which instructions, commands, and requests can be transported between functional units. Thus, it should be apparent that a need exists for an improved method and apparatus for transporting instructions, commands, or requests between functional units within a processor such that transport delay among functional units is minimized.
In accordance with a preferred embodiment of the present invention, a data processing system includes a data dispatching unit, a data receiving unit, and a segmented data pipeline along with a segmented feedback line coupled between the data dispatching unit and the data receiving unit. Having multiple latches interconnected between segments, the segmented data pipeline systolically transfers data from the data dispatching unit to the data receiving unit. The segmented feedback line has multiple control latches interconnected between segments. Each of the control latches sends a control signal to a respective one of the latches in the segmented instruction pipeline to forward data to a next segment within the segmented data pipeline.
All objects, features, and advantages of the present invention will become apparent in the following detailed written description.
The invention itself, as well as a preferred mode of use, further objects, and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
FIG. 1 is a block diagram of a processor in which a preferred embodiment of the present invention is implemented;
FIG. 2 is a detailed block diagram of a high-speed request pipeline connected between the load/store unit and the data cache from FIG. 1, in accordance with a preferred embodiment of the present invention;
FIG. 3 is a high-level control flow diagram of the operation of the load/store unit from FIG. 2, in accordance with a preferred embodiment of the present invention; and
FIG. 4 is a high-level control flow diagram of the operation of the control unit within the data cache from FIG. 2, in accordance with a preferred embodiment of the present invention.
Referring now to the drawings and in particular to FIG. 1, there is depicted a block diagram of a processor 10 in which a preferred embodiment of the present invention is implemented. Within processor 10, a bus interface unit 12 is coupled to a data cache 13 and an instruction cache 14. Instruction cache 14 is further coupled to an instruction unit 11 that fetches instructions from instruction cache 14 during each execution cycle.
Processor 10 also includes at least three execution units, namely, an integer unit 15, a load/store unit 16, and a floating-point unit 17. Each of execution units 15-17 can execute one or more classes of instructions, and all execution units 15-17 can operate concurrently during each processor cycle. After execution of an instruction has terminated, the associated one of execution units 15-17 stores data results to a respective rename buffer (not shown), depending upon the instruction type. Then, any one of execution units 15-17 may signal a completion unit 20 that the execution of an instruction has finished. Finally, each instruction is completed in program order, and the result data are transferred from a respective rename buffer to a general purpose register 18 or a floating-point register 19, accordingly.
With reference now to FIG. 2, there is depicted a detailed block diagram of a high-speed request pipeline, in accordance with a preferred embodiment of the present invention. As shown, a request pipeline 21 is connected between, for example, load/store unit 16 and data cache 13 from FIG. 1. Executed but uncompleted store requests can be transported between load/store unit 16 and data cache 13 via request pipeline 21. Request pipeline 21 has multiple segments, which are interconnected by latches 23 a- 24 n. Latches 23 a- 23 n are needed to maintain the required cycle time to transport store requests between load/store unit 16 and data cache 13. In one embodiment, each store request has 101 bits, preferably in form of one request_valid bit, 32 address bits, four command bits, and 64 data bits. Store request can be gathered in a store queue 25 upon reaching data cache 13.
In addition to request pipeline 21, a feedback line 22 is also connected between load/store unit 16 and data cache 13. Feedback line 22 informs load/store unit 16 when data cache 13 has removed an entry from request pipeline 21. Load/store unit 16 maintains a numerical count of store requests currently loaded in request pipeline 21. With the numerical count, load/store unit 16 is able to detect whether request pipeline 21 is full (the maximum count of store requests that request pipeline 21 can hold is equal to the number of latches 23 a- 23 n plus the number of entries in store queue 25). For example, after a store request has been removed out of store queue 25, a feedback signal is sent to load/store unit 16 via feedback line 22 to inform load/store unit 16 that the total number of store requests in request pipeline 21 has been reduced by one and that a new store request can be sent via request pipeline 21. Similar to request pipeline 21, feedback line 22 also has multiple segments, which are interconnected by control latches 24 a- 24 n. The feedback signal preferably travels from between data cache 13 and load/store unit 16 within the same cycle time as store requests.
In accordance with a preferred embodiment of the present invention, each of control latches 24 a- 24 n within feedback line 22 is associated with a respective one of latches 23 a- 23 n within request pipeline 21. As shown in FIG. 2, control latch 24 a is connected to latch 23 a via an OR gate 25 a, control latch 24 b is connected to latch 23 b via an OR gate 25 b, etc. Furthermore, each of latches 23 a- 23 n is associated with a preceding one the latches. For example, latch 23 b is connected with latch 23 a via a latch 26 a and OR gate 25 a, latch 23 c is connected with latch 23 b via a latch 26 b and OR gate 25 b, etc. At the final latch stage, data cache 13 is connected with latch 23 n.
With the above-mentioned associations, store requests can be propagated down request pipeline 21 by three separate ways, namely, by the prompting of a feedback signal from each of control latches 24 a- 24 n, by the prompting of a return signal from each of latches 23 a- 23 n, and by the prompting of a return signal from a latch stage “below.” Each of these is explained below in turn. First, store requests can be propagated down request pipeline 21 by the prompting of a feedback signal from each of control latches 24 a- 24 n on feedback line 22. For example, control latch 24 n sends a feedback signal to latch 23 n via a feedback line 31 n and OR gate 25 n to move a store request from latch 23 n to store queue 25, control latch 24 b sends a feedback signal to latch 23 b via a feedback line 31 b and OR gate 25 b to move a store request from latch 23 b to latch 23 c, etc.
During operation, when the feedback signal is transferred from control latch 24 n to the next control latch, the feedback signal is also utilized to enable latch 23 n to load data from a latch “above” since a store request in latch 24 n has just been propagated to store queue 25. As the feedback signal is travelling “up” feedback line 22, a store request (if present) in each of latches 23 a- 23 n is propagating “down” request pipeline 21, until the feedback signal reaches load/store unit 16.
Second, store requests can be propagated down request pipeline 21 by the prompting of a return signal from each of latches 23 a- 23 n via a respective one of return lines 32 a- 32 n. For example, latch 23 a sends a return signal to latch 23 a via a return line 32 a and OR gate 25 a to load a store request into latch 23 a, latch 23 b sends a return signal to latch 23 b via a return line 32 b and OR gate 25 b to load a store request into latch 23 b, etc. This propagation movement serves to fill in any empty latch (i.e., a latch that does not contain a valid request) within request pipeline 21. Preferably, the valid_request bit (i.e., bit 0 of the 101 bits) is used as the return signal on return lines 32 a- 32 n to determine whether a latch within request pipeline 21 is empty.
Third, store requests can be propagated down request pipeline 21 by the prompting of a return signal from latches 26 b- 26 n. Each of latches 26 b- 26 n serve as an enable control for a latch preceding each of latches 23 a- 23 n. For example, OR gate 25 b sends a return signal to latch 23 a via a latch 26 b, a return line 33 a, and OR gate 25 a to cause a store request to be loaded into latch 23 a from load/store unit 16, OR gate 25 c sends a return signal to latch 23 b via a latch 26 c, a return line 33 b, and OR gate 25 b to cause a store request to be loaded into latch 23 b from latch 23 a, etc. The purpose of this return signal is to detect when a latch stage “below” has just latched a store request from a current latch stage. Because this indication had to be latched (i.e., latches 26 b- 26 n are also required to key off OR gate 25 b- 25 n), gates 27 a- 27 n are utilized to prevent the same store request from loading twice. For example, gate 25 b is active, latch 23 b is being enabled such that in the next cycle, it will contain what was in latch 23 a in the previous cycle. But because it takes a cycle (via latch 26 b) for latch 23 a to realize its request has already been loaded into latch 23 b, it must gate off it request via gate 27 a for one cycle to prevent the same request from being loaded twice by latch 23 b.
Referring now to FIG. 3, there is depicted a high-level control flow diagram of the operation of load/store unit 16, in accordance with a preferred embodiment of the present invention. Starting at block 30, a determination is made whether or not a new store request needs to be dispatched (for example, when execution of a store instruction by load/store unit 16 is completed), as shown in block 31. If no new store request need to be dispatched, the process returns to block 31; otherwise, a determination is made whether or not request pipeline 21 (from FIG. 2) is full, as depicted in block 32. Load/store unit 16 maintains a counter to keep track of the number of store requests that have been dispatched down request pipeline 21. The counter is incremented every time a new store request is dispatched down request pipeline 21 until the maximum count (i.e., the number of latches within request pipeline 21 plus the number of latches within store queue 25) is reached. On the other hand, the counter is decremented when load/store unit 16 receives a feedback back signal from feedback line 22.
A new store request is dispatched to request pipeline 21 when request pipeline 21 is not full, as illustrated in block 33, and the process returns to block 31. Otherwise, load/store unit 16 waits until request pipeline 21 becomes available for data dispatching.
With reference now to FIG. 4, there is depicted a high-level control flow diagram of the operation of control unit 26 within data cache 13 (from FIG. 2), in accordance with a preferred embodiment of the present invention. Starting at block 40, a determination is made whether or not store queue 25 (from FIG. 5) is full, as shown in block 41. If store queue 25 is full, the process returns to block 41; otherwise, a store request is captured from latch 23 n (from FIG. 2), as depicted in block 42. A determination is then made whether or not the captured store request is a valid request, as shown in block 43. If the captured store request is a valid request, then the captured data is placed within store queue 25, and a feedback signal is sent to feedback line 22, as depicted in block 44, indicating that a store request can be dispatched down data pipeline 21. Otherwise, if the captured store request is not a valid request, the process return to block 42.
As has been described, the present invention provides an improved method and apparatus for transporting requests within a processor that minimize transporting latency. Although an execution unit and a store queue are utilized to illustrate the present invention, it is understood by those skilled in the art that the illustrated principles can be applied to data transfer between any two function units within a data processing system.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US5404550 *||Jul 25, 1991||Apr 4, 1995||Tandem Computers Incorporated||Method and apparatus for executing tasks by following a linked list of memory packets|
|US5574933 *||Oct 26, 1994||Nov 12, 1996||Tandem Computers Incorporated||Task flow computer architecture|
|US5659780 *||Jun 15, 1994||Aug 19, 1997||Wu; Chen-Mie||Pipelined SIMD-systolic array processor and methods thereof|
|US5758139 *||Apr 23, 1996||May 26, 1998||Sun Microsystems, Inc.||Control chains for controlling data flow in interlocked data path circuits|
|US5799134 *||May 15, 1995||Aug 25, 1998||Industrial Technology Research Institute||One dimensional systolic array architecture for neural network|
|US5819308 *||Feb 27, 1997||Oct 6, 1998||Industrial Technology Research Institute||Method for buffering and issuing instructions for use in high-performance superscalar microprocessors|
|US5937177 *||Oct 1, 1996||Aug 10, 1999||Sun Microsystems, Inc.||Control structure for a high-speed asynchronous pipeline|
|US6298423 *||Aug 26, 1996||Oct 2, 2001||Advanced Micro Devices, Inc.||High performance load/store functional unit and data cache|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US8429384 *||Nov 15, 2006||Apr 23, 2013||Harman International Industries, Incorporated||Interleaved hardware multithreading processor architecture|
|US20080016321 *||Nov 15, 2006||Jan 17, 2008||Pennock James D||Interleaved hardware multithreading processor architecture|
|U.S. Classification||712/218, 710/61|
|International Classification||G06F3/00, G06F5/08|
|Sep 30, 1999||AS||Assignment|
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ARIMILLI, RAVI K.;CARGNONI, ROBERT ALAN;GUTHRIE, GUY LYNN;REEL/FRAME:010292/0141
Effective date: 19990930
|Jan 9, 2006||FPAY||Fee payment|
Year of fee payment: 4
|Jun 14, 2010||REMI||Maintenance fee reminder mailed|
|Oct 28, 2010||FPAY||Fee payment|
Year of fee payment: 8
|Oct 28, 2010||SULP||Surcharge for late payment|
Year of fee payment: 7
|Jun 13, 2014||REMI||Maintenance fee reminder mailed|
|Oct 24, 2014||FPAY||Fee payment|
Year of fee payment: 12
|Oct 24, 2014||SULP||Surcharge for late payment|
Year of fee payment: 11
|Mar 13, 2015||AS||Assignment|
Owner name: LINKEDIN CORPORATION, CALIFORNIA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:035201/0479
Effective date: 20140331