Search Images Maps Play YouTube News Gmail Drive More »
Sign in
Screen reader users: click this link for accessible mode. Accessible mode has the same essential features but works better with your reader.

Patents

  1. Advanced Patent Search
Publication numberUS5467476 A
Publication typeGrant
Application numberUS 08/293,164
Publication dateNov 14, 1995
Filing dateAug 19, 1994
Priority dateApr 30, 1991
Fee statusPaid
Publication number08293164, 293164, US 5467476 A, US 5467476A, US-A-5467476, US5467476 A, US5467476A
InventorsTakashi Kawasaki
Original AssigneeKabushiki Kaisha Toshiba
Export CitationBiBTeX, EndNote, RefMan
External Links: USPTO, USPTO Assignment, Espacenet
Superscalar processor having bypass circuit for directly transferring result of instruction execution between pipelines without being written to register file
US 5467476 A
Abstract
In a superscalar parallel processor, the execution time for instructions can be reduced, and the performance of instruction processing can be improved. A superscalar parallel processor having a plurality of pipelines arranged to parallelly execute a maximum of N (N>1) instructions includes a bypass circuit for transferring a data output of each step of at least two pipelines between the pipelines.
Images(4)
Previous page
Next page
Claims(4)
What is claimed is:
1. A superscalar processing system for parallelly processing a plurality of instructions using a superscalar method, comprising:
first and second instruction execution operating units;
a register file having first and second read ports and first and second write ports; and
first and second instruction decoders, said first and second instruction decoders responding to said first and second instruction execution operating units, respectively;
said first instruction execution operating unit including:
a first pipeline having a first terminal connected to said first write port,
a first arithmetic and logic unit having an output port connected to a second terminal of said first pipeline,
two two-input selector circuits for selectively receiving data to be processed in accordance with instructions,
two flip-flop circuits, each positioned between respective ones of said two-input selector circuits and an input port of said first arithmetic and logic unit, and
a first group of flip-flop circuits arranged on said first pipeline for storing data;
said second instruction execution operating unit including:
a second pipeline having a first terminal connected to said second write port,
a second arithmetic and logic unit having an output port connected to a second terminal of said pipeline,
two two-input selector circuits for selectively receiving data to be processed in accordance with instructions,
two flip-flop circuits, each positioned between respective ones of said two-input selector circuits and an input port of said second arithmetic and logic unit, and
a second group of flip-flop circuits arranged on said second pipeline for storing data;
a first group of bypass circuits, each connecting said first pipeline to one of said two two-input selection circuits of said second instruction execution operating unit, for transferring data stored in said first group of flip-flop circuits to said one of said two two-input selection circuits of said second instruction execution operating unit; and
a second group of bypass circuits, each connecting said second pipeline to one of said two two-input selection circuits of said first instruction execution operating unit, for transferring data stored in said second group of flip-flop circuits to said one of said two two-input selection circuits of said first instruction execution operating unit.
2. A system according to claim 1, wherein said instruction execution operating units, said register file and said instruction decoders are formed on a common semiconductor chip.
3. A system according to claim 1, wherein each of said first group of said bypass circuits comprises a tri-state buffer circuit.
4. A system according to claim 1, wherein each of said second group of said bypass circuits comprises a tri-state buffer circuit.
Description

This application is a continuation of application Ser. No. 07/874,135, filed Apr. 27, 1992, now abandoned.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to an RISC (Reduced Instruction Set Computer) microprocessor and, more particularly, to a parallel processor system for parallelly processing a plurality of instructions using a superscalar method.

2. Description of the Related Art

In a conventional data processor, a SISD (Single Instruction Single Data) method for sequentially processing single instructions is usually used. For improving the performance of the processor, the following countermeasures are provided. The width of processable data is increased, and the operating frequency is increased. In addition, a pipeline method for simultaneously processing a plurality of data by dividing the processing itself into several sections is used, or a hardware used for special processing such as floating-point arithmetic is added.

FIG. 1 is a conventional pipeline processor having only one operating unit. Reference numeral 51 denotes a register file. Reference symbol RP denotes a read port of the register file 51; and WP, a write port of the register file 51. Reference numeral 52 denotes an arithmetic and logic unit (to be referred to as an ALU hereinafter); 531 and 532, two-input selector circuits; 54a to 54d, flip-flop circuits; 55a to 55c and 56a to 56c, tri-state buffer circuits; and 17, an instruction decoder.

In this pipeline processor, when instructions 1 to 4 shown in FIG. 2 are to be executed, as shown in FIG. 3, instruction decode D, instruction execute E, memory access M, and register write W of the instruction 1 are sequentially executed in four steps I to IV, the instruction 2 is executed in steps II to V, the instruction 3 is executed in steps III to VI, and the instruction 4 is executed in steps IV to VII. Therefore, a total of 7 cycles are required for writing the operation result in the register.

In order to further improve the performance of a processor, an MIMD (Multiple-instruction structure Multiple-data stream) method for simultaneously (parallelly) executing a plurality of instructions can effectively be used. In this method, a plurality of operation processing units are arranged, and these units are simultaneously operated. For example, there are an array processor having an array of identical operating units and a superscalar parallel processor having a plurality of different operating units and having a plurality of pipelines.

Since it is difficult to apply the former to general data processing, the application field of the processor is limited. In contrast to this, since the control method of the superscalar processor corresponds to the extension of the control method of a conventional processor, the superscalar processor can be relatively easily applied to general data processing.

In the superscalar parallel processor, a plurality of instructions are parallelly executed for 1 clock (cycle) by simultaneously operating a plurality of operating units. In this case, in processing the instruction, the plurality of instructions are simultaneously fetched/decoded by the operating units. For this reason, the superscalar parallel processor has a processing capacity larger than that of a conventional processor.

FIG. 4 shows a conventional superscalar parallel processor having two operating units and two pipelines arranged to parallelly execute two instructions.

In FIG. 4, reference numeral 71 denotes a register file, and reference symbols RP and WP denote a read port and a write port of the register file 71, respectively. Reference numerals 721 and 722 denote ALUs; 731a, 731b, 732a, and 732b, two-input selector circuits; 741a to 741d and 742a to 742d, flip-flop circuits; 751a to 751c, 761a to 761c, 752a to 752c, and 762a to 762c, tri-state buffer circuits; and 771 and 772, instruction decoders.

When this parallel processor is to execute instructions 1 to 4 (shown in FIG. 2), as shown in FIG. 5, instruction decode D, instruction execute E, memory access M, and register write W of the instructions 1 and 2 are sequentially executed in four steps I to IV, and instruction decode D, instruction execute E, memory access M, and register write W of the instructions 3 and 4 are sequentially executed in four steps V to VIII.

At this time, until the operation result of the instructions 1 and 2 are written in the register, instructions 3 and 4 cannot be executed. Therefore, a time required for writing the operation result of the instructions 3 and 4 in the register is 8 cycles obtained by summing 4 cycles required for executing the instructions 1 and 2 and 4 cycles required for executing the instructions 3 and 4.

Therefore in the above conventional superscalar parallel processor, although hardware is increased, when the number of instructions to be executed is larger than the number of instructions which can be parallelly executed, until the operation result of an instruction is written in the register, another instruction cannot be executed. As a result, the execution time for the instructions may be increased.

In other words, in the conventional superscalar parallel processor, when instructions having a number larger than the number of instructions which can be parallelly executed are to be executed, the execution time for the instructions may be increased.

SUMMARY OF THE INVENTION

The present invention has been made to solve the above problems, and has as its object to provide the development of a superscalar parallel processor capable of decreasing execution time for instructions and improving the performance of instruction processing.

According to the present invention, a superscalar parallel processor has a plurality of pipelines arranged to be able to parallelly execute a maximum of N (N>1) instructions. It comprises a bypass circuit for transferring a data output of each step of at least one instruction pipeline to and from the other instruction pipelines.

Before the operation result of an arbitrary step of an instruction pipeline is written in a register file, the operation result can be used as an operand of another operation by transferring the operation result to another instruction pipeline. Therefore, even when instructions having a number larger than the number of instructions which can be parallelly executed are to be executed, the execution time of the instruction is decreased, and the performance of instruction processing can be improved.

Additional objects and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate a presently preferred embodiment of the invention and, together with the general description given above and the detailed description of the preferred embodiment given below, serve to explain the principles of the invention.

FIG. 1 is a logical circuit diagram showing a conventional pipeline processor having only one operating unit;

FIG. 2 is a view for explaining instructions 1 to 4 which are parallelly executed in a parallel processor in FIG. 6;

FIG. 3 is a view showing the conditions of steps when the instructions 1 to 4 in FIG. 2 are executed in the pipeline processor in FIG. 1;

FIG. 4 is a logical circuit diagram showing two pipelines in a conventional superscalar parallel processor in the case where two operating units are provided;

FIG. 5 is a view showing the conditions of each step when the instructions 1 to 4 in FIG. 2 are executed in the parallel processor in FIG. 4;

FIG. 6 is a logical circuit diagram showing pipelines of a superscalar parallel processor according to an embodiment of the present invention;

FIG. 7 is a circuit diagram showing a tri-state buffer circuit in FIG. 6; and

FIG. 8 is a view showing the conditions of steps when the instructions 1 to 4 in FIG. 2 are executed in the parallel processor in FIG. 6.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

An embodiment of the present invention will be described below with reference to the accompanying drawings.

FIG. 6 shows a part of a superscalar parallel processor. This parallel processor has a plurality (e.g., two) of operating units and a plurality (in this embodiment, two) of instruction pipelines. The pipelines are arranged such that a maximum of two instructions can be parallelly executed for one clock by simultaneously operating the two operating units.

In FIG. 6, reference numeral 11 denotes a register file, and reference symbols RP and WP denote a read port and a write port of the register file 11, respectively. Reference numerals 121 and 122 denote ALUs; 131a, 131b, 132a, and 132b, two-input selector circuits; 141a to 141d and 142a to 142d, flip-flop circuits; 151a to 151f, 161a to 161f, 152a to 152f, and 162a to 162f, tri-state buffer circuits; and 171 and 172, instruction decoders.

All the elements shown in FIG. 6 are formed on a single semiconductor chip.

FIG. 7 shows one circuit selected from the tri-state buffer circuits 151a to 151f, 161a to 161f, 152a to 152f, and 162a to 162f. Reference symbol Vcc denotes a power source potential; Vss, a ground potential; D, input data from the instruction pipeline; and E, an activating (enable) control signal from the instruction decoder 171 or 172. Reference numeral 21 denotes a two-input NAND gate; 22, an inverter circuit; 23, a two-input NOR gate; 24, a PMOS transistor; and 25, an NMOS transistor. When the enable control signal E is set to be "H" level, and the input data D is set to be "H" level, an output from the two-input NAND gate 21 goes to "L" level, an output from the two-input NOR gate 23 goes to "L" level, the PMOS transistor 24 is turned on, the NMOS transistor 25 is turned off, and a buffer output goes to "H" level.

In contrast to this, when the input data D goes to "L" level, the output from the two-input NAND gate 21 goes to "H" level, the output from the two-input NOR gate 23 goes to "H" level, the PMOS transistor 24 is turned off, the NMOS transistor 25 is turned on, and the buffer output goes to "L" level.

On the other hand, when the enable control signal E is set to be "L" level, the output from the two-input gate 21 goes to "H" level independently of the level of the input data D, the output from the two-input NOR gate 23 goes to "L" level independently of the level of the input data D, and the PMOS transistor 24 and the NMOS transistor 25 are turned off, thereby setting the buffer output in a high-impedance state.

When the instructions 1 to 4 shown in FIG. 2 are to be executed in the above parallel processor, as shown in FIG. 8, the parallel processor is controlled as follows. Instruction decode D, instruction execute E, memory access M, and register write W of the instructions 1 and 2 are sequentially executed in four steps I to IV, and instruction decode D, instruction execute E, memory access M, and register write W of the instructions 3 and 4 are sequentially executed in four steps II to V.

In step I, in response to the instruction 1, register values $1 and $2 are read from the register file 11, and they are stored in the flip-flop circuits 141a and 141b through the two-input selector circuits 131a and 131b, respectively. In response to the instruction 2, register values $4 and $5 are read from the register file 11, and they are stored in the flip-flop circuits 142a and 142b through the two-input selector circuits 132a and 132b, respectively.

In step II, in response to the instruction 1, an addition of $1+$2 is executed by the ALU 121, and the resultant value $3 is stored in the flip-flop circuit 141c. In response to the instruction 2, an addition of $4+$5 is executed by the ALU 122, and the resultant value $6 is stored in the flip-flop circuit 142c. At this time, these resultant values $3 and $6 are not written in the register file 11.

In response to the instruction 3, the operation result $3 of the ALU 121 is stored in the flip-flop circuit 141a as an operand through the tri-state buffer circuit 151a and the two-input selector circuit 131a, and the operation result $6 of the ALU 122 is stored in the flip-flop circuit 141b as an operand through the tri-state buffer circuit 162d and the two-input selector circuit 131b.

In response to the instruction 4, the operation result $3 of the ALU 121 is stored in the flip-flop circuit 142a as an operand through the tri-state buffer circuit 151d and the two-input selector circuit 132a, and the operation result $6 of the ALU 122 is stored in the flip-flop circuit 142b as an operand through the tri-state buffer circuit 162a and the two-input selector circuit 132b.

In step III, in response to the instruction 1, the operation result $3 of the instruction 1 stored in the flip-flop circuit 141c is transferred to the flip-flop circuit 141d. In response to the instruction 2, the operation result $6 stored in the flip-flop circuit 142c is transferred to the flip-flop circuit 142d.

In response to the instruction 3, an OR operation of the operation results $3 and $6 is executed by the ALU 121, and the resultant value $7 is stored in the flip-flop circuit 141c. In response to the instruction 4, an OR operation of the operation results $3 and $6 is executed, and the resultant value $8 is stored in the flip-flop circuit 142c.

In step IV, the operation result $3 of the instruction 1 stored in the flip-flop circuit 141d is written in the register 11, and the operation result $6 of the instruction 2 stored in the flip-flop circuit 142d is written in the register file 11.

In response to the instruction 3, the operation result $7 of the instruction 3 stored in the flip-flip circuit 141c is transferred to the flip-flip circuit 141d. In response to the instruction 4, the operation result $8 stored in the flip-flip circuit 142c is transferred to the flip-flop circuit 142d.

In step V, the operation result $7 of the instruction 3 stored in the flip-flop circuit 141d is written in the register 11, and the operation result $8 of the instruction 4 stored in the flip-flop circuit 142d is written in the register file 11.

Therefore, the time required for writing the operation results of the instructions 3 and 4 is a total of 5 cycles. The 5 cycles are less by 3 cycles compared with the required time, i.e., 8 cycles, in the conventional parallel processor shown in FIG. 4.

According to the above embodiment, the parallel processor comprises a bypass circuit (the tri-state buffer circuits 151a to 151f, 161a to 161f, 152a to 152f, and 162a to 162f) for transferring a data output of each step of two instruction pipelines.

By virtue of this circuit, before an operation result of an arbitrary step of one instruction pipeline is written in the register file, the operation result is brought in to another instruction pipeline, and, therefore, the operation result can be used as the operand of another operation.

Therefore, when instructions having a number larger than the number of instructions which can be parallelly executed are to be executed, the execution time for the instructions is decreased, and the performance of instruction processing can be remarkably improved.

When the bypass circuit is placed on the same semiconductor chip as that of an instruction execution operating unit, the system arrangement of the parallel processor can be simplified.

As described above, in the superscalar parallel processor according to the present invention, the execution time for instructions can be reduced, and the performance of instruction processing can be improved.

In the pipelines, stages are respectively formed between the ALUs 121 and 122 and the flop-flop circuits 141c and 142c, between the flip-flop circuits 141c and 142c and the flip-flip circuits 141d and 142d, and between the flip-flop circuits 141d and 142d and the write port WP of the register file 11.

Additional advantages and modifications will readily occur to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details, and representative devices shown and described herein. Accordingly, various modifications may be made without departing from the spirit or scope of the general inventive concept as defined by the appended claims and their equivalents.

Patent Citations
Cited PatentFiling datePublication dateApplicantTitle
US4541046 *Mar 23, 1982Sep 10, 1985Hitachi, Ltd.Data processing system including scalar data processor and vector data processor
US4626989 *Aug 5, 1983Dec 2, 1986Hitachi, Ltd.Data processor with parallel-operating operation units
US4639866 *Jan 11, 1985Jan 27, 1987International Computers LimitedPipelined data processing apparatus
US4742454 *Aug 30, 1983May 3, 1988Amdahl CorporationApparatus for buffer control bypass
US4782441 *Jun 10, 1986Nov 1, 1988Hitachi, Ltd.Vector processor capable of parallely executing instructions and reserving execution status order for restarting interrupted executions
US4851990 *Feb 9, 1987Jul 25, 1989Advanced Micro Devices, Inc.High performance processor interface between a single chip processor and off chip memory means having a dedicated and shared bus structure
US5051940 *Apr 4, 1990Sep 24, 1991International Business Machines CorporationData dependency collapsing hardware apparatus
US5067069 *Feb 3, 1989Nov 19, 1991Digital Equipment CorporationControl of multiple functional units with parallel operation in a microcoded execution unit
US5123108 *Sep 11, 1989Jun 16, 1992Wang Laboratories, Inc.Improved cpu pipeline having register file bypass and working register bypass on update/access address compare
US5133077 *Jun 5, 1990Jul 21, 1992International Business Machines CorporationData processor having multiple execution units for processing plural classs of instructions in parallel
US5222240 *Feb 14, 1990Jun 22, 1993Intel CorporationMethod and apparatus for delaying writing back the results of instructions to a processor
US5333284 *Oct 9, 1992Jul 26, 1994Honeywell, Inc.Repeated ALU in pipelined processor design
Non-Patent Citations
Reference
1Popescu et al; "The Metaflow Architecture"; IEEE Micro, Jun. 1991, pp. 10-13, 63-73.
2 *Popescu et al; The Metaflow Architecture ; IEEE Micro, Jun. 1991, pp. 10 13, 63 73.
3Smith et al; "Implementing Precise Interrupts In Pipelined Processors"; IEEE Transactions on Computer, vol. 37, No. 5, May 1988, pp. 562-573.
4 *Smith et al; Implementing Precise Interrupts In Pipelined Processors ; IEEE Transactions on Computer, vol. 37, No. 5, May 1988, pp. 562 573.
Referenced by
Citing PatentFiling datePublication dateApplicantTitle
US5564056 *Nov 2, 1994Oct 8, 1996Intel CorporationMethod and apparatus for zero extension and bit shifting to preserve register parameters in a microprocessor utilizing register renaming
US5592679 *Nov 14, 1994Jan 7, 1997Sun Microsystems, Inc.Apparatus and method for distributed control in a processor architecture
US5615385 *Nov 29, 1995Mar 25, 1997Intel CorporationMethod and apparatus for zero extension and bit shifting to preserve register parameters in a microprocessor utilizing register renaming
US5778248 *Jun 17, 1996Jul 7, 1998Sun Microsystems, Inc.Fast microprocessor stage bypass logic enable
US5813037 *Mar 30, 1995Sep 22, 1998Intel CorporationMulti-port register file for a reservation station including a pair of interleaved storage cells with shared write data lines and a capacitance isolation mechanism
US5996065 *Mar 31, 1997Nov 30, 1999Intel CorporationApparatus for bypassing intermediate results from a pipelined floating point unit to multiple successive instructions
US6035388 *Jun 27, 1997Mar 7, 2000Sandcraft, Inc.Method and apparatus for dual issue of program instructions to symmetric multifunctional execution units
US6112289 *Aug 28, 1998Aug 29, 2000Mitsubishi Denki Kabushiki KaishaData processor
US6115730 *Nov 17, 1997Sep 5, 2000Via-Cyrix, Inc.Reloadable floating point unit
US6178492 *Nov 9, 1995Jan 23, 2001Mitsubishi Denki Kabushiki KaishaData processor capable of executing two instructions having operand interference at high speed in parallel
US6298367 *Apr 6, 1998Oct 2, 2001Advanced Micro Devices, Inc.Floating point addition pipeline including extreme value, comparison and accumulate functions
US6397239Feb 6, 2001May 28, 2002Advanced Micro Devices, Inc.Floating point addition pipeline including extreme value, comparison and accumulate functions
US6405307 *Jun 2, 1998Jun 11, 2002Intel CorporationApparatus and method for detecting and handling self-modifying code conflicts in an instruction fetch pipeline
US6430679 *Oct 2, 1998Aug 6, 2002Intel CorporationPre-arbitrated bypasssing in a speculative execution microprocessor
US6571268 *Oct 1, 1999May 27, 2003Texas Instruments IncorporatedMultiplier accumulator circuits
US6594753Mar 6, 2000Jul 15, 2003Sandcraft, Inc.Method and apparatus for dual issue of program instructions to symmetric multifunctional execution units
US6615338Dec 3, 1998Sep 2, 2003Sun Microsystems, Inc.Clustered architecture in a VLIW processor
US6633971 *Oct 1, 1999Oct 14, 2003Hitachi, Ltd.Mechanism for forward data in a processor pipeline using a single pipefile connected to the pipeline
US6735611 *Dec 21, 2001May 11, 2004Certicom Corp.Arithmetic processor
US6839831 *Dec 8, 2000Jan 4, 2005Texas Instruments IncorporatedData processing apparatus with register file bypass
US7093107 *Dec 29, 2000Aug 15, 2006Stmicroelectronics, Inc.Bypass circuitry for use in a pipelined processor
US7114056Dec 3, 1998Sep 26, 2006Sun Microsystems, Inc.Local and global register partitioning in a VLIW processor
US7117342Dec 3, 1998Oct 3, 2006Sun Microsystems, Inc.Implicitly derived register specifiers in a processor
US7139899 *Sep 3, 1999Nov 21, 2006Cisco Technology, Inc.Selected register decode values for pipeline stage register addressing
US7424504May 4, 2004Sep 9, 2008Certicom Corp.Arithmetic processor for accomodating different field sizes
US8672592 *May 16, 2012Mar 18, 2014Iscar, Ltd.Milling collet having pull-out preventer for retaining a fluted milling tool
US20130309035 *May 16, 2012Nov 21, 2013Iscar, Ltd.Milling Collet Having Pull-Out Preventer for Retaining a Fluted Milling Tool
WO2000033176A2 *Dec 2, 1999Jun 8, 2000Sun Microsystems IncClustered architecture in a vliw processor
WO2013101114A1 *Dec 29, 2011Jul 4, 2013Intel CorporationLater stage read port reduction
Classifications
U.S. Classification712/23, 712/41, 712/218, 712/E09.046, 712/E09.071
International ClassificationG06F9/38, G06F7/00, G06F15/16, G06F15/80
Cooperative ClassificationG06F9/3824, G06F9/3885
European ClassificationG06F9/38T, G06F9/38D
Legal Events
DateCodeEventDescription
Apr 20, 2007FPAYFee payment
Year of fee payment: 12
Apr 23, 2003FPAYFee payment
Year of fee payment: 8
May 3, 1999FPAYFee payment
Year of fee payment: 4