WO2001050252A1 - Store to load forwarding predictor with untraining - Google Patents

Store to load forwarding predictor with untraining Download PDF

Info

Publication number
WO2001050252A1
WO2001050252A1 PCT/US2000/021752 US0021752W WO0150252A1 WO 2001050252 A1 WO2001050252 A1 WO 2001050252A1 US 0021752 W US0021752 W US 0021752W WO 0150252 A1 WO0150252 A1 WO 0150252A1
Authority
WO
WIPO (PCT)
Prior art keywords
store
load
memory operation
dependency
execution
Prior art date
Application number
PCT/US2000/021752
Other languages
French (fr)
Inventor
James B. Keller
Thomas S. Green
Wei-Han Lien
Ramsey W. Haddad
Original Assignee
Advanced Micro Devices, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Micro Devices, Inc. filed Critical Advanced Micro Devices, Inc.
Priority to EP00951015A priority Critical patent/EP1244961B1/en
Priority to DE60009151T priority patent/DE60009151T2/en
Priority to JP2001550545A priority patent/JP4920156B2/en
Publication of WO2001050252A1 publication Critical patent/WO2001050252A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/3834Maintaining memory consistency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/3826Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3854Instruction completion, e.g. retiring, committing or graduating
    • G06F9/3856Reordering of instructions, e.g. using queues or age tags

Definitions

  • This invention is related to the field of processors and, more particularly, to store to load forward mechanisms within processors.
  • Processors often include store queues to buffer store memory operations which have been executed but which are still speculative.
  • the store memory operations may be held in the store queue until they are retired. Subsequent to retirement, the store memory operations may be committed to the cache and/or memory.
  • a memory operation is an operation specifying a transfer of data between a processor and a main memory (although the transfer may be completed in cache).
  • Load memory operations specify a transfer of data from memory to the processor, and store memory operations specify a transfer of data from the processor to memory.
  • Memory operations may be an implicit part of an instruction which includes a memory operation, or may be explicit load/store instructions. Load memory operations may be more succinctly referred to herein as "loads”. Similarly, store memory operations may be more succinctly referred to as "stores".
  • While executing stores speculatively and queueing them in the store queue may allow for increased performance (by removing the stores from the instruction execution pipeline and allowing other, subsequent instructions to execute), subsequent loads may access the memory locations updated by the stores in the store queue.
  • processor performance is not necessarily directly affected by having stores queued in the store queue, performance may be affected if subsequent loads are delayed due to accessing memory locations updated by stores in the store queue.
  • store queues are designed to forward data stored therein if a load hits the store queue.
  • a store queue entry storing a store memory operation is referred to as being "hit" by a load memory operation if at least one byte updated by the store memory operation is accessed by the load memory operation.
  • the younger loads may often have no dependency on the older stores, and thus need not await the execution of the older stores. Since the loads provide operands for execution of dependent instructions, executing the loads allows for still other instructions to be executed. However, merely detecting hits in the store queue as loads are executing may not lead to correct program execution if younger loads are allowed to execute out of order with respect to older stores, since certain older stores may not have executed yet (and thus the store addresses of those stores may not be known and dependencies of the loads on the certain older stores may not be detectable as the loads are executed).
  • a younger load executes prior to an older store on which that younger load is dependent may be required, and then corrective action may be taken in response to the detection. For example, instructions may be purged and refetched or reexecuted in some other suitable fashion.
  • a load is "dependent" on a store if the store updates at least one byte of memory accessed by the load, is older than the load, and is younger than any other stores updating that byte.
  • executing the load out of order improperly and the subsequent corrective actions to achieve correct execution may reduce performance.
  • loads, stores, and other instruction operations may be referred to herein as being older or younger than other instruction operations
  • a first instruction is older than a second instruction if the first instruction precedes the second instruction m program order (1 e the order of the instructions in the program being executed)
  • a first instruction m younger than a second instruction if the first instruction is subsequent to the second instruction in program order
  • the problems outlined above are in large part solved by a processor as described herem
  • the processor generally may schedule and/or execute younger loads ahead of older stores Additionally, the processor may detect and take corrective action for strig ⁇ os in which an older store mterferes with the execution of the younger load
  • the processor employs a store to load forward (STLF) predictor which may indicate, for dispatching loads, a dependency on a store
  • the dependency is indicated for a store which, during a previous execution, interfered with the execution of the load Smce a dependency is indicated on the store, the load is prevented from scheduling and/or executmg prior to the store Performance may be increased due to the decreased interference between loads and stores
  • the STLF predictor is trained with information for a particular load and store in response to executmg the load and store and detecting the interference Additionally, the STLF predictor may be untrained (e g information for a particular load and store may be deleted) if a load is indicated by the STLF predictor as dependent upon a particular store and the dependency
  • the STLF predictor records at least a portion of the PC of a store which interferes with the load in a first table mdexed by the load PC
  • a second table maintains a correspondmg portion of the store PCs of recently dispatched stores, along with tags identifying the recently dispatched stores
  • the PC of a dispatchmg load is used to select a store PC from the first table
  • the selected store PC is compared to the PCs stored m the second table If a match is detected, the correspondmg tag is read from the second table and used to mdicate a dependency for the load
  • the STLF predictor records a difference between the tags assigned to a load and a store which interferes with the load m a first table mdexed by the load PC
  • the PC of the dispatching load is used to select a difference from the table, and the difference is added to the tag assigned to the load Accordmgly, a tag of the store may be generated and a dependency of the load on the store may be indicated
  • a processor is contemplated comp ⁇ smg an STLF predictor and an execution pipeline coupled to the STLF predictor
  • the STLF predictor is coupled to receive an indication of dispatch of a first load memory operation, and is configured to indicate a dependency of the first load memory operation on a first store memory operation responsive to information stored withm the STLF predictor indicating that, durmg a previous execution, the first store memory operation mterfered with the first load memory operation
  • the execution pipeline is configured to inhibit execution of the first load memory operation prior to the first store memory operation responsive to the dependency
  • the execution pipeline is configured to detect a lack of the dependency during execution of the first load memory operation
  • the execution pipeline is configured to generate an untrain signal responsive to the lack of dependency Coupled to receive the untrain signal, the STLF predictor is configured to update the information stored therem to not indicate that the first store memory operation mterfered with the first load memory operation durmg the previous execution.
  • a computer system is contemplated mcludmg the processor and an input/output (I/O) device configured to communicate between the computer system and another computer system to which the I/O device is couplable
  • I/O input/output
  • a method is contemplated A dependency of a first load memory operation on a first store memory operation is mdicated responsive to information indicatmg that, during a previous execution, the first store memory operation mterfered with the first load memory operation. Schedulmg of the first load memory operation is inhibited prior to schedulmg the first store memory operation.
  • a lack of the dependency is detected during execution of the first load memory operation
  • the information mdicatmg that, during the previous execution, the first store memory operation mterfered with the first load memory operation is updated to not indicate that, during the previous execution, the first store memory operation interfered with the first load memory operation.
  • the updating is performed responsive to the detecting of the lack of dependency.
  • FIG. 1 is a block diagram of one embodiment of a processor.
  • Fig 2 is a pipeline diagram of an exemplary pipeline which may be employed m the processor shown in Fig 1
  • Fig 3 is a block diagram illustrating one embodiment of a map unit, a scheduler, an AGU/TLB, and a load/store unit in greater detail.
  • Fig 4 is a block diagram of one embodiment of a store to load forward (STLF) predictor shown m Fig 3
  • Fig 5 is a block diagram of a second embodiment of an STLF predictor shown m Fig 3.
  • Fig. 6 is a flowchart illustrating training and untraining of loads m one embodiment of an STLF predictor shown in Figs 4 or 5
  • Fig. 7 is a block diagram illustrating one embodiment of a control circuit which may be employed m an STLF predictor shown m Figs 4 or 5
  • Fig. 8 is a block diagram of one embodiment of a dependency unit shown m Fig 3
  • Fig. 9 is a block diagram of one embodiment of a computer system mcludmg the processor shown in Fig
  • Fig 10 is a block diagram of a second embodiment of a computer system mcludmg the processor shown m Fig. 1.
  • processor 10 includes a line predictor 12, an instruction cache (I-cache) 14, an alignment unit 16, a branch prediction/fetch PC generation unit 18, a plurality of decode units 24A-24D, a predictor miss decode unit 26, a microcode unit 28, a map unit 30, a retire queue 32, an architectural renames file 34, a future file 20, a scheduler 36, an integer register file 38A, a floating pomt register file 38B, an integer execution core 40A, a floating pomt execution core 40B, a load/store unit 42, a data cache (D-cache) 44, an external mterface unit 46, and a PC silo 48 Lme predictor 12 is coupled to predictor miss decode unit 26, branch prediction fetch PC generation unit 18, PC silo 48, and alignment unit 16 Lme predictor 12 may also be coupled to I-cache 14 I-cache 14
  • processor 10 employs a variable byte length, complex instruction set computing (CISC) instruction set architecture
  • processor 10 may employ the x86 instruction set architecture (also referred to as IA-32)
  • Other embodiments may employ other instruction set architectures mcludmg fixed length instruction set architectures and reduced instruction set computing (RISC) instruction set architectures
  • CISC complex instruction set computing
  • Branch prediction fetch PC generation unit 18 is configured to provide a fetch address (fetch PC) to I- cache 14, lme predictor 12, and PC silo 48
  • Branch prediction/fetch PC generation unit 18 may include a suitable branch prediction mechanism used to aid in the generation of fetch addresses
  • lme predictor 12 provides alignment information correspondmg to a plurality of instructions to alignment umt 16, and may provide a next fetch address for fetchmg instructions subsequent to the instructions identified by the provided instruction information
  • the next fetch address may be provided to branch prediction/fetch PC generation unit 18 or may be directly provided to I-cache 14, as desired
  • Branch prediction/fetch PC generation unit 18 may receive a trap address from PC silo 48 (if a trap is detected) and the trap address may comprise the fetch PC generated by branch prediction fetch PC generation unit 18 Otherwise, the fetch PC may be generated usmg the branch prediction information and information from lme predictor 12
  • line predictor 12 stores information correspondmg to instructions previously speculatively fetched by processor 10
  • lme predictor 12 stores information
  • I-cache 14 is a high speed cache memory for storing mstruction bytes Accordmg to one embodiment I- cache 14 may comprise, for example, a 128 Kbyte, four way set associative organization employmg 64 byte cache lines However, any I-cache structure may be suitable (including direct-mapped structures)
  • Alignment unit 16 receives the mstruction alignment information from lme predictor 12 and mstruction bytes corresponding to the fetch address from I-cache 14 Alignment unit 16 selects mstruction bytes mto each of decode units 24A-24D accordmg to the provided mstruction alignment information More particularly, lme predictor 12 provides an instruction pomter correspondmg to each decode unit 24A-24D The instruction pointer locates an mstruction within the fetched mstruction bytes for conveyance to the corresponding decode unit 24A- 24D In one embodiment, certain instructions may be conveyed to more than one decode unit 24A-24D Accordmgly, in the embodiment shown, a line of instructions from lme predictor 12 may mclude up to 4 instructions, although other embodiments may mclude more or fewer decode units 24 to provide for more or fewer instructions withm a lme Decode units 24A-24D decode the instructions provided thereto, and each decode umt 24A-24D generates information
  • predictor miss decode unit 26 Upon detecting a miss m lme predictor 12, alignment unit 16 routes the corresponding instruction bytes from I-cache 14 to predictor miss decode unit 26 Predictor miss decode unit 26 decodes the instruction, enforcing any limits on a line of instructions as processor 10 is designed for (e g maximum number of instruction operations, maximum number of instructions, terminate on branch instructions, etc ) Upon terminating a lme, predictor miss decode unit 26 provides the information to lme predictor 12 for storage It is noted that predictor miss decode unit 26 may be configured to dispatch instructions as they are decoded Alternatively, predictor miss decode unit 26 may decode the lme of instruction information and provide it to lme predictor 12 for storage Subsequently, the missing fetch address may be reattempted m lme predictor 12 and a hit may be detected In addition to decodmg instructions upon a miss m lme predictor 12, predictor miss decode unit 26 may be configured to decode instructions if
  • Map unit 30 is configured to perform register renaming by assignmg physical register numbers (PR#s) to each destination register operand and source register operand of each mstruction operation
  • the physical register numbers identify registers withm register files 38A-38B
  • Map unit 30 additionally provides an indication of the dependencies for each mstruction operation by providmg R#s of the mstruction operations which update each physical register number assigned to a source operand of the instruction operation
  • Map unit 30 updates future file 20 with the physical register numbers assigned to each destination register (and the R# of the correspondmg instruction operation) based on the corresponding logical register number
  • map unit 30 stores the logical register numbers of the destmation registers, assigned physical register numbers, and the previously assigned physical register numbers m retire queue 32 As instructions are retired (mdicated to map umt 30 by scheduler 36), retire queue 32 updates architectural renames file 34 and frees any registers which are no longer in use Accordmgly, the physical register numbers m
  • the line of instruction operations, source physical register numbers, and destination physical register numbers are stored mto scheduler 36 according to the R#s assigned by map unit 30 Furthermore, dependencies for a particular instruction operation may be noted as dependencies on other mstruction operations which are stored m the scheduler In one embodiment, mstruction operations remain m scheduler 36 until retired Scheduler 36 stores each instruction operation until the dependencies noted for that instruction operation have been satisfied In response to scheduling a particular mstruction operation for execution, scheduler 36 may determine at which clock cycle that particular instruction operation will update register files 38A-38B Different execution units withm execution cores 40A-40B may employ different numbers of pipeline stages (and hence different latencies) Furthermore, certain instructions may experience more latency withm a pipeline than others Accordmgly, a countdown is generated which measures the latency for the particular mstruction operation (m numbers of clock cycles).
  • Scheduler 36 awaits the specified number of clock cycles (until the update will occur prior to or coincident with the dependent mstruction operations reading the register file), and then indicates that mstruction operations dependent upon that particular mstruction operation may be scheduled It is noted that scheduler 36 may schedule an mstruction once its dependencies have been satisfied (l e out of order with respect to its order withm the scheduler queue)
  • Integer and load/store mstruction operations read source operands accordmg to the source physical register numbers from register file 38A and are conveyed to execution core 40A for execution.
  • Execution core 40A executes the instruction operation and updates the physical register assigned to the destmation withm register file 38 A Additionally, execution core 40A reports the R# of the mstruction operation and exception mformation regardmg the mstruction operation (if any) to scheduler 36.
  • Register file 38B and execution core 40B may operate m a similar fashion with respect to floating pomt mstruction operations (and may provide store data for floating pomt stores to load/store unit 42).
  • execution core 40A may mclude, for example, two integer units, a branch unit, and two address generation units (with corresponding translation lookaside buffers, or TLBs)
  • Execution core 40B may include a floating point/multimedia multiplier, a floating point/multimedia adder, and a store data unit for delivering store data to load/store unit 42.
  • Other configurations of execution units are possible
  • Load store unit 42 provides an mterface to D-cache 44 for performing memory operations and for scheduling fill operations for memory operations which miss D-cache 44
  • Load memory operations may be completed by execution core 40A performmg an address generation and forwarding data to register files 38A-38B (from D-cache 44 or a store queue withm load store unit 42).
  • Store addresses may be presented to D-cache 44 upon generation thereof by execution core 40A (directly via connections between execution core 40A and D- Cache 44).
  • the store addresses are allocated a store queue entry.
  • the store data may be provided concurrently, or may be provided subsequently, according to design choice.
  • load/store unit 42 may mclude a load/store buffer for stormg load store addresses which miss D- cache 44 for subsequent cache fills (via external interface unit 46) and re-attempting the missing load/store operations
  • Load/store unit 42 is further configured to handle load/store memory dependencies.
  • D-cache 44 is a high speed cache memory for stormg data accessed by processor 10. While D-cache 44 may comprise any suitable structure (including direct mapped and set-associative structures), one embodiment of D-cache 44 may comp ⁇ se a 128 Kbyte, 2 way set associative cache having 64 byte lines
  • External mterface umt 46 is configured to communicate to other devices via external interface 52 Any suitable external interface 52 may be used, including interfaces to L2 caches and an external bus or buses for connecting processor 10 to other devices External interface unit 46 fetches fills for I-cache 16 and D-cache 44, as well as writing discarded updated cache lmes from D-cache 44 to the external mterface. Furthermore, external mterface unit 46 may perform non-cacheable reads and writes generated by processor 10 as well
  • FIG. 2 an exemplary pipeline diagram illustrating an exemplary set of pipeline stages which may be employed by one embodiment of processor 10 is shown
  • Other embodiments may employ different pipelines, pipelines including more or fewer pipeline stages than the pipeline shown in Fig. 2.
  • the stages shown in Fig 2 are delimited by vertical lmes Each stage is one clock cycle of a clock signal used to clock storage elements (e g registers, latches, flops, and the like) withm processor 10
  • the exemplary pipeline includes a CAMO stage, a CAM1 stage, a line predictor (LP) stage, an mstruction cache (IC) stage, an alignment (AL) stage, a decode (DEC) stage, a mapl (Ml) stage, a map2 (M2) stage, a write scheduler (WR SC) stage, a read scheduler (RD SC) stage, a register file read (RF RD) stage, an execute (EX) stage, a register file write (RF WR) stage, and a retire (RET) stage
  • Some instructions utilize multiple clock cycles m the execute state For example, memory operations, floating pomt operations, and mteger multiply operations are illustrated in exploded form in Fig 2
  • Memory operations mclude an address generation (AGU) stage, a translation (TLB) stage, a data cache 1 (DC1) stage, and a data cache 2 (DC2) stage
  • floating pomt operations mclude up to four floating point execute (FEX1-FEX4) stages, and
  • lme predictor 12 compares the fetch address provided by branch prediction/fetch PC generation unit 18 to the addresses of lines stored therem Additionally, the fetch address is translated from a virtual address (e g a lmear address m the x86 architecture) to a physical address du ⁇ ng the CAMO and CAM1 stages In response to detectmg a hit durmg the CAMO and CAM1 stages, the corresponding lme mformation is read from the line predictor durmg the lme predictor stage Also, I-cache 14 initiates a read (usmg the physical address) during the line predictor stage The read completes durmg the instruction cache stage
  • lme predictor 12 provides a next fetch address for I-cache 14 and a next entry m line predictor 12 for a hit, and therefore the CAMO and CAMl stages may be skipped for fetches resulting from a previous hit m line predictor 12
  • Instruction bytes provided by I-cache 14 are aligned to decode units 24A-24D by alignment unit 16 durmg the alignment stage m response to the correspondmg lme mformation from lme predictor 12
  • Decode units 24A-24D decode the provided instructions, identifying ROPs corresponding to the instructions as well as operand mformation durmg the decode stage Map unit 30 generates ROPs from the provided mformation during the mapl stage, and performs register renaming (updating future file 20) Durmg the map2 stage, the ROPs and assigned renames are recorded m retire queue 32 Furthermore, the ROPs upon which each ROP is dependent are determined Each ROP may be register dependent upon earlier ROPs as recorded in the future file, and may also exhibit other types of dependencies (e g dependencies on a previous serializing instruction, etc )
  • the generated ROPs are w ⁇ tten into scheduler 36 during the write scheduler stage Up until this stage, the ROPs located by a particular line of information flow through the pipeline as a unit However, subsequent to be written mto scheduler 36, the ROPs may flow independently through the remammg stages, at different times Generally, a particular ROP remams at this stage until selected for execution by scheduler 36 (e g after the ROPs upon which the particular ROP is dependent have been selected for execution, as described above) Accordingly, a particular ROP may experience one or more clock cycles of delay between the write scheduler write stage and the read scheduler stage Durmg the read scheduler stage, the particular ROP participates m the selection logic withm scheduler 36, is selected for execution, and is read from scheduler 36 The particular ROP then proceeds to read register file operations from one of register files 38A-38B (depending upon the type of ROP) in the register file read stage
  • ROPs are provided to the correspondmg execution core 40A or 40B, and the instruction operation is performed on the operands during the execution stage
  • some ROPs have several pipelme stages of execution
  • memory mstruction operations e g loads and stores
  • an address generation stage m which the data address of the memory location accessed by the memory instruction operation is generated
  • a translation stage in which the virtual data address provided by the address generation stage is translated
  • Floatmg pomt operations may employ up to 4 clock cycles of execution, and integer multiplies may similarly employ up to 4 clock cycles of execution
  • the particular ROP Upon completing the execution stage or stages, the particular ROP updates its assigned physical register du ⁇ ng the register file write stage Finally, the particular ROP is retired after each previous ROP is retired (m the retire stage) Agam, one or more clock cycles may elapse for a particular ROP between the register file write stage and the retire stage Furthermore, a particular ROP may be stalled at any stage due to pipelme stall conditions, as is well known in the art
  • Fig 3 a block diagram illustrating one embodiment of map unit 30, scheduler 36, an address generation unit/translation lookaside buffer (AGU/TLB) 40AA, and load/store unit 42 m greater detail is shown
  • scheduler 36, AGU/TLB 40AA, and load/store unit 42 are collectively referred to as execution pipeline 72
  • Map unit 30 m cludes a store to load forward (STLF) predictor 60, a dependency unit 62, and an R# assign unit 64 (which assigns R#s to mstruction operations)
  • Scheduler 36 m cludes a scheduler buffer 66 and a physical address (PA) buffer 70
  • Load store unit 42 m cludes a store queue 68
  • Map unit 30 is coupled to receive instruction operations and correspondmg program counter addresses (PCs) from decode units 24, a retire signal from scheduler 36, and a trarn/untrain mterface (mcludmg trarn/untrain (T/UT)
  • PCs program counter addresses
  • STLF predictor 60 determines if it has any information indicating that, du ⁇ ng a previous execution, a store memory operation interfered with the load memory operation If a store memory operation did mterfere, STLF predictor 60 provides an indication of that store memory operation to dependency umt 62 Dependency unit 62 mdicates a dependency for the load memory operation on that store memory operation (m addition to any dependencies for address operands, etc ), and thus the load memory operation does not get scheduled p ⁇ or to the store memory operation Accordmgly, durmg the current execution of the load memory operation, the store memory operation may not interfere On the other hand, if no mformation regarding mterference from a store memory operation is recorded by STLF predictor 60 for a particular load memory operation, STLF predictor 60 does not mdicate a dependency to dependency unit 62 The particular load memory operation may receive dependencies for source register operands but not for any store memory operations
  • a store memory operation "interferes" with a load memory operation if the store memory operation causes additional clock cycles to be added to the execution of the load memory operation
  • the additional clock cycles may be added in the form of pipelme stalls or may be added via reexecution of the load memory operation
  • the remamder of this disclosure will focus on an embodiment m which a store memory operation interferes with a load memory operation if the store memory operation is older than the load memory operation, the load memory operation has a dependency on the store memory operation, and the load memory operation is scheduled and/or executed p ⁇ or to the store memory operation
  • Other embodiments are contemplated
  • an embodiment is contemplated m which load memory operations are not scheduled prior to the address generation of a store memory operation, but which may be scheduled prior to the store data bemg provided
  • the store may mterfere with the load if there is a dependency and the store data is not available when the load memory operation executes
  • Map unit 30 passes the instruction operations, PCs, and dependencies to scheduler 36, which w ⁇ tes the mstruction operations mto scheduler buffer 66
  • Scheduler buffer 66 mcludes multiple entries, each entry capable of stormg mformation regarding one mstruction operation
  • An exemplary memory operation entry is illustrated m scheduler buffer 66, mcludmg a valid bit and a type field (identifying the entry as storing a memory operation and which type of memory operation is stored, either load or store)
  • the PC of the load memory operation LPC
  • Additional mformation may be stored as well to aid in training STLF predictor 60 with information regardmg a store memory operation which interferes with the load memory operation
  • a store ID field SID
  • R retry indication
  • scheduler 36 may schedule a memory operation for execution once each of its recorded dependencies are satisfied, younger loads may be scheduled prior to older stores if STLF predictor 60 does not indicate a dependency of the younger load on the older store.
  • Map unit 30 may detect each source register operand dependency, but may not be capable of detecting all load dependencies on earlier stores.
  • the dependency of a load on a store is based on the memory addresses affected by the load and store, respectively, generated from source operands of the load and store during execution of the load and store.
  • STLF predictor 60 detects certain dependencies of loads on stores (as described herein), but others may not be detected. Accordingly, processor 10 employs PA buffer 70 as described below to detect cases in which a younger store scheduled prior to an older store is dependent on the older store.
  • AGU/TLB 40 AA receives the memory operation and operands (read from register file 38A in response to PR#s from scheduler 36). AGU/TLB 40AA adds the operands to produce a virtual address, and translates the virtual address to a physical address using translations cached in the TLB. AGU/TLB 40AA provides the physical address and other control information to store queue 68. Store data is also provided, if the operation is a store. Among the control information provided by AGU/TLB 40AA may be the load or store nature of the operation. The physical address and other control information is also provided by AGU/TLB 40AA to D-cache 44 and to PA buffer 70.
  • PA buffer 70 is used in the present embodiment to detect stores which interfere with loads.
  • PA buffer 70 includes multiple entries, one entry for each entry in scheduler buffer 66. Each entry is capable of storing physical address information.
  • the physical address provided to PA buffer 70 is stored into an entry corresponding to the scheduler buffer entry storing the load.
  • the physical address is compared to the physical addresses stored in PA buffer 70. If a match is found, and the corresponding instruction operation is a load which is younger than the store, then the load is retried.
  • a memory operation is referred to herein as "retried” if the operation's state within scheduler 36 is reset to a not executed state. Retrying the memory operation subsequently leads to the memory operation being rescheduled and reexecuted.
  • the retry indication in the co ⁇ esponding scheduler buffer entry is set.
  • the store ID used by STLF predictor 60 to identify the store is stored in the scheduler buffer entry's SID field.
  • the store ID may be the store PC.
  • the store ID may be the R# of the store or the difference between the R# of the store and the R# of the load (the delta R#). Embodiments using each store ID are described in more detail below.
  • the retry indication being set causes execution pipeline 72 to train the load and the co ⁇ esponding store into STLF predictor 60 using the trarn/untrain interface (so that subsequent executions may avoid the retry of the load by making the load dependent on the store). More particularly, a train signal within the interface may be asserted, and the load PC and the store ID from the corresponding scheduler entry are provided to STLF predictor 60 as well. It is noted that the training may occur from any stage of the execution pipeline 72, according to design choice.
  • execution pipelme 72 may detect these situations as well and use the train/untram mterface to untram the load and co ⁇ espondmg store from STLF predictor 60 More particularly, if a load is scheduled and its tram indication in scheduler buffer 66 is set, execution pipeline 72 determmes if the load receives forwarded data from store queue 68 If no forwarding occurs, then a dependency on a store may not have been wa ⁇ anted for the load Accordingly, execution pipelme 72 may assert an untram signal and provide the load PC to STLF predictor 60 STLF predictor 60 may untram the mformation co ⁇ espondmg to the load
  • trammg refers to stormg mformation which identifies the occu ⁇ ence of a store which mterferes with a load, and may include updating information which mdicates the likelihood of the mterference recu ⁇ mg (e g if the situation has occu ⁇ ed repeatedly m the past, it may be more likely to occur agam) Thus, trammg may include creatmg a stronger co ⁇ elation between the load and the store.
  • untram refers to deleting information which identifies the occu ⁇ ence of a store mterfermg with a load, and may mclude creatmg a weaker co ⁇ elation between the load and the store prior to deleting the mformation It is noted that the training and untrarning of STLF predictor 60 may occur from any pipelme stage, and trammg may be performed at a different stage than untrarning For example, m the present embodiment, training is performed m response to the retry
  • AGUs and TLBs are possible
  • a load AGU and a separate store AGU are contemplated
  • the store AGU may be coupled to a write port on store queue 68
  • the load AGU may be coupled to a compare port on store queue 68
  • Other embodiments may mclude any number of AGUs for loads, stores, or loads and stores, as desired
  • map unit 30 may perform register renaming, as descnbed above with respect to Fig 1
  • STLF predictor 60 operates durmg the map2 stage of the pipelme shown m Fig 2 m terms of mdicatmg dependencies for loads on earlier stores
  • STLF predictor 60 may operate at any pipelme stage prior to the selection of the load for execution, accordmg to various embodiments
  • the above description describes training durmg the reexecution of the load
  • alternative embodiments may perform the training at different times
  • an alternative embodiment may tram m response to detectmg the retry situation (e g during execution of the store upon which the load is dependent)
  • PC is used to refer to the program counter address of an instruction
  • the PC is the address of the instruction m memory
  • the PC is the address used to fetch the mstruction from memory
  • the PC of the mstruction is also the PC of each of the mstruction operations (e g load and store memory operations)
  • R# is used in certain embodiments described above and below to identify instruction operations Generally, any suitable tag may be used
  • the R# identifies relative program order of instruction operations, and may identify the entry m scheduler buffer 66 assigned to the instruction operations Other embodiments may employ reorder buffer tags or any other tag to identify the instruction operations
  • R#s or tags may be assigned at any pomt m the pipelme of processor 10 prior to or comcident with operation of STLF predictor 60
  • STLF predictor 60a mcludes a load/store dependency table 80, a store PC/R# table 82, a load/store dependency table (LSDT) control circuit 84, a ST/LD dependency circuit 86, a store table control circuit 88, an mtralme dependency check circuit 90, and a multiplexor (mux) 92
  • Load/store dependency table 80 is coupled to receive the PCs of dispatchmg instruction operations from decode units 24, and is coupled to LSDT control circuit 84 Additionally, load/store dependency table 80 is coupled to receive a load PC and store PC from execution pipelme 72 for trammg Load store dependency table 80 is coupled to provide store PCs to mtralme dependency check cncuit 90 and store PC/R# table 82, and valid indications to ST/LD dependency circuit 86 Intrahne dependency check circuit
  • load/store dependency table 80 is mdexed by a load PC to select one of multiple entries
  • the entry stores a valid indication and a store PC (SPC m Fig 4) of a store which may have mterfered with that load during a prior execution
  • the store PC/R# table includes multiple entries which store the store PC of recently dispatched stores, along with the co ⁇ espondmg R# for that store If the store PC from the entry selected in load/store dependency table 80 hits m store PC/R# table 82, a dependency of the load on the store is noted for the load In this manner, the load is prevented from scheduling (and thus executmg) ahead of the store Accordmgly, the interference may be avoided during the present execution More particularly, as mstruction operations are dispatched, the PCs of the mstruction operations are used to index into load/store dependency table 80 The remainder of this discussion will focus on the response of STLF predictor 60a to one mput PC co ⁇ espondmg to one dispatching ms
  • store table control circuit 88 signals ST/LD dependency circuit 86 with an indication of whether or not a hit was detected in store PC/R# table 82 for that mstruction operation
  • ST/LD dependency circuit 86 provides a dependency valid signal to dependency unit 62
  • the dependency valid signal if asserted, mdicates that dependency unit 62 is to record a dependency for the mstruction operation on the store identified by the store R# provided by mux 92 If the dependency valid signal is deasserted, the signal mdicates that no dependency is to be recorded by dependency unit 62
  • ST/LD dependency circuit 86 may assert the dependency valid signal if (l) the mstruction operation is a load (dete ⁇ nined from the load store indications from decode units 24), (n) if the valid indication from the mdexed entry of load/store dependency table 80 mdicates valid, and (in) if the store PC from the mdexed entry hits m store PC/R# table 82 Processor 10 as shown
  • mtralme dependency check circuit 90 compares the store PC output from load/store dependency table 80 to the PCs of each concu ⁇ ently dispatched mstruction operation which is prior to the given mstruction operation in program order If the prior mstruction operation's PC matches the store PC from load/store dependency table 80 and the prior mstruction operation is a store (mdicated by the load/store indications provided by decode units 24), lntrahne dependency check circuit 90 may (I) indicate a hit to ST/LD dependency circuit 86 for the co ⁇ esponding load, and (n) control mux 92 to over ⁇ de the store R# provided by store PC/R# table 82 with the R# of the mstruction operation upon which the hit is detected In this manner, the store R# output to dependency
  • ST/LD dependency circuit 86 may assert the dependency valid signal for the load if: (i) the instruction operation is a load (determined from the load/store indications from decode units 24); (ii) if the valid indication from the indexed entry of load/store dependency table 80 indicates valid; and (iii) if the hit signal from intraline dependency check circuit 90 for the load is asserted.
  • ST/LD dependency circuit 86 may further assert the depend all signal for the instruction operation.
  • the depend all signal if asserted, indicates to dependency unit 62 to record dependencies for the instruction operation on each outstanding (dispatched and not retired) store.
  • the depend all signal is used to handle a situation in which a particular entry is repeatedly trained with store PCs of stores which interfere with loads. Since load/store dependency table 80 selects an entry in response to a PC of an instruction operation and the entry stores one store PC, loads for which different stores interfere on different executions may still be interfered with even though STLF predictor 60a indicates a dependency on a store. To better handle such cases, the valid indication in load/store dependency table 80 may be a bit vector.
  • bit in the bit vector may be placed in the valid state (e.g. set or clear, depending upon design choice). If each of the bits is in the valid state, the entry may be repeatedly being trained because the load is being interfered with by different stores during various executions. Accordingly, the depend all signal may be asserted if: (i) each bit in the bit vector is in the valid state; and (ii) the instruction operation is a load.
  • the bit vector and placing bits in the valid or invalid state is described in more detail below.
  • STLF predictor 60a may indicate dependencies for loads on stores which may have interfered with the loads on prior executions. Additionally, STLF predictor 60a may be trained with the information on the loads and stores.
  • Store PC/R# table 82 stores the store PCs and R#s of the most recently dispatched stores. Thus, store table control circuit 88 may allocate entries in store PC/R# table 82 to stores which are being dispatched.
  • Store table control circuit 88 receives the load/store indications for each instruction operation from decode units 24 and allocates entries to each dispatching store. The allocated entries are updated with the store PC (received from decode units 24) and the co ⁇ esponding R# (received from R# assign unit 64).
  • store table control circuit 88 may operate store PC/R# table 82 as a first-in, first-out (FIFO) buffer of the most recently dispatched stores.
  • FIFO first-in, first-out
  • Load/store dependency table 80 is trained in response to the train/untrain interface from execution pipeline 72. More particularly, if the train signal is asserted by execution pipeline 72, LSDT control circuit 84 causes load/store dependency table 80 to be updated. Execution pipeline 72 provides the PC of the load to be trained (LPC in Fig. 4) and the co ⁇ esponding store PC which interferes with the load as input to load/store dependency table 80. Load/store dependency table 80 updates the entry indexed by the load PC with the store PC and LSDT control circuit 84 places the valid indication into a valid state. In one embodiment, the valid indication may be a bit and the valid state may be set (or clear) and invalid state may be clear (or set). In another embodiment as described above, the valid indication may be a bit vector. In such an embodiment, LSDT control circuit 84 may select a bit withm the bit vector and place that bit in the valid state durmg trammg
  • LSDT control circuit 84 may untram an entry m response to the assertion of the untrain signal by execution pipelme 72
  • execution pipeline 72 may provide the load PC of the load to be untrained, but the store PC may be a don't care m the untraining case Load/store dependency table 80 mdexes the entry indicated by the load PC, and LSDT control circuit 84 causes the valid indication m the mdexed entry to be placed m the invalid state
  • the bit may be cleared (or set) to indicate invalid
  • a selected bit may be placed in the invalid state The entry may still remam valid m the bit vector case if other bits remam m the valid state
  • multiple untram events may eventually cause each of the other bits to become invalid as well
  • mdex load/store dependency table 80 may be determined by the number of entries employed withm the table
  • load/store dependency table 80 may be IK entries and thus 10 bits of the PC may be used as an mdex (e g the least significant 10 bits)
  • the number of entries may generally be selected as design choice based, m part, on the area occupied by the table versus the accuracy of the table in general for the loads m targeted software
  • the number of bits used for the store PCs stored in load/store dependency table 80 and store PC/R# dependency table 82 may differ from the number of bits used m the index, and agam may be selected as design choice based, in part, on the area occupied by the tables versus the accuracy of the tables in general for the loads
  • the number of entries in store PC/R# table 82 may be a matter of design choice as well, based, m part, on the area occupied by the table versus the accuracy of the table m general for the loads in targeted software In one particular implementation, 8-12 entries may be used
  • the PCs and R#s mput to STLF predictor 60a may be muxed m response to the load/store indications from decode units 24, such that only the PCs of loads are mput to load/store dependency table 80 and only the PCs of stores are input to store PC/R# table 82 for storage
  • predictor miss decode unit 26 may terminate a line of mstruction operations once the load and/or store limit is reached
  • each entry m load/store dependency table 80 may provide storage for multiple store PCs and co ⁇ esponding valid bits Each store PC from a selected entry may be compared to store PC R# table 82 and a dependency may be recorded for the load on each store which is a hit in store PC/R# table 82
  • STLF predictor 60b includes a load/store dependency table 100, an adder circuit 102, a load/store dependency table (LSDT) control circuit 104, a ST/LD dependency circuit 106, and an optional store validation circuit 108.
  • LSDT load/store dependency table
  • Load/store dependency table 100 is coupled to receive PCs of dispatchmg mstruction operations from decode units 24, and is further coupled to receive a load PC and delta R# from execution pipelme 72 Additionally, load/store dependency table 100 is coupled to LSDT control circuit 104 and is coupled to provide valid mdications to ST/LD dependency circuit 106 and delta R#s to adder circuit 102 Adder circuit 102 is further coupled to receive R#s of the dispatchmg mstruction operations from R# assign unit 64 Adder circuit 102 is coupled to provide store R#s to dependency unit 62 and to store validation circuit 108, which is coupled to receive a valid store R# indication from dependency unit 62 Store validation circuit 108 is coupled to provide store valid signals to ST/LD dependency circuit 106, which is further coupled to receive load/store mdications co ⁇ espondmg to the dispatchmg mstruction operations from decode units 24. ST/LD dependency circuit 106 is coupled to provide dependency valid signals and depend all signals to dependency unit 62 LSDT control circuit is coupled
  • STLF predictor 60b may respond to a dispatchmg load as follows
  • the load PC is used to index mto load/store dependency table 100, thereby selecting one of multiple ent ⁇ es.
  • the selected entry stores a valid indication and a delta R#.
  • the valid indication mdicates whether or not STLF predictor 60b has been trained within information regardmg a load havmg the mdexmg PC, and thus whether or not the delta R# is valid.
  • the delta R# is the difference between the R# of the load and the R# of a store which interfered with the load du ⁇ ng a previous execution.
  • Adder circuit 102 adds the delta R# to the R# assigned to the dispatchmg load to generate a store R#, which is provided to dependency unit 62
  • Dependency unit 62 may then record a dependency for the load on the store In this manner, the load is prevented from schedulmg (and thus executing) ahead of the store.
  • the mterference may be avoided durmg the present execution More particularly, as mstruction operations are dispatched, the PCs of the mstruction operations are used to mdex mto load/store dependency table 100 The remainder of this discussion will focus on the response of STLF predictor 60b to one mput PC co ⁇ esponding to one dispatching instruction operation, unless otherwise noted.
  • STLF predictor 60b may respond in parallel to each PC of each dispatchmg instruction operation
  • Load store dependency table 100 outputs a delta R# and valid indication from the selected entry
  • Adder 102 adds the delta R# to the R# co ⁇ espondmg to the dispatchmg mstruction operation and thus generates a store R# which is conveyed to dependency unit 62.
  • adder circuit 102 may mclude an adder for each dispatchmg mstruction operation, receivmg the co ⁇ espondmg delta R# output from load/store dependency table 100 and the R# assigned to that dispatching mstruction operation by R# assign unit 64.
  • ST/LD dependency circuit 106 receives the valid indication and an indication of whether or not the mstruction operation is a load or a store from decode units 24 ST/LD dependency circuit 106 provides a dependency valid signal to dependency unit 62, similar to ST/LD dependency valid circuit 86 above. ST/LD dependency circuit 106 may assert the dependency valid signal if.
  • STLF predictor 60b may employ store validation circuit 108
  • Store validation circuit 108 receives an mdication of which R#s co ⁇ espond to outstandmg stores from dependency unit 62
  • the mdication may be a bit vector having one bit per R#, indicatmg whether or not the R# co ⁇ esponds to a store
  • Store validation circuit 108 determines whether or not the R# generated by adder circuit 102 co ⁇ esponds to a store, and signals ST/LD dependency circuit 106 with the store valid signal.
  • an additional condition for ST/LD dependency circuit 106 to assert the dependency valid signal is that the store valid signal from store validation circuit 108 is asserted.
  • ST/LD dependency circuit 106 may be configured to provide depend all signal m embodiments in which the valid mdication is a bit vector Operation of ST/LD dependency circuit 106 may be similar to ST/LD dependency circuit 106 m this regard.
  • Load/store dependency table 100 is trained in response to the trarn/untrain mterface from execution pipelme 72. More particularly, if the tram signal is asserted by execution pipeline 72, LSDT control circuit 104 causes load/store dependency table 100 to be updated Execution pipeline 72 provides the PC of the load to be trained (LPC in Fig. 5) and the co ⁇ espondmg delta R# as mput to load/store dependency table 100. Load/store dependency table 100 updates the entry mdexed by the load PC with the delta R# and LSDT control circuit 104 places the valid indication into a valid state In one embodiment, the valid indication may be a bit and the valid state may be set (or clear) and invalid state may be clear (or set). In another embodiment as desc ⁇ bed above, the valid mdication may be a bit vector In such an embodunent, LSDT control circuit 104 may select a bit withm the bit vector and place that bit in the valid state during training
  • LSDT control circuit 104 may untrain an entry m response to the assertion of the untram signal by execution pipelme 72.
  • execution pipeline 72 may provide the load PC of the load to be untrained, but the delta R# may be a don't care m the untraining case Load/store dependency table 100 mdexes the entry indicated by the load PC, and LSDT control circuit 104 causes the valid indication m the mdexed entry to be placed m the invalid state.
  • a valid bit as a valid indication, the bit may be cleared (or set) to mdicate invalid.
  • a selected bit may be placed in the invalid state. The entry may still remam valid m the bit vector case if other bits remain in the valid state. However, multiple untram events may eventually cause each of the other bits to become invalid as well
  • mdexmg load/store dependency table 100 various embodiments may index with only a portion of the PCs.
  • the portion used to mdex load/store dependency table 100 may be determined by the number of entries employed withm the table
  • m one particular implementation load store dependency table 100 may be IK entries and thus 10 bits of the PC may be used as an mdex (e.g. the least significant 10 bits)
  • the number of entries may generally be selected as design choice based, m part, on the area occupied by the table versus the accuracy of the table in general for the loads in targeted software
  • a flowchart is shown illustrating operation of one embodiment of execution pipeline 72 with respect to load memory operations. Other embodiments are possible and contemplated. While the steps shown in Fig.
  • steps may be performed in parallel by combinatorial logic within execution pipeline 72. Still further, various steps may be performed at different states within execution pipeline 72. Information regarding other steps may be pipelined to the stages at which steps are performed.
  • Execution pipeline 72 determines if a load has been scheduled for execution (decision block 110). If a load is not scheduled, then no training operations are possible in this embodiment. If a load is scheduled, execution pipeline 72 determines if the load was retried due to a hit in physical address buffer 70 (decision block 112). More particularly, execution pipeline 72 may examine the retry indication from the scheduler buffer entry allocated to the load. If the load was retried due to a physical address buffer hit, the execution pipeline 72 asserts the train signal to STLF predictor 60 and provides the load PC and store ID of the load and store to be trained into STLF predictor 60 (block 114).
  • execution pipeline 72 determines if the load received a dependency on a store due to operation of STLF predictor 60 (decision block 116). In other words, execution pipeline 72 determines if the train indication in the scheduler buffer entry allocated to the load indicates that the load was trained. If the load was trained, execution pipeline 72 determines if data is forwarded from the store queue for the load (decision block 118). If data is not forwarded, it is likely that the load would not have been interfered with by a store. Accordingly, in this case, execution pipeline 72 may assert the untrain signal to STLF predictor 60 and provide the load PC of the load for untraining (block 120). It is noted that training may also be performed during execution of a store which interferes with a load, rather than during the reexecution of the load due to the retry.
  • LSDT control circuit 130 may be used as LSDT control circuit 84 and/or LSDT control circuit 104, in various embodiments. Other embodiments are possible and contemplated.
  • LSDT control circuit 130 includes a control circuit 132 and a counter circuit 134 coupled to the control circuit.
  • Control circuit 132 is coupled to receive the train and untrain signals from execution pipeline 72, and is coupled to provide Set_V[3:0] signals and Clear_V[3:0] signals to load/store dependency table 80 or 100 (depending upon the embodiment).
  • LSDT control circuit 130 is configured to manage the valid indications in the load/store dependency table during training and untraining for embodiments in which the valid indications are bit vectors.
  • each bit in the bit vector is in the valid state if set and in the invalid state if clear, although alternative embodiments may have each bit in the bit vector in the valid state if clear and the invalid state if set. Still other embodiments may encode valid states in the bits.
  • control circuit 132 selects a bit in the bit vector to set responsive to the value maintained by counter circuit 134 Similarly, if an entry is being untrained, control circuit 132 selects a bit in the bit vector to clear responsive to the value mamtamed by counter circuit 134
  • Each value of the counter circuit 134 selects one of the bits m the bit vector Counter circuit 134 includes a counter register and an lncrementor which mcrements the value in the counter register
  • counter circuit 134 increments each clock cycle Accordmgly, the selected bit for a given training or untraining may be pseudo-random m the present embodiment
  • valid mdications are 4 bit vectors Accordmgly, one signal withm Set_V[3 0] and Clear_V[3 0] co ⁇ esponds to each bit in the vector If an entry is bemg trained, control circuit 132 asserts the Set_V[3 0] signal co ⁇ espondmg to the bit selected based on counter circuit
  • alternative configurations may mcrement the count after each tram or untrain event, if desired Still further, alternative configurations may select a bit which is in the invalid state to change to the valid state durmg trammg and may select a bit which is m the valid state to change to invalid du ⁇ ng untrarning
  • dependency unit 62 mcludes a control circuit 140 and a bit vector storage 142 coupled to control circuit 140
  • Control circuit 140 is further coupled to receive an mdication of the load/store nature of dispatchmg mstruction operations from decode units 24 and assigned R#s from R# assign unit 64 Additionally, control circuit 140 is coupled to receive retired R#s and an abort indication from scheduler 36
  • the store bit vector from bit vector storage 142 is conveyed to store validation circuit 108
  • control circuit 140 receives mdications of the store memory operations from decode units 24
  • the co ⁇ espondmg R#s are provided from R# assign unit 64
  • the store bit vector in bit vector storage 142 includes a bit for each R#
  • Control unit 140 sets the bits m the store bit vector which co ⁇ espond to dispatchmg stores
  • control circuit 140 resets the co ⁇ espondmg bits in the store bit vector
  • aborts may be signalled when the mstruction operation causmg the abort is retired
  • the abort mdication may be a signal used to clear the store bit vector
  • the abort indication may identify the R# of the aborting mstruction and only younger stores may be aborted
  • control circuit refers to circuitry which operates on mputs to produce outputs as described.
  • a control circuit may mclude any combination of combmato ⁇ al logic (static or dynamic), state machmes, custom circuitry, and clocked storage devices (such as flops, registers, etc ) Computer Systems
  • FIG. 9 a block diagram of one embodiment of a computer system 200 including processor 10 coupled to a variety of system components through a bus bridge 202 is shown.
  • a main memory 204 is coupled to bus bridge 202 through a memory bus 206
  • a graphics controller 208 is coupled to bus bridge 202 through an AGP bus 210.
  • a plurality of PCI devices 212A-212B are coupled to bus bridge 202 through a PCI bus 214.
  • a secondary bus bridge 216 may further be provided to accommodate an electrical interface to one or more EISA or ISA devices 218 through an EISA/ISA bus 220.
  • Processor 10 is coupled to bus bridge 202 through a CPU bus 224 and to an optional L2 cache 228. Together, CPU bus 224 and the interface to L2 cache 228 may comprise external interface 52.
  • Bus bridge 202 provides an interface between processor 10, main memory 204, graphics controller 208, and devices attached to PCI bus 214.
  • bus bridge 202 identifies the target of the operation (e.g. a particular device or, in the case of PCI bus 214, that the target is on PCI bus 214).
  • Bus bridge 202 routes the operation to the targeted device.
  • Bus bridge 202 generally translates an operation from the protocol used by the source device or bus to the protocol used by the target device or bus.
  • secondary bus bridge 216 may further incorporate additional functionality, as desired.
  • An input output controller (not shown), either external from or integrated with secondary bus bridge 216, may also be included within computer system 200 to provide operational support for a keyboard and mouse 222 and for various serial and parallel ports, as desired.
  • An external cache unit (not shown) may further be coupled to CPU bus 224 between processor 10 and bus bridge 202 in other embodiments. Alternatively, the external cache may be coupled to bus bridge 202 and cache control logic for the external cache may be integrated into bus bridge 202.
  • L2 cache 228 is further shown in a backside configuration to processor 10. It is noted that L2 cache 228 may be separate from processor 10, integrated into a cartridge (e.g. slot 1 or slot A) with processor 10, or even integrated onto a semiconductor substrate with processor 10.
  • Main memory 204 is a memory in which application programs are stored and from which processor 10 primarily executes.
  • a suitable main memory 204 comprises DRAM (Dynamic Random Access Memory).
  • DRAM Dynamic Random Access Memory
  • SDRAM Serial DRAM
  • RDRAM Rambus DRAM
  • PCI devices 212A-212B are illustrative of a variety of peripheral devices such as, for example, network interface cards, video accelerators, audio cards, hard or floppy disk drives or drive controllers, SCSI (Small Computer Systems Interface) adapters and telephony cards.
  • ISA device 218 is illustrative of various types of peripheral devices, such as a modem, a sound card, and a variety of data acquisition cards such as GPIB or field bus interface cards.
  • Graphics controller 208 is provided to control the rendering of text and images on a display 226.
  • Graphics controller 208 may embody a typical graphics accelerator generally known in the art to render three- dimensional data structures which can be effectively shifted into and from main memory 204. Graphics controller 208 may therefore be a master of AGP bus 210 in that it can request and receive access to a target interface within bus bridge 202 to thereby obtain access to main memory 204.
  • a dedicated graphics bus accommodates rapid retrieval of data from main memory 204.
  • graphics controller 208 may further be configured to generate PCI protocol transactions on AGP bus 210.
  • the AGP interface of bus bridge 202 may thus include functionality to support both AGP protocol transactions as well as PCI protocol target and initiator transactions.
  • Display 226 is any electronic display upon which an image or text can be presented.
  • a suitable display 226 includes a cathode ray tube ("CRT"), a liquid crystal display (“LCD”), etc.
  • computer system 200 may be a multiprocessing computer system including additional processors (e.g. processor 10a shown as an optional component of computer system 200).
  • processor 10a may be similar to processor 10. More particularly, processor 10a may be an identical copy of processor 10.
  • Processor 10a may be connected to bus bridge 202 via an independent bus (as shown in Fig. 9) or may share CPU bus 224 with processor 10.
  • processor 10a may be coupled to an optional L2 cache 228a similar to L2 cache 228.
  • FIG. 10 another embodiment of a computer system 300 is shown. Other embodiments are possible and contemplated.
  • computer system 300 includes several processing nodes 312A, 312B, 312C, and 312D. Each processing node is coupled to a respective memory 314A-314D via a memory controller 316A-316D included within each respective processing node 312A-312D. Additionally, processing nodes 312A-312D include interface logic used to communicate between the processing nodes 312A- 312D.
  • processing node 312A includes interface logic 318A for communicating with processing node 312B, interface logic 318B for communicating with processing node 312C, and a third interface logic 318C for communicating with yet another processing node (not shown).
  • processing node 312B includes interface logic 318D, 318E, and 318F;
  • processing node 312C includes interface logic 318G, 318H, and 3181; and
  • processing node 312D includes interface logic 318J, 318K, and 318L.
  • Processing node 312D is coupled to communicate with a plurality of input output devices (e.g. devices 320A-320B in a daisy chain configuration) via interface logic 318L.
  • Other processing nodes may communicate with other I/O devices in a similar fashion.
  • Processing nodes 312A-312D implement a packet-based link for inter-processing node communication.
  • the link is implemented as sets of unidirectional lines (e.g. lines 324A are used to transmit packets from processing node 312A to processing node 312B and lines 324B are used to transmit packets from processing node 312B to processing node 312A).
  • Other sets of lines 324C-324H are used to transmit packets between other processing nodes as illustrated in Fig. 10.
  • each set of lines 324 may include one or more data lines, one or more clock lines co ⁇ esponding to the data lines, and one or more control lines indicating the type of packet being conveyed.
  • the link may be operated in a cache coherent fashion for communication between processing nodes or in a noncoherent fashion for communication between a processing node and an I/O device (or a bus bridge to an I/O bus of conventional construction such as the PCI bus or ISA bus). Furthermore, the link may be operated in a non-coherent fashion using a daisy-chain structure between I/O devices as shown. It is noted that a packet to be transmitted from one processing node to another may pass through one or more intermediate nodes. For example, a packet transmitted by processing node 312A to processing node 312D may pass through either processing node 312B or processing node 312C as shown in Fig. 10. Any suitable routing algorithm may be used.
  • inventions of computer system 300 may include more or fewer processing nodes then the embodiment shown in Fig. 10.
  • the packets may be transmitted as one or more bit times on the lmes 324 between nodes.
  • a bit tune may be the nsmg or falling edge of the clock signal on the co ⁇ esponding clock lmes
  • the packets may include command packets for initiating transactions, probe packets for maintaining cache coherency, and response packets from respondmg to probes and commands
  • Processmg nodes 312A-312D, m addition to a memory controller and interface logic, may include one or more processors.
  • a processing node comprises at least one processor and may optionally include a memory controller for communicating with a memory and other logic as desired More particularly, a processmg node 312A-312D may comprise processor 10. External mterface unit 46 may mcludes the mterface logic 318 within the node, as well as the memory controller 316. Memories 314A-314D may comp ⁇ se any suitable memory devices. For example, a memory 314A-
  • 314D may comp ⁇ se one or more RAMBUS DRAMs (RDRAMs), synchronous DRAMs (SDRAMs), static RAM, etc
  • RDRAMs RAMBUS DRAMs
  • SDRAMs synchronous DRAMs
  • static RAM etc
  • the address space of computer system 300 is divided among memories 314A-314D.
  • Each processmg node 312A-312D may include a memory map used to determine which addresses are mapped to which memories 314A-314D, and hence to which processmg node 312A-312D a memory request for a particular address should be routed.
  • the coherency point for an address within computer system 300 is the memory controller 316A-316D coupled to the memory stormg bytes co ⁇ espondmg to the address
  • the memory controller 316A-316D is responsible for ensurmg that each memory access to the co ⁇ espondmg memory 314A-314D occurs m a cache coherent fashion.
  • Memory controllers 316A-316D may comp ⁇ se control circuitry for mterfacmg to memories 314A-314D. Additionally, memory controllers 316A-316D may mclude request queues for queumg memory requests.
  • interface logic 318A-318L may comp ⁇ se a variety of buffers for receivmg packets from the link and for buffering packets to be transmitted upon the link
  • Computer system 300 may employ any suitable flow control mechanism for transmitting packets
  • each interface logic 318 stores a count of the number of each type of buffer withm the receiver at the other end of the link to which that mterface logic is connected. The mterface logic does not transmit a packet unless the receivmg interface logic has a free buffer to store the packet.
  • the receiving interface logic transmits a message to the sendmg mterface logic to mdicate that the buffer has been freed
  • a mechanism may be refe ⁇ ed to as a "coupon-based" system.
  • I O devices 320A-320B may be any suitable I/O devices
  • I/O devices 320A-320B may include network interface cards, video accelerators, audio cards, hard or floppy disk drives or drive controllers, SCSI (Small Computer Systems Interface) adapters and telephony cards, modems, sound cards, and a variety of data acquisition cards such as GPIB or field bus interface cards.
  • SCSI Small Computer Systems Interface

Abstract

A processor (10) employs a store to load forward (STLF) predictor (60) which may indicate, for dispatching loads, a dependency on a store. The dependency is indicated for a store which, during a previous execution, interfered with the execution of the load. Since a dependency is indicated on the store, the load is prevented from scheduling and/or executing prior to the store. The STLF predictor (60) is trained with information for a particular load and store in response to executing the load and store and detecting the interference. Additionally, the STLF predictor (60) may be untrained (e.g. information for a particular load and store may be deleted) if a load is indicated by the STLF predictor (60) as dependent upon a particular store and the dependency does not actually occur.

Description

STORE TO LOAD FORWARDING PREDICTOR WITH UNTRAINING BACKGROUND OF THE INVENTION
1. Technical Field This invention is related to the field of processors and, more particularly, to store to load forward mechanisms within processors.
2. Background Art
Processors often include store queues to buffer store memory operations which have been executed but which are still speculative. The store memory operations may be held in the store queue until they are retired. Subsequent to retirement, the store memory operations may be committed to the cache and/or memory. As used herein, a memory operation is an operation specifying a transfer of data between a processor and a main memory (although the transfer may be completed in cache). Load memory operations specify a transfer of data from memory to the processor, and store memory operations specify a transfer of data from the processor to memory. Memory operations may be an implicit part of an instruction which includes a memory operation, or may be explicit load/store instructions. Load memory operations may be more succinctly referred to herein as "loads". Similarly, store memory operations may be more succinctly referred to as "stores".
While executing stores speculatively and queueing them in the store queue may allow for increased performance (by removing the stores from the instruction execution pipeline and allowing other, subsequent instructions to execute), subsequent loads may access the memory locations updated by the stores in the store queue. While processor performance is not necessarily directly affected by having stores queued in the store queue, performance may be affected if subsequent loads are delayed due to accessing memory locations updated by stores in the store queue. Often, store queues are designed to forward data stored therein if a load hits the store queue. As used herein, a store queue entry storing a store memory operation is referred to as being "hit" by a load memory operation if at least one byte updated by the store memory operation is accessed by the load memory operation.
To further increase performance, it is desirable to execute younger loads out of order with respect to older stores. The younger loads may often have no dependency on the older stores, and thus need not await the execution of the older stores. Since the loads provide operands for execution of dependent instructions, executing the loads allows for still other instructions to be executed. However, merely detecting hits in the store queue as loads are executing may not lead to correct program execution if younger loads are allowed to execute out of order with respect to older stores, since certain older stores may not have executed yet (and thus the store addresses of those stores may not be known and dependencies of the loads on the certain older stores may not be detectable as the loads are executed). Accordingly, hardware to detect scenarios in which a younger load executes prior to an older store on which that younger load is dependent may be required, and then corrective action may be taken in response to the detection. For example, instructions may be purged and refetched or reexecuted in some other suitable fashion. As used herein, a load is "dependent" on a store if the store updates at least one byte of memory accessed by the load, is older than the load, and is younger than any other stores updating that byte. Unfortunately, executing the load out of order improperly and the subsequent corrective actions to achieve correct execution may reduce performance. It is noted that loads, stores, and other instruction operations may be referred to herein as being older or younger than other instruction operations A first instruction is older than a second instruction if the first instruction precedes the second instruction m program order (1 e the order of the instructions in the program being executed) A first instruction m younger than a second instruction if the first instruction is subsequent to the second instruction in program order
DISCLOSURE OF INVENTION The problems outlined above are in large part solved by a processor as described herem The processor generally may schedule and/or execute younger loads ahead of older stores Additionally, the processor may detect and take corrective action for scenaπos in which an older store mterferes with the execution of the younger load The processor employs a store to load forward (STLF) predictor which may indicate, for dispatching loads, a dependency on a store The dependency is indicated for a store which, during a previous execution, interfered with the execution of the load Smce a dependency is indicated on the store, the load is prevented from scheduling and/or executmg prior to the store Performance may be increased due to the decreased interference between loads and stores The STLF predictor is trained with information for a particular load and store in response to executmg the load and store and detecting the interference Additionally, the STLF predictor may be untrained (e g information for a particular load and store may be deleted) if a load is indicated by the STLF predictor as dependent upon a particular store and the dependency does not actually occur For example, in one embodiment, the STLF predictor is untrained if the load is indicated as dependent upon the particular store but store data is not forwarded from a store queue within the processor when the load executes
In one implementation, the STLF predictor records at least a portion of the PC of a store which interferes with the load in a first table mdexed by the load PC A second table maintains a correspondmg portion of the store PCs of recently dispatched stores, along with tags identifying the recently dispatched stores The PC of a dispatchmg load is used to select a store PC from the first table The selected store PC is compared to the PCs stored m the second table If a match is detected, the correspondmg tag is read from the second table and used to mdicate a dependency for the load
In another implementation, the STLF predictor records a difference between the tags assigned to a load and a store which interferes with the load m a first table mdexed by the load PC The PC of the dispatching load is used to select a difference from the table, and the difference is added to the tag assigned to the load Accordmgly, a tag of the store may be generated and a dependency of the load on the store may be indicated
Broadly speakmg, a processor is contemplated compπsmg an STLF predictor and an execution pipeline coupled to the STLF predictor The STLF predictor is coupled to receive an indication of dispatch of a first load memory operation, and is configured to indicate a dependency of the first load memory operation on a first store memory operation responsive to information stored withm the STLF predictor indicating that, durmg a previous execution, the first store memory operation mterfered with the first load memory operation The execution pipeline is configured to inhibit execution of the first load memory operation prior to the first store memory operation responsive to the dependency The execution pipeline is configured to detect a lack of the dependency during execution of the first load memory operation The execution pipeline is configured to generate an untrain signal responsive to the lack of dependency Coupled to receive the untrain signal, the STLF predictor is configured to update the information stored therem to not indicate that the first store memory operation mterfered with the first load memory operation durmg the previous execution. Additionally, a computer system is contemplated mcludmg the processor and an input/output (I/O) device configured to communicate between the computer system and another computer system to which the I/O device is couplable Moreover, a method is contemplated A dependency of a first load memory operation on a first store memory operation is mdicated responsive to information indicatmg that, during a previous execution, the first store memory operation mterfered with the first load memory operation. Schedulmg of the first load memory operation is inhibited prior to schedulmg the first store memory operation. A lack of the dependency is detected during execution of the first load memory operation The information mdicatmg that, during the previous execution, the first store memory operation mterfered with the first load memory operation is updated to not indicate that, during the previous execution, the first store memory operation interfered with the first load memory operation. The updating is performed responsive to the detecting of the lack of dependency.
BRIEF DESCRIPTION OF DRAWINGS Other objects and advantages of the mvention will become apparent upon readmg the following detailed description and upon reference to the accompanymg drawings m which Fig. 1 is a block diagram of one embodiment of a processor.
Fig 2 is a pipeline diagram of an exemplary pipeline which may be employed m the processor shown in Fig 1 Fig 3 is a block diagram illustrating one embodiment of a map unit, a scheduler, an AGU/TLB, and a load/store unit in greater detail.
Fig 4 is a block diagram of one embodiment of a store to load forward (STLF) predictor shown m Fig 3
Fig 5 is a block diagram of a second embodiment of an STLF predictor shown m Fig 3. Fig. 6 is a flowchart illustrating training and untraining of loads m one embodiment of an STLF predictor shown in Figs 4 or 5
Fig. 7 is a block diagram illustrating one embodiment of a control circuit which may be employed m an STLF predictor shown m Figs 4 or 5
Fig. 8 is a block diagram of one embodiment of a dependency unit shown m Fig 3 Fig. 9 is a block diagram of one embodiment of a computer system mcludmg the processor shown in Fig
1
Fig 10 is a block diagram of a second embodiment of a computer system mcludmg the processor shown m Fig. 1.
While the mvention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawmgs and will herem be described in detail It should be understood, however, that the drawmgs and detailed description thereto are not mtended to limit the mvention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling withm the spirit and scope of the present invention as defined by the appended claims. MODE(S) FOR CARRYING OUT THE INVENTION Processor Overview
Turning now to Fig 1, a block diagram of one embodiment of a processor 10 is shown Other embodiments are possible and contemplated In the embodiment of Fig 1, processor 10 includes a line predictor 12, an instruction cache (I-cache) 14, an alignment unit 16, a branch prediction/fetch PC generation unit 18, a plurality of decode units 24A-24D, a predictor miss decode unit 26, a microcode unit 28, a map unit 30, a retire queue 32, an architectural renames file 34, a future file 20, a scheduler 36, an integer register file 38A, a floating pomt register file 38B, an integer execution core 40A, a floating pomt execution core 40B, a load/store unit 42, a data cache (D-cache) 44, an external mterface unit 46, and a PC silo 48 Lme predictor 12 is coupled to predictor miss decode unit 26, branch prediction fetch PC generation unit 18, PC silo 48, and alignment unit 16 Lme predictor 12 may also be coupled to I-cache 14 I-cache 14 is coupled to alignment unit 16 and branch prediction/fetch PC generation unit 18, which is further coupled to PC silo 48 Alignment unit 16 is further coupled to predictor miss decode unit 26 and decode units 24A-24D Decode units 24A-24D are further coupled to map unit 30, and decode unit 24D is coupled to microcode unit 28 Map unit 30 is coupled to retire queue 32 (which is coupled to architectural renames file 34), future file 20, scheduler 36, and PC silo 48 Architectural renames file 34 is coupled to future file 20 Scheduler 36 is coupled to register files 38A-38B, which are further coupled to each other and respective execution cores 40A-40B Execution cores 40A-40B are further coupled to load/store unit 42 and scheduler 36 Execution core 40 A is further coupled to D-cache 44 Load store umt 42 is coupled to scheduler 36, D-cache 44, and external interface unit 46 D-cache 44 is coupled to register files 38 External mterface unit 46 is coupled to an external interface 52 and to I-cache 14 Elements referred to herein by a reference numeral followed by a letter will be collectively referred to by the reference numeral alone For example, decode units 24A-24D will be collectively referred to as decode units 24
In the embodiment of Fig 1, processor 10 employs a variable byte length, complex instruction set computing (CISC) instruction set architecture For example, processor 10 may employ the x86 instruction set architecture (also referred to as IA-32) Other embodiments may employ other instruction set architectures mcludmg fixed length instruction set architectures and reduced instruction set computing (RISC) instruction set architectures Certain features shown in Fig 1 may be omitted m such architectures
Branch prediction fetch PC generation unit 18 is configured to provide a fetch address (fetch PC) to I- cache 14, lme predictor 12, and PC silo 48 Branch prediction/fetch PC generation unit 18 may include a suitable branch prediction mechanism used to aid in the generation of fetch addresses In response to the fetch address, lme predictor 12 provides alignment information correspondmg to a plurality of instructions to alignment umt 16, and may provide a next fetch address for fetchmg instructions subsequent to the instructions identified by the provided instruction information The next fetch address may be provided to branch prediction/fetch PC generation unit 18 or may be directly provided to I-cache 14, as desired Branch prediction/fetch PC generation unit 18 may receive a trap address from PC silo 48 (if a trap is detected) and the trap address may comprise the fetch PC generated by branch prediction fetch PC generation unit 18 Otherwise, the fetch PC may be generated usmg the branch prediction information and information from lme predictor 12 Generally, line predictor 12 stores information correspondmg to instructions previously speculatively fetched by processor 10 In one embodiment, lme predictor 12 mcludes 2K entries, each entry locatmg a group of one or more instructions referred to herem as a "lme" of instructions The lme of instructions may be concurrently processed by the mstruction processmg pipeline of processor 10 through bemg placed mto scheduler 36
I-cache 14 is a high speed cache memory for storing mstruction bytes Accordmg to one embodiment I- cache 14 may comprise, for example, a 128 Kbyte, four way set associative organization employmg 64 byte cache lines However, any I-cache structure may be suitable (including direct-mapped structures)
Alignment unit 16 receives the mstruction alignment information from lme predictor 12 and mstruction bytes corresponding to the fetch address from I-cache 14 Alignment unit 16 selects mstruction bytes mto each of decode units 24A-24D accordmg to the provided mstruction alignment information More particularly, lme predictor 12 provides an instruction pomter correspondmg to each decode unit 24A-24D The instruction pointer locates an mstruction within the fetched mstruction bytes for conveyance to the corresponding decode unit 24A- 24D In one embodiment, certain instructions may be conveyed to more than one decode unit 24A-24D Accordmgly, in the embodiment shown, a line of instructions from lme predictor 12 may mclude up to 4 instructions, although other embodiments may mclude more or fewer decode units 24 to provide for more or fewer instructions withm a lme Decode units 24A-24D decode the instructions provided thereto, and each decode umt 24A-24D generates information identifying one or more mstruction operations (or ROPs) corresponding to the instructions In one embodiment, each decode unit 24A-24D may generate up to two instruction operations per mstruction As used herem, an mstruction operation (or ROP) is an operation which an execution unit within execution cores 40A-40B is configured to execute as a smgle entity Simple instructions may correspond to a single instruction operation, while more complex instructions may correspond to multiple mstruction operations Certain of the more complex instructions may be implemented withm microcode unit 28 as microcode routmes (fetched from a read-only memory therein via decode unit 24D in the present embodiment) Furthermore, other embodiments may employ a smgle mstruction operation for each mstruction (I e instruction and mstruction operation may be synonymous m such embodiments) PC silo 48 stores the fetch address and mstruction information for each instruction fetch, and is responsible for redirecting mstruction fetchmg upon exceptions (such as instruction traps defined by the instruction set architecture employed by processor 10, branch mispredictions, and other microarchitecturally defined traps) PC silo 48 may include a circular buffer for stormg fetch address and mstruction information correspondmg to multiple lmes of instructions which may be outstandmg withm processor 10 In response to retirement of a lme of instructions, PC silo 48 may discard the correspondmg entry In response to an exception, PC silo 48 may provide a trap address to branch prediction/fetch PC generation unit 18 Retirement and exception information may be provided by scheduler 36 In one embodiment, map unit 30 assigns a sequence number (R#) to each mstruction to identify the order of instructions outstandmg withm processor 10 Scheduler 36 may return R#s to PC silo 48 to identify instruction operations expeπencmg exceptions or retiring mstruction operations
Upon detecting a miss m lme predictor 12, alignment unit 16 routes the corresponding instruction bytes from I-cache 14 to predictor miss decode unit 26 Predictor miss decode unit 26 decodes the instruction, enforcing any limits on a line of instructions as processor 10 is designed for (e g maximum number of instruction operations, maximum number of instructions, terminate on branch instructions, etc ) Upon terminating a lme, predictor miss decode unit 26 provides the information to lme predictor 12 for storage It is noted that predictor miss decode unit 26 may be configured to dispatch instructions as they are decoded Alternatively, predictor miss decode unit 26 may decode the lme of instruction information and provide it to lme predictor 12 for storage Subsequently, the missing fetch address may be reattempted m lme predictor 12 and a hit may be detected In addition to decodmg instructions upon a miss m lme predictor 12, predictor miss decode unit 26 may be configured to decode instructions if the instruction information provided by lme predictor 12 is invalid In one embodiment, processor 10 does not attempt to keep information in line predictor 12 coherent with the instructions withm I-cache 14 (e g when instructions are replaced or invalidate m I-cache 14, the correspondmg mstruction information may not actively be invalidated) Decode units 24A-24D may verify the mstruction information provided, and may signal predictor miss decode unit 26 when invalid instruction information is detected According to one particular embodiment, the following mstruction operations are supported by processor 10 mteger (including arithmetic, logic, shift/rotate, and branch operations), floating point (mcludmg multimedia operations), and load/store
The decoded instruction operations and source and destination register numbers are provided to map unit 30 Map unit 30 is configured to perform register renaming by assignmg physical register numbers (PR#s) to each destination register operand and source register operand of each mstruction operation The physical register numbers identify registers withm register files 38A-38B Map unit 30 additionally provides an indication of the dependencies for each mstruction operation by providmg R#s of the mstruction operations which update each physical register number assigned to a source operand of the instruction operation Map unit 30 updates future file 20 with the physical register numbers assigned to each destination register (and the R# of the correspondmg instruction operation) based on the corresponding logical register number Additionally, map unit 30 stores the logical register numbers of the destmation registers, assigned physical register numbers, and the previously assigned physical register numbers m retire queue 32 As instructions are retired (mdicated to map umt 30 by scheduler 36), retire queue 32 updates architectural renames file 34 and frees any registers which are no longer in use Accordmgly, the physical register numbers m architectural register file 34 identify the physical registers storing the committed architectural state of processor 10, while future file 20 represents the speculative state of processor 10 In other words, architectural renames file 34 stores a physical register number correspondmg to each logical register, representing the committed register state for each logical register Future file 20 stores a physical register number correspondmg to each logical register, representing the speculative register state for each logical register
The line of instruction operations, source physical register numbers, and destination physical register numbers are stored mto scheduler 36 according to the R#s assigned by map unit 30 Furthermore, dependencies for a particular instruction operation may be noted as dependencies on other mstruction operations which are stored m the scheduler In one embodiment, mstruction operations remain m scheduler 36 until retired Scheduler 36 stores each instruction operation until the dependencies noted for that instruction operation have been satisfied In response to scheduling a particular mstruction operation for execution, scheduler 36 may determine at which clock cycle that particular instruction operation will update register files 38A-38B Different execution units withm execution cores 40A-40B may employ different numbers of pipeline stages (and hence different latencies) Furthermore, certain instructions may experience more latency withm a pipeline than others Accordmgly, a countdown is generated which measures the latency for the particular mstruction operation (m numbers of clock cycles). Scheduler 36 awaits the specified number of clock cycles (until the update will occur prior to or coincident with the dependent mstruction operations reading the register file), and then indicates that mstruction operations dependent upon that particular mstruction operation may be scheduled It is noted that scheduler 36 may schedule an mstruction once its dependencies have been satisfied (l e out of order with respect to its order withm the scheduler queue)
Integer and load/store mstruction operations read source operands accordmg to the source physical register numbers from register file 38A and are conveyed to execution core 40A for execution. Execution core 40A executes the instruction operation and updates the physical register assigned to the destmation withm register file 38 A Additionally, execution core 40A reports the R# of the mstruction operation and exception mformation regardmg the mstruction operation (if any) to scheduler 36. Register file 38B and execution core 40B may operate m a similar fashion with respect to floating pomt mstruction operations (and may provide store data for floating pomt stores to load/store unit 42).
In one embodiment, execution core 40A may mclude, for example, two integer units, a branch unit, and two address generation units (with corresponding translation lookaside buffers, or TLBs) Execution core 40B may include a floating point/multimedia multiplier, a floating point/multimedia adder, and a store data unit for delivering store data to load/store unit 42. Other configurations of execution units are possible
Load store unit 42 provides an mterface to D-cache 44 for performing memory operations and for scheduling fill operations for memory operations which miss D-cache 44 Load memory operations may be completed by execution core 40A performmg an address generation and forwarding data to register files 38A-38B (from D-cache 44 or a store queue withm load store unit 42). Store addresses may be presented to D-cache 44 upon generation thereof by execution core 40A (directly via connections between execution core 40A and D- Cache 44). The store addresses are allocated a store queue entry. The store data may be provided concurrently, or may be provided subsequently, according to design choice. Upon retirement of the store mstruction, the data is stored into D-cache 44 (although there may be some delay between retirement and update of D-cache 44). Additionally, load/store unit 42 may mclude a load/store buffer for stormg load store addresses which miss D- cache 44 for subsequent cache fills (via external interface unit 46) and re-attempting the missing load/store operations Load/store unit 42 is further configured to handle load/store memory dependencies.
D-cache 44 is a high speed cache memory for stormg data accessed by processor 10. While D-cache 44 may comprise any suitable structure (including direct mapped and set-associative structures), one embodiment of D-cache 44 may compπse a 128 Kbyte, 2 way set associative cache having 64 byte lines
External mterface umt 46 is configured to communicate to other devices via external interface 52 Any suitable external interface 52 may be used, including interfaces to L2 caches and an external bus or buses for connecting processor 10 to other devices External interface unit 46 fetches fills for I-cache 16 and D-cache 44, as well as writing discarded updated cache lmes from D-cache 44 to the external mterface. Furthermore, external mterface unit 46 may perform non-cacheable reads and writes generated by processor 10 as well
Turning next to Fig. 2, an exemplary pipeline diagram illustrating an exemplary set of pipeline stages which may be employed by one embodiment of processor 10 is shown Other embodiments may employ different pipelines, pipelines including more or fewer pipeline stages than the pipeline shown in Fig. 2. The stages shown in Fig 2 are delimited by vertical lmes Each stage is one clock cycle of a clock signal used to clock storage elements (e g registers, latches, flops, and the like) withm processor 10
As illustrated in Fig 2, the exemplary pipeline includes a CAMO stage, a CAM1 stage, a line predictor (LP) stage, an mstruction cache (IC) stage, an alignment (AL) stage, a decode (DEC) stage, a mapl (Ml) stage, a map2 (M2) stage, a write scheduler (WR SC) stage, a read scheduler (RD SC) stage, a register file read (RF RD) stage, an execute (EX) stage, a register file write (RF WR) stage, and a retire (RET) stage Some instructions utilize multiple clock cycles m the execute state For example, memory operations, floating pomt operations, and mteger multiply operations are illustrated in exploded form in Fig 2 Memory operations mclude an address generation (AGU) stage, a translation (TLB) stage, a data cache 1 (DC1) stage, and a data cache 2 (DC2) stage Similarly, floating pomt operations mclude up to four floating point execute (FEX1-FEX4) stages, and mteger multiplies mclude up to four (IM1-IM4) stages
Duπng the CAMO and CAM1 stages, lme predictor 12 compares the fetch address provided by branch prediction/fetch PC generation unit 18 to the addresses of lines stored therem Additionally, the fetch address is translated from a virtual address (e g a lmear address m the x86 architecture) to a physical address duπng the CAMO and CAM1 stages In response to detectmg a hit durmg the CAMO and CAM1 stages, the corresponding lme mformation is read from the line predictor durmg the lme predictor stage Also, I-cache 14 initiates a read (usmg the physical address) during the line predictor stage The read completes durmg the instruction cache stage
It is noted that, while the pipeline illustrated in Fig 2 employs two clock cycles to detect a hit m lme predictor 12 for a fetch address, other embodiments may employ a smgle clock cycle (and stage) to perform this operation Moreover, one embodiment, lme predictor 12 provides a next fetch address for I-cache 14 and a next entry m line predictor 12 for a hit, and therefore the CAMO and CAMl stages may be skipped for fetches resulting from a previous hit m line predictor 12
Instruction bytes provided by I-cache 14 are aligned to decode units 24A-24D by alignment unit 16 durmg the alignment stage m response to the correspondmg lme mformation from lme predictor 12 Decode units 24A-24D decode the provided instructions, identifying ROPs corresponding to the instructions as well as operand mformation durmg the decode stage Map unit 30 generates ROPs from the provided mformation during the mapl stage, and performs register renaming (updating future file 20) Durmg the map2 stage, the ROPs and assigned renames are recorded m retire queue 32 Furthermore, the ROPs upon which each ROP is dependent are determined Each ROP may be register dependent upon earlier ROPs as recorded in the future file, and may also exhibit other types of dependencies (e g dependencies on a previous serializing instruction, etc )
The generated ROPs are wπtten into scheduler 36 during the write scheduler stage Up until this stage, the ROPs located by a particular line of information flow through the pipeline as a unit However, subsequent to be written mto scheduler 36, the ROPs may flow independently through the remammg stages, at different times Generally, a particular ROP remams at this stage until selected for execution by scheduler 36 (e g after the ROPs upon which the particular ROP is dependent have been selected for execution, as described above) Accordingly, a particular ROP may experience one or more clock cycles of delay between the write scheduler write stage and the read scheduler stage Durmg the read scheduler stage, the particular ROP participates m the selection logic withm scheduler 36, is selected for execution, and is read from scheduler 36 The particular ROP then proceeds to read register file operations from one of register files 38A-38B (depending upon the type of ROP) in the register file read stage
The particular ROP and operands are provided to the correspondmg execution core 40A or 40B, and the instruction operation is performed on the operands during the execution stage As mentioned above, some ROPs have several pipelme stages of execution For example, memory mstruction operations (e g loads and stores) are executed through an address generation stage (m which the data address of the memory location accessed by the memory instruction operation is generated), a translation stage (in which the virtual data address provided by the address generation stage is translated) and a parr of data cache stages m which D-cache 44 is accessed Floatmg pomt operations may employ up to 4 clock cycles of execution, and integer multiplies may similarly employ up to 4 clock cycles of execution
Upon completing the execution stage or stages, the particular ROP updates its assigned physical register duπng the register file write stage Finally, the particular ROP is retired after each previous ROP is retired (m the retire stage) Agam, one or more clock cycles may elapse for a particular ROP between the register file write stage and the retire stage Furthermore, a particular ROP may be stalled at any stage due to pipelme stall conditions, as is well known in the art
Store to Load Forwarding
Turning now to Fig 3, a block diagram illustrating one embodiment of map unit 30, scheduler 36, an address generation unit/translation lookaside buffer (AGU/TLB) 40AA, and load/store unit 42 m greater detail is shown Other embodiments are possible and contemplated In the embodiment of Fig 3, scheduler 36, AGU/TLB 40AA, and load/store unit 42 are collectively referred to as execution pipeline 72 Map unit 30 mcludes a store to load forward (STLF) predictor 60, a dependency unit 62, and an R# assign unit 64 (which assigns R#s to mstruction operations) Scheduler 36 mcludes a scheduler buffer 66 and a physical address (PA) buffer 70 Load store unit 42 mcludes a store queue 68 Map unit 30 is coupled to receive instruction operations and correspondmg program counter addresses (PCs) from decode units 24, a retire signal from scheduler 36, and a trarn/untrain mterface (mcludmg trarn/untrain (T/UT) signals, a load PC (LPC) and a store identifier (SID) from execution pipelme 72 Map unit 30 is coupled to provide the instruction operations, PCs, and dependency mformation to scheduler 36 More particularly, STLF predictor 60 is coupled to receive the instruction operations, PCs, R#s from R# assign unit 64, and train/untram mterface and to provide mformation regarding load dependencies on stores to dependency unit 62, which is also coupled to receive the mstruction operations and R#s from R# assign umt 64 Dependency unit 62 is coupled to provide the dependency information to scheduler 36 Scheduler 36 is coupled to provide loads and stores to AGU/TLB 40AA, along with correspondmg control information AGU/TLB 40AA is coupled to receive corresponding operands from register file 38A and to provide a physical address and other control information to store queue 68, along with, in the case of a store, store data Additionally, AGU/TLB 40 AA is coupled to provide the physical address and control information to PA buffer 70, which is coupled to scheduler buffer 66 Store queue 68 is coupled to receive a retire signal from scheduler 36 and to provide a store to commit and store forward data to D-cache 44 In one embodiment, AGU/TLB 40AA is part of integer execution core 40A
Generally, instruction operations are received by map unit 30 from decode units 24 For each load memory operation, STLF predictor 60 determines if it has any information indicating that, duπng a previous execution, a store memory operation interfered with the load memory operation If a store memory operation did mterfere, STLF predictor 60 provides an indication of that store memory operation to dependency umt 62 Dependency unit 62 mdicates a dependency for the load memory operation on that store memory operation (m addition to any dependencies for address operands, etc ), and thus the load memory operation does not get scheduled pπor to the store memory operation Accordmgly, durmg the current execution of the load memory operation, the store memory operation may not interfere On the other hand, if no mformation regarding mterference from a store memory operation is recorded by STLF predictor 60 for a particular load memory operation, STLF predictor 60 does not mdicate a dependency to dependency unit 62 The particular load memory operation may receive dependencies for source register operands but not for any store memory operations
As used herem, a store memory operation "interferes" with a load memory operation if the store memory operation causes additional clock cycles to be added to the execution of the load memory operation The additional clock cycles may be added in the form of pipelme stalls or may be added via reexecution of the load memory operation The remamder of this disclosure will focus on an embodiment m which a store memory operation interferes with a load memory operation if the store memory operation is older than the load memory operation, the load memory operation has a dependency on the store memory operation, and the load memory operation is scheduled and/or executed pπor to the store memory operation Other embodiments are contemplated For example, an embodiment is contemplated m which load memory operations are not scheduled prior to the address generation of a store memory operation, but which may be scheduled prior to the store data bemg provided In such an embodiment, the store may mterfere with the load if there is a dependency and the store data is not available when the load memory operation executes
Map unit 30 passes the instruction operations, PCs, and dependencies to scheduler 36, which wπtes the mstruction operations mto scheduler buffer 66 Scheduler buffer 66 mcludes multiple entries, each entry capable of stormg mformation regarding one mstruction operation An exemplary memory operation entry is illustrated m scheduler buffer 66, mcludmg a valid bit and a type field (identifying the entry as storing a memory operation and which type of memory operation is stored, either load or store) For load memory operations, the PC of the load memory operation (LPC) is stored Additional mformation may be stored as well to aid in training STLF predictor 60 with information regardmg a store memory operation which interferes with the load memory operation For example, a store ID field (SID) may be mcluded to store an indication of the store memory operation which has mterfered with the load memory operation duπng the present execution, and a retry indication (R) indicating that the load memory operation has been retπed (due to the mterference by the store memory operation) and thus is to be rescheduled for re-execution A tram indication (T) is also stored to indicate that the load was detected, by STLF predictor 60 on dispatch of the load to scheduler 36, as bemg dependent on an older store In one embodiment, the retry indication may be a bit indicating retry when the bit is set Similarly, the train indication may be a bit indicatmg that the dependency was detected when set The opposite sense may be used, and other encodmgs may be used, m other embodiments Still further, additional mformation may be stored as desired (e g size information, operand PR#s, etc ), and other types of entπes (e g integer, floatmg pomt, etc ) may have different formats Scheduler 36 schedules the memory operation for execution subsequent to each of its recorded dependencies being satisfied (mcludmg any dependencies identified by STLF predictor 60), and conveys the load/store nature of the operation and other control information to AGU/TLB 40AA.
More particularly, since scheduler 36 may schedule a memory operation for execution once each of its recorded dependencies are satisfied, younger loads may be scheduled prior to older stores if STLF predictor 60 does not indicate a dependency of the younger load on the older store. Map unit 30 may detect each source register operand dependency, but may not be capable of detecting all load dependencies on earlier stores. The dependency of a load on a store is based on the memory addresses affected by the load and store, respectively, generated from source operands of the load and store during execution of the load and store. STLF predictor 60 detects certain dependencies of loads on stores (as described herein), but others may not be detected. Accordingly, processor 10 employs PA buffer 70 as described below to detect cases in which a younger store scheduled prior to an older store is dependent on the older store.
AGU/TLB 40 AA receives the memory operation and operands (read from register file 38A in response to PR#s from scheduler 36). AGU/TLB 40AA adds the operands to produce a virtual address, and translates the virtual address to a physical address using translations cached in the TLB. AGU/TLB 40AA provides the physical address and other control information to store queue 68. Store data is also provided, if the operation is a store. Among the control information provided by AGU/TLB 40AA may be the load or store nature of the operation. The physical address and other control information is also provided by AGU/TLB 40AA to D-cache 44 and to PA buffer 70.
PA buffer 70 is used in the present embodiment to detect stores which interfere with loads. PA buffer 70 includes multiple entries, one entry for each entry in scheduler buffer 66. Each entry is capable of storing physical address information. When a load is executed, the physical address provided to PA buffer 70 is stored into an entry corresponding to the scheduler buffer entry storing the load. On the other hand, when a store is executed, the physical address is compared to the physical addresses stored in PA buffer 70. If a match is found, and the corresponding instruction operation is a load which is younger than the store, then the load is retried. Generally, a memory operation is referred to herein as "retried" if the operation's state within scheduler 36 is reset to a not executed state. Retrying the memory operation subsequently leads to the memory operation being rescheduled and reexecuted.
If a retry situation is detected in PA buffer 70, the retry indication in the coπesponding scheduler buffer entry is set. Additionally, the store ID used by STLF predictor 60 to identify the store is stored in the scheduler buffer entry's SID field. In one embodiment, the store ID may be the store PC. In another embodiment, the store ID may be the R# of the store or the difference between the R# of the store and the R# of the load (the delta R#). Embodiments using each store ID are described in more detail below. Subsequently, when the load is rescheduled and reexecuted, the retry indication being set causes execution pipeline 72 to train the load and the coπesponding store into STLF predictor 60 using the trarn/untrain interface (so that subsequent executions may avoid the retry of the load by making the load dependent on the store). More particularly, a train signal within the interface may be asserted, and the load PC and the store ID from the corresponding scheduler entry are provided to STLF predictor 60 as well. It is noted that the training may occur from any stage of the execution pipeline 72, according to design choice.
By indicating a dependency of the load upon the store which interfered with the load during a previous execution of the load, schedulmg of the load may be inhibited until after the store is scheduled In this fashion, the dependency of the load upon the store may be detected dunng the first execution of the load memory operation and store data may be forwarded m response to the load memory operation Thus, rescheduling and reexecution of the load may be avoided Performance may be increased due to the lack of rescheduling and reexecution of the load
On the other hand, if the load is trained to be dependent on a store and there is no actual dependency durmg an execution, performance may be lost due to the delayed schedulmg of the load Accordmgly, execution pipelme 72 may detect these situations as well and use the train/untram mterface to untram the load and coπespondmg store from STLF predictor 60 More particularly, if a load is scheduled and its tram indication in scheduler buffer 66 is set, execution pipeline 72 determmes if the load receives forwarded data from store queue 68 If no forwarding occurs, then a dependency on a store may not have been waπanted for the load Accordingly, execution pipelme 72 may assert an untram signal and provide the load PC to STLF predictor 60 STLF predictor 60 may untram the mformation coπespondmg to the load
As used herem, the term "train" refers to stormg mformation which identifies the occuπence of a store which mterferes with a load, and may include updating information which mdicates the likelihood of the mterference recuπmg (e g if the situation has occuπed repeatedly m the past, it may be more likely to occur agam) Thus, trammg may include creatmg a stronger coπelation between the load and the store The term "untram" refers to deleting information which identifies the occuπence of a store mterfermg with a load, and may mclude creatmg a weaker coπelation between the load and the store prior to deleting the mformation It is noted that the training and untrarning of STLF predictor 60 may occur from any pipelme stage, and trammg may be performed at a different stage than untrarning For example, m the present embodiment, training is performed m response to the retry indication when the load is rescheduled, and thus could be performed at any stage after the scheduler read stage in Fig 2 Untrarning is performed m response to the tram indication and the lack of store forwarding for the load, and thus may occur later m the pipelme (e g the DC2 stage m Fig 2) Returning to the execution of memory operations, if the memory operation is a store, store queue 68 stores the mformation provided by AGU/TLB 40AA On the other hand, if the memory operation is a load, store queue 68 compares the load mformation to the mformation in the store queue entries If a hit on a store queue entry is detected, the coπespondmg store queue data is read and provided to D-cache 44 for forwarding (store forward data m Fig 3) Store queue 68 retains the stores at least until they are retired by scheduler 36 Scheduler 36 signals store queue 68 via the retire signal to indicate retirement of one or more stores Store queue 68 conveys the retired stores, m order, usmg the store commit path to D-cache 44 Thus, stores may remam m store queue 68 until they are committed to D-cache 44, which may be delayed from the retirement of the stores
It is noted that various combmations of AGUs and TLBs are possible For example, in one embodiment, a load AGU and a separate store AGU are contemplated The store AGU may be coupled to a write port on store queue 68, and the load AGU may be coupled to a compare port on store queue 68 Other embodiments may mclude any number of AGUs for loads, stores, or loads and stores, as desired
It is noted that, while certain details of the various units shown in Fig 3 are illustrated, other details and features unrelated to the detection of loads hitting m the store queue may have been omitted for simplicity For example, map unit 30 may perform register renaming, as descnbed above with respect to Fig 1
In one embodiment, STLF predictor 60 operates durmg the map2 stage of the pipelme shown m Fig 2 m terms of mdicatmg dependencies for loads on earlier stores However, STLF predictor 60 may operate at any pipelme stage prior to the selection of the load for execution, accordmg to various embodiments The above description describes training durmg the reexecution of the load However, alternative embodiments may perform the training at different times For example, an alternative embodiment may tram m response to detectmg the retry situation (e g during execution of the store upon which the load is dependent)
The PCs of loads (and stores, in one embodiment) have been discussed m the context of training and untraining loads and stores m STLF predictor 60 However, it is noted that only a portion of the PC may be used m some embodiments for trammg and untraining loads and stores in STLF predictor 60 For example, the 10 least significant bits of the load PC and store PC may be used in one embodiment of the STLF predictor 60 described below
As used herein the acronym "PC" is used to refer to the program counter address of an instruction The PC is the address of the instruction m memory In other words, the PC is the address used to fetch the mstruction from memory In embodiments in which multiple mstruction operations may coπespond to an mstruction, the PC of the mstruction is also the PC of each of the mstruction operations (e g load and store memory operations)
It is noted that the R# is used in certain embodiments described above and below to identify instruction operations Generally, any suitable tag may be used The R# identifies relative program order of instruction operations, and may identify the entry m scheduler buffer 66 assigned to the instruction operations Other embodiments may employ reorder buffer tags or any other tag to identify the instruction operations Furthermore, R#s or tags may be assigned at any pomt m the pipelme of processor 10 prior to or comcident with operation of STLF predictor 60
Turning now to Fig 4, a block diagram of a first embodiment of STLF predictor 60 (STLF predictor 60a) is shown Other embodiments are possible and contemplated In the embodiment of Fig 4, STLF predictor 60a mcludes a load/store dependency table 80, a store PC/R# table 82, a load/store dependency table (LSDT) control circuit 84, a ST/LD dependency circuit 86, a store table control circuit 88, an mtralme dependency check circuit 90, and a multiplexor (mux) 92 Load/store dependency table 80 is coupled to receive the PCs of dispatchmg instruction operations from decode units 24, and is coupled to LSDT control circuit 84 Additionally, load/store dependency table 80 is coupled to receive a load PC and store PC from execution pipelme 72 for trammg Load store dependency table 80 is coupled to provide store PCs to mtralme dependency check cncuit 90 and store PC/R# table 82, and valid indications to ST/LD dependency circuit 86 Intrahne dependency check circuit 90 is coupled to receive the PCs of dispatchmg mstruction operations and an indication of the load or store nature of each mstruction operation from decode units 24, and is coupled to provide hit signals to ST/LD dependency circuit 86 and a selection control to mux 92 Store PC/R# table 82 is coupled to receive the PCs of dispatchmg mstruction operations and the coπespondmg R#s assigned to the mstruction operations Store PC/R# table 82 is coupled to provide store R#s to mux 92, and mux 92 is further coupled to receive the R#s assigned to the dispatchmg mstruction operations and to provide store R#s to dependency unit 62 Store PC/R# table 82 is coupled to provide hit signals to store table control circuit 88 and is coupled to receive control mformation from store table control circuit 88 ST/LD dependency circuit 86 is coupled to store table control circuit 88 and is coupled to provide dependency valid and depend all signals to dependency umt 62 LSDT control circuit 84 is coupled to receive train/untram signals from execution pipelme 72
Generally, load/store dependency table 80 is mdexed by a load PC to select one of multiple entries The entry stores a valid indication and a store PC (SPC m Fig 4) of a store which may have mterfered with that load during a prior execution The store PC/R# table includes multiple entries which store the store PC of recently dispatched stores, along with the coπespondmg R# for that store If the store PC from the entry selected in load/store dependency table 80 hits m store PC/R# table 82, a dependency of the load on the store is noted for the load In this manner, the load is prevented from scheduling (and thus executmg) ahead of the store Accordmgly, the interference may be avoided during the present execution More particularly, as mstruction operations are dispatched, the PCs of the mstruction operations are used to index into load/store dependency table 80 The remainder of this discussion will focus on the response of STLF predictor 60a to one mput PC coπespondmg to one dispatching mstruction operation, unless otherwise noted However, it is noted that STLF predictor 60a may respond in parallel to each PC of each dispatching mstruction operation Responsive to the mput PC, load/store dependency table 80 outputs a valid indication and a store PC from the indexed entry The store PC is mput to store PC/R# table 82, and is compared to the store PCs stored m store PC/R# table 82 For example, store PC/R# table 82 may comprise a content addressable memory (CAM) Store PC/R# table 82 outputs hit signals for each entry, indicating whether or not that entry is hit by the store PC Store table control circuit 88 receives the hit signals and selects the youngest store represented m store PC/R# table 82 which is hit by the store PC The selected entry outputs a store R# to mux 92, which generally selects that store R# to output to dependency umt 62
Additionally, store table control circuit 88 signals ST/LD dependency circuit 86 with an indication of whether or not a hit was detected in store PC/R# table 82 for that mstruction operation ST/LD dependency circuit 86 provides a dependency valid signal to dependency unit 62 The dependency valid signal, if asserted, mdicates that dependency unit 62 is to record a dependency for the mstruction operation on the store identified by the store R# provided by mux 92 If the dependency valid signal is deasserted, the signal mdicates that no dependency is to be recorded by dependency unit 62 More particularly in one embodiment, ST/LD dependency circuit 86 may assert the dependency valid signal if (l) the mstruction operation is a load (deteπnined from the load store indications from decode units 24), (n) if the valid indication from the mdexed entry of load/store dependency table 80 mdicates valid, and (in) if the store PC from the mdexed entry hits m store PC/R# table 82 Processor 10 as shown m Fig 1 attempts to dispatch multiple instruction operations per clock cycle
Thus, it is possible that the youngest store which matches the store PC provided from load/store dependency table 80 is bemg concuπently dispatched with the coπesponding load Accordmgly, for a given mstruction operation, mtralme dependency check circuit 90 compares the store PC output from load/store dependency table 80 to the PCs of each concuπently dispatched mstruction operation which is prior to the given mstruction operation in program order If the prior mstruction operation's PC matches the store PC from load/store dependency table 80 and the prior mstruction operation is a store (mdicated by the load/store indications provided by decode units 24), lntrahne dependency check circuit 90 may (I) indicate a hit to ST/LD dependency circuit 86 for the coπesponding load, and (n) control mux 92 to overπde the store R# provided by store PC/R# table 82 with the R# of the mstruction operation upon which the hit is detected In this manner, the store R# output to dependency unit 62 is the R# of the store which is concuπently dispatched with the load. Additionally, ST/LD dependency circuit 86 may assert the dependency valid signal for the load if: (i) the instruction operation is a load (determined from the load/store indications from decode units 24); (ii) if the valid indication from the indexed entry of load/store dependency table 80 indicates valid; and (iii) if the hit signal from intraline dependency check circuit 90 for the load is asserted.
In one embodiment, ST/LD dependency circuit 86 may further assert the depend all signal for the instruction operation. The depend all signal, if asserted, indicates to dependency unit 62 to record dependencies for the instruction operation on each outstanding (dispatched and not retired) store. The depend all signal is used to handle a situation in which a particular entry is repeatedly trained with store PCs of stores which interfere with loads. Since load/store dependency table 80 selects an entry in response to a PC of an instruction operation and the entry stores one store PC, loads for which different stores interfere on different executions may still be interfered with even though STLF predictor 60a indicates a dependency on a store. To better handle such cases, the valid indication in load/store dependency table 80 may be a bit vector. Each time an entry is trained by execution pipeline 72, a bit in the bit vector may be placed in the valid state (e.g. set or clear, depending upon design choice). If each of the bits is in the valid state, the entry may be repeatedly being trained because the load is being interfered with by different stores during various executions. Accordingly, the depend all signal may be asserted if: (i) each bit in the bit vector is in the valid state; and (ii) the instruction operation is a load. One embodiment of the bit vector and placing bits in the valid or invalid state is described in more detail below.
The above has described the use of STLF predictor 60a to indicate dependencies for loads on stores which may have interfered with the loads on prior executions. Additionally, STLF predictor 60a may be trained with the information on the loads and stores. Store PC/R# table 82 stores the store PCs and R#s of the most recently dispatched stores. Thus, store table control circuit 88 may allocate entries in store PC/R# table 82 to stores which are being dispatched. Store table control circuit 88 receives the load/store indications for each instruction operation from decode units 24 and allocates entries to each dispatching store. The allocated entries are updated with the store PC (received from decode units 24) and the coπesponding R# (received from R# assign unit 64). In one embodiment, store table control circuit 88 may operate store PC/R# table 82 as a first-in, first-out (FIFO) buffer of the most recently dispatched stores. Thus, once store PC/R# table is filled with stores, subsequently dispatched stores displace the oldest stores within store PC/R# table 82. Additionally, it is possible that a store may retire prior to being deleted from PC/R# table 82 via subsequently dispatched stores. Accordingly, store table control circuit 88 may receive the R#s of retiring stores and may delete entries having the coπesponding R#.
Load/store dependency table 80 is trained in response to the train/untrain interface from execution pipeline 72. More particularly, if the train signal is asserted by execution pipeline 72, LSDT control circuit 84 causes load/store dependency table 80 to be updated. Execution pipeline 72 provides the PC of the load to be trained (LPC in Fig. 4) and the coπesponding store PC which interferes with the load as input to load/store dependency table 80. Load/store dependency table 80 updates the entry indexed by the load PC with the store PC and LSDT control circuit 84 places the valid indication into a valid state. In one embodiment, the valid indication may be a bit and the valid state may be set (or clear) and invalid state may be clear (or set). In another embodiment as described above, the valid indication may be a bit vector. In such an embodiment, LSDT control circuit 84 may select a bit withm the bit vector and place that bit in the valid state durmg trammg
Additionally, LSDT control circuit 84 may untram an entry m response to the assertion of the untrain signal by execution pipelme 72 Again, execution pipeline 72 may provide the load PC of the load to be untrained, but the store PC may be a don't care m the untraining case Load/store dependency table 80 mdexes the entry indicated by the load PC, and LSDT control circuit 84 causes the valid indication m the mdexed entry to be placed m the invalid state In an embodiment employmg a valid bit as a valid indication, the bit may be cleared (or set) to indicate invalid In an embodiment employmg the above-described bit vector, a selected bit may be placed in the invalid state The entry may still remam valid m the bit vector case if other bits remam m the valid state However, multiple untram events may eventually cause each of the other bits to become invalid as well
As mentioned above with respect to Fig 3, while PCs have been described as indexing load/store dependency table 80 and bemg stored in load/store dependency table 80 and store PC/R# dependency table 82, various embodiments may index with and or store only a portion of the PCs The portion used to mdex load/store dependency table 80 may be determined by the number of entries employed withm the table For example, m one particular implementation load/store dependency table 80 may be IK entries and thus 10 bits of the PC may be used as an mdex (e g the least significant 10 bits) The number of entries may generally be selected as design choice based, m part, on the area occupied by the table versus the accuracy of the table in general for the loads m targeted software The number of bits used for the store PCs stored in load/store dependency table 80 and store PC/R# dependency table 82 may differ from the number of bits used m the index, and agam may be selected as design choice based, in part, on the area occupied by the tables versus the accuracy of the tables in general for the loads m targeted software In one particular implementation, the least significant 10 bits of the store PC are stored
Furthermore, the number of entries in store PC/R# table 82 may be a matter of design choice as well, based, m part, on the area occupied by the table versus the accuracy of the table m general for the loads in targeted software In one particular implementation, 8-12 entries may be used
It is noted that, while the above embodiment may respond to each PC of each dispatchmg mstruction operation, other embodiments may limit the number of concuπent instruction operations to which STLF predictor 60a responds In such embodiments, the PCs and R#s mput to STLF predictor 60a may be muxed m response to the load/store indications from decode units 24, such that only the PCs of loads are mput to load/store dependency table 80 and only the PCs of stores are input to store PC/R# table 82 for storage In such an embodiment, predictor miss decode unit 26 may terminate a line of mstruction operations once the load and/or store limit is reached
It is noted that, as an alternative to the bit vector used for the valid indication and the depend all signal for handling loads which are mterfered with by different stores on different executions, each entry m load/store dependency table 80 may provide storage for multiple store PCs and coπesponding valid bits Each store PC from a selected entry may be compared to store PC R# table 82 and a dependency may be recorded for the load on each store which is a hit in store PC/R# table 82
Turning now to Fig 5, a second embodiment of STLF predictor 60 (STLF predictor 60b) is shown Other embodiments are possible and contemplated In the embodiment of Fig 5, STLF predictor 60b includes a load/store dependency table 100, an adder circuit 102, a load/store dependency table (LSDT) control circuit 104, a ST/LD dependency circuit 106, and an optional store validation circuit 108. Load/store dependency table 100 is coupled to receive PCs of dispatchmg mstruction operations from decode units 24, and is further coupled to receive a load PC and delta R# from execution pipelme 72 Additionally, load/store dependency table 100 is coupled to LSDT control circuit 104 and is coupled to provide valid mdications to ST/LD dependency circuit 106 and delta R#s to adder circuit 102 Adder circuit 102 is further coupled to receive R#s of the dispatchmg mstruction operations from R# assign unit 64 Adder circuit 102 is coupled to provide store R#s to dependency unit 62 and to store validation circuit 108, which is coupled to receive a valid store R# indication from dependency unit 62 Store validation circuit 108 is coupled to provide store valid signals to ST/LD dependency circuit 106, which is further coupled to receive load/store mdications coπespondmg to the dispatchmg mstruction operations from decode units 24. ST/LD dependency circuit 106 is coupled to provide dependency valid signals and depend all signals to dependency unit 62 LSDT control circuit is coupled to receive train/untrain signals from execution pipelme 72
Generally, STLF predictor 60b may respond to a dispatchmg load as follows The load PC is used to index mto load/store dependency table 100, thereby selecting one of multiple entπes. The selected entry stores a valid indication and a delta R#. The valid indication mdicates whether or not STLF predictor 60b has been trained within information regardmg a load havmg the mdexmg PC, and thus whether or not the delta R# is valid. The delta R# is the difference between the R# of the load and the R# of a store which interfered with the load duπng a previous execution. Smce mstruction sequences typically do not change during execution, the difference between the R# of the load and the R# of the store durmg the present execution may typically be the same as the difference during the previous execution. Adder circuit 102 adds the delta R# to the R# assigned to the dispatchmg load to generate a store R#, which is provided to dependency unit 62 Dependency unit 62 may then record a dependency for the load on the store In this manner, the load is prevented from schedulmg (and thus executing) ahead of the store. Accordmgly, the mterference may be avoided durmg the present execution More particularly, as mstruction operations are dispatched, the PCs of the mstruction operations are used to mdex mto load/store dependency table 100 The remainder of this discussion will focus on the response of STLF predictor 60b to one mput PC coπesponding to one dispatching instruction operation, unless otherwise noted. However, it is noted that STLF predictor 60b may respond in parallel to each PC of each dispatchmg instruction operation Load store dependency table 100 outputs a delta R# and valid indication from the selected entry Adder 102 adds the delta R# to the R# coπespondmg to the dispatchmg mstruction operation and thus generates a store R# which is conveyed to dependency unit 62. It is noted that adder circuit 102 may mclude an adder for each dispatchmg mstruction operation, receivmg the coπespondmg delta R# output from load/store dependency table 100 and the R# assigned to that dispatching mstruction operation by R# assign unit 64.
ST/LD dependency circuit 106 receives the valid indication and an indication of whether or not the mstruction operation is a load or a store from decode units 24 ST/LD dependency circuit 106 provides a dependency valid signal to dependency unit 62, similar to ST/LD dependency valid circuit 86 above. ST/LD dependency circuit 106 may assert the dependency valid signal if. (l) the mstruction operation is a load (as determined from the load/store indications provided by decode units 24), and (n) the valid indication from the indexed entry indicates valid Additionally, m one embodiment, STLF predictor 60b may employ store validation circuit 108 Store validation circuit 108 receives an mdication of which R#s coπespond to outstandmg stores from dependency unit 62 In one embodiment, the mdication may be a bit vector having one bit per R#, indicatmg whether or not the R# coπesponds to a store Store validation circuit 108 determines whether or not the R# generated by adder circuit 102 coπesponds to a store, and signals ST/LD dependency circuit 106 with the store valid signal. If the store valid signal is asserted, the generated R# coπesponds to a store On the other hand, if the store valid signal is deasserted, the generated R# does not coπespond to a store. For embodiments employmg store validation circuit 108, an additional condition for ST/LD dependency circuit 106 to assert the dependency valid signal is that the store valid signal from store validation circuit 108 is asserted. Again similar to ST/LD dependency circuit 86, ST/LD dependency circuit 106 may be configured to provide depend all signal m embodiments in which the valid mdication is a bit vector Operation of ST/LD dependency circuit 106 may be similar to ST/LD dependency circuit 106 m this regard.
Load/store dependency table 100 is trained in response to the trarn/untrain mterface from execution pipelme 72. More particularly, if the tram signal is asserted by execution pipeline 72, LSDT control circuit 104 causes load/store dependency table 100 to be updated Execution pipeline 72 provides the PC of the load to be trained (LPC in Fig. 5) and the coπespondmg delta R# as mput to load/store dependency table 100. Load/store dependency table 100 updates the entry mdexed by the load PC with the delta R# and LSDT control circuit 104 places the valid indication into a valid state In one embodiment, the valid indication may be a bit and the valid state may be set (or clear) and invalid state may be clear (or set). In another embodiment as descπbed above, the valid mdication may be a bit vector In such an embodunent, LSDT control circuit 104 may select a bit withm the bit vector and place that bit in the valid state during training
Additionally, LSDT control circuit 104 may untrain an entry m response to the assertion of the untram signal by execution pipelme 72. Agam, execution pipeline 72 may provide the load PC of the load to be untrained, but the delta R# may be a don't care m the untraining case Load/store dependency table 100 mdexes the entry indicated by the load PC, and LSDT control circuit 104 causes the valid indication m the mdexed entry to be placed m the invalid state. In an embodiment employmg a valid bit as a valid indication, the bit may be cleared (or set) to mdicate invalid. In an embodunent employmg the above descπbed bit vector, a selected bit may be placed in the invalid state. The entry may still remam valid m the bit vector case if other bits remain in the valid state. However, multiple untram events may eventually cause each of the other bits to become invalid as well
As mentioned above with respect to Fig 4, while PCs have been described as mdexmg load/store dependency table 100, various embodiments may index with only a portion of the PCs. The portion used to mdex load/store dependency table 100 may be determined by the number of entries employed withm the table For example, m one particular implementation load store dependency table 100 may be IK entries and thus 10 bits of the PC may be used as an mdex (e.g. the least significant 10 bits) The number of entries may generally be selected as design choice based, m part, on the area occupied by the table versus the accuracy of the table in general for the loads in targeted software
It is noted that, while m the present embodiment the delta R# is provided to STLF predictor 60b during trammg, other embodiments may provide the load and store R#s and the delta R# may be calculated m STLF predictor 60b. Furthermore, embodiments may either add or subtract the delta R# and the R# of the load to generate the R# of the store. Still further, an alternative configuration for store validation circuit 108 may be to look up the store R# generated by adder circuit 102 in scheduler 36 to determine if the instruction operation is a store. Turning now to Fig. 6, a flowchart is shown illustrating operation of one embodiment of execution pipeline 72 with respect to load memory operations. Other embodiments are possible and contemplated. While the steps shown in Fig. 6 are illustrated in a particular order for ease of understanding, any suitable order may be used. Particularly, steps may be performed in parallel by combinatorial logic within execution pipeline 72. Still further, various steps may be performed at different states within execution pipeline 72. Information regarding other steps may be pipelined to the stages at which steps are performed.
Execution pipeline 72 determines if a load has been scheduled for execution (decision block 110). If a load is not scheduled, then no training operations are possible in this embodiment. If a load is scheduled, execution pipeline 72 determines if the load was retried due to a hit in physical address buffer 70 (decision block 112). More particularly, execution pipeline 72 may examine the retry indication from the scheduler buffer entry allocated to the load. If the load was retried due to a physical address buffer hit, the execution pipeline 72 asserts the train signal to STLF predictor 60 and provides the load PC and store ID of the load and store to be trained into STLF predictor 60 (block 114).
On the other hand, if the load was not retried due to a physical address buffer hit, execution pipeline 72 determines if the load received a dependency on a store due to operation of STLF predictor 60 (decision block 116). In other words, execution pipeline 72 determines if the train indication in the scheduler buffer entry allocated to the load indicates that the load was trained. If the load was trained, execution pipeline 72 determines if data is forwarded from the store queue for the load (decision block 118). If data is not forwarded, it is likely that the load would not have been interfered with by a store. Accordingly, in this case, execution pipeline 72 may assert the untrain signal to STLF predictor 60 and provide the load PC of the load for untraining (block 120). It is noted that training may also be performed during execution of a store which interferes with a load, rather than during the reexecution of the load due to the retry.
Turning now to Fig. 7, a block diagram of a portion of one embodiment of an LSDT control circuit 130 is shown. LSDT control circuit 130 may be used as LSDT control circuit 84 and/or LSDT control circuit 104, in various embodiments. Other embodiments are possible and contemplated. In the embodiment of Fig. 7, LSDT control circuit 130 includes a control circuit 132 and a counter circuit 134 coupled to the control circuit. Control circuit 132 is coupled to receive the train and untrain signals from execution pipeline 72, and is coupled to provide Set_V[3:0] signals and Clear_V[3:0] signals to load/store dependency table 80 or 100 (depending upon the embodiment).
LSDT control circuit 130 is configured to manage the valid indications in the load/store dependency table during training and untraining for embodiments in which the valid indications are bit vectors. In the present embodiment, each bit in the bit vector is in the valid state if set and in the invalid state if clear, although alternative embodiments may have each bit in the bit vector in the valid state if clear and the invalid state if set. Still other embodiments may encode valid states in the bits.
If an entry is being trained, control circuit 132 selects a bit in the bit vector to set responsive to the value maintained by counter circuit 134 Similarly, if an entry is being untrained, control circuit 132 selects a bit in the bit vector to clear responsive to the value mamtamed by counter circuit 134 Each value of the counter circuit 134 selects one of the bits m the bit vector Counter circuit 134 includes a counter register and an lncrementor which mcrements the value in the counter register Thus, counter circuit 134 increments each clock cycle Accordmgly, the selected bit for a given training or untraining may be pseudo-random m the present embodiment In the present embodiment, valid mdications are 4 bit vectors Accordmgly, one signal withm Set_V[3 0] and Clear_V[3 0] coπesponds to each bit in the vector If an entry is bemg trained, control circuit 132 asserts the Set_V[3 0] signal coπespondmg to the bit selected based on counter circuit 134 In response, load/store dependency table 80 or 100 sets the coπespondmg bit m the bit vector of the mdexed entry On the other hand, if an entry is bemg untrained, control circuit 132 asserts the Clear_V[3 0] signal coπespondmg to the bit selected based on counter circuit 134 In response, load/store dependency table 80 or 100 clears the coπespondmg bit in the bit vector of the indexed entry Control circuit 132 may also provide a write enable signal to enable updatmg of the mdexed entry, if desired
Rather than incrementing the count each clock cycle, alternative configurations may mcrement the count after each tram or untrain event, if desired Still further, alternative configurations may select a bit which is in the invalid state to change to the valid state durmg trammg and may select a bit which is m the valid state to change to invalid duπng untrarning
Turning now to Fig 8, a block diagram of a portion of one embodiment of dependency unit 62 is shown Other embodiments are possible and contemplated The portion illustrated m Fig 8 may be related to mamtammg a store bit vector indicating outstandmg stores Other portions (not shown) may be configured to record dependencies for instruction operations for dispatch to scheduler 36 In the embodiment of Fig 8, dependency unit 62 mcludes a control circuit 140 and a bit vector storage 142 coupled to control circuit 140 Control circuit 140 is further coupled to receive an mdication of the load/store nature of dispatchmg mstruction operations from decode units 24 and assigned R#s from R# assign unit 64 Additionally, control circuit 140 is coupled to receive retired R#s and an abort indication from scheduler 36 The store bit vector from bit vector storage 142 is conveyed to store validation circuit 108
Generally, as mstruction operations are dispatched, control circuit 140 receives mdications of the store memory operations from decode units 24 The coπespondmg R#s are provided from R# assign unit 64 The store bit vector in bit vector storage 142 includes a bit for each R# Control unit 140 sets the bits m the store bit vector which coπespond to dispatchmg stores Similarly, as stores are retired by scheduler 36 and mdicated via the retire R#s, control circuit 140 resets the coπespondmg bits in the store bit vector Finally, if an abort is signalled, control circuit 140 resets the bits of the aborted stores In one embodiment, aborts may be signalled when the mstruction operation causmg the abort is retired Thus, the abort mdication may be a signal used to clear the store bit vector In other embodiments, the abort indication may identify the R# of the aborting mstruction and only younger stores may be aborted
As used herem, the term "control circuit" refers to circuitry which operates on mputs to produce outputs as described. Generally, a control circuit may mclude any combination of combmatoπal logic (static or dynamic), state machmes, custom circuitry, and clocked storage devices (such as flops, registers, etc ) Computer Systems
Turning now to Fig. 9, a block diagram of one embodiment of a computer system 200 including processor 10 coupled to a variety of system components through a bus bridge 202 is shown. Other embodiments are possible and contemplated. In the depicted system, a main memory 204 is coupled to bus bridge 202 through a memory bus 206, and a graphics controller 208 is coupled to bus bridge 202 through an AGP bus 210. Finally, a plurality of PCI devices 212A-212B are coupled to bus bridge 202 through a PCI bus 214. A secondary bus bridge 216 may further be provided to accommodate an electrical interface to one or more EISA or ISA devices 218 through an EISA/ISA bus 220. Processor 10 is coupled to bus bridge 202 through a CPU bus 224 and to an optional L2 cache 228. Together, CPU bus 224 and the interface to L2 cache 228 may comprise external interface 52.
Bus bridge 202 provides an interface between processor 10, main memory 204, graphics controller 208, and devices attached to PCI bus 214. When an operation is received from one of the devices connected to bus bridge 202, bus bridge 202 identifies the target of the operation (e.g. a particular device or, in the case of PCI bus 214, that the target is on PCI bus 214). Bus bridge 202 routes the operation to the targeted device. Bus bridge 202 generally translates an operation from the protocol used by the source device or bus to the protocol used by the target device or bus.
In addition to providing an interface to an ISA/EISA bus for PCI bus 214, secondary bus bridge 216 may further incorporate additional functionality, as desired. An input output controller (not shown), either external from or integrated with secondary bus bridge 216, may also be included within computer system 200 to provide operational support for a keyboard and mouse 222 and for various serial and parallel ports, as desired. An external cache unit (not shown) may further be coupled to CPU bus 224 between processor 10 and bus bridge 202 in other embodiments. Alternatively, the external cache may be coupled to bus bridge 202 and cache control logic for the external cache may be integrated into bus bridge 202. L2 cache 228 is further shown in a backside configuration to processor 10. It is noted that L2 cache 228 may be separate from processor 10, integrated into a cartridge (e.g. slot 1 or slot A) with processor 10, or even integrated onto a semiconductor substrate with processor 10.
Main memory 204 is a memory in which application programs are stored and from which processor 10 primarily executes. A suitable main memory 204 comprises DRAM (Dynamic Random Access Memory). For example, a plurality of banks of SDRAM (Synchronous DRAM) or Rambus DRAM (RDRAM) may be suitable. PCI devices 212A-212B are illustrative of a variety of peripheral devices such as, for example, network interface cards, video accelerators, audio cards, hard or floppy disk drives or drive controllers, SCSI (Small Computer Systems Interface) adapters and telephony cards. Similarly, ISA device 218 is illustrative of various types of peripheral devices, such as a modem, a sound card, and a variety of data acquisition cards such as GPIB or field bus interface cards. Graphics controller 208 is provided to control the rendering of text and images on a display 226.
Graphics controller 208 may embody a typical graphics accelerator generally known in the art to render three- dimensional data structures which can be effectively shifted into and from main memory 204. Graphics controller 208 may therefore be a master of AGP bus 210 in that it can request and receive access to a target interface within bus bridge 202 to thereby obtain access to main memory 204. A dedicated graphics bus accommodates rapid retrieval of data from main memory 204. For certain operations, graphics controller 208 may further be configured to generate PCI protocol transactions on AGP bus 210. The AGP interface of bus bridge 202 may thus include functionality to support both AGP protocol transactions as well as PCI protocol target and initiator transactions. Display 226 is any electronic display upon which an image or text can be presented. A suitable display 226 includes a cathode ray tube ("CRT"), a liquid crystal display ("LCD"), etc.
It is noted that, while the AGP, PCI, and ISA or EISA buses have been used as examples in the above description, any bus architectures may be substituted as desired. It is further noted that computer system 200 may be a multiprocessing computer system including additional processors (e.g. processor 10a shown as an optional component of computer system 200). Processor 10a may be similar to processor 10. More particularly, processor 10a may be an identical copy of processor 10. Processor 10a may be connected to bus bridge 202 via an independent bus (as shown in Fig. 9) or may share CPU bus 224 with processor 10. Furthermore, processor 10a may be coupled to an optional L2 cache 228a similar to L2 cache 228.
Turning now to Fig. 10, another embodiment of a computer system 300 is shown. Other embodiments are possible and contemplated. In the embodiment of Fig. 10, computer system 300 includes several processing nodes 312A, 312B, 312C, and 312D. Each processing node is coupled to a respective memory 314A-314D via a memory controller 316A-316D included within each respective processing node 312A-312D. Additionally, processing nodes 312A-312D include interface logic used to communicate between the processing nodes 312A- 312D. For example, processing node 312A includes interface logic 318A for communicating with processing node 312B, interface logic 318B for communicating with processing node 312C, and a third interface logic 318C for communicating with yet another processing node (not shown). Similarly, processing node 312B includes interface logic 318D, 318E, and 318F; processing node 312C includes interface logic 318G, 318H, and 3181; and processing node 312D includes interface logic 318J, 318K, and 318L. Processing node 312D is coupled to communicate with a plurality of input output devices (e.g. devices 320A-320B in a daisy chain configuration) via interface logic 318L. Other processing nodes may communicate with other I/O devices in a similar fashion. Processing nodes 312A-312D implement a packet-based link for inter-processing node communication.
In the present embodiment, the link is implemented as sets of unidirectional lines (e.g. lines 324A are used to transmit packets from processing node 312A to processing node 312B and lines 324B are used to transmit packets from processing node 312B to processing node 312A). Other sets of lines 324C-324H are used to transmit packets between other processing nodes as illustrated in Fig. 10. Generally, each set of lines 324 may include one or more data lines, one or more clock lines coπesponding to the data lines, and one or more control lines indicating the type of packet being conveyed. The link may be operated in a cache coherent fashion for communication between processing nodes or in a noncoherent fashion for communication between a processing node and an I/O device (or a bus bridge to an I/O bus of conventional construction such as the PCI bus or ISA bus). Furthermore, the link may be operated in a non-coherent fashion using a daisy-chain structure between I/O devices as shown. It is noted that a packet to be transmitted from one processing node to another may pass through one or more intermediate nodes. For example, a packet transmitted by processing node 312A to processing node 312D may pass through either processing node 312B or processing node 312C as shown in Fig. 10. Any suitable routing algorithm may be used. Other embodiments of computer system 300 may include more or fewer processing nodes then the embodiment shown in Fig. 10. Generally, the packets may be transmitted as one or more bit times on the lmes 324 between nodes. A bit tune may be the nsmg or falling edge of the clock signal on the coπesponding clock lmes The packets may include command packets for initiating transactions, probe packets for maintaining cache coherency, and response packets from respondmg to probes and commands Processmg nodes 312A-312D, m addition to a memory controller and interface logic, may include one or more processors. Broadly speakmg, a processing node comprises at least one processor and may optionally include a memory controller for communicating with a memory and other logic as desired More particularly, a processmg node 312A-312D may comprise processor 10. External mterface unit 46 may mcludes the mterface logic 318 within the node, as well as the memory controller 316. Memories 314A-314D may compπse any suitable memory devices. For example, a memory 314A-
314D may compπse one or more RAMBUS DRAMs (RDRAMs), synchronous DRAMs (SDRAMs), static RAM, etc The address space of computer system 300 is divided among memories 314A-314D. Each processmg node 312A-312D may include a memory map used to determine which addresses are mapped to which memories 314A-314D, and hence to which processmg node 312A-312D a memory request for a particular address should be routed. In one embodiment, the coherency point for an address within computer system 300 is the memory controller 316A-316D coupled to the memory stormg bytes coπespondmg to the address In other words, the memory controller 316A-316D is responsible for ensurmg that each memory access to the coπespondmg memory 314A-314D occurs m a cache coherent fashion. Memory controllers 316A-316D may compπse control circuitry for mterfacmg to memories 314A-314D. Additionally, memory controllers 316A-316D may mclude request queues for queumg memory requests.
Generally, interface logic 318A-318L may compπse a variety of buffers for receivmg packets from the link and for buffering packets to be transmitted upon the link Computer system 300 may employ any suitable flow control mechanism for transmitting packets For example, in one embodiment, each interface logic 318 stores a count of the number of each type of buffer withm the receiver at the other end of the link to which that mterface logic is connected. The mterface logic does not transmit a packet unless the receivmg interface logic has a free buffer to store the packet. As a receivmg buffer is freed by routing a packet onward, the receiving interface logic transmits a message to the sendmg mterface logic to mdicate that the buffer has been freed Such a mechanism may be refeπed to as a "coupon-based" system.
I O devices 320A-320B may be any suitable I/O devices For example, I/O devices 320A-320B may include network interface cards, video accelerators, audio cards, hard or floppy disk drives or drive controllers, SCSI (Small Computer Systems Interface) adapters and telephony cards, modems, sound cards, and a variety of data acquisition cards such as GPIB or field bus interface cards.
Numerous variations and modifications will become apparent to those skilled m the art once the above disclosure is fully appreciated It is mtended that the following claims be interpreted to embrace all such variations and modifications
INDUSTRIAL APPLICABILITY This invention may be applicable to processors and computer systems

Claims

WHAT IS CLAIMED IS:
1 A processor (10) compπsmg a store to load forward (STLF) predictor (60) coupled to receive an indication of dispatch of a first load memory operation, wherem said STLF predictor (60) is configured to mdicate a dependency of said first load memory operation on a first store memory operation responsive to information stored withm said STLF predictor (60) mdicatmg that, during a previous execution, said first store memory operation mterfered with said first load memory operation, and an execution pipeline (72) coupled to said STLF predictor (60), wherem said execution pipelme (72) is configured to inhibit execution of said first load memory operation pπor to said first store memory operation responsive to said dependency, and wherem said execution pipelme (72) is configured to detect a lack of said dependency during execution of said first load memory operation, and wherem said execution pipelme (72) is configured to generate an untram signal responsive to said lack of said dependency, wherem said STLF predictor (60) is coupled to receive said untram signal and is configured to update said information stored therein to not indicate that said first store memory operation interfered with said first load memory operation durmg said previous execution
2 The processor as recited in claim 1 wherem said information includes a valid indication coπesponding to said first load memory operation, and wherem said STLF predictor (60) is configured to place said valid mdication m an invalid state to update said information
3 The processor as recited m claim 1 wherem said mformation mcludes a valid mdication coπesponding to said first load memory operation, and wherein said valid mdication compπses a bit vector, and wherem said valid mdication is in a valid state if at least one bit is m a valid state, and wherein said STLF predictor (60) is configured to place a first bit of said bit vector into said invalid state to update said information
4 The processor as recited m claim 1 wherein said first store memory operation interferes with said first load memory operation if said first load memory operation is scheduled prior to scheduling said first store memory operation and said first load memory operation is dependent on said first store memory operation
5 The processor as recited m claim 1 wherem said execution pipeline (72) mcludes a store queue (68), and wherein said execution pipelme (72) is configured to detect said lack of dependency if forwardmg of data from said store queue (68) for said first load memory operation does not occur
6 A method compπsmg indicating a dependency of a first load memory operation on a first store memory operation responsive to mformation mdicatmg that, duπng a previous execution, said first store memory operation mterfered with said first load memory operation, inhibiting schedulmg of said first load memory operation prior to schedulmg said first store memory operation, detectmg a lack of said dependency during execution of said first load memory operation, and updating said mformation mdicatmg that, duπng said previous execution, said first store memory operation mterfered with said first load memory operation to not mdicate that, durmg said previous execution, said first store memory operation interfered with said first load memory operation responsive to said detectmg
7 The method as recited in claun 6 wherem said mformation mcludes a valid mdication coπespondmg to said first load memory operation, and wherein said updatmg comprises placmg said valid mdication m an mvalid state
8 The method as recited m claun 6 wherein said information mcludes a valid mdication coπesponding to said first load memory operation, and wherein said valid indication comprises a bit vector, and wherein said mdicatmg is performed if at least one bit m said bit vector is in a valid state, and wherem said updatmg comprises selecting a first bit of said bit vector and placmg said first bit in an mvalid state
9 The method as recited m claim 6 wherem said detecting comprises detectmg a lack of forwarding of data from a store queue (68) for said first load memory operation
10 A store to load forward (STLF) predictor configured to mdicate a dependency of a first load memory operation on a first store memory operation
PCT/US2000/021752 2000-01-03 2000-08-08 Store to load forwarding predictor with untraining WO2001050252A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP00951015A EP1244961B1 (en) 2000-01-03 2000-08-08 Store to load forwarding predictor with untraining
DE60009151T DE60009151T2 (en) 2000-01-03 2000-08-08 FORECASTING DATA TRANSPORT FROM STORAGE TO LOADING COMMAND WITH UNTRAINING
JP2001550545A JP4920156B2 (en) 2000-01-03 2000-08-08 Store-load transfer predictor with untraining

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/476,937 2000-01-03
US09/476,937 US6651161B1 (en) 2000-01-03 2000-01-03 Store load forward predictor untraining

Publications (1)

Publication Number Publication Date
WO2001050252A1 true WO2001050252A1 (en) 2001-07-12

Family

ID=23893860

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/021752 WO2001050252A1 (en) 2000-01-03 2000-08-08 Store to load forwarding predictor with untraining

Country Status (7)

Country Link
US (1) US6651161B1 (en)
EP (1) EP1244961B1 (en)
JP (1) JP4920156B2 (en)
KR (1) KR100764920B1 (en)
CN (1) CN1209706C (en)
DE (1) DE60009151T2 (en)
WO (1) WO2001050252A1 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2660716A1 (en) * 2012-05-04 2013-11-06 Apple Inc. Load-store dependency predictor content management
WO2013181012A1 (en) * 2012-05-30 2013-12-05 Apple Inc. Load-store dependency predictor using instruction address hashing
WO2013188701A1 (en) 2012-06-15 2013-12-19 Soft Machines, Inc. A method and system for implementing recovery from speculative forwarding miss-predictions/errors resulting from load store reordering and optimization
US9158691B2 (en) 2012-12-14 2015-10-13 Apple Inc. Cross dependency checking logic
US9710268B2 (en) 2014-04-29 2017-07-18 Apple Inc. Reducing latency for pointer chasing loads
US9904552B2 (en) 2012-06-15 2018-02-27 Intel Corporation Virtual load store queue having a dynamic dispatch window with a distributed structure
US9965277B2 (en) 2012-06-15 2018-05-08 Intel Corporation Virtual load store queue having a dynamic dispatch window with a unified structure
US9990198B2 (en) 2012-06-15 2018-06-05 Intel Corporation Instruction definition to implement load store reordering and optimization
US10019263B2 (en) 2012-06-15 2018-07-10 Intel Corporation Reordered speculative instruction sequences with a disambiguation-free out of order load store queue
US10048964B2 (en) 2012-06-15 2018-08-14 Intel Corporation Disambiguation-free out of order load store queue
US10437595B1 (en) 2016-03-15 2019-10-08 Apple Inc. Load/store dependency predictor optimization for replayed loads
US10514925B1 (en) 2016-01-28 2019-12-24 Apple Inc. Load speculation recovery
US11048506B2 (en) 2016-08-19 2021-06-29 Advanced Micro Devices, Inc. Tracking stores and loads by bypassing load store units
US11113056B2 (en) * 2019-11-27 2021-09-07 Advanced Micro Devices, Inc. Techniques for performing store-to-load forwarding

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7739483B2 (en) * 2001-09-28 2010-06-15 Intel Corporation Method and apparatus for increasing load bandwidth
US7165167B2 (en) * 2003-06-10 2007-01-16 Advanced Micro Devices, Inc. Load store unit with replay mechanism
US7321964B2 (en) * 2003-07-08 2008-01-22 Advanced Micro Devices, Inc. Store-to-load forwarding buffer using indexed lookup
US7376817B2 (en) * 2005-08-10 2008-05-20 P.A. Semi, Inc. Partial load/store forward prediction
US7590825B2 (en) * 2006-03-07 2009-09-15 Intel Corporation Counter-based memory disambiguation techniques for selectively predicting load/store conflicts
US8099582B2 (en) * 2009-03-24 2012-01-17 International Business Machines Corporation Tracking deallocated load instructions using a dependence matrix
US20100306509A1 (en) * 2009-05-29 2010-12-02 Via Technologies, Inc. Out-of-order execution microprocessor with reduced store collision load replay reduction
US9405542B1 (en) * 2012-04-05 2016-08-02 Marvell International Ltd. Method and apparatus for updating a speculative rename table in a microprocessor
US10114794B2 (en) * 2014-12-14 2018-10-30 Via Alliance Semiconductor Co., Ltd Programmable load replay precluding mechanism
WO2016097797A1 (en) * 2014-12-14 2016-06-23 Via Alliance Semiconductor Co., Ltd. Load replay precluding mechanism
US9606805B1 (en) * 2015-10-19 2017-03-28 International Business Machines Corporation Accuracy of operand store compare prediction using confidence counter
US9996356B2 (en) * 2015-12-26 2018-06-12 Intel Corporation Method and apparatus for recovering from bad store-to-load forwarding in an out-of-order processor
US10331357B2 (en) * 2016-08-19 2019-06-25 Advanced Micro Devices, Inc. Tracking stores and loads by bypassing load store units
US10684859B2 (en) * 2016-09-19 2020-06-16 Qualcomm Incorporated Providing memory dependence prediction in block-atomic dataflow architectures
US11243774B2 (en) 2019-03-20 2022-02-08 International Business Machines Corporation Dynamic selection of OSC hazard avoidance mechanism
US10929142B2 (en) * 2019-03-20 2021-02-23 International Business Machines Corporation Making precise operand-store-compare predictions to avoid false dependencies

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0709770A2 (en) * 1994-10-24 1996-05-01 International Business Machines Corporation Apparatus to control load/store instructions
US5619662A (en) * 1992-11-12 1997-04-08 Digital Equipment Corporation Memory reference tagging
US5781752A (en) * 1996-12-26 1998-07-14 Wisconsin Alumni Research Foundation Table based data speculation circuit for parallel processing computer

Family Cites Families (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4521851A (en) 1982-10-13 1985-06-04 Honeywell Information Systems Inc. Central processor
US4594660A (en) 1982-10-13 1986-06-10 Honeywell Information Systems Inc. Collector
US5488729A (en) 1991-05-15 1996-01-30 Ross Technology, Inc. Central processing unit architecture with symmetric instruction scheduling to achieve multiple instruction launch and execution
JPH0820949B2 (en) 1991-11-26 1996-03-04 松下電器産業株式会社 Information processing device
US5467473A (en) 1993-01-08 1995-11-14 International Business Machines Corporation Out of order instruction load and store comparison
EP0651321B1 (en) 1993-10-29 2001-11-14 Advanced Micro Devices, Inc. Superscalar microprocessors
US5465336A (en) 1994-06-30 1995-11-07 International Business Machines Corporation Fetch and store buffer that enables out-of-order execution of memory instructions in a data processing system
US5625789A (en) 1994-10-24 1997-04-29 International Business Machines Corporation Apparatus for source operand dependendency analyses register renaming and rapid pipeline recovery in a microprocessor that issues and executes multiple instructions out-of-order in a single cycle
US5717883A (en) 1995-06-28 1998-02-10 Digital Equipment Corporation Method and apparatus for parallel execution of computer programs using information providing for reconstruction of a logical sequential program
US5710902A (en) 1995-09-06 1998-01-20 Intel Corporation Instruction dependency chain indentifier
US5835747A (en) * 1996-01-26 1998-11-10 Advanced Micro Devices, Inc. Hierarchical scan logic for out-of-order load/store execution control
US5799165A (en) 1996-01-26 1998-08-25 Advanced Micro Devices, Inc. Out-of-order processing that removes an issued operation from an execution pipeline upon determining that the operation would cause a lengthy pipeline delay
US5748978A (en) 1996-05-17 1998-05-05 Advanced Micro Devices, Inc. Byte queue divided into multiple subqueues for optimizing instruction selection logic
US6016540A (en) 1997-01-08 2000-01-18 Intel Corporation Method and apparatus for scheduling instructions in waves
US5923862A (en) 1997-01-28 1999-07-13 Samsung Electronics Co., Ltd. Processor that decodes a multi-cycle instruction into single-cycle micro-instructions and schedules execution of the micro-instructions
US5996068A (en) 1997-03-26 1999-11-30 Lucent Technologies Inc. Method and apparatus for renaming registers corresponding to multiple thread identifications
US5850533A (en) 1997-06-25 1998-12-15 Sun Microsystems, Inc. Method for enforcing true dependencies in an out-of-order processor
US6108770A (en) * 1998-06-24 2000-08-22 Digital Equipment Corporation Method and apparatus for predicting memory dependence using store sets
US6212623B1 (en) 1998-08-24 2001-04-03 Advanced Micro Devices, Inc. Universal dependency vector/queue entry
US6122727A (en) 1998-08-24 2000-09-19 Advanced Micro Devices, Inc. Symmetrical instructions queue for high clock frequency scheduling
US6212622B1 (en) 1998-08-24 2001-04-03 Advanced Micro Devices, Inc. Mechanism for load block on store address generation
US6266744B1 (en) * 1999-05-18 2001-07-24 Advanced Micro Devices, Inc. Store to load forwarding using a dependency link file
US6393536B1 (en) * 1999-05-18 2002-05-21 Advanced Micro Devices, Inc. Load/store unit employing last-in-buffer indication for rapid load-hit-store
US6502185B1 (en) * 2000-01-03 2002-12-31 Advanced Micro Devices, Inc. Pipeline elements which verify predecode information
US6542984B1 (en) * 2000-01-03 2003-04-01 Advanced Micro Devices, Inc. Scheduler capable of issuing and reissuing dependency chains

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5619662A (en) * 1992-11-12 1997-04-08 Digital Equipment Corporation Memory reference tagging
EP0709770A2 (en) * 1994-10-24 1996-05-01 International Business Machines Corporation Apparatus to control load/store instructions
US5781752A (en) * 1996-12-26 1998-07-14 Wisconsin Alumni Research Foundation Table based data speculation circuit for parallel processing computer

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHRYSOS G Z ET AL: "MEMORY DEPENDENCE PREDICTION USING STORE SETS", ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE,US,LOS ALAMITOS, CA: IEEE COMPUTER SOC, 27 June 1998 (1998-06-27), pages 142 - 153, XP000849912, ISBN: 0-8186-8492-5 *
MOSHOVOS A ET AL: "DYNAMIC SPECULATION AND SYNCHRONIZATION OF DATA DEPENDENCES", ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE,US,NEW YORK, ACM, vol. CONF. 24, 2 June 1997 (1997-06-02), pages 181 - 193, XP000738156, ISBN: 0-7803-4175-9 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2660716A1 (en) * 2012-05-04 2013-11-06 Apple Inc. Load-store dependency predictor content management
US9128725B2 (en) 2012-05-04 2015-09-08 Apple Inc. Load-store dependency predictor content management
US9600289B2 (en) 2012-05-30 2017-03-21 Apple Inc. Load-store dependency predictor PC hashing
WO2013181012A1 (en) * 2012-05-30 2013-12-05 Apple Inc. Load-store dependency predictor using instruction address hashing
US10019263B2 (en) 2012-06-15 2018-07-10 Intel Corporation Reordered speculative instruction sequences with a disambiguation-free out of order load store queue
US9990198B2 (en) 2012-06-15 2018-06-05 Intel Corporation Instruction definition to implement load store reordering and optimization
US10592300B2 (en) 2012-06-15 2020-03-17 Intel Corporation Method and system for implementing recovery from speculative forwarding miss-predictions/errors resulting from load store reordering and optimization
US10048964B2 (en) 2012-06-15 2018-08-14 Intel Corporation Disambiguation-free out of order load store queue
US9904552B2 (en) 2012-06-15 2018-02-27 Intel Corporation Virtual load store queue having a dynamic dispatch window with a distributed structure
US9928121B2 (en) 2012-06-15 2018-03-27 Intel Corporation Method and system for implementing recovery from speculative forwarding miss-predictions/errors resulting from load store reordering and optimization
US9965277B2 (en) 2012-06-15 2018-05-08 Intel Corporation Virtual load store queue having a dynamic dispatch window with a unified structure
EP2862084A4 (en) * 2012-06-15 2016-11-30 Soft Machines Inc A method and system for implementing recovery from speculative forwarding miss-predictions/errors resulting from load store reordering and optimization
WO2013188701A1 (en) 2012-06-15 2013-12-19 Soft Machines, Inc. A method and system for implementing recovery from speculative forwarding miss-predictions/errors resulting from load store reordering and optimization
US9158691B2 (en) 2012-12-14 2015-10-13 Apple Inc. Cross dependency checking logic
US9710268B2 (en) 2014-04-29 2017-07-18 Apple Inc. Reducing latency for pointer chasing loads
US10514925B1 (en) 2016-01-28 2019-12-24 Apple Inc. Load speculation recovery
US10437595B1 (en) 2016-03-15 2019-10-08 Apple Inc. Load/store dependency predictor optimization for replayed loads
US11048506B2 (en) 2016-08-19 2021-06-29 Advanced Micro Devices, Inc. Tracking stores and loads by bypassing load store units
US11113056B2 (en) * 2019-11-27 2021-09-07 Advanced Micro Devices, Inc. Techniques for performing store-to-load forwarding

Also Published As

Publication number Publication date
EP1244961B1 (en) 2004-03-17
KR100764920B1 (en) 2007-10-09
CN1209706C (en) 2005-07-06
CN1415088A (en) 2003-04-30
DE60009151D1 (en) 2004-04-22
DE60009151T2 (en) 2004-11-11
EP1244961A1 (en) 2002-10-02
JP2003519832A (en) 2003-06-24
US6651161B1 (en) 2003-11-18
KR20020097148A (en) 2002-12-31
JP4920156B2 (en) 2012-04-18

Similar Documents

Publication Publication Date Title
US6622237B1 (en) Store to load forward predictor training using delta tag
EP1244961B1 (en) Store to load forwarding predictor with untraining
US6694424B1 (en) Store load forward predictor training
US6542984B1 (en) Scheduler capable of issuing and reissuing dependency chains
US6481251B1 (en) Store queue number assignment and tracking
US6523109B1 (en) Store queue multimatch detection
EP1228426B1 (en) Store buffer which forwards data based on index and optional way match
US6502185B1 (en) Pipeline elements which verify predecode information
US7213126B1 (en) Method and processor including logic for storing traces within a trace cache
JP5294632B2 (en) A processor with a dependency mechanism that predicts whether a read depends on a previous write
EP2674856B1 (en) Zero cycle load instruction
EP1244962B1 (en) Scheduler capable of issuing and reissuing dependency chains
US6564315B1 (en) Scheduler which discovers non-speculative nature of an instruction after issuing and reissues the instruction
US6622235B1 (en) Scheduler which retries load/store hit situations
US20090077560A1 (en) Strongly-Ordered Processor with Early Store Retirement
US6721877B1 (en) Branch predictor that selects between predictions based on stored prediction selector and branch predictor index generation
US20090198981A1 (en) Data processing system, processor and method of data processing having branch target address cache storing direct predictions
WO2004099977A2 (en) System and method for operation replay within a data-speculative microprocessor
US6704854B1 (en) Determination of execution resource allocation based on concurrently executable misaligned memory operations
US6363471B1 (en) Mechanism for handling 16-bit addressing in a processor
US7043626B1 (en) Retaining flag value associated with dead result data in freed rename physical register with an indicator to select set-aside register instead for renaming
US7321964B2 (en) Store-to-load forwarding buffer using indexed lookup
KR20020039689A (en) Apparatus and method for caching alignment information
US6721876B1 (en) Branch predictor index generation using varied bit positions or bit order reversal
US7555633B1 (en) Instruction cache prefetch based on trace cache eviction

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CN JP KR SG

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2000951015

Country of ref document: EP

ENP Entry into the national phase

Ref country code: JP

Ref document number: 2001 550545

Kind code of ref document: A

Format of ref document f/p: F

WWE Wipo information: entry into national phase

Ref document number: 008181551

Country of ref document: CN

Ref document number: 1020027008671

Country of ref document: KR

WWP Wipo information: published in national office

Ref document number: 2000951015

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1020027008671

Country of ref document: KR

WWG Wipo information: grant in national office

Ref document number: 2000951015

Country of ref document: EP