TITLE: BRANCH PREDICTION MECHANISM EMPLOYING BRANCH SELECTORS
TO SELECT A BRANCH PREDICTION BACKGROUND OF THE INVENTION
1 Field of the Invention
This ention relates to the field of microprocessors and. more particularly, to branch prediction mechanisms within microprocessors
2 Descnption of the Related Art
Superscalar microprocessors achieve high performance bv executing multiple instructions per clock cvcle and by choosing the shortest possible clock cycle consistent with the design As used hereia the term "clock cscle" refers to an interval of time accorded to various stages of an instruction processing pipeline within the microprocessor Storage devices (e g registers and arrays) capture their \alues according to the clock cycle For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle The storage device then stores the value until the subsequent nsing or falling edge of the clock signal, respectively The term "instruction processmg pipeline" is used herein to refer to the logic circuits employed to process instructions m a pipelined fashion Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally compnses fetching the instruction, decoding the instruction, executing the instruction, and stonng the execution results in the destination identified by die instruction An important feature of a superscalar microprocessor (and a superpipelined microprocessor as well) is its branch prediction mechanism The branch prediction mechanism indicates a predicted direction (taken or not-taken) for a branch instructioa allowing subsequent instruction fetching to continue within the predicted instruction stream indicated bv the branch prediction A branch instruction is an instruction which causes subsequent instructions to be fetched from one of at least two addresses a sequential address identifying an instruction stream beginning with instructions which directh follow the branch instruction and a target address identifying an instruction stream beginning at an arbitraty location in memory Unconditional branch instructions always branch to the target address, while conditional branch instructions may select either the sequential or the target address based on the outcome of a pπor instructioa Instructions from the predicted instruction stream may be speculatively executed pnor to execution of the branch instructioa and in any case are placed into the instruction processing pipeline pπor to execution of the branch instructioa If the predicted instruction stream is correct then the number of instructions executed per clock cycle is advantageously increased Howe er. rf the predicted instruction stream is incorrect (l e one or more branch instructions are predicted incoπectiv). then the instructions from the incorrectly predicted instruction stream are discarded from the instruction processing pipeline and the number of instructions executed per clock cvcle is decreased
In order to be effective, the branch prediction mechanism must be highly accurate such that the predicted instruction stream is coπect as often as possible Typically, increasing the accuracy of the branch prediction mechanism is achieved bv increasing the complexity of the branch prediction mechanism. For example, a cache-line based branch prediction scheme may be employed in which branch predictions are stored with a particular cache line of instruction bvtes in an instruction cache A cache line is a number of contiguous bvtes which are treated as a unit for allocation and deallocation of storage space within the instruction cache When the cache line is fetched, the coπesponding branch predictions are also fetched Furthermore. v» hen the particular cache line is discarded, the corresponding branch predictions are discarded as well The cache line is aligned in memory A cache-line based branch prediction scheme mav be made more accurate b\ stoπng a larger number of branch predictions for each cache line A given cache line may include multiple branch instructions, each of which is represented by a different branch predictioa Therefore, more branch predictions allocated to a cache line allows for more branch instructions to be represented and predicted bv the branch prediction mechanism A branch instruction which cannot be represented w lthin the branch prediction mechanism is not predicted, and subsequently a "misprediction" may be detected if the branch is found to be taken However, complexity of the branch prediction mechanism is increased bv the need to select between additional branch predictions As used hereia a "branch prediction" is a \ alue which may be interpreted bv the branch prediction mechanism as a prediction of whether or not a branch instruction is taken or not taken Furthermore a branch prediction may include the target address For cache-line based branch prediction mechanisms, a prediction of a sequential line to the cache line being fetched is a branch prediction v hen no branch instructions are within the instructions being fetched from the cache line
A problem related to increasing the complexity of the branch prediction mechanism is that the increased complexity generally requires an increased amount of time to form the branch prediction For example, selecting among multiple branch predictions may require a substantial amount of time The offset of the fetch address identifies the first byte being fetched within the cache line a branch prediction for a branch instruction pπor to the offset should not be selected The offset of the fetch address within the cache line mav need to be compared to the offset of the branch instructions represented by the branch predictions stored for the cache line m order to determine which branch prediction to use The branch prediction coπesponding to a branch instrucuon subsequent to the fetch address offset and nearer to the fetch address offset than outer branch instructions which are subsequent to the fetch address offset should be selected As the number of branch predictions is increased, the complexity (and time required) for the selection logic increases When the amount of time needed to form a branch prediction for a fetch address exceeds the clock cycle time of the microprocessor performance of the microprocessor may be decreased Because die branch prediction cannot be formed in a single clock cycle, "bubbles" are introduced into the instruction processing pipeline duπng clock cycles that instructions cannot be fetched due to a lack of a branch prediction coπesponding to a previous fetch address The bubble occupies vaπous stages in the instruction processing pipeline duπng subsequent clock cycles, and no woik occurs at the stage including the bubble because no instructions are included in the bubble Performance of the microprocessor may thereby be decreased
SUMMARY OF THE INVENTION
The problems oudmed above are m large part solved by a branch prediction apparatus in accordance with the present invention The branch prediction apparatus stores multiple branch selectors coπesponding to instruction bytes within a cache line of instructions or portion thereof The branch selectors identify a branch prediction to be selected rf the coπesponding instruction byte is the bvte indicated by the offset of the fetch address used to fetch the cache line Instead of compaπng pointers to the branch instructions with the offset of the fetch address, the branch prediction is selected simply by decoding the offset of the fetch address and choosmg the corresponding branch selector Advantageously, the branch prediction apparatus may operate at a higher frequencies (t e lower clock cycles) than if the pointers to the branch instruction and the fetch address were compared (a greater than or less than compaπson) The branch selectors directly determine which branch prediction is appropπate according to the instructions being fetched, thereby decreasing the amount of logic employed to select the branch prediction
Broadly speaking, the present invention contemplates a method for selecting a branch prediction coπesponding to a group of contiguous instruction bytes including a plurality of instructions A plurality of branch selectors are stored in a branch prediction storage, wherein at least one of the plurality of branch selectors corresponds to a first one of the plurality of instructions The branch selector identifies a particular branch prediction to be selected rf the first one of the plurality of instructions is fetched The group of contiguous instruction bytes is fetched concurrent with fetching the plurality of branch selectors The fetch address identifies the group of contiguous instrucUon bytes One of the plurality of branch selectors is selected in response to the fetch address The branch prediction is selected in response to the one of the plurality of the branch selectors
The present invention further contemplates a branch prediction apparatus, compπsing a branch prediction storage and a selection mechanism The branch prediction storage is coupled to receive a fetch address corresponding to a group of contiguous instruction bvtes being fetched from an instruction cache The branch prediction storage is configured to store a plurality of branch selectors wherein at least one of the plurality of branch selectors coπesponds to a first instruction within the group of contiguous instruction bytes The at least one of the plurality of branch selectors identifies a particular branch prediction to be selected rf the first instruction is fetched Coupled to the branch prediction storage to receives the plurality of branch selectors, the selection mechanism is configured to select a particular one of the plurality of branch selectors in response to a plurality of least significant bits of a fetch address used to fetch the group of contiguous instruction bytes
The present invention still further contemplates a microprocessor compπsing an instruction cache and a branch prediction unit The instruction cache is configured to store a plurality of cache lines of instruction bytes and to provide a group of instruction bytes upon receipt of a fetch address to an instruction processing pipeline of the microprocessor Coupled to the instruction cache and coupled to receive the fetch address concurrent with the instruction cache, the branch prediction unit is configured to store a plurality of branch selectors with respect to the group of instruction bytes and is configured to select one of the plurality of
branch selectors in response to the fetch address The one of the plurality of branch selectors identifies a branch prediction which is used as a subsequent fetch address bv the instruction cache
BRIEF DESCRIPTION OF THE DRAWINGS
Other objects and advantages of the invention will become apparent upon reading the following detailed descπption and upon reference to the accompanying drawings in which
Fig 1 is a block diagram of one embodiment of a superscalar microprocessor
Fig 2 is a block diagram of one embodiment of a pair of decode units shown in Fig 1
Fig 3 is a diagram of a group of contiguous instrucUon bvtes and a coπesponding set of branch selectors
Fig 4 is a block diagram of a portion of one embodiment of a branch prediction unit shown in Fig 1
Fig 5 is a diagram of a prediction block for a group of contiguous instruction bytes as stored in the branch prediction unit shown in Fig 4
Fig 6 is a table showing an exemplary encoding of a branch selector
Fig 7 is a flowchart depicting steps performed in order to update a set of branch selectors corresponding to a group of contiguous instruction bytes
Fig 8 is a first example of updating the set of branch selectors
Fig 9 is a second example of updating the set of branch selectors
Fig 10 is a third example of updating the set of branch selectors
Fig 11 is a block diagram of a computer system including the microprocessor shown in Fig 1
While the invention is susceptible to vaπous modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be descπbed in detail It should be understood, however that the drawings and detailed descπption thereto are not intended to limit the invention to the particular form disclosed but on the contrary the intention is to cover all modifications, equivalents and alternatives falling w ithin the spint and scope of the present invention as defined by the appended claims
DETAILED DESCRIPTION OF THE INVENTION
Turning now to Fig 1. a block diagram of one embodiment of a microprocessor 10 is shown Microprocessor 10 includes a prefetch/predecode umt 12 a branch prediction umt 14 an instruction cache 16 an instruction alignment umt 18 a plurality of decode units 20 A-20C, a plurality of reservation stations 22 A- 22C. a plurality of functional units 24A-24C. a load/store umt 26, a data cache 28. a register file 30. a reorder buffer 32. and an MROM umt 34 Elements referred to herein with a particular reference number followed by a letter will be collectively refeπed to by the reference number alone For example, decode units 20A-20C will be collectively refeπed to as decode units 20 Prefetch/predecode umt 12 is coupled to receive instructions from a main memory subsystem (not shown), and is further coupled to instruction cache 16 and branch prediction umt 14 Similarly, branch prediction umt 14 is coupled to instruction cache 16 Still further, branch prediction unit 14 is coupled to decode units 20 and functional units 24 Instruction cache 16 is further coupled to MROM umt 34 and instruction alignment unit 18 Instruction alignment umt 18 is in turn coupled to decode units 20 Each decode umt 20A-20C is coupled to load/store umt 26 and to respective reservation stations 22 A-22C
Reservation stations 22A-22C are further coupled to respective functional units 24A-24C Additionally, decode units 20 and reservation stations 22 are coupled to register file 30 and reorder buffer 2 Functional units 24 are coupled to load store umt 26, register file 30, and reorder buffer 32 as well Data cache 28 is coupled to load/store umt 26 and to the mam memory subsystem Finally. MROM umt 34 is coupled to decode units 20
Generally speaking, branch prediction umt 14 employs a cache-line based branch prediction mechanism for predicting branch instructions Multiple branch predictions may be stored for each cache line Additionally, a branch selector is stored for each byte within the cache line The branch selector for a particular byte indicates which of the branch predictions hich may be stored with respect to the cache line is the branch prediction appropπate for an instruction fetch address which fetches the particular bvte The appropπate branch prediction is the branch prediction for the first predicted-taken branch instruction encountered within the cache line subsequent to the particular byte As used hereia tire terms "subsequent" and "pπor to" refer to an ordenng of bvtes within the cache line A byte stored at a memory address which is numeπcally smaller than the memory address at winch a second byte is stored is pnor to the second byte Conversely, a bvte stored at a memory address which is numeπcally larger than the memory address of a second byte is subsequent to the second byte Similarly, a first instruction is pnor to a second instruction in program order rf the first instruction is encountered before the second instruction when stepping one at a tune through the sequence of instructions forming the program
In one embodiment microprocessor 10 employs a microprocessor architecture in which the instruction set is a vaπable byte length instruction set (e g the x86 microprocessor architecture) When a variable bvte length instruction set is employed, any byte within the cache line mav be identified as the first byte to be fetched to a given fetch address For example, a branch instruction may have a target address at byte position two within a cache line In such a case, the bytes at byte positions zero and one are not bemg fetched duπng the cuπent cache access Additionally bytes subsequent to a predicted-taken branch which is
subsequent to the first byte are not fetched duπng the current cache access Since branch selectors are stored for each byte, the branch prediction for the predicted taken branch can be located by selecting the branch selector of the first byte to be fetched from the cache line The branch selector is used to select the appropπate branch predictioa which is then provided to the instruction fetch logic in instruction cache 16 Duπng the succeeding clock cycle, the branch prediction is used as the fetch address Advantageously the process of compaπng the byte position of the first bvte being fetched to the byte positions of the predicted-taken branch instructions is eliminated from the generation of a branch prediction in response to a fetch address The amount of time required to form a branch prediction may be reduced accordingly, allowing the branch prediction mechanism to operate at higher clock frequencies (1 e shorter clock cycles) while still providing a single cycle branch prediction
It is noted that although the term "cache line" has been used in the preceding discussioa some embodiments of instruction cache 16 may not provide an entire cache line at its output duπng a given clock cycle For example, in one embodiment instruction cache 16 is configured with 32 byte cache lines However, only 16 bytes are fetched in a given clock cycle (either the upper half or the lower half of the cache line) The branch prediction storage locations and branch selectors are allocated to the portion of the cache line being fetched As used hereia the term "group of contiguous instruction bytes" is used to refer to the instruction bytes which are provided by the instruction cache in a particular clock cycle in response to a fetch address A group of contiguous instruction bytes may be a portion of a cache line or an entire cache line, according to vaπous embodiments When a group of contiguous instruction bytes is a portion of a cache line, it is still an aligned portion of a cache hne For example, rf a group of contiguous instruction bytes is half a cache line, it is either the upper half of the cache line or the lower half of the cache line A number of branch prediction storage locations are allocated to each group of contiguous instruction bytes, and branch selectors indicate one of the branch prediction storage locations associated with that group Furthermore, branch selectors may indicate a return stack address from a return stack structure or a sequential address rf no branch instructions are encountered between tire coπesponding byte and the last byte in the group of contiguous instruction bvtes Instruction cache 16 is a high speed cache memory provided to store instructions Instructions are fetched from instruction cache 16 and dispatched to decode units 20 In one embodiment, instruction cache 1 is configured to store up to 32 kilobytes of instructions in a 4 way set associative structure having 32 byte lines (a byte compπses 8 binary bits) Instruction cache 16 may additionally employ a way prediction scheme m order to speed access times to the instruction cache Instead of accessing tags identifying each line of instructions and compaπng the tags to the fetch address to select a way, instruction cache 16 predicts the way that is accessed In this manner, the wav is selected pπor to accessing the instruction storage The access time of instruction cache 16 may be similar to a direct-mapped cache A tag compaπson is performed and, rf the way prediction is incorrect the coπect instructions are fetched and the incorrect instructions are discarded It is noted that instruction cache 16 may be implemented as a fully associative, set associative, or direct mapped configuratioa
Instructions are fetched from main memory and stored into instruction cache 16 by prefetch/predecode umt 12 Instructions may be prefetched pnor to the request thereof from instruction cache 16 in accordance with a prefetch scheme A vanetv of prefetch schemes mav be employed by
prefetch/predecode umt 12 As prefetch/predecode umt 12 transfers instructions from main memory to instruction cache 16, prefetch predecode umt 12 generates three predecode bits for each bvte of the instructions a start bit an end bit and a functional bit The predecode bits form tags indicative of the boundaπes of each instruction The predecode tags may also convev additional information such as whether a given instruction can be decoded directly by decode umts 20 or whether the instruction is executed to invoking a microcode procedure controlled by MROM umt 34, as will be descπbed m greater detail below Still further, prefetch/predecode umt 12 may be configured to detect branch instructions and to store branch prediction information corresponding to the branch instructions into branch prediction umt 14
One encoding of the predecode tags for an embodiment of microprocessor 10 employing the x86 instruction set will next be descπbed If a given bvte is the first byte of an instructioa the start bit for that byte is set If the byte is the last byte of an instructioa the end bit for that byte is set Instructions wluch may be directly decoded bv decode units 20 are referred to as "fast pauY instructions The remaining \86 instructions are refeπed to as MROM instructions, according to one embodiment For fast path instructions the functional bit is set for each prefix tote included in the instructioa and cleared for other totes Alternatively, for MROM instructions the functional bit is cleared for each prefix bvte and set for other totes The type of instruction mav be determined by examining the functional bit coπesponding to the end byte If that functional bit is clear the instruction is a fast path instruction Conversely, if that functional bit is set the instruction is an MROM instruction The opcode of an instruction may thereby be located within an instruction winch may be directly decoded by decode umts 20 as the byte associated with the first clear functional bit in the instructioa For example, a fast path instruction including two prefix bytes, a Mod R/M byte, and an SD3 bvte would have start end, and functional bits as follows
Start bits 10000
End bits 00001 Functional bits 11000
MROM instructions are instructions wluch are determined to be too complex for decode bv decode umts 20 MROM instructions are executed by invoking MROM umt 34 More specifically.
an MROM instruction is encountered. MROM umt 34 parses and issues the instruction into a subset of defined fast path instructions to effectuate the desired operation MROM umt 34 dispatches the subset of fast padi instructions to decode umts 20 A listing of exemplary x86 instructions categoπzed as fast path instructions will be provided further below
Microprocessor 10 employs branch prediction in order to speculatively fetch instructions subsequent to conditional branch instructions Branch prediction umt 14 is included to perform branch prediction operations In one embodiment up to two branch target addresses are stored with respect to each 16 byte portion of each cache line in instruction cache 1 Prefetch/predecode umt 12 determines initial branch targets when a particular line is predecoded Subsequent updates to die branch targets coπesponding to a cache line may occur due to the execution of instructions within the cache line Instruction cache 16 provides an indication of the instruction address being fetched, so diat branch prediction umt 14 may determine which branch target addresses to select for forming a branch prediction Decode umts 20 and functional umts 24
provide update information to branch prediction umt 14 Because branch prediction umt 14 stores tv o targets per 16 byte portion of the cache hne. some branch instructions within the line may not be stored in branch prediction umt 14 Decode umts 20 detect branch instructions which v> ere not predicted by branch prediction umt 14 Functional umts 24 execute the branch instructions and determine if tlie predicted branch direction is incoπect The branch direction may be "taken", in wluch subsequent instructions are fetched from die target address of the branch instruction Conversely, die branch direction mav be "not taken", in which subsequent instructions are fetched from memory locations consecutive to the branch instruction When a mispredicted branch instruction is detected, instructions subsequent to the mispredicted branch are discarded from the vaπous umts of microprocessor 10 A vanety of suitable branch prediction algonthms may be employed by branch prediction umt 14
Instructions fetched from instruction cache 16 are conveyed to instruction alignment umt 18 As instructions are fetched from instruction cache 16, the coπesponding predecode data is scanned to provide information to instruction alignment umt 18 (and to MROM umt 34) regarding die instructions being fetched Instruction alignment umt 18 utilizes the scanning data to align an instruction to each of decode umts 20 In one embodiment, instruction alignment umt 18 aligns instructions from three sets of eight instruction bvtes to decode umts 20 Instructions are selected independendy from each set of eight instruction bvtes into preliminary issue positions The preliminary issue positions are dien merged to a set of aligned issue positions corresponding to decode umts 20, such that the aligned issue positions contain the three instructions which are pπor to other instructions within the preliminary issue positions in program order Decode umt 20A receives an instruction which is pπor to instructions concurrendy received by decode umts 20B and 20C (in program order) Similarly, decode umt 20B receives an instruction w hich is pπor to the instruction concurrendy received by decode umt 20C in program order
Decode umts 20 are configured to decode instructions received from instruction alignment umt 18 Register operand information is detected and routed to register file 30 and reorder buffer 32 Additionally, rf tlie instructions require one or more memory operations to be performed decode umts 20 dispatch die memory operations to load/store umt 26 Each instruction is decoded into a set of control values for functional umts 24. and diese control \ alues are dispatched to reservation stations 22 along widi operand address information and displacement or immediate data which may be included w ith the instruction
Microprocessor 10 supports out of order executioa and thus employs reorder buffer 32 to keep track of the onginal program sequence for register read and wnte operations, to implement register renaming, to allow for speculative instruction execution and branch misprediction recovery, and to facilitate precise exceptions A temporary storage location within reorder buffer 32 is reserved upon decode of an instruction that involves die update of a register to thereby store speculative register states If a branch prediction is incorrect, the results of speculatively-executed instructions along the mispredicted path can be invalidated in the buffer before diey are wπtten to register file 30 Similarly, rf a particular instruction causes an exceptioa instructions subsequent to die particular instruction mav be discarded In ϋns manner, exceptions are "precise" (l e instructions subsequent to die particular instruction causing the exception are not completed pπor to die exception) It is noted that a particular instruction is speculatively executed rf it is executed pnor to instructions w luch precede die particular instruction in program order Preceding instructions may be a branch
instruction or an exception-causing instructioa in which case the speculative results mav be discarded bv reorder buffer 32
The instruction control values and immediate or displacement data provided at die outputs of decode umts 20 are routed direcdy to respective reservation stations 22 In one embodiment each reservation station 22 is capable of holding instruction information (I e , instruction control values as well as operand values, operand tags and/or immediate data) for up to three pending instructions awaiting issue to the coπesponding functional umt It is noted that for the embodiment of Fig 1 , each reservation station 22 is associated with a dedicated functional umt 24 Accordingly, diree dedicated "issue positions" are formed by reservation stations 22 and functional umts 24 In odier words, issue position 0 is formed by reservation station 22 A and functional umt 24 A Instructions aligned and dispatched to reservation station 22 A are executed bv functional umt 24A Similarly, issue position 1 is formed bv reservation station 22B and functional umt 24B. and issue position 2 is formed bv reservation station 22C and functional umt 24C
Upon decode of a particular instructioa if a required operand is a register locatioa register address information is routed to reorder buffer 32 and register file 30 simultaneously Those of skill in die art will appreciate that the x86 register file includes eight 32 bit real registers (l e , typically refeπed to as EAX EBX, ECX, EDX, EBP. ESI. EDI and ESP) In embodiments of microprocessor 10 which emplov die x86 microprocessor architecture, register file 30 compπses storage locations for each of the 32 bit real registers Additional storage locations may be included within register file 30 for use by MROM umt 34 Reorder buffer 32 contains temporary storage locations for results which change the contents of tfiese registers to diereby allow out of order execution A temporary storage location of reorder buffer 32 is reserved for each instruction which, upon decode, is determined to modify the contents of one of the real registers Therefore, at vaπous points duπng execution of a particular progra reorder buffer 32 may have one or more locations which contain the speculatively executed contents of a given register If following decode of a given instruction it is determined that reorder buffer 32 lias a previous location or locations assigned to a register used as an operand in the given instructioa the reorder buffer 32 forwards to the coπesponding reservation station eπher 1 ) the value in the most recendv assigned locatioa or 2) a tag for the most recendy assigned location if die value has not yet been produced to the functional umt that will eventually execute die previous instruction If reorder buffer 32 has a location reserved for a given register, the operand value (or reorder buffer tag) is provided from reorder buffer 32 rather dian from register file 30 If there is no location reserved for a required register m reorder buffer 32. die value is taken direcdy from register file 30 If die operand coπesponds to a memory locatioa the operand \ alue is provided to the reservation station through load/store umt 26
In one particular embodiment, reorder buffer 32 is configured to store and manipulate concurrendy decoded instructions as a umt This configuration will be refeπed to herein as "l ne-onented" By manipulating several instructions together, the hardware employed wrdun reorder buffer 32 may be srmphfied For example, a hne-oπented reorder buffer included in die present embodiment allocates storage sufficient for instruction information pertaining to diree instructions (one from each decode umt 20) whenever one or more instructions are dispatched by decode umts 20 By contrast a vaπable amount of storage is allocated m conventional reorder buffers, dependent upon die number of instructions actually dispatched A comparatively larger number of logic gates mav be required to allocate d e vaπable amount of storage When each of the
concurrendy decoded instructions has executed the instruction results are stored into register file 30 simultaneously The storage is then free for allocation to another set of concurrendv decoded instructions Additionally, die amount of control logic crrcmtry employed per instruction is reduced because die control logic is amortized over several concurrendy decoded instructions A reorder buffer tag identifying a particular instruction mav be divided into two fields a line tag and an offset tag The line tag identifies the set of concurrendy decoded instructions including die particular instructioa and die offset tag identifies which instruction whhrn d e set coπesponds to die particular instruction It is noted that stoπng instruction results into register file 30 and freeing die corresponding storage is referred to as "retinng" the instructions It is further noted diat any reorder buffer configuration may be employed in vaπous embodiments of microprocessor 10
As noted earlier, reservation stations 22 store instructions until die instructions are executed by die coπesponding functional umt 24 An instruction is selected for execution if (l) the operands of die instruction have been provided, and (n) die operands have not yet been provided for instructions which are v> lthm die same reservation station 22 A-22C and which are pπor to die instruction in program order It is noted diat when an instruction is executed bv one of die functional umts 24. die result of that instruction is passed dιrecd to any reservation stations 22 iat are waiting for diat result at the same time the result is passed to update reorder buffer 32 (this technique is commonly refeπed to as "result forwarding") An instruction may be selected for execution and passed to a functional umt 24 A-24C duπng die clock cycle diat die associated result is forwarded Reservation stations 22 route the forwarded result to d e functional umt 24 in this case In one embodiment each of die functional umts 24 is configured to perform integer andimetic operations of addition and subtraction, as well as shifts, rotates, logical operations, and branch operations The operations are performed in response to the control values decoded for a particular instruction by decode umts 20 It is noted diat a floating point umt (not shown) may also be employed to accommodate floating point operations Tlie floating point umt may be operated as a coprocessor, receiving instructions from MROM umt 34 and subsequendv communicating wid reorder buffer 32 to complete the instructions Additionally, functional umts 24 may be configured to perform address generation for load and store memoπ operations performed by load/store umt 26
Each of die functional umts 24 also provides information regarding the execution of conditional branch instructions to the branch prediction umt 14 If a branch prediction was incorrect branch prediction umt 14 flushes instructions subsequent to the mispredicted branch diat have entered die instruction processing pipeline, and causes fetch of die required instructions from instruction cache 16 or main memory It is noted diat in such situations, results of instructions in the oπgrnal program sequence which occur after the mispredicted branch instruction are discarded, including those which were speculatively executed and temporaπly stored in load/store umt 26 and reorder buffer 32 Results produced by functional umts 24 are sent to reorder buffer 32 if a register value is being updated, and to load/store umt 26 if die contents of a memory location are changed If die result is to be stored in a register, reorder buffer 32 stores the result in the location reserved for die value of die register when the instruction was decoded A plurality' of result buses 38 are included for forwarding of results from functional umts 24 and load/store umt 26 Result buses 38 convev the result generated as well as the reorder buffer tag
identifying die instruction being executed
Load/store umt 26 provides an interface between functional umts 24 and data cache 28 In one embodiment load/store umt 26 is configured with a load store buffer having eight storage locations for data and address information for pending loads or stores Decode umts 20 arbitrate for access to die load/store umt 26 When the buffer is full, a decode umt must wait until load store umt 26 has room for die pending load or store request information Load/store umt 26 also performs dependency checking for load memory operations against pending store memory operations to ensure that data coherency is maintained A memory operation is a transfer of data between microprocessor 10 and die main memory subsystem Memory operations may be the result of an instruction which utilizes an operand stored in memory, or may be die result of a load/store instruction which causes die data transfer but no other operation Additionally, load/store umt 26 may include a special register storage for special registers such as the segment registers and odier registers related to die address translation mechamsm defined by die x86 microprocessor architecture
In one embodiment, load/store umt 26 is configured to perform load memory operations speculatively Store memory operations are performed in program order, but may be speculatively stored into the predicted way If the predicted way is mcoπect die data pπor to the store memory operation is subsequendy restored to die predicted way and die store memory operation is performed to die coπect way In anodier embodiment stores may be executed speculatively as well Speculatively executed stores are placed into a store buffer, along with a copy of the cache line pnor to die update If die speculatively executed store is later discarded due to branch misprediction or exception, die cache line may be restored to die value stored in die buffer. It is noted diat load/store umt 26 may be configured to perform any amount of speculative execution, including no speculative execution
Data cache 28 is a high speed cache memory provided to temporarily store data being transferred between load/store umt 26 and die main memory subsystem In one embodiment data cache 28 has a capacity of stoπng up to sixteen kilobytes of data in an eight way set associative structure Similar to instrucUon cache 16. data cache 28 may employ a way prediction mechamsm It is understood that data cache 28 may be implemented in a vaπety of specific memory configurations, including a set associative configuration
In one particular embodiment of microprocessor 10 employing die x86 microprocessor architecture, instruction cache 16 and data cache 28 are linearly addressed The linear address is formed from die offset specified by d e instruction and die base address specified by die segment portion of the x86 address translation mechamsm Linear addresses may optionally be translated to physical addresses for accessing a main memory Tlie linear to physical translation is specified by die paging portion of the x86 address translation mechamsm It is noted drat a linear addressed cache stores linear address tags A set of physical tags (not shown) may be employed for mapping the lrnear addresses to physrcal addresses and for detecting translation aliases Additionally, die physical tag block may perform lrnear to physrcal address translation Turning now to Fig 2, a block diagram of one embodiment of decode umts 20B and 20C rs shown
Each decode umt 20 recetves an rnstructron from rnstruction alignment umt 18 Additionally, MROM umt 34 is coupled to each decode umt 20 for dispatching fast path instructions corresponding to a particular MROM rnstructron Decode umt 20B compπses early decode umt 40B. multiplexor 42B and opcode decode umt
44B Simrlarlv. decode umt 20C includes earlv decode umt 40C. multiplexor 42C. and opcode decode umt 44C
Certain instructions in the x86 instruction set are both farrly co phcated and frequentiv used Irr one embodiment of microprocessor 10, such instructions include more complex operations than die Irardware mcluded rthin a particular functional umt 24A-24C ts configured to perform Such rrrstructions are classrfied as a specral type of MROM rnstruction refeπed to as a "double drspatch" rnstructron These instructions are dispatched to a pair of opcode decode umts 44 It is noted that opcode decode umts 44 are coupled to respective reservation stations 22 Each of opcode decode umts 44A-44C forms an rssue posrtion wrth the corresponding reservation station 22A-22C and functional umt 24A-24C Instructions are passed from an opcode decode umt 44 to die coπespondrng reservation station 22 and further to the coπespondrng functional umt 24
Multiplexor 42B is included for selecting between the instructions pro\ ided by MROM umt 34 and by early decode umt 40B Duπng times in which MROM umt 34 is drspatclrrng instructions, multiplexor 42B selects instructions provided by MROM umt 34 At ouier times, multiplexor 42B selects instructions provided by early decode umt 40B Srmilarly. multiplexor 42C selects between instructions provided to MROM umt 34, early decode umt 40B, and early decode umt 40C The instruction from MROM umt 34 is selected dunng times in which MROM umt 34 is dispatching instructions Duπng times m which the early decode umt wrtlun decode umt 20 A (not shown) detects a double dispatch rnstructioa the rnstruction from early decode umt 40B is selected by multiplexor 42C Otherwise, the rnstruction from early decode umt 40C is selected Selecting die instruction from early decode umt 40B into opcode decode umt 44C allows a fast padi instruction decoded by decode umt 20B to be dispatched concurrendy widi a double drspatch rnstruction decoded by decode umt 20A
Accordrng to one embodrment employmg fire x86 rnstruction set, early decode umts 40 perform the following operations
(l) merge the prefix bytes of dre instruction into an encoded prefix bvte
(n) decode unconditional branch instructions (wluch mav include die unconditional jump, die
CALL, and die RETURN) whrch were not detected duπng branch prediction (in) decode source and destination flags, (rv) decode dre source and destination operands whrch are register operands and generate operand srze mformatioa and (v) determrne die displacement and or immediate size so that Displacement and rmmedrate data may be routed to the opcode decode umt Opcode decode umts 44 are configured to decode die opcode of the rnstructioa producing control values for functional umt 24 Displacement and immediate data are routed with the control Λ alues to reservation stations 22
Since earlv decode umts 40 detect operands, die outputs of multiplexors 42 are routed to register file 30 and reorder buffer 32 Operand values or tags may thereby be routed to resen ation stations 22 Additionally, memory operands are detected bv early decode umts 40 Therefore the outputs of multiplexors
42 are routed to load/store umt 26 Memory operations coπespondrng to instructions
mg memory operands are stored bv load store umt 26
Turning now to Fig 3, a diagram of an exemplary group of contiguous instruction bvtes 50 and a correspondmg set of branch selectors 52 are shown In Fig 3, each bvte widun an instruction is illustrated by a short vertical hne (e g reference number 54) Additionally, die vertical lines separating instructions in group 50 delimit bytes (e g reference number 56) Tire rnstructions shown m Frg 3 are vaπable in length, and therefore the instruction set including die instructions shown in Fig 3 is a vanable bvte lengdi rnstruction set In other w ords. a first instruction within die vanable byte length rnstruction set mav occupy a first number of bytes which is different than a second number of bytes occupied by a second instruction wrthm die rnstruction set Odrer rnstruction sets may be fixed-length, such drat each instruction within the rnstruction set occupres the same number of bvtes as each other instructioa
As illustrated in Fig 3, group 50 includes non-branch instructions IN0-IN5 Instructions INO, IN3, IN4, and IN5 are two byte rnstructions Instruction INI is a one byte rnstruction and rnstructron IN2 is a three byte instructioa Two predicted-taken branch instructions PBO and PB 1 are illustrated as \\ ell. each shown as occupying two bλtes It ts noted diat bodi non-branch and branch instructions mav occupy vanous numbers of bytes
The end byte of each predicted-taken branch PBO and PB 1 provides a division of group 50 into three regions a first regron 58 a second regron 60, and a thrrd regron 62 If a fetch address identifying group 50 is presented, and die offset of the fetch address wrthm the group identifies a byte position withm first region 58, then the first predicted-taken branch instruction to be encountered is PBO and dierefore die branch prediction for PBO is selected bv die branch prediction mechamsm Srπularly, rf the offset of die fetch address identifies a byte withrn second regron 60, the appropnate branch prediction is the branch predrctron for PB 1 Frnally, rf the offset of the fetch address rdentrfies a byte wrthm thrrd regron 62, then there rs no predrcted-taken branch instrucUon withrn die group of instruction bvtes and subsequent to die identified byte Therefore, the branch prediction for third region 62 is sequential The sequential address identifies die group of instruction bytes which immediately follows group 50 wπhin main memory
As used hereia die offset of an address compnses a number of least significant bits of die address The number is sufficient to provide drfferent encodmgs of the bits for each byte
the group of bytes to which the offset relates For example, group 50 rs 16 bytes Therefore, four least signrficant brts of an address wrt m the group form die offset of the address The remarmng bits of die address identity' group 50 from other groups of instruction bvtes within the mam memory Addrtionally, a number of least srgnrfrcant bits of the remaining bits form an rndex used by rnstruction cache 16 to select a row of storage locations whrch are ehgrble for storrng group 50
Set 52 is an exemplary set of branch selectors for group 50 One branch selector is included for each byte vvithin group 50 The branch selectors wrthm set 52 use the encoding shown in Fig 6 below In the example, die branch predrctron for PBO rs stored as dre second of two branch predictions associated widi group 50 (as mdrcated bv a branch selector value of "3 ") Therefore die branch selector for each byte withm first region 58 is set to "3" Similarly, the branch prediction for PB1 is stored as the first of die branch predictions (as indicated by a branch selector value of "2") Therefore the branch selector for each byte wrdun
second region 60 is set to "2" Finally, the sequential branch prediction is indicated to die branch selectors for bytes within dirrd regron 62 by a branch selector encoding of "0"
It rs noted drat due to the vaπable byte length nature of the x86 instruction set a branch instruction may begm wrthm one group of contiguous rnstruction bytes and end widun a second group of contiguous rnstruction bvtes In such a case, the branch prediction for die branch instruction is stored widi die second group of contiguous instruction bytes Among other thrngs, the bvtes of the branch instrucUon whrch are stored within the second group of contiguous instruction bytes need to be fetched and dispatched Forming the branch prediction m the first group of contiguous instruction bytes would cause the totes of the branch rnstruction whrch he within the second group of rnstruction bytes not to be fetched Turning now to Frg 4, a portion of one embodiment of branch prediction umt 14 is show n Other embodiments of branch prediction umt 14 and die portion shown m Fig 4 are contemplated As shown in Fig 4, branch predrctron umt 14 includes a branch predrctron storage 70, a way multiplexor 72, a branch selector multiplexor 74, a branch prediction multiplexor 76, a sequential/return multiplexor 78. a final prediction multiplexor 80 an update logrc block 82, and a decoder 84 Branch predrctron storage 70 and decoder 84 are coupled to a fetch address bus 86 from rnstruction cache 16 A fetch address concuπendv provided to die instruction bλtes storage widun instruction cache 16 is conveyed upon fetch address bus 86 Decoder block 84 provrdes selection controls to prediction selector multiplexor 74 Prediction controls for way multiplexor 72 are provided \ la a way selection bus 88 from instruction cache 16 Way selection bus 88 provides die way of rnstruction cache 16 whrch rs stoπng die cache line coπesponding to die fetch address provided on fetch address bus 86 Additionally, a selection control is provided by decoder 84 based upon which portion of the cache line is being fetched Wav multiplexor 72 is coupled to receive the contents of each storage location withm die row of branch prediction storage 70 which is indexed bv die fetch address upon fetch address bus 86 Branch selector multiplexor 74 and branch prediction multiplexor 76 are coupled to recerve portions of die output of wav multiplexor 72 as mputs Additionally, die output of branch selector multiplexor 74 provides selection controls for multiplexors 76, 78, and 80 Sequential/return multiplexor 78 selects between a sequential address provided upon a sequential address bus 90 from instruction cache 16 and a return address provided upon a return address bus 92 from a return stack The output of multiplexors 76 and 78 is provided to final predrctron multiplexor 80. wluch provides a branch prediction bus 94 to instruction cache 16 Instruction cache 16 uses the branch prediction provided upon branch prediction bus 94 as the fetch address for the subsequent clock cvcle Update logic block 82 is coupled to branch prediction storage 70 via an update bus 96 used to update branch prediction information stored diereia Update logic block 82 provides updates in response to a misprediction signalled via a mispredict bus 98 from functional umts 24 and decode umts 20 Additionally, update logrc block 82 provides updates in response to newly predecoded instruction indicated by prefetch/predecode umt 12 upon a predecode bus 100 Branch predrctron storage 70 is arranged widi a number of ways equal to the number of ways in instruction cache 16 For each way. a prediction block is stored for each group of contiguous instrucUon bytes existing within a cache ne In die embodiment of Fig 4, two groups of instruction bλtes are included in each cache lme Therefore, prediction block P0n is die prediction block coπesponding to die first group of contiguous instruction bytes in die first wav and prediction block P0ι is the prediction block coπesponding to
die second group of contiguous instruction bvtes in die first wav Similarly, prediction block P i n rs the predrction block coπesponding to die first group of contiguous instrucUon bλtes in the second way and predrctron block Pn rs the prediction block coπesponding to the second group of contiguous instruction totes in the second wav etc Each prediction block P00 to P3! in die indexed roλv is provided as an output of branch prediction storage 70. and hence as an input to way multiplexor 72 The indexed row is similar to indexing into a cache a number of bits which are not part of the offset portion of the fetch address are used to select one of the rows of branch prediction storage 70 It is noted diat branch prediction storage 70 may be configured wrth fewer rows than instruction cache 16 For example, branch predrction storage 70 may mclude 1/4 the number of roλvs of instruction cache 16 In such a case, the address bits hich are index bits of instruction cache 16 but which are not index bits of branch prediction storage 70 may be stored widi the branch prediction information and checked against the coπesponding bits of the fetch address to confirm that dre branch prediction information is associated widi dre row of rnstruction cache 16 which is being accessed Way multiplexor 72 selects one of the sets of branch predrction information P00-P31 based upon die way selection provided from instruction cache 16 and die group of rnstruction bytes referenced by die fetch address In dre embodrment showa for example, a 32 byte cache hne is divided into two 16 tote groups Therefore, die fifth least significant bit of the address is used to select whrch of the two groups contarns the fetch address If the fifth least significant bit is zero, then the first group of contiguous rnstruction bytes is selected If the fifth least significant bit is one. then die second group of contiguous instruction bytes is selected It is noted diat die wav selection provided upon way select bus 88 may be a way prediction produced by a branch prediction from the previous clock cycle, according to one embodiment Alternatively, the way selection may be generated vra tag compansons between the fetch address and die address tags rdenufyrng die cache lines stored in each way of the rnstruction cache It is noted drat an address tag is die portion of the address which is not an offset wrthm the cache hne nor an mdex mto the instruction cache
The selected prediction block provided by way multiplexor 72 includes branch selectors for each byte in die group of contiguous instruction bytes, as well as branch predictions BPl and BP2 The branch selectors are provided to branch selector multiplexor 74. wluch selects one of the branch selectors based upon selection controls provided by decoder 84 Decoder 84 decodes die offset of the fetch address mto the group of contiguous instruction bλtes to select dre coπesponding branch selector For example, rf a group of contiguous instruction bytes is 16 bytes, dien decoder 84 decodes die four least significant bits of the fetch address In thrs manner, a branch selector rs chosen
The selected branch selector is used to provide selection controls to branch prediction multiplexor 76, sequential/return multiplexor 78. and final predrction multiplexor 80 In one embodrment the encodrng of die branch selector can be used drrecdy as die multiplexor select controls In other embodiments, a logic block may be inserted between branch selector multiplexor 74 and multiplexors 76, 78, and 80 For the embodrment shown, branch selectors compnse two bits One bit of die selected branch selector provides die selection control for prediction multiplexor 76 and sequential/return selector 78 The other brt provides a selection control for final predrction multiplexor 80 A branch predrction rs thereby selected from die multiple branch predictions stored in branch prediction storage 70 coπesponding to die group of contiguous instruction bytes being fetched, die sequential address of die group of contiguous instruction bλtes sequential to the group of
contiguous instruction bytes being fetched, and a return stack address from a return stack structure It is noted drat multiplexors 76. 78, and 80 may be combined into a single 4 to 1 multiplexor for λvhich the selected branch selector provides selection controls to select between the ti\ o branch predictions from branch prediction storage 70, die sequential address, and die return address The return stack structure (not shown) is used to store return addresses coπespondtng to subroutine call instructions previously fetched by microprocessor 10 In one embodiment, die branch predictions stored by branch prediction storage 70 include an indication d at the branch prediction coπesponds to a subroutine call instruction Subroutine call rnstructions are a subset of branch rnstructions λvhich save the address of the sequential instruction (die return address) in addition to redirecting die instruction stream to die target address of die subroutine call instruction For example, the in the x86 microprocessor architecture, the subroutine call instruction (CALL) pushes the return address onto die stack mdrcated bv die ESP register
A subroutine return instruction is another subset of the branch instructions The subroutine return instrucUon uses die return address saved by die most recendv executed subroutine call rnstruction as a target address Therefore, λvhen a branch prediction includes an lndrcation that dre branch predrction corresponds to a subroutine call rnstructioa die sequential address to die subroutine call instruction is placed at die top of die return stack When a subroutine return instruction is encountered (as indicted by a particular branch selector encoding), die address nearest the top of the return stack which has not previously been used as a prediction is used as die prediction of tire address The address nearest the top of the return stack λvhrch lias not previously been used as a predrction rs conveyed by die return stack upon return address bus 92 (along λvith the preώcted way of die return address, provided to the return stack similar to its provision upon λvay select bus 88 Branch prediction umt 14 informs the return stack when die return address rs selected as dre predrction Additional details regarding an exemplary return stack structure may be found in die commonly assrgned, co-pendrng patent applrcauon entided "Speculative Return Address Predrction Umt for a Superscalar Mrcroprocessor". Senal No 08/550,296, filed October 30, 1995 by Mahalingaiah et al The disclosure of the referenced patent apphcation rs rncorporated heretn by reference in its entirety
The sequential address is provided by instruction cache 16 The sequential address rdentrfies die next group of contiguous instrucUon bytes widun main memory to the group of rnstructron totes mdrcated by the fetch address upon fetch address bus 86 It rs noted drat accordrng to one embodiment, a λvay prediction is supplied for the sequential address when die sequential address is selected The wav predrction may be selected to be dre same as the wav selected for the fetch address Alternatively, a λvay predrction for the sequential address may be stored within branch predrction storage 70
As mentioned above, update logic block 82 is configured to update a prediction block upon detection of a branch misprediction or upon detection of a branch instruction while predecodmg the coπespondrng group of contiguous instruction bytes in prefetch/predecode umt 12 The prediction block corresponding to each branch prediction is stored in update logic block 82 as die prediction is performed A branch tag is conveyed along wnh die instructions being fetched (via a branch tag bus 102), such that if a misprediction rs detected or a branch rnstruction rs detected dunng predecodmg. die coπespondrng prediction block can be identified λ la die branch tag In one embodrment the predrction block as shown rn Fig 5 is stored, as well as
die index of die fetch address λvhrch cacuse the prediction block to be fetched and the way in \λ Inch the prediction block is stored
When a branch rmspredrction rs detected, dre coπespondrng branch tag rs provrded upon mrspredrct bus 98 from either the functional umt 24 which executes die branch instruction or from decode umts 20 If decode umts 20 provide die branch tag, dien die misprediction is of die previously undetected type (e g diere are more branch rnstructions in the group than can be predicted using the correspond g branch predictions) Decode umts 20 detect mispredictions of unconditional branch instructions (I e branch instructions which always select die target address) Functional umts 24 may detect a rmspredrction due to a previously undetected conditional branch instruction or due to an incorrect taken/not-taken predrction Update logrc 82 selects the correspondmg predrction block out of die aforementioned storage In the case of a previously undetected branch instructioa one of the branch predictions within die prediction block is assigned to die previously undetected branch instruction According to one embodiment die algondim for selecting one of the branch predictions to store the branch prediction for the prevrously undetected branch rnstruction rs as follows If the branch instruction is a subroutine return instructioa the branch selector for the rnstruction rs selected to be die value rndrcatmg die return stack Otherwise, a branch prediction which is cuπendy predicted not-taken is selected If each branch prediction is cuπendy predicted-takea then a branch predrction is randomly selected Tlie branch selector for the new prediction is set to indicate die selected branch prediction Additionally, die branch selectors coπesponding to bytes between the first branch rnstruction pnor to the newly detected branch rnstruction and die newly detected branch instruction are set to the branch selector corresponding to die neλv prediction Fig 7 below descπbes one mediod for updating the branch selectors For a mrspredrcted taken predrction whrch causes the prediction to become predicted not-takea the branch selectors corresponding to die mispredicted prediction are set to the branch selector coπesponding to die byte subsequent to die mispredicted branch instruction In this manner, a prediction for a subsequent branch instruction will be used rf dre rnstructions are fetched agam at a later clock cycle When prefetch/predecode umt 12 detects a branch instruction while predecodmg a group of contiguous instruction bλtes, prefetch predecode umt 12 provides die branch tag for the group of contiguous instruction bλtes if die predecodmg is performed because invalid predecode information is stored in the instruction cache for die cache line (case (ι)) Alternatively, rf the predecodmg is being performed upon a cache line being fetched from the ma memory subsystem prefetch/predecode umt 12 provides die address of die group of contiguous instruction bytes being predecoded, die offset of die end byte of die branch instruction withm die group, and die λvay of die instruction cache selected to store the group (case (n)) In case (I), the update is performed similar to die branch misprediction case above In case (n), there rs not yet a valrd prediction block stored in branch prediction storage 70 for the group of instructions For this case, update logic block 82 initializes the branch selectors pπor to the detected branch to the branch selector selected for the detected branch Furthermore, the branch selectors subsequent to the detected branch are imtialized to die sequential value Alternatively, each of the branch selectors may be inrtialtzed to sequential λvhen the corresponding cache line in instruction cache 16 is allocated, and subsequendv updated via detection of a branch instructions dunng predecode in a manner srmrlar to case (t)
Upon generation of an update update logic block 82 conveys die updated prediction block, along widi die fetch address tndex and coπespondrng λvay, upon update bus 96 for storage in branch prediction storage 70 It is noted drat in order to marntain branch prediction storage 70 as a single ported storage, branch prediction storage 70 may employ a branch holding register The updated prediction information is stored into die branch holding register and updated into die branch prediction storage upon an idle cycle on fetch address bus 86 An exemplary cache holding register structure is descπbed in die commonly assrgned, co-pendrng patent application entided "Delayed Update Regrster for an Array", Senal No 08/481,914. filed June 7, 1995, by Tran, et al , incorporated herein by reference in its entirety
It rs noted that a coπecdy predicted branch instruction may result in an update to die correspondmg branch predrction as λvell A counter mdrcative of prevrous executions of d e branch rnstruction (used to form the taken/not-taken prediction of the branch instruction) may need to be incremented or decremented, for example Such updates are performed upon retirement of die corresponding branch prediction Retirement is indicated via a branch tag upon retire tag bus 104 from reorder buffer 32
It is noted diat die structure of Fig 4 may be further accelerated dirough the use of a predicted branch selector Tlie predicted branch selector is stored with each predrction block and rs set to dre branch selector selected in a previous fetch of the coπesponding group of contiguous instruction bytes Tlie predicted branch selector is used to select die branch predrctioa removing branch selector multiplexor 74 from the path of branch prediction generation Branch selector multiplexor 74 rs still employed, hoλvever, to verify the selected branch selector is equal to the predrcted branch selector If the selected branch selector and die predicted branch selector are not equal, then die selected branch selector rs used to provrde the correct branch predrction duπng the succeedmg clock cycle and dre fetch of die tncoπect branch prediction is cancelled
Turning now to Fig 5, an exemplary prediction block 110 employed by one embodiment of the branch prediction umt 14 as shown in Fig 4 is shown Prediction block 110 includes a set of branch selectors 112, a first branch prediction (BPl) 114, and a second branch prediction (BP2) 116 Set of branch selectors 112 includes a branch selector for each byte of die group of contiguous instruction bytes coπesponding to prediction block 110
First branch prediction 114 is shown in an exploded view in Fig 5 Second branch prediction 116 is configured similarly First branch prediction 114 includes an index 118 for die cache line containing die target address, and a λvay selection 120 for die cache line as well According to one embodiment mdex 118 includes die offset portion of the target address, as well as die index Index 118 is concatenated λvidi the tag of dre way mdrcated by way selection 120 to form die branch prediction address Additionally, a prediction counter 122 is stored for each branch prediction The prediction counter is incremented each time die coπesponding branch instruction rs executed and rs takea and rs decremented each time the correspondmg branch rnstruction rs executed and rs not-taken The most srgnrficant brt of die prediction counter is used as die taken/not-taken prediction If the most significant brt rs set the branch instruction is predicted taken Conversely, the branch instruction is predicted not-taken if die most significant bit is clear In one embodiment die prediction counter is a two brt saturating counter The counter saturates when incremented at binary '11' and saturates when decremented at a binary '01' In anodier embodiment die prediction counter is a single bit λvhich indicates a strong (a binary one) or a weak (a binary zero) taken prediction If a strong taken predrction is mrspredrcted, it
becomes a weak taken prediction If a weak taken prediction is mispredicted, die branch becomes predicted not taken and die branch selector is updated (1 e die case of a mispredicted branch that becomes not-taken) Finally, a call bit 124 is included in first branch prediction 114 Call bit 124 is indicative, λvhen set diat die correspondmg branch instruction is a subroutine call instruction If call brt 124 rs set, the current fetch address and way are stored mto the return stack structure mentioned above
Turning next to Fig 6, a table 130 illustrating an exemplary branch selector encoding is shown A binary encoding is listed (most significant brt first), followed by the branch predrction wluch is selected when the branch selector is encoded widi die corresponding value As table 130 illustrates, the least significant bit of the branch selector can be used as a selection control for branch prediction multiplexor 76 and sequential/return multiplexor 78 If the least srgnrficant brt rs clear, then die first branch prediction is selected by branch prediction multiplexor 76 and the sequential address is selected by sequenuarVreturn multiplexor 78 On the odier hand, die second branch predrction rs selected by branch prediction multiplexor 76 and die return address rs selected by sequential/return multiplexor rf the least significant bit is clear Furthermore, the most srgnrficant brt of the branch selector can be used as a selection control for final prediction multiplexor 80 If die most significant bit is set the output of branch predrction multiplexor 76 rs selected If the most srgnrficant bit is clear, the output of sequential/return multiplexor 78 is selected
Turning now to Fig 7, a flow chart depicting die steps employed to update die branch selectors of a group of contiguous instruction bytes in response to a mrspredrcted branch instruction is shown Updating due to a branch rnstruction discovered dunng predecodmg may be performed srnularh The misprediction may be the result of detecting a branch rnstruction for whrch predrction rnformation rs not stored in branch prediction storage 70, or may be die result of an incoπect taken not-taken prediction which causes die coπesponαing prediction counter to indicate not-takea
Upon detection of die mispredictioa branch prediction umt 14 uses an "end pointer" the offset of the end byte of die mispredicted branch instruction λvithin the correspondmg group of contiguous rnstruction bytes Addrtronally, the prediction block is selected for update using die branch tag received in response to the misprediction Branch prediction umt 14 decodes the end pointer into an update mask (step 140) The update mask compnses a binary digit for each byte widun die group of continuous instruction bytes Digits coπesponding to bytes pπor to and including die end byte of the branch instruction wrthm the cache ne are set and die remainrng digits are clear Branch prediction umt 14 identifies die current branch selector For mispredicted taken/not-taken predictions, die current branch selector rs the branch selector corresponding to die mispredicted branch rnstruction For misprediction due to an undetected branch, die current branch selector is the branch selector corresponding to the end byte of die undetected branch rnstruction The current branch selector rs XNOR'd widi each of die branch selectors to create a branch mask (step 142) The branch mask includes binary digits which are set for each byte having a branch selector which matches the current branch selector and brnary digits whrch are clear for each byte havmg a branch selector whrch does not match the current branch selector
The update mask created in step 140 and die branch mask created rn step 142 are subsequentiy ANDed, producing a final update mask (step 144) The final update mask mcludes binary digits λvhich are set for each byte of die group of contiguous instruction bytes which is to be updated to die neλv branch selector
For a mispredicted taken branch, the new branch selector is the branch selector of die tote subsequent to die end byte of die mispredicted taken branch instructioa For an undetected branch, the neλv branch selector is the branch selector indicating the branch prediction storage assigned to die previoush undetected branch to update logic block 82 An extended mask is also generated (steps 146 and 148) The extended mask rndicates whrch branch selectors are to be erased because the branch prediction corresponding to die branch selector has been reallocated to the newly discovered branch rnstruction or because the branch prediction now indicates not taken The extended mask is generated by first creating a second branch mask srmrlar to the branch mask, except usrng dre new branch selector instead of die cuπent branch selector (i e the mask is created by XNORmg die branch selectors coπespondrng to the cache line with the new branch selector (step 146)) The resulting mask is then ANDed widi the inversion of the final update mask to create the extended mask (step 148) Branch selectors coπespondrng to brts in the extended mask which are set are updated to indicate die branch selector of the byte immediately subsequent to die last byte for which a bit in the extended mask is set In this manner, the branch prediction formerly mdrcated by die branch selector is erased and replaced wrth the followmg branch selector λvrthin dre cache hne Duπng a step 150, die branch selectors are updated in response to die final update mask and die extended mask
Turning now to Fig 8. an example of the update of die branch selectors using die steps shown in die flowchart of Fig 7 is showa Each byte posrtion rs lrsted (reference number 160), folloλλed by a set of branch selectors pnor to update (reference number 162) In the rnrtial set of branch selectors 162, a subroutine return instruction ends at byte position 1 as well as a first branch instruction ending at byte position 8 (indicated by branch selector number 3) and a second branch instruction ending at byte position 11 (rndicated by branch selector number 2)
For the example of Fig 8, a previously undetected branch instruction is detected ending at byte position 6 The second branch prediction is selected to represent the branch prediction for the previously undetected branch instruction The update mask is generated as shown at reference number 164 given die end pointer of die previously undetected branch instruction is bvte position 6 Since die example is a case of misprediction due to a previously undetected branch instruction and die branch selector at bvte position 6 is "3 ", die current branch selector is "3 " The XNOR g of the current branch selector ιdι die inrtial branch selectors 162 yields dre branch mask depicted at reference number 166 The subsequent ANDing of the update mask and the branch mask yields the final update mask shown at reference number 168 As indicated by final update mask 1 8 bvte positions 2 through 6 are updated to die new branch selector
The second branch mask rs generated by XNORmg die neλv branch selector λvith rnrtial branch selectors 162 (reference number 170) The new branch selector rs "3", so second branch mask 170 is equal to branch mask 166 m this example ANDing branch mask 170 with the logical inversion of the final update mask 168 produces die extended mask shown at reference number 172 As extended mask 172 rndicates, byte positions 7 and 8 are to be updated to indicate die first branch predrction, since the second branch prediction has been assigned to die branch instruction ending at byte position 6 and the branch instruction represented bv die first branch prediction ends at tote 11 An updated set of branch selectors is shoλvn at reference number 174 The updated set of branch selectors at reference number 174 reflects choosmg the branch prediction
corresponding to branch selector "3 " for stoπng branch prediction information corresponding to die previously undetected branch instruction
Turning next to Fig 9, a second example of the update of die branch selectors using the steps shown rn the flowchart of Fig 7 is shown Similar to the example of Fig 8, each byte position is listed (reference number 160), followed by a set of branch selectors pπor to update (reference number 162) In die initial set of branch selectors 162. a subroutine return instruction ends at byte position 1 as well as a first branch instruction ending at byte position 8 (indicated by branch selector number 3 ) and a second branch instruction end ng at byte position 11 (indicated by branch selector number 2)
For the example of Fig 9, a previously undetected branch instruction is again detected endrng at byte position 6 However, the first branch predrction is selected to represent the branch predrction for the previously undetected branch instruction (as opposed to die second branch prediction as sknvn rn Frg 8) Smce the misprediction is at die same byte position as Fig 8, die same update mask branch mask, and final update mask are generated as in Fig 8 (reference numbers 164, 166, and 168)
The second branch mask rs generated by XNORmg the neλv branch selector λvi i mitral branch selectors 162 (reference number 180) The new branch selector rs "2" in thrs example, so second branch mask 180 mdrcates tote positions 9 through 11 ANDing branch mask 180 λvith die logical inversion of die final update mask 168 produces die extended mask shown at reference number 182 As extended mask 182 rndicates. byte positions 9 through 11 are to be updated to rndrcate dre branch predrction follow ng byte posrtton 11 (r e the sequential branch prediction), since the first branch prediction has been assrgned to the branch rnstruction endrng at byte position 6 and die branch instruction represented by die second branch prediction ends at byte 8 An updated set of branch selectors is shown at reference number 184 The updated set of branch selectors at reference number 184 reflects choosmg die branch prediction coπesponding to branch selector "2" for stoπng branch prediction information corresponding to die previously undetected branch rnstruction Turning now to Fig 10, a thrrd example of the update of die branch selectors using the steps shown in the flowchart of Fig 7 is shown Similar to the example of Fig 8. each byte position is listed (reference number 160), followed bv a set of branch selectors pπor to update (reference number 162) In the rnrtial set of branch selectors 162, a subroutine return rnstruction ends at byte posttion 1 as well as a first branch instruction ending at byte position 8 (indicated by branch selector number 3) and a second branch instruction ending at byte posttion 11 (mdrcated by branch selector number 2)
For the example of Fig 10, a die branch instruction ending at byte position 8 is mispredicted as taken, and the ensuing update of die second branch prediction causes the predrction counter to indicate not- takea Smce die branch prediction is not takea the branch selectors indicating die branch prediction should be updated to indicate the subsequent branch instruction (or update to indicate sequential, rf there rs no subsequent branch instruction withrn the group of contiguous rnstruction bytes) In cases in which a branch prediction becomes not-takea die end pointer of the "new" branch instruction is invalid, smce diere rs no neλvly detected branch instruction Therefore, the update mask rs generated as all zero (reference number 190) Smce the current branch selector rs "3 ", the branch mask is generated as shown at reference number 191 Therefore, die final update mask (reference number 192) rs all zeros
The second branch mask rs generated by XNORmg the new branch selector widi mitral branch selectors 162 (reference number 194) The new branch selector rs set to "3 " rn dns example, such that each of the branch selectors coded to "3 " are indicated by second branch mask 194 ANDmg branch mask 180 widi the logrcal rnversron of the final update mask 192 produces die extended mask shown at reference number 196 As extended mask 196 indicates, byte positions 2 through 8 are to be updated to indicate die branch prediction following byte position 8 (I e die first branch prediction), since die first branch prediction is assigned to the branch instruction endrng at byte position 11 An updated set of branch selectors is shown at reference number 198 The updated set of branch selectors at reference number 198 reflects deleting branch selector "3 " from die set of branch selectors coπesponding to die group of contiguous instruction bytes, smce the first branch prediction is not stoπng a predicted-taken branch prediction
As Fig 10 illustrates, the procedure for removing a branch selector when a prediction indicates not- taken is similar to the procedure for reassigning a branch prediction The differences in the two procedures are diat die update mask for removing a branch selector is always generated as zeros, and the current branch selector rs provrded as the "neλv" branch selector in order to generate the extended mask It is noted diat akhough the preceding discussion has focused on an embodrment rn which a vaπable byte-length instruction set is employed (e g die x86 instruction set), branch selectors mav be employed in branch prediction mechanisms for fixed byte length instruction sets as well An embodiment for fixed byte length rnstruction sets may store a branch selector for each instruction, smce the rnstructions are stored at constant offsets widun cache lines or groups of contiguous instruction bytes It is further noted diat although die embodiment above shows multiple branch predictions per group of contiguous rnstruction bytes, branch selectors may be employed even when only one branch prediction is stored for each group The branch selectors in this case may be a single bit If the brt is set dien the branch prediction is selected If the bit is clear, then the sequential prediction is selected
It is noted drat, as referred to above, a prevrously undetected branch instruction is a branch instruction represented by none of die branch predictions λvrthrn the corresponding prediction block The prevrouslλ undetected branch rnstruction may be prevrouslv undetected (r e not executed smce the coπesponding cache line was stored into instruction cache 16) Alternauveh', the branch predrction correspondmg to die previously undetected branch instruction mav have been reassigned to a different branch rnstruction wrthm the corresponding group of contiguous instruction bytes Turning now to Frg 11, a computer system 200 including microprocessor 10 is shown Computer system 200 further includes a bus bndge 202, a main memory 204, and a plurality of input/output (I O) devices 206A-206N Plurality of I/O devices 206A-206N will be collectively referred to as I/O devices 206 Microprocessor 10, bus bndge 202, and main memory 204 are coupled to a system bus 208 I/O devices 206 are coupled to an I/O bus 210 for communication with bus bndge 202 Bus bndge 202 is provided to assist m communications between I O devices 206 and devices coupled to system bus 208 I/O devrces 206 typrcallv require longer bus clock cycles than i croprocessor 10 and odier devices coupled to sλ stem bus 208 Therefore, bus bndge 202 provides a buffer between system bus 208 and mput/output bus 210 Additionally, bus bndge 202 translates transactions from one bus protocol to another In one embodiment input/output bus 210 is an Enhanced Industry Standard Architecture (EISA) bus
and bus bndge 202 translates from the svstem bus protocol to the EISA bus protocol In another embodiment input/output bus 210 is a Penpheral Component Interconnect (PCI) bus and bus bndge 202 translates from die system bus protocol to the PCI bus protocol It is noted diat πranv λ aπations of svstem bus protocols exrst Mrcroprocessor 10 may employ any surtable system bus protocol 1/0 devrces 206 provrde an mterface between computer sλ stem 200 and odier devices external to ie computer system Exemplary I/O devrces mclude a modem, a senal or parallel port, a sound card, etc I/O devrces 206 may also be referred to as penpheral devrces Main memory 204 stores data and instructions for use by microprocessor 10 In one embodiment main memory 204 includes at least one Dynamic Random Access Memory (DRAM) and a DRAM memory controller It is noted that although computer system 200 as shown m Fig 11 includes one bus bndge 202, odier embodiments of computer system 200 may mclude multiple bus bπdges 202 for translating to multiple dissimilar or srmrlar I/O bus protocols Still further, a cache memory for enhancing the performance of computer system 200 by stoπng instructions and data referenced to mrcroprocessor 10 rn a faster memory storage may be mcluded The cache memory may be rnserted between mrcroprocessor 10 and svstem bus 208, or may resrde on system bus 208 rn a "lookaside" configuration
Although vanous components above have been descnbed as multiplexors rt rs noted tiiat multiple multiplexors, in senes or in parallel, may be employed to perform die selection represented by die multiplexors showa
It rs still further noted diat die present discussion may refer to the assertion of vanous srgnals As used hereia a srgnal rs "asserted" rf it conveys a value indicative of a particular condition Conversely, a signal is "deasserted" rf it conveys a value indicative of a lack of a particular condition A signal may be defined to be asserted when it conveys a logical zero value or, conversely, when it conveys a logical one value Additionally, vanous values have been descπbed as berng drscarded m the aboy e drscussron A value may be drscarded in a number of manners, but generally involves modifying die value such diat it is ignored by logic circuitry which receives die value For example, rf the value compπses a bit die logic state of die value may be mverted to drscard die value If die value is an n-bit value, one of the n-bit encodrngs may mdrcate that die value is invalid Setting die value to die invalid encoding causes die value to be discarded Additionally, an n-bit value may include a valid bit indicative, λvhen set that the n-bit value is λ alid Resetting the valrd brt mav compnse drscardrng die value Odier methods of discarding a value may be used as well Table 1 below indicates fast path, double drspatch. and MROM rnstructions for one embodiment of microprocessor 10 employing die x86 instruction set
Table 1: x86 Fast Path, Double Dispatch, and MROM Instructions
X86 Instruction Instruction Category
AAA MROM
AAD MROM
AAM MROM
AAS MROM
BOUND MROM
BSF fast path
BSR fast padi
BSWAP MROM
BT fast path
BTC fast path
BTR fast path
BTS fast path
CALL fast path/double di
CBW fast path
CWDE fast path
CLC fast path
CLD fast path
CLI MROM
CUTS MROM
CMC fast path
CMP fast path
CMPS MROM
CMPSB MROM
CMPSW MROM
CMPSD MROM
CMPXCHG MROM
CMPXCHG8B MROM
CPUID MROM
CWD MROM
CWQ MROM
DDA MROM
DAS MROM
DEC fast path
DΓV MROM
ENTER MROM
HLT MROM
IDΓV MROM
IMUL double dispatch
ΓNSB MROM
ΓNSW MROM
INSD MROM
INT MROM
INTO MROM
INVD MROM
INVLPG MROM
IRET MROM
IRETD MROM
Jcc fast patii
JCXZ double dispatch
JECXZ double dispatch
IDS MROM
LES MROM
LFS MROM
LGS MROM
LSS MROM
LEA fast path
LEAVE double dispatch
LGDT MROM
LIDT MROM
LLDT MROM
LMSW MROM
LODS MROM
LODSB MROM
LODSW MROM
LODSD MROM
LOOP double dispatch
LOOPcond MROM
LSL MROM
LTR MROM
MOV fast path
MOVCC fast padi
MOV.CR MROM
MOV.DR MROM
MOVS MROM
MOVSB MROM
MOVSW MROM
MOVSD MROM
MOVSX fast padi
MOVZX fast path
MUL double dispatch
NEG fast path
NOP fast path
NOT fast padi
OR fast path
OUT MROM
OUTS MROM
OUTSB MROM
OUTSW MROM
OUTSD MROM
POP double dispatch
POPA MROM
POPAD MROM
POPF MROM
POPFD MROM
PUSH fast path/double dispatch
PUSHA MROM
PUSHAD MROM
PUSHF fast padi
PUSHFD fast path
RCL MROM
RCR MROM
REP MROM
REPE MROM
REPZ MROM
REPNE MROM
REPNZ MROM
RET double dispatch
RSM MROM
SCASW MROM
SCASD MROM
SETcc fast path
SGDT MROM
SIDT MROM
SHLD MROM
SHRD MROM
SLDT MROM
SMSW MROM
STC fast path
STD fast padi
STI MROM
STOS MROM
STOSB MROM
STOSW MROM
STOSD MROM
STR MROM
VERW MROM
WBINVD MROM
WRMSR MROM
XADD MROM
XLATB fast padi
XOR fast padi
Note Instructions including an SIB byte are also considered double dispatch rnstructions
It rs noted diat a superscalar microprocessor in accordance widi dre foregorng may further employ the latching structures as disclosed widun die co-pending, commonly assigned patent application entided "Conditional Latchmg Mechanism and Pipelined Microprocessor Employing the Same", Senal No 08/400,608 filed March 8, 1995, to Pflum et al Tlie disclosure of dus patent applrcation rs incorporated herein by reference rn its entirety
It rs further noted diat aspects regarding array crrcurtry may be found in the co-pending commonh assigned patent application entided "High Performance Ram Array Circmt Employing Self-Time Clock
Generator for Enabling Array Access", Senal No 08/473,103 filed June 7, 1995 bγ Traa The disclosure of this patent application is incorporated herein by reference rn its entirety
It is additionally noted diat odier aspects regarding superscalar microprocessors may be found m the following co-pending, commonly assigned patent applications "Lmearlv Addressable Microprocessor Cache" Senal No 08/146,381 , filed Oct 29, 1993 by Witt, "Superscalar Microprocessor Including a High
Performance Instruction Alignment Umt", Senal No 08/377,843, filed January 25, 1995 by Witt, et al, "A Way Prediction Structure", Senal No 08/522, 181 , filed August 31 , 1995 by Roberts, et al, "A Data Cache Capable of Performmg Store Accesses m a Single Clock Cycle" Senal No 08/521,627, filed August 31, 1995 by Witt, et al "A Parallel and Scalable Instruction Scanrung Umt" Senal No 08/475,400, filed June 7, 1995 bv Naravan and "An Apparatus and Mediod for Alrgnmg Vanable-Bvte Length Instructions to a Plmality of
Issue Positions", Senal No 08/582,473, filed January 2. 1996 by Traa et al The disclosure of these patent applications are incorporated herein by reference m tfierr entirety
In accordance wtth the above disclosure, a branch prediction mechamsm using branch selectors is descnbed The branch predrctron mechamsm quickly locates the branch predrction coπespondrng to a grven fetch address by selecting the branch selector coπesponding to die byte indicated by die given fetch address and selecting the branch predrction indicated by diat branch selector The branch prediction mechamsm may be capable of operating at a hrgher frequency than previous branch prediction mechanisms.
Numerous vanations and modrfications wrll become apparent to those skdled m die art once the above disclosure is fully appreciated It is intended diat die following clarms be rnterpreted to embrace all such vanations and modrfications