WO2003032187A2 - Multiply-accumulate (mac) unit for single-instruction/multiple-data (simd) instructions - Google Patents

Multiply-accumulate (mac) unit for single-instruction/multiple-data (simd) instructions Download PDF

Info

Publication number
WO2003032187A2
WO2003032187A2 PCT/US2002/031412 US0231412W WO03032187A2 WO 2003032187 A2 WO2003032187 A2 WO 2003032187A2 US 0231412 W US0231412 W US 0231412W WO 03032187 A2 WO03032187 A2 WO 03032187A2
Authority
WO
WIPO (PCT)
Prior art keywords
vectors
multiply
bits
vector
machine
Prior art date
Application number
PCT/US2002/031412
Other languages
French (fr)
Other versions
WO2003032187A3 (en
Inventor
Stephen Strazdus
Yuyun Liao
Anthony Jebson
Nigel Paver
Deli Deng
Original Assignee
Intel Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corporation filed Critical Intel Corporation
Priority to DE60222163T priority Critical patent/DE60222163T2/en
Priority to JP2003535084A priority patent/JP4584580B2/en
Priority to EP02800879A priority patent/EP1446728B1/en
Priority to AU2002334792A priority patent/AU2002334792A1/en
Priority to KR1020047005030A priority patent/KR100834178B1/en
Publication of WO2003032187A2 publication Critical patent/WO2003032187A2/en
Publication of WO2003032187A3 publication Critical patent/WO2003032187A3/en
Priority to HK04106791A priority patent/HK1065127A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/3808Details concerning the type of numbers or the way they are handled
    • G06F2207/3812Devices capable of handling different types of numbers
    • G06F2207/382Reconfigurable for different fixed word lengths
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/3808Details concerning the type of numbers or the way they are handled
    • G06F2207/3828Multigauge devices, i.e. capable of handling packed numbers without unpacking them
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/386Special constructional features
    • G06F2207/3884Pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • G06F7/53Multiplying only in parallel-parallel fashion, i.e. both operands being entered in parallel
    • G06F7/5318Multiplying only in parallel-parallel fashion, i.e. both operands being entered in parallel with column wise addition of partial products, e.g. using Wallace tree, Dadda counters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • G06F7/53Multiplying only in parallel-parallel fashion, i.e. both operands being entered in parallel
    • G06F7/5324Multiplying only in parallel-parallel fashion, i.e. both operands being entered in parallel partitioned, i.e. using repetitively a smaller parallel parallel multiplier or using an array of such smaller multipliers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • G06F7/533Reduction of the number of iteration steps or stages, e.g. using the Booth algorithm, log-sum, odd-even
    • G06F7/5334Reduction of the number of iteration steps or stages, e.g. using the Booth algorithm, log-sum, odd-even by using multiple bit scanning, i.e. by decoding groups of successive multiplier bits in order to select an appropriate precalculated multiple of the multiplicand as a partial product
    • G06F7/5336Reduction of the number of iteration steps or stages, e.g. using the Booth algorithm, log-sum, odd-even by using multiple bit scanning, i.e. by decoding groups of successive multiplier bits in order to select an appropriate precalculated multiple of the multiplicand as a partial product overlapped, i.e. with successive bitgroups sharing one or more bits being recoded into signed digit representation, e.g. using the Modified Booth Algorithm
    • G06F7/5338Reduction of the number of iteration steps or stages, e.g. using the Booth algorithm, log-sum, odd-even by using multiple bit scanning, i.e. by decoding groups of successive multiplier bits in order to select an appropriate precalculated multiple of the multiplicand as a partial product overlapped, i.e. with successive bitgroups sharing one or more bits being recoded into signed digit representation, e.g. using the Modified Booth Algorithm each bitgroup having two new bits, e.g. 2nd order MBA

Definitions

  • MAC MULTIPLY-ACCUMULATE UNIT FOR SINGLE-INSTRUCTION/MULTIPLE-DATA (SIMD)
  • DSPs Digital signal processors
  • SIMD Single-Instruction/Multiple-Data
  • data parallel processors In SIMD operations, a single instruction is sent to a number of processing elements, which perform the -same operation on different data.
  • SIMD instructions provide for several types of standard operations including addition, subtraction, multiplication, multiply-accumulate (MAC) , and a number of special instructions for performing, for example, clipping and bilinear interpolation operations.
  • MAC multiply-accumulate
  • Many DSP applications including many speech codecs, require high performance 16-bit multiply-accumulate (MAC) operations. To achieve high performance for these 16- bit DSP applications, 64-bit SIMD instructions may be introduced.
  • the 64-bit SIMD instructions may be used to handle media streams more efficiently and reduce register pressure and memory traffic since four 16-bit data items may be loaded into a 64-bit register at one time.
  • high throughput is an important factor for achieving high performance
  • power consumption may also be an important consideration in designing DSPs for wireless/handheld products. Accordingly, MAC architectures which are capable of high performance with low power demands may be desirable for use in DSPs.
  • Figure 1 is block diagram of a dual multiply- accumulate (MAC) unit according to an embodiment.
  • Figure 2 is a block diagram illustrating a MAC SIMD (Single-Instruction/Multiple-Data) operation according to an embodiment .
  • Figures 3A to 3C are flowcharts describing a MAC SIMD operation according to an embodiment.
  • Figures 4A to 4C are block diagrams illustrating pipelined instruction sequences utilizing data forwarding according to an embodiment.
  • Figures 5A to 5C are block diagrams illustrating pipelined instruction sequences utilizing intermediate data forwarding according to an embodiment .
  • Figures 6A and 6B are flowcharts describing a 32- bit X 32-bit MAC operation performed on a tightly coupled dual 16-bit MAC unit according to an embodiment.
  • Figure 7 is a block diagram of a mobile video unit including a MAC unit according to an embodiment .
  • FIG. 1 illustrates a Multiply-Accumulate (MAC) unit 100 according to an embodiment.
  • the MAC unit 100 may be used to perform a number of different SIMD (Single- Instruction/Multiple-Data) operations .
  • SIMD Single- Instruction/Multiple-Data
  • the MAC unit 100 may have a tightly coupled dual 16-bit MAC architecture.
  • a 16-bit MAC SIMD operation 200 which may be performed by such a MAC unit is shown conceptually in Figure 2.
  • the contents of two 64-bit registers, 202 (wRn) and 204 (wRm) may be treated as four pairs of 16-bit values, Ao ⁇ A 3 (wRn) and B 0 -B 3 (wR ) .
  • the first 16 bits to fourth 16 bits of wRn are multiplied by the first 16 bits to fourth 16 bits of wRm, respectively.
  • the four multiplied results P o -P 3 are then added to the value in 64-bit register 206 (wRd) , and the result is sent to a register 206.
  • the MAC operation 200 may be implemented in four execution stages: (1) Booth encoding and Wallace Tree compression of Bi and B 0 ; (2) Booth encoding and Wallace Tree compression of B 3 and B 2 ; (3) 4-to-2 compression, and addition of the low 32-bits of the result; and (4) addition of the upper 32-bits of the result. These four stages may be referred to as the CSA0, CSA1, CLA0, and CLA1 stages, respectively.
  • Figures 3A to 3C illustrate a flow chart describing an implementation 300 of the MAC operation 200 according to an embodiment.
  • a MUX & Booth encoder unit 102 selects B 0 (16 bits) and encodes those bits (block 302).
  • Control signals are generated, each of which select a partial product vector from the set ⁇ 0, -A 0 , - 2Ao, Ao, 2Ao) .
  • Nine partial product vectors, PaO to Pa8, are generated and passed to a MUX array 104 (block 304) .
  • All nine partial product vectors and the low 32 bits of the value in register 206 (wRd) are compressed into two vectors by a Wallace Tree unit 106 (block 306) .
  • the two vectors include a sum vector and a carry vector, which are stored in a sum vector flip-flop (FF) 108 and a carry vector FF 110, respectively.
  • FF sum vector flip-flop
  • a MUX & Booth encoder unit 112 selects B ⁇ (16 bits) and encodes those bits (block 308). Control signals are generated, each of which select a partial product vector from the set ⁇ 0, -Ai, -2A ⁇ , Ai, 2A ⁇ .
  • Nine partial product vectors, PbO to Pb8, are generated and passed to a MUX array 114 (block 310) . All nine partial product vectors and a zero vector are compressed into two vectors by a Wallace Tree unit 116 (block 312) .
  • the two vectors include a sum vector and a carry vector, which are stored in a sum vector
  • a MUX & 4-to-2 compressor unit 122 In the CSAl stage, four vectors from the sum and carry vectors FFs 108, 110, 118, and 120 from the CSAO stage are compressed into vectors Vs 0 and Vco by a MUX & 4-to-2 compressor unit 122 (block 314) .
  • the MUX & Booth encoder unit 102 selects B 2 (16 bits) and encodes those bits (block 316) . Control signals are generated, each of which select a partial product vector from the set ⁇ 0, -A 2 , -2A 2 , A 2 , 2A 2 ⁇ .
  • Nine partial product vectors are generated (block 318) .
  • All nine partial product vectors and vector Vs 0 are then compressed into two vectors by the Wallace Tree unit 106 (block 320) .
  • the two vectors include a sum vector and a carry vector, which are stored in a sum vector FF 108 and a carry vector FF 110, respectively.
  • the MUX & Booth encoder 112 selects B 3 (16 bits) and then encodes those bits (block 322) . Control signals are generated, each of which select a partial product vector from the set (0, -A 3 , -2A 3 , A 3 , 2A 3 ⁇ . Nine partial product vectors are generated (block 324) . All nine partial product vectors and vector Vco are then compressed into two vectors by the Wallace Tree unit 116 (block 326) . The two vectors include a sum vector and a carry vector, which are stored in a sum vector FF 118 and a carry vector FF 120, respectively. [0018] In the CLA0 stage, four vectors from FFs 108,
  • the 4-to-2 compressor unit 122 to generate vector Vsi and vector Vci (block 327) .
  • the lower 32 bits of Vsi and Vci are added by the carry look-ahead (CLA) unit 124 to generate the low 32 bits of the final result (block 328).
  • CLA carry look-ahead
  • the upper bits of si and Vci are sign extended to two 32-bit vectors (block 330) .
  • the extended vectors and the upper 32-bits of wRd are then compressed into two vectors by a 3-to ⁇ 2 compressor unit 126 (block 332) .
  • the dual MAC architecture shown in Figure 1 may be more readily implemented in very high frequency and low power application.
  • the CLA1 stage may have less logic gates than that of CLAO stage, which enables the final results to have enough time to return through the bypass logic, making this dual MAC architecture suitable for a high speed and low power 64-bit datapath.
  • the MAC unit may be used in a pipelined DSP.
  • Pipelining which changes the relative timing of instructions by overlapping their execution, may increase the throughput of a DSP compared to a non-pipelined DSP.
  • pipelining may introduce data dependencies, or hazards, which may occur whenever the result of a previous instruction is not available and is needed by the current instruction. The current operation may be stalled in the pipeline until the data dependency is solved.
  • FIG. 4A-4C show possible accumulating dependency penalties for a standard data forwarding scheme. The standard forwarding scheme is used to reduce the accumulating dependency penalty, where EX 402 is the execution stage for other non-MAC instructions.
  • the MAC unit 100 may be used to implement a new data forwarding scheme, referred to as intermediate data forwarding, which may eliminate the accumulating dependency penalty. Instead of waiting for a final result from a previous operation, the intermediate data forwarding scheme forwards an intermediate result to solve data dependencies .
  • Figures 5A-5C illustrate the sequences shown in Figures 4A- 4C, but implemented using an intermediate data forwarding technique .
  • the CSAO stage 500 is segmented into two sub-stages 502 (BE0) and 504 (WTO) for Booth encoding and Wallace tree compressing, respectively, operands B o and B .
  • the CSAl stage 506 is segmented into two sub-stages 508 (BE1) and 510 (WT1) for Booth encoding and Wallace tree compressing, respectively, operands B 2 and B 3 .
  • the CLAO stage 512 is segmented into two sub-stages 514 (4T2) and 516 (ADDO) for 4-to-2 compressing of vectors and low 32-bit addition of the final result.
  • the CLA1 stage 518 includes the upper 32-bit addition of the final result 520 (ADD1) .
  • the low 32-bits of intermediate vectors Vs, Vc of the first MAC instruction may be forwarded to the Wallace Tree units 106 and 116 for the second MAC instruction to solve the accumulating dependency.
  • the upper 32-bit result of the first MAC instruction from the CLA1 unit 128 is forwarded to the MUX & 3-to-2 compressor unit 126.
  • the stall 404 in Figure 5A is due to the Wallace Tree resource conflict, which is not counted as data dependency penalty.
  • the final result of the first MAC instruction is not available when it is needed by the second MAC instruction, but the low 32-bit result of the first MAC instruction is available.
  • the low 32-bit result of the first MAC instruction is forwarded to the Wallace Tree unit 106 to solve the accumulating dependency.
  • the upper 32- bit result of the first MAC instruction from the CLA1 unit 126 is forwarded to the MUC & 3-to-2 compressor unit 128.
  • Table 1 The accumulating data dependency penalty comparisons between the standard data forwarding technique shown in Figures 4A to 4C and the intermediate data forwarding technique shown in Figures 5A to 5C are given in Table 1. As shown in Table 1, intermediate data forwarding may eliminate accumulating dependencies, which may enable relatively high throughput for many DSP applications.
  • a tightly coupled dual 16-bit MAC unit such as that shown in Figure 1, may be used for 32-bit X 32-bit instructions as well as 16-bit SIMD instructions according to an embodiment.
  • a 32-bit X 32-bit operation may be divided into four 16-bit X 16-bit operations, as shown in the following equation:
  • A[31:0] X B[31:0] (A[31:16] X B[15:0] X 2 l ⁇ + A[15:0] X B[15:0]) + (A[31:16] X B[31:16] X 2 1S + A[15:0] X B[31:16]) X 2 16 .
  • Figure 6 is a flow chart describing a 32-bit X 32- bit MAC operation 600 according to an embodiment.
  • the partial product vectors of A[15:0] X B[15:0] are generated by the MUX & Booth encoder unit 102 (block 602).
  • the Wallace Tree unit 106 compresses the partial product vectors into two vectors (block 604) .
  • the two vectors include a sum vector and a carry vector, which are stored in the sum vector FF 108 and the carry vector FF 110, respectively.
  • the partial product vectors of A[31:16] X B[15:0] are generated by the MUX & Booth encoder unit 112 (block 606) .
  • the Wallace Tree unit 116 compresses the partial product vectors into two vectors (block 608).
  • the two vectors include a sum vector and a carry vector, which are stored in the sum vector FF 108 and the carry vector FF 110, respectively.
  • the partial product vectors of A[15:0] X B[31:16] and the feedback vector from Vs 0 are then compressed into two vectors by the Wallace Tree unit 106 (block 616) .
  • the two vectors include a sum vector and a carry vector, which are stored in the sum vector FF 108 and the carry vector FF 120, respectively.
  • the partial product vector of A[31:16] X B[31:16] and the feedback vector from Vso are then compressed into two vectors by the Wallace Tree unit 116 (block 618) .
  • the two vectors include a sum vector and a carry vector, which are stored in the sum vector FF 118 and the carry vector FF 120, respectively.
  • the MAC unit 100 may be implemented in a variety of systems including general purpose computing systems, digital processing systems, laptop computers, personal digital assistants (PDAs) and cellular phones. In such a system, the MAC unit may be included in a processor coupled to a memory device, such as a Flash memory device or a static random access memory (SRAM) , which stores an operating system or other software applications.
  • a memory device such as a Flash memory device or a static random access memory (SRAM) , which stores an operating system or other software applications.
  • Such a processor may be used in video camcorders, teleconferencing, PC video cards, and High-Definition Television (HDTV) .
  • the processor may be used in connection with other technologies utilizing digital signal processing such as voice processing used in mobile telephony, speech recognition, and other applications.
  • Figure' 7 illustrates a mobile video device 700 including a processor 701 including a MAC unit
  • the mobile video device 700 may be a hand-held device which displays video images produced from an encoded video signal received from an antenna 702 or a digital video storage medium 704, e.g., a digital video disc (DVD) or a memory card.
  • the processor 100 may communicate with a cache memory 706, which may store instructions and data for the processor operations, and other devices, for example, an SRAM 708.
  • a cache memory 706 which may store instructions and data for the processor operations, and other devices, for example, an SRAM 708.

Abstract

A tightly coupled dual 16-bit multiply-accumulate (MAC) unit for performing single-instruction/multiple-data (SIMD) operations may forward an intermediate result to another operation in a pipeline to resolve an accumulating dependency penalty. The MAC unit may also be used to perform 32-bit X 32-bit operations.

Description

MULTIPLY-ACCUMULATE (MAC) UNIT FOR SINGLE-INSTRUCTION/MULTIPLE-DATA (SIMD)
INSTRUCTIONS
BACKGROUND
[0001] Digital signal processors (DSPs) may operate as SIMD (Single-Instruction/Multiple-Data) , or data parallel, processors. In SIMD operations, a single instruction is sent to a number of processing elements, which perform the -same operation on different data. SIMD instructions provide for several types of standard operations including addition, subtraction, multiplication, multiply-accumulate (MAC) , and a number of special instructions for performing, for example, clipping and bilinear interpolation operations. [0002] Many DSP applications, including many speech codecs, require high performance 16-bit multiply-accumulate (MAC) operations. To achieve high performance for these 16- bit DSP applications, 64-bit SIMD instructions may be introduced. The 64-bit SIMD instructions may be used to handle media streams more efficiently and reduce register pressure and memory traffic since four 16-bit data items may be loaded into a 64-bit register at one time. [0003] While high throughput is an important factor for achieving high performance, power consumption may also be an important consideration in designing DSPs for wireless/handheld products. Accordingly, MAC architectures which are capable of high performance with low power demands may be desirable for use in DSPs.
BRIEF DESCRIPTION OF THE DRAWINGS [0004] Figure 1 is block diagram of a dual multiply- accumulate (MAC) unit according to an embodiment. [0005] Figure 2 is a block diagram illustrating a MAC SIMD (Single-Instruction/Multiple-Data) operation according to an embodiment . [0006] Figures 3A to 3C are flowcharts describing a MAC SIMD operation according to an embodiment.
[0007] Figures 4A to 4C are block diagrams illustrating pipelined instruction sequences utilizing data forwarding according to an embodiment. [0008] Figures 5A to 5C are block diagrams illustrating pipelined instruction sequences utilizing intermediate data forwarding according to an embodiment .
[0009] Figures 6A and 6B are flowcharts describing a 32- bit X 32-bit MAC operation performed on a tightly coupled dual 16-bit MAC unit according to an embodiment.
[0010] Figure 7 is a block diagram of a mobile video unit including a MAC unit according to an embodiment .
DETAILED DESCRIPTION [0011] Figure 1 illustrates a Multiply-Accumulate (MAC) unit 100 according to an embodiment. The MAC unit 100 may be used to perform a number of different SIMD (Single- Instruction/Multiple-Data) operations . [0012] The MAC unit 100 may have a tightly coupled dual 16-bit MAC architecture. A 16-bit MAC SIMD operation 200 which may be performed by such a MAC unit is shown conceptually in Figure 2. The contents of two 64-bit registers, 202 (wRn) and 204 (wRm) , may be treated as four pairs of 16-bit values, Ao~A3 (wRn) and B0-B3 (wR ) . The first 16 bits to fourth 16 bits of wRn are multiplied by the first 16 bits to fourth 16 bits of wRm, respectively. The four multiplied results Po-P3 are then added to the value in 64-bit register 206 (wRd) , and the result is sent to a register 206.
[0013] The MAC operation 200 may be implemented in four execution stages: (1) Booth encoding and Wallace Tree compression of Bi and B0; (2) Booth encoding and Wallace Tree compression of B3 and B2; (3) 4-to-2 compression, and addition of the low 32-bits of the result; and (4) addition of the upper 32-bits of the result. These four stages may be referred to as the CSA0, CSA1, CLA0, and CLA1 stages, respectively. [0014] Figures 3A to 3C illustrate a flow chart describing an implementation 300 of the MAC operation 200 according to an embodiment. In the CSAO stage, a MUX & Booth encoder unit 102 selects B0 (16 bits) and encodes those bits (block 302). Control signals are generated, each of which select a partial product vector from the set {0, -A0, - 2Ao, Ao, 2Ao) . Nine partial product vectors, PaO to Pa8, are generated and passed to a MUX array 104 (block 304) . All nine partial product vectors and the low 32 bits of the value in register 206 (wRd) are compressed into two vectors by a Wallace Tree unit 106 (block 306) . The two vectors include a sum vector and a carry vector, which are stored in a sum vector flip-flop (FF) 108 and a carry vector FF 110, respectively.
[0015] A MUX & Booth encoder unit 112 selects Bα (16 bits) and encodes those bits (block 308). Control signals are generated, each of which select a partial product vector from the set {0, -Ai, -2Aι, Ai, 2Aχ} . Nine partial product vectors, PbO to Pb8, are generated and passed to a MUX array 114 (block 310) . All nine partial product vectors and a zero vector are compressed into two vectors by a Wallace Tree unit 116 (block 312) . The two vectors include a sum vector and a carry vector, which are stored in a sum vector
FF 118 and a carry vector FF 120, respectively.
[0016] In the CSAl stage, four vectors from the sum and carry vectors FFs 108, 110, 118, and 120 from the CSAO stage are compressed into vectors Vs0 and Vco by a MUX & 4-to-2 compressor unit 122 (block 314) . The MUX & Booth encoder unit 102 selects B2 (16 bits) and encodes those bits (block 316) . Control signals are generated, each of which select a partial product vector from the set {0, -A2, -2A2, A2, 2A2} . Nine partial product vectors are generated (block 318) . All nine partial product vectors and vector Vs0 are then compressed into two vectors by the Wallace Tree unit 106 (block 320) . The two vectors include a sum vector and a carry vector, which are stored in a sum vector FF 108 and a carry vector FF 110, respectively.
[0017] The MUX & Booth encoder 112 selects B3 (16 bits) and then encodes those bits (block 322) . Control signals are generated, each of which select a partial product vector from the set (0, -A3, -2A3, A3, 2A3} . Nine partial product vectors are generated (block 324) . All nine partial product vectors and vector Vco are then compressed into two vectors by the Wallace Tree unit 116 (block 326) . The two vectors include a sum vector and a carry vector, which are stored in a sum vector FF 118 and a carry vector FF 120, respectively. [0018] In the CLA0 stage, four vectors from FFs 108,
110, 118, and 120 from the CSAl stage are sent to the 4-to-2 compressor unit 122 to generate vector Vsi and vector Vci (block 327) . The lower 32 bits of Vsi and Vci are added by the carry look-ahead (CLA) unit 124 to generate the low 32 bits of the final result (block 328). [0019] In the CLA1 stage, the upper bits of si and Vci are sign extended to two 32-bit vectors (block 330) . The extended vectors and the upper 32-bits of wRd are then compressed into two vectors by a 3-to~2 compressor unit 126 (block 332) . Two compressed vectors and carry-in bit from the CLA0 unit 124 are added together by CLA unit 128 to generate the upper 32-bits of the final result (block 334). [0020] As described above, the Booth encoding and vectors compressing take two cycles to finish. In the first cycle, the results from both Wallace Tree units are sent back for further processing in the second cycle. Conventionally, all four vectors from FFs 108, 110, 118, and 120 would be sent back to the Wallace trees for further processing in the second cycle. However, it has been observed that the MUX & 4-to-2 compressor unit 122 may perform the 4-to-2 compression of the vectors faster than the MUX & Booth encoder units and the MUX arrays. Thus, only two vectors (Vso and Vc0) from the MUX & 4-to-2 compressor unit 122 are sent back to the Wallace Tree units 106 and 116. With this architecture, the feedback routings may be reduced and the Wallace Tree units 106, 116 made relatively smaller. Less feedback routings make the layout easier, which is desirable since routing limitations are an issue in MAC design. [0021] Some conventional MAC implementations perform the 64-bit addition in one cycle. However, such MACs may not be suitable for a very high frequency 64-bit datapath, and their results may not have enough time to return through the bypass logic, which is commonly used for solving data dependency in pipelining. Compared with conventional architectures, the dual MAC architecture shown in Figure 1 may be more readily implemented in very high frequency and low power application. The CLA1 stage may have less logic gates than that of CLAO stage, which enables the final results to have enough time to return through the bypass logic, making this dual MAC architecture suitable for a high speed and low power 64-bit datapath.
[0022] The MAC unit may be used in a pipelined DSP. Pipelining, which changes the relative timing of instructions by overlapping their execution, may increase the throughput of a DSP compared to a non-pipelined DSP. However, pipelining may introduce data dependencies, or hazards, which may occur whenever the result of a previous instruction is not available and is needed by the current instruction. The current operation may be stalled in the pipeline until the data dependency is solved.
[0023] Typically, data forwarding is based on a final result of an operation. For many DSP algorithms, the result of the previous MAC operation needs to be added to the current MAC operation. However, a MAC operation may take four cycles to complete, and the result of the previous MAC operation may not be available for the current MAC operation. In this case, a data dependency called an accumulating dependency is introduced. [0024] Figures 4A-4C show possible accumulating dependency penalties for a standard data forwarding scheme. The standard forwarding scheme is used to reduce the accumulating dependency penalty, where EX 402 is the execution stage for other non-MAC instructions. Even if the standard data forwarding is employed, an accumulating dependency penalty is still two cycles in the worst case, which is shown in Figure 4A (note that, although there are three stalls 404 before the final result is available after the CLA1 stage, the first stall 404 in Figure 4A is due to a resource conflict in the Wallace Tree unit, which is not counted as data dependency penalty) . Two cycle penalties may be too severe for some DSP applications, and hence it is desirable to eliminate the accumulating dependency penalty. [0025] The MAC unit 100 may be used to implement a new data forwarding scheme, referred to as intermediate data forwarding, which may eliminate the accumulating dependency penalty. Instead of waiting for a final result from a previous operation, the intermediate data forwarding scheme forwards an intermediate result to solve data dependencies . Figures 5A-5C illustrate the sequences shown in Figures 4A- 4C, but implemented using an intermediate data forwarding technique .
[0026] As shown in Figures 5A-5C, the CSAO stage 500 is segmented into two sub-stages 502 (BE0) and 504 (WTO) for Booth encoding and Wallace tree compressing, respectively, operands Bo and B . The CSAl stage 506 is segmented into two sub-stages 508 (BE1) and 510 (WT1) for Booth encoding and Wallace tree compressing, respectively, operands B2 and B3. The CLAO stage 512 is segmented into two sub-stages 514 (4T2) and 516 (ADDO) for 4-to-2 compressing of vectors and low 32-bit addition of the final result. The CLA1 stage 518 includes the upper 32-bit addition of the final result 520 (ADD1) . [0027] In the cases shown in Figures 5A and 5B, the low 32-bits of intermediate vectors Vs, Vc of the first MAC instruction may be forwarded to the Wallace Tree units 106 and 116 for the second MAC instruction to solve the accumulating dependency. The upper 32-bit result of the first MAC instruction from the CLA1 unit 128 is forwarded to the MUX & 3-to-2 compressor unit 126. The stall 404 in Figure 5A is due to the Wallace Tree resource conflict, which is not counted as data dependency penalty. [0028] In the case shown in Figure 5C, the final result of the first MAC instruction is not available when it is needed by the second MAC instruction, but the low 32-bit result of the first MAC instruction is available. Instead of waiting for the final result, the low 32-bit result of the first MAC instruction is forwarded to the Wallace Tree unit 106 to solve the accumulating dependency. The upper 32- bit result of the first MAC instruction from the CLA1 unit 126 is forwarded to the MUC & 3-to-2 compressor unit 128. [0029] The accumulating data dependency penalty comparisons between the standard data forwarding technique shown in Figures 4A to 4C and the intermediate data forwarding technique shown in Figures 5A to 5C are given in Table 1. As shown in Table 1, intermediate data forwarding may eliminate accumulating dependencies, which may enable relatively high throughput for many DSP applications.
Figure imgf000011_0001
TABLE 1
[0030] A tightly coupled dual 16-bit MAC unit, such as that shown in Figure 1, may be used for 32-bit X 32-bit instructions as well as 16-bit SIMD instructions according to an embodiment. A 32-bit X 32-bit operation may be divided into four 16-bit X 16-bit operations, as shown in the following equation:
A[31:0] X B[31:0] = (A[31:16] X B[15:0] X 2 + A[15:0] X B[15:0]) + (A[31:16] X B[31:16] X 21S + A[15:0] X B[31:16]) X 216.
[0031] Figure 6 is a flow chart describing a 32-bit X 32- bit MAC operation 600 according to an embodiment. In the CSAO stage, the partial product vectors of A[15:0] X B[15:0] are generated by the MUX & Booth encoder unit 102 (block 602). The Wallace Tree unit 106 compresses the partial product vectors into two vectors (block 604) . The two vectors include a sum vector and a carry vector, which are stored in the sum vector FF 108 and the carry vector FF 110, respectively. The partial product vectors of A[31:16] X B[15:0] are generated by the MUX & Booth encoder unit 112 (block 606) . The Wallace Tree unit 116 compresses the partial product vectors into two vectors (block 608). The two vectors include a sum vector and a carry vector, which are stored in the sum vector FF 108 and the carry vector FF 110, respectively.
[0032] In the CSAl stage, two vectors from the sum vector FF 118 and carry vector FF 120 are shifted left 16 bits (block 610) . The MUX & 4-to-2 compressor unit 122 compresses the shifted vectors and the other two vectors from the sum vector FF 108 and carry vector FF 110 into vector Vs0 and vector Vc0 (block 612) . The low 16 bit of Vs0 and Vco are sent to the CLAO unit 124. The remaining bits are sent back to the Wallace Tree units 106 and 116. The final results from bit 0 to bit 15 are then generated by the CLAO unit 124 (block 614) . The partial product vectors of A[15:0] X B[31:16] and the feedback vector from Vs0 are then compressed into two vectors by the Wallace Tree unit 106 (block 616) . The two vectors include a sum vector and a carry vector, which are stored in the sum vector FF 108 and the carry vector FF 120, respectively. The partial product vector of A[31:16] X B[31:16] and the feedback vector from Vso are then compressed into two vectors by the Wallace Tree unit 116 (block 618) . The two vectors include a sum vector and a carry vector, which are stored in the sum vector FF 118 and the carry vector FF 120, respectively.
[0033] In the CLAO stage, two vectors from the sum vector FF 118 and the carry vector FF 120 are shifted left 16 bits (block 620) . The MUX & 4-to-2 compressor unit 122 compresses the shifted vectors and the other two vectors from the sum vector FF 108 and the carry vector FF 110 into vector Vsi and vector Vci (block 622) . The low 16 bits of vectors Vsx and Vci are added by the CLAO unit 124. The final results from bit 16 to bit 31 are then generated (block 624) . [0034] In the CLAl stage, the upper bits (from bit 16 to bit 47) of vectors Vsi and Vci are added by the CLAl unit 128 to generate the upper 32-bit final results (from bit 32 to bit 63) (block 626) . [0035] The MAC unit 100 may be implemented in a variety of systems including general purpose computing systems, digital processing systems, laptop computers, personal digital assistants (PDAs) and cellular phones. In such a system, the MAC unit may be included in a processor coupled to a memory device, such as a Flash memory device or a static random access memory (SRAM) , which stores an operating system or other software applications. [0036] Such a processor may be used in video camcorders, teleconferencing, PC video cards, and High-Definition Television (HDTV) . In addition, the processor may be used in connection with other technologies utilizing digital signal processing such as voice processing used in mobile telephony, speech recognition, and other applications. [0037] For example, Figure' 7 illustrates a mobile video device 700 including a processor 701 including a MAC unit
100 according to an embodiment. The mobile video device 700 may be a hand-held device which displays video images produced from an encoded video signal received from an antenna 702 or a digital video storage medium 704, e.g., a digital video disc (DVD) or a memory card. The processor 100 may communicate with a cache memory 706, which may store instructions and data for the processor operations, and other devices, for example, an SRAM 708. [0038] A number of embodiments have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. For example, blocks in the flowchart may be skipped or performed out of order and still produce desirable results. Furthermore, the size of the operands and number of operands operated on per SIMD instruction may vary. Accordingly, other embodiments are within the scope of the following claims.

Claims

1 . A method comprising : performing a first compression operation in a first multiply-accumulate operation in a pipeline; generating two or more intermediate vectors in a first compression operation in the first multiply-accumulate operation; and forwarding at least a portion of each of the two or more intermediate vectors to a second multiply-accumulate operation in the pipeline.
2. The method of claim 1, wherein said forwarding at least a portion of each of the two or more intermediate vectors comprises forwarding a lower portions of each of the two or more intermediate vectors .
3. The method of claim 1, wherein said performing the first compression operation comprises compressing a first plurality of partial products into a first sum vector and a first carry vector and compressing a second plurality of partial products into a second sum vector and a second carry vector.
4. The method of claim 1, wherein said generating two or more intermediate vectors comprises compressing the first and second sum vectors and the first and second carry vectors into an intermediate sum vector and an intermediate carry vector.
5. The method of claim 1, wherein said forwarding comprises forwarding at least a portion of each of the two or more intermediate vectors to a Wallace tree compression unit .
6. An article comprising a machine-readable medium which stores machine-executable instructions, the instructions causing a machine to: perform a first compression operation in a first multiply-accumulate operation in a pipeline; generate two or more intermediate vectors in a first compression operation in the first multiply-accumulate operation; and forward at least a portion of each of the two or more intermediate vectors to a second multiply-accumulate operation in the pipeline.
7. The article of claim 6, wherein the instructions causing the machine to forward at least a portion of each of the two or more intermediate vectors include instructions causing the machine to forward a lower number of bits of each of the two or more intermediate vectors .
8. The article of claim 6, wherein the instructions causing the machine to perform the first compression operation include instructions causing the machine to compress a first plurality of partial products into a first sum vector and a first carry vector and compress a second plurality of partial products into a second sum vector and a second carry vector.
9. The article of claim 6, wherein the instructions causing the machine to generate two or more intermediate vectors include instructions causing the machine to compress the first and second sum vectors and the first and second carry vectors into an intermediate sum vector and an intermediate carry vector.
10. The article of claim 6, wherein the instructions causing the machine to forward include instructions causing the machine to forward at least a portion of each of the two or more intermediate vectors to a Wallace tree compression unit .
11. A method comprising: compressing a first plurality of partial products into a first sum vector and a first carry vector and compressing a second plurality of partial products into a second sum vector and a second carry vector in a first Wallace tree compression stage of a multiply-accumulate operation; compressing the first and second sum vectors and the first and second carry vectors into a first intermediate sum vector and a first intermediate carry vector; and compressing the intermediate sum vector and a third plurality of partial products and compressing the intermediate carry vector and a fourth plurality of partial products in a second stage of the multiply-accumulate operatio .
12. The method of claim 11, wherein the multiply- accumulate operation comprises a single instruction/multiple data (SIMD) operation.
13. The method of claim 11, further comprising: generating the first plurality of partial products from a first pair of operands; generating the second plurality of partial products from a second pair of operands; generating the third plurality of partial products from a third pair of operands; and generating the fourth plurality of partial products from a fourth pair of operands.
14. The method of claim 11, further comprising forwarding the intermediate sum and carry vectors to a second multiply-accumulate operation in a pipeline.
15. The method of claim 14, wherein said forwarding comprises eliminating an accumulate data dependency in the second multiply-accumulate operation.
16. An article comprising a machine-readable medium which stores machine-executable instructions, the instructions causing a machine to: compress a first plurality of partial products into a first sum vector and a first carry vector and compressing a second plurality of partial products into a second sum vector and a second carry vector in a first Wallace tree compression stage of a multiply-accumulate operation; compress the first and second sum vectors and the first and second carry vectors into a first intermediate sum vector and a first intermediate carry vector; and compress the intermediate sum vector and a third plurality of partial products and compressing the intermediate carry vector and a fourth plurality of partial products in a second stage of the multiply-accumulate operation.
17. The article of claim 16, wherein the multiply- accumulate operation comprises a single instruction/multiple data (SIMD) operation.
18. The article of claim 16, further comprising instructions causing the machine to: generate the first plurality of partial products from a first pair of operands; generate the second plurality of partial products from a second pair of operands; generate the third plurality of partial products from a third pair of operands; and generate the fourth plurality of partial products from a fourth pair of operands .
19. The article of claim 16, further comprising instructions causing the machine to forward the intermediate sum and carry vectors to a second multiply-accumulate operation in a pipeline.
20. The article of claim 16, wherein the instructions causing the machine to forward include instructions causing the machine to eliminate an accumulate data dependency in the second multiply-accumulate operation.
21. An apparatus comprising: first and second Wallace tree compression units operative to compress vectors in first and second stages of a multiply-accumulate operation; a compressor operative to compress a plurality of vectors output from the first and second Wallace tree units in the first stage of the multiply-accumulate operation into two intermediate vectors; and a data path from an output of the compressor to an input of a multiplexer, said multiplexer operative to selectively input one of said intermediate vectors to one of said first and second Wallace tree compression units in the second stage of the multiply-accumulate operation.
22. The apparatus of claim 21, further comprising a dual multiply-accumulate unit .
23. The apparatus of claim 21, wherein the plurality of vectors comprise first and second sum vectors and first and second carry vectors.
24. The apparatus of claim 21, wherein the compressor comprises a four-to-two vector compressor.
25. The apparatus of claim 21, wherein the multiplexer comprises a first multiplexer having an output coupled to the first Wallace tree compression unit and a second multiplexer having an output coupled to the second Wallace tree compression unit.
26. A system comprising: a static random address memory; and a processor coupled to the static random access memory, said processor comprising a dual multiply-accumulate unit, said unit including first and second Wallace tree compression units operative to compress vectors in first and second stages of a multiply-accumulate operation, a compressor operative to compress a plurality of vectors output from the first and second Wallace tree units in the first stage of the multiply-accumulate operation into two intermediate vectors, and a data path from an output of the compressor to an input of a multiplexer, said multiplexer operative to selectively input one of said intermediate vectors to one of said first and second Wallace tree compression units in the second stage of the multiply-accumulate operation.
27. The system of claim 21, wherein the multiplexer comprises a first multiplexer having an output coupled to the first Wallace tree compression unit and a second multiplexer having an output coupled to the second Wallace tree compression unit.
28. A method comprising: performing a multiply-accumulate operation on first and second 2n-bit operands as four n-bit operations .
29. The method of claim 28, wherein said performing comprises : generating partial product vectors from the lower n bits of the first operand and the lower n bits of the second operand; generating partial product vectors from the upper n bits of the first operand and the lower n bits of the second operand; generating partial product vectors from the upper n bits of the first operand and the upper n bits of the second operand; and generating partial product vectors from the lower n bits of the first operand and the upper n bits of the second operand.
30. The method of claim 28, further comprising: compressing the partial products generated from the upper n bits of the first operand and the lower n bits of the second operand into two intermediate vectors; and shifting the intermediate vectors left by n bits .
31. The method of claim 28, wherein said performing comprises performing the multiply-accumulate operation on a tightly coupled dual n-bit multiply-accumulate unit.
32. The method of claim 28, wherein n equals sixteen.
33. An article comprising a machine-readable medium which stores machine-executable instructions, the instructions causing a machine to: perform a multiply-accumulate operation on first and second 2n-bit operands as four n-bit operations .
34. The article of claim 33, wherein the instructions causing the machine to perform includes instructions causing the machine to: generate partial product vectors from the lower n bits of the first operand and the lower n bits of the second operand; generate partial product vectors from the upper n bits of the first operand and the lower n bits of the second operand; generate partial product vectors from the upper n bits of the first operand and the upper n bits of the second operand; and generate partial product vectors from the lower n bits of the first operand and the upper n bits of the second operand.
35. The article of claim 33, further comprising instructions causing the machine to: compress the partial products generated from the upper n bits of the first operand and the lower n bits of the second operand into two intermediate vectors; and shift the intermediate vectors left by n bits.
36. The article of claim 33, wherein the instructions causing the machine to perform includes instructions causing the machine to perform the multiply-accumulate operation on a tightly coupled dual n-bit multiply-accumulate unit.
7. The article of claim 33, wherein n equals sixteen.
PCT/US2002/031412 2001-10-05 2002-10-03 Multiply-accumulate (mac) unit for single-instruction/multiple-data (simd) instructions WO2003032187A2 (en)

Priority Applications (6)

Application Number Priority Date Filing Date Title
DE60222163T DE60222163T2 (en) 2001-10-05 2002-10-03 ACCUMULATION (MAC) UNIT FOR SINGLE INSTRUCTION / MULTI-DATA (SIMD) INSTRUCTIONS
JP2003535084A JP4584580B2 (en) 2001-10-05 2002-10-03 Multiply-and-accumulate (MAC) unit for single instruction multiple data (SIMD) instructions
EP02800879A EP1446728B1 (en) 2001-10-05 2002-10-03 Multiply-accumulate (mac) unit for single-instruction/multiple-data (simd) instructions
AU2002334792A AU2002334792A1 (en) 2001-10-05 2002-10-03 Multiply-accumulate (mac) unit for single-instruction/multiple-data (simd) instructions
KR1020047005030A KR100834178B1 (en) 2001-10-05 2002-10-03 Multiply-accumulate mac unit for single-instruction/multiple-data simd instructions
HK04106791A HK1065127A1 (en) 2001-10-05 2004-09-07 Multiply-accumulate (mac) unit for single-instruction/multiple-data (simd) instructions

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/972,720 2001-10-05
US09/972,720 US7107305B2 (en) 2001-10-05 2001-10-05 Multiply-accumulate (MAC) unit for single-instruction/multiple-data (SIMD) instructions

Publications (2)

Publication Number Publication Date
WO2003032187A2 true WO2003032187A2 (en) 2003-04-17
WO2003032187A3 WO2003032187A3 (en) 2004-06-10

Family

ID=25520040

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/031412 WO2003032187A2 (en) 2001-10-05 2002-10-03 Multiply-accumulate (mac) unit for single-instruction/multiple-data (simd) instructions

Country Status (11)

Country Link
US (1) US7107305B2 (en)
EP (1) EP1446728B1 (en)
JP (2) JP4584580B2 (en)
KR (1) KR100834178B1 (en)
CN (1) CN100474235C (en)
AT (1) ATE371893T1 (en)
AU (1) AU2002334792A1 (en)
DE (1) DE60222163T2 (en)
HK (1) HK1065127A1 (en)
TW (1) TWI242742B (en)
WO (1) WO2003032187A2 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005122141A (en) * 2003-10-15 2005-05-12 Microsoft Corp Utilizing simd instruction within montgomery multiplication
JP2005235004A (en) * 2004-02-20 2005-09-02 Altera Corp Multiplier-accumulator block mode splitting
WO2011028723A2 (en) * 2009-09-03 2011-03-10 Azuray Technologies, Inc. Digital signal processing systems
JP2011054012A (en) * 2009-09-03 2011-03-17 Nec Computertechno Ltd Product-sum computation device and control method of the same

Families Citing this family (73)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6581003B1 (en) * 2001-12-20 2003-06-17 Garmin Ltd. Systems and methods for a navigational device with forced layer switching based on memory constraints
US7353244B2 (en) * 2004-04-16 2008-04-01 Marvell International Ltd. Dual-multiply-accumulator operation optimized for even and odd multisample calculations
US20060004903A1 (en) * 2004-06-30 2006-01-05 Itay Admon CSA tree constellation
US8856201B1 (en) 2004-11-10 2014-10-07 Altera Corporation Mixed-mode multiplier using hard and soft logic circuitry
EP1710691A1 (en) * 2005-04-07 2006-10-11 STMicroelectronics (Research & Development) Limited MAC/MUL unit
US8620980B1 (en) 2005-09-27 2013-12-31 Altera Corporation Programmable device with specialized multiplier blocks
TWI361379B (en) * 2006-02-06 2012-04-01 Via Tech Inc Dual mode floating point multiply accumulate unit
US8266199B2 (en) * 2006-02-09 2012-09-11 Altera Corporation Specialized processing block for programmable logic device
US8301681B1 (en) 2006-02-09 2012-10-30 Altera Corporation Specialized processing block for programmable logic device
US8266198B2 (en) 2006-02-09 2012-09-11 Altera Corporation Specialized processing block for programmable logic device
US8041759B1 (en) * 2006-02-09 2011-10-18 Altera Corporation Specialized processing block for programmable logic device
US7836117B1 (en) 2006-04-07 2010-11-16 Altera Corporation Specialized processing block for programmable logic device
US7822799B1 (en) 2006-06-26 2010-10-26 Altera Corporation Adder-rounder circuitry for specialized processing block in programmable logic device
US7783862B2 (en) * 2006-08-07 2010-08-24 International Characters, Inc. Method and apparatus for an inductive doubling architecture
US8386550B1 (en) 2006-09-20 2013-02-26 Altera Corporation Method for configuring a finite impulse response filter in a programmable logic device
US8122078B2 (en) * 2006-10-06 2012-02-21 Calos Fund, LLC Processor with enhanced combined-arithmetic capability
US8386553B1 (en) 2006-12-05 2013-02-26 Altera Corporation Large multiplier for programmable logic device
US7930336B2 (en) 2006-12-05 2011-04-19 Altera Corporation Large multiplier for programmable logic device
US20080140753A1 (en) * 2006-12-08 2008-06-12 Vinodh Gopal Multiplier
US7814137B1 (en) 2007-01-09 2010-10-12 Altera Corporation Combined interpolation and decimation filter for programmable logic device
US7865541B1 (en) 2007-01-22 2011-01-04 Altera Corporation Configuring floating point operations in a programmable logic device
US8650231B1 (en) 2007-01-22 2014-02-11 Altera Corporation Configuring floating point operations in a programmable device
US8645450B1 (en) 2007-03-02 2014-02-04 Altera Corporation Multiplier-accumulator circuitry and methods
US7949699B1 (en) 2007-08-30 2011-05-24 Altera Corporation Implementation of decimation filter in integrated circuit device using ram-based data storage
US8959137B1 (en) 2008-02-20 2015-02-17 Altera Corporation Implementing large multipliers in a programmable integrated circuit device
US8307023B1 (en) 2008-10-10 2012-11-06 Altera Corporation DSP block for implementing large multiplier on a programmable integrated circuit device
US8468192B1 (en) 2009-03-03 2013-06-18 Altera Corporation Implementing multipliers in a programmable integrated circuit device
US8706790B1 (en) 2009-03-03 2014-04-22 Altera Corporation Implementing mixed-precision floating-point operations in a programmable integrated circuit device
US8645449B1 (en) 2009-03-03 2014-02-04 Altera Corporation Combined floating point adder and subtractor
US8650236B1 (en) 2009-08-04 2014-02-11 Altera Corporation High-rate interpolation or decimation filter in integrated circuit device
US8412756B1 (en) 2009-09-11 2013-04-02 Altera Corporation Multi-operand floating point operations in a programmable integrated circuit device
US8396914B1 (en) 2009-09-11 2013-03-12 Altera Corporation Matrix decomposition in an integrated circuit device
US8996845B2 (en) * 2009-12-22 2015-03-31 Intel Corporation Vector compare-and-exchange operation
US7948267B1 (en) 2010-02-09 2011-05-24 Altera Corporation Efficient rounding circuits and methods in configurable integrated circuit devices
US8539016B1 (en) 2010-02-09 2013-09-17 Altera Corporation QR decomposition in an integrated circuit device
US8601044B2 (en) 2010-03-02 2013-12-03 Altera Corporation Discrete Fourier Transform in an integrated circuit device
US8484265B1 (en) 2010-03-04 2013-07-09 Altera Corporation Angular range reduction in an integrated circuit device
US8510354B1 (en) 2010-03-12 2013-08-13 Altera Corporation Calculation of trigonometric functions in an integrated circuit device
US8539014B2 (en) 2010-03-25 2013-09-17 Altera Corporation Solving linear matrices in an integrated circuit device
US8589463B2 (en) 2010-06-25 2013-11-19 Altera Corporation Calculation of trigonometric functions in an integrated circuit device
US8862650B2 (en) 2010-06-25 2014-10-14 Altera Corporation Calculation of trigonometric functions in an integrated circuit device
US8577951B1 (en) 2010-08-19 2013-11-05 Altera Corporation Matrix operations in an integrated circuit device
US8478969B2 (en) 2010-09-24 2013-07-02 Intel Corporation Performing a multiply-multiply-accumulate instruction
US8645451B2 (en) 2011-03-10 2014-02-04 Altera Corporation Double-clocked specialized processing block in an integrated circuit device
US9600278B1 (en) 2011-05-09 2017-03-21 Altera Corporation Programmable device using fixed and configurable logic to implement recursive trees
US8812576B1 (en) 2011-09-12 2014-08-19 Altera Corporation QR decomposition in an integrated circuit device
US9053045B1 (en) 2011-09-16 2015-06-09 Altera Corporation Computing floating-point polynomials in an integrated circuit device
US8949298B1 (en) 2011-09-16 2015-02-03 Altera Corporation Computing floating-point polynomials in an integrated circuit device
US8762443B1 (en) 2011-11-15 2014-06-24 Altera Corporation Matrix operations in an integrated circuit device
US8868634B2 (en) * 2011-12-02 2014-10-21 Advanced Micro Devices, Inc. Method and apparatus for performing multiplication in a processor
CN102520906A (en) * 2011-12-13 2012-06-27 中国科学院自动化研究所 Vector dot product accumulating network supporting reconfigurable fixed floating point and configurable vector length
CN107368286B (en) * 2011-12-19 2020-11-06 英特尔公司 SIMD integer multiply-accumulate instruction for multi-precision arithmetic
US8543634B1 (en) 2012-03-30 2013-09-24 Altera Corporation Specialized processing block for programmable integrated circuit device
US9098332B1 (en) 2012-06-01 2015-08-04 Altera Corporation Specialized processing block with fixed- and floating-point structures
US8996600B1 (en) 2012-08-03 2015-03-31 Altera Corporation Specialized processing block for implementing floating-point multiplier with subnormal operation support
US9207909B1 (en) 2012-11-26 2015-12-08 Altera Corporation Polynomial calculations optimized for programmable integrated circuit device structures
US9275014B2 (en) * 2013-03-13 2016-03-01 Qualcomm Incorporated Vector processing engines having programmable data path configurations for providing multi-mode radix-2x butterfly vector processing circuits, and related vector processors, systems, and methods
US9189200B1 (en) 2013-03-14 2015-11-17 Altera Corporation Multiple-precision processing block in a programmable integrated circuit device
US9348795B1 (en) 2013-07-03 2016-05-24 Altera Corporation Programmable device using fixed and configurable logic to implement floating-point rounding
US9684488B2 (en) 2015-03-26 2017-06-20 Altera Corporation Combined adder and pre-adder for high-radix multiplier circuit
US10489155B2 (en) 2015-07-21 2019-11-26 Qualcomm Incorporated Mixed-width SIMD operations using even/odd register pairs for wide data elements
CN107977192A (en) * 2016-10-21 2018-05-01 超威半导体公司 For performing the method and system of low-power and the more accuracy computations of low delay
US10942706B2 (en) 2017-05-05 2021-03-09 Intel Corporation Implementation of floating-point trigonometric functions in an integrated circuit device
KR102408858B1 (en) * 2017-12-19 2022-06-14 삼성전자주식회사 A nonvolatile memory device, a memory system including the same and a method of operating a nonvolatile memory device
US11409525B2 (en) * 2018-01-24 2022-08-09 Intel Corporation Apparatus and method for vector multiply and accumulate of packed words
WO2020046642A1 (en) 2018-08-31 2020-03-05 Flex Logix Technologies, Inc. Multiplier-accumulator circuit, logic tile architecture for multiply-accumulate and ic including logic tile array
US11194585B2 (en) 2019-03-25 2021-12-07 Flex Logix Technologies, Inc. Multiplier-accumulator circuitry having processing pipelines and methods of operating same
US11314504B2 (en) 2019-04-09 2022-04-26 Flex Logix Technologies, Inc. Multiplier-accumulator processing pipelines and processing component, and methods of operating same
US11288076B2 (en) 2019-09-13 2022-03-29 Flex Logix Technologies, Inc. IC including logic tile, having reconfigurable MAC pipeline, and reconfigurable memory
US11455368B2 (en) 2019-10-02 2022-09-27 Flex Logix Technologies, Inc. MAC processing pipeline having conversion circuitry, and methods of operating same
US11693625B2 (en) 2019-12-04 2023-07-04 Flex Logix Technologies, Inc. Logarithmic addition-accumulator circuitry, processing pipeline including same, and methods of operation
US11442881B2 (en) 2020-04-18 2022-09-13 Flex Logix Technologies, Inc. MAC processing pipelines, circuitry to control and configure same, and methods of operating same
US11604645B2 (en) 2020-07-22 2023-03-14 Flex Logix Technologies, Inc. MAC processing pipelines having programmable granularity, and methods of operating same

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5847981A (en) * 1997-09-04 1998-12-08 Motorola, Inc. Multiply and accumulate circuit
WO2001048595A1 (en) * 1999-12-23 2001-07-05 Intel Corporation Processing multiply-accumulate operations in a single cycle

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3435744B2 (en) * 1993-09-09 2003-08-11 富士通株式会社 Multiplication circuit
US7395298B2 (en) * 1995-08-31 2008-07-01 Intel Corporation Method and apparatus for performing multiply-add operations on packed data
US6385634B1 (en) * 1995-08-31 2002-05-07 Intel Corporation Method for performing multiply-add operations on packed data
US5777679A (en) * 1996-03-15 1998-07-07 International Business Machines Corporation Video decoder including polyphase fir horizontal filter
JPH10207863A (en) * 1997-01-21 1998-08-07 Toshiba Corp Arithmetic processor
CN1109990C (en) * 1998-01-21 2003-05-28 松下电器产业株式会社 Method and apparatus for arithmetic operation
JP2000081966A (en) * 1998-07-09 2000-03-21 Matsushita Electric Ind Co Ltd Arithmetic unit
US6571268B1 (en) * 1998-10-06 2003-05-27 Texas Instruments Incorporated Multiplier accumulator circuits
US6542915B1 (en) * 1999-06-17 2003-04-01 International Business Machines Corporation Floating point pipeline with a leading zeros anticipator circuit
US6532485B1 (en) * 1999-09-08 2003-03-11 Sun Microsystems, Inc. Method and apparatus for performing multiplication/addition operations
US6574651B1 (en) * 1999-10-01 2003-06-03 Hitachi, Ltd. Method and apparatus for arithmetic operation on vectored data
US6922716B2 (en) * 2001-07-13 2005-07-26 Motorola, Inc. Method and apparatus for vector processing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5847981A (en) * 1997-09-04 1998-12-08 Motorola, Inc. Multiply and accumulate circuit
WO2001048595A1 (en) * 1999-12-23 2001-07-05 Intel Corporation Processing multiply-accumulate operations in a single cycle

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ACKLAND B ET AL: "A SINGLE-CHIP, 1.6-BILION, 16-B MAC/S MULTIPROCESSOR DSP" IEEE JOURNAL OF SOLID-STATE CIRCUITS, IEEE INC. NEW YORK, US, vol. 35, no. 3, March 2000 (2000-03), pages 412-422, XP000956951 ISSN: 0018-9200 *
ALIDINA M ET AL: "DSP16000: a high performance, low-power dual-MAC DSP core for communications applications" CUSTOM INTEGRATED CIRCUITS CONFERENCE, 1998. PROCEEDINGS OF THE IEEE 1998 SANTA CLARA, CA, USA 11-14 MAY 1998, NEW YORK, NY, USA,IEEE, US, 11 May 1998 (1998-05-11), pages 119-122, XP010293968 ISBN: 0-7803-4292-5 *
See also references of EP1446728A2 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005122141A (en) * 2003-10-15 2005-05-12 Microsoft Corp Utilizing simd instruction within montgomery multiplication
JP2005235004A (en) * 2004-02-20 2005-09-02 Altera Corp Multiplier-accumulator block mode splitting
WO2011028723A2 (en) * 2009-09-03 2011-03-10 Azuray Technologies, Inc. Digital signal processing systems
JP2011054012A (en) * 2009-09-03 2011-03-17 Nec Computertechno Ltd Product-sum computation device and control method of the same
WO2011028723A3 (en) * 2009-09-03 2011-09-29 Azuray Technologies, Inc. Digital signal processing systems

Also Published As

Publication number Publication date
KR100834178B1 (en) 2008-05-30
JP2008217805A (en) 2008-09-18
DE60222163D1 (en) 2007-10-11
KR20040048937A (en) 2004-06-10
DE60222163T2 (en) 2008-06-12
JP4584580B2 (en) 2010-11-24
TWI242742B (en) 2005-11-01
US20030069913A1 (en) 2003-04-10
US7107305B2 (en) 2006-09-12
EP1446728B1 (en) 2007-08-29
ATE371893T1 (en) 2007-09-15
CN100474235C (en) 2009-04-01
JP4555356B2 (en) 2010-09-29
EP1446728A2 (en) 2004-08-18
WO2003032187A3 (en) 2004-06-10
CN1633637A (en) 2005-06-29
JP2005532601A (en) 2005-10-27
AU2002334792A1 (en) 2003-04-22
HK1065127A1 (en) 2005-02-08

Similar Documents

Publication Publication Date Title
EP1446728B1 (en) Multiply-accumulate (mac) unit for single-instruction/multiple-data (simd) instructions
US6611856B1 (en) Processing multiply-accumulate operations in a single cycle
US6353843B1 (en) High performance universal multiplier circuit
JP5273866B2 (en) Multiplier / accumulator unit
EP1576493B1 (en) Method, device and system for performing calculation operations
US8074058B2 (en) Providing extended precision in SIMD vector arithmetic operations
JP4064989B2 (en) Device for performing multiplication and addition of packed data
US6282556B1 (en) High performance pipelined data path for a media processor
US6609143B1 (en) Method and apparatus for arithmetic operation
US6324638B1 (en) Processor having vector processing capability and method for executing a vector instruction in a processor
US7519646B2 (en) Reconfigurable SIMD vector processing system
US6446193B1 (en) Method and apparatus for single cycle processing of data associated with separate accumulators in a dual multiply-accumulate architecture
US10929101B2 (en) Processor with efficient arithmetic units
Kumar et al. VLSI architecture of pipelined booth wallace MAC unit
Quan et al. A novel vector/SIMD multiply-accumulate unit based on reconfigurable booth array
US20090031117A1 (en) Same instruction different operation (sido) computer with short instruction and provision of sending instruction code through data
Brunelli et al. A flexible multiplier for media processing
Sangireddy et al. On-chip adaptive circuits for fast media processing
Farooqui et al. RECONFIGURABLE MULTIMEDIA DATAPATH FOR LOW COST MEDIA PROCESSORS

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BY BZ CA CH CN CO CR CU CZ DE DM DZ EC EE ES FI GB GD GE GH HR HU ID IL IN IS JP KE KG KP KR LC LK LR LS LT LU LV MA MD MG MN MW MX MZ NO NZ OM PH PL PT RU SD SE SG SI SK SL TJ TM TN TR TZ UA UG UZ VN YU ZA ZM

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ UG ZM ZW AM AZ BY KG KZ RU TJ TM AT BE BG CH CY CZ DK EE ES FI FR GB GR IE IT LU MC PT SE SK TR BF BJ CF CG CI GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2003535084

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 20028196473

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 1020047005030

Country of ref document: KR

WWE Wipo information: entry into national phase

Ref document number: 2002800879

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2002800879

Country of ref document: EP

WWG Wipo information: grant in national office

Ref document number: 2002800879

Country of ref document: EP