Publication number | US20020002573 A1 |

Publication type | Application |

Application number | US 09/796,415 |

Publication date | Jan 3, 2002 |

Filing date | Mar 1, 2001 |

Priority date | Jan 22, 1996 |

Also published as | US6247036, WO1998032071A2, WO1998032071A3 |

Publication number | 09796415, 796415, US 2002/0002573 A1, US 2002/002573 A1, US 20020002573 A1, US 20020002573A1, US 2002002573 A1, US 2002002573A1, US-A1-20020002573, US-A1-2002002573, US2002/0002573A1, US2002/002573A1, US20020002573 A1, US20020002573A1, US2002002573 A1, US2002002573A1 |

Inventors | George Landers, Earle Jennings, Tim Smith, Glen Haas |

Original Assignee | Infinite Technology Corporation. |

Export Citation | BiBTeX, EndNote, RefMan |

Referenced by (86), Classifications (26) | |

External Links: USPTO, USPTO Assignment, Espacenet | |

US 20020002573 A1

Abstract

A reconfigurable processor includes at least three (3) MacroSequencers (**10**)-(**16**) which are configured in an array. Each of the MacroSequencers is operable to receive on a separate one of four buses (**18**) an input from the other three MacroSequencers and from itself in a feedback manner. In addition, a control bus (**20**) is operable to provide control signals to all of the MacroSequencers for the purpose of controlling the instruction sequence associated therewith and also for inputting instructions thereto. Each of the MacroSequencers includes a plurality of executable units having inputs and outputs and each for providing an associated execution algorithm. The outputs of the execution units are input to an output selector which selects the outputs for outputs on at least one external output and on at least one feedback path. An input selector (**66**) is provided having an input for receiving at least one external output and at least the feedback path. These are selected between for input to select ones of the execution units. An instruction memory (**48**) contains an instruction word that is operable to control configurations of the datapath through the execution units for a given instruction cycle. This instruction word can be retrieved from the instruction memory (**48**), the stored instructions therein sequenced through to change the configuration of the datapath for subsequent instruction cycles.

Claims(6)

generating partial product signals from a plurality of arithmetic data signals representing mantissas of numbers to be multiplied; adding the partial product signals using a multiple-level adder tree to generate a product signal representing the product of the arithmetic data signals at an output level of the adder tree; accumulating in first pipeline registers intermediate level signals output from one level of the adder tree for input to a subsequent level of the adder tree; wherein a first pipeline operation comprising generating said partial product signals and accumulating said intermediate level signals in said first pipeline registers is carried out in one clock cycle;

accumulating in second pipeline registers output signals from a further adder comprising local carry propagate adder cells; selectively feeding back to an input of said further adder signals representing a constant or the contents of at least some of said second pipeline registers; and supplying said product signal as another input to said further adder; wherein said inputs to said further adder are aligned with the precision components of a output signal from said further adder stored by said second pipeline registers; and wherein the signal alignment, storage of said output signal from said further adder in said second pipeline registers, and said selective feedback are effected during a single clock cycle subsequent to said one clock cycle.

Description

- [0001]This application claims priority in Provisional Application Serial No. 60/010317, filed Jan. 22, 1996.
- [0002]The present invention pertains in general to dual processors and, more particularly, to a digital processor that has a plurality of execution units that are reconfigurable and which utilizes a multiplier-accumulator that is synchronous.
- [0003]Digital single processors have seen increased use in recent years. This is due to the fact that the processing technology has advanced to an extent that large fast processors can be manufactured. The speed of these processors allows a large number of computations to be made, such that a very complex algorithms can be executed in very short periods of time. One use for these digital single processors is in real-time applications wherein data is received on an input, the algorithm of the transformer function computed and an output generated in what is virtually real-time.
- [0004]When digital single processors are fabricated, they are typically manufactured to provide a specific computational algorithm and its associated data path. For example, in digital filters, a Finite Impulse Response (FIR) filter is typically utilized and realized with a Digital Single Processor (DSP). Typically, a set of coefficients is stored in a RAM and then a multiplier/accumulator circuit is provided that is operable to process the various coefficients and data in a multi-tap configuration. However, the disadvantage to this type of application is that the DSP is “customized” for each particular application. The reason for this is that a particular algorithm requires a different sequence of computations. For example, in digital filters, there is typically a multiplication followed by an accumulation operation. Other algorithms may require additional multiplications or additional operations and even some shift operations in order to realize the entire function. This therefore requires a different data path configuration. At present, the reconfigurable DSPs have not been a reality and they have not provided the necessary versatility to allow them to be configured to cover a wide range of applications.
- [0005]The present invention disclosed and claimed herein comprises a reconfigurable processing unit. The reconfigurable unit includes a plurality of execution units, each having at least one input and at least one output. The execution units operate in parallel with each other, with each having a predetermined executable algorithm associated therewith. An output selector is provided for selecting one or more of the at least one outputs of the plurality of execution units, and providing at least one output to an external location and at least one feedback path. An input selector is provided for receiving at least one external input and the feedback path. It is operable to interface to at least one of the at least one inputs of each of the execution units, and is further operable to selectively connect one or both of the at least one external input and the feedback path to select ones of the at least one inputs of the execution units. A reconfiguration register is provided for storing a reconfiguration instruction. This is utilized by a configuration controller for configuring the output selector and the input selector in accordance with the reconfiguration instruction to define a data path configuration through the execution units in a given instruction cycle.
- [0006]I another embodiment of the present invention, an input device is provided for inputting a new reconfiguration instruction into the reconfiguration register for a subsequent instruction cycle. The configuration controller is operable to reconfigure the data path of data through the configured execution units for the subsequent instruction cycle. An instruction memory is provided for storing a plurality of reconfiguration instructions, and a sequencer is provided for outputting the stored reconfiguration instructions to the reconfiguration register in subsequent instruction cycles in accordance with a predetermined execution sequence.
- [0007]In yet another aspect of the present invention, at least one of the execution units has multiple configurable data paths therethrough with the execution algorithm of the one execution unit being reconfigurable in accordance with the contents of the instruction register to select between one of said multiple data paths therein. This allows the operation of each of said execution units to be programmable in accordance with the contents of the reconfiguration register such that the configuration controller will configure both the data path through and the executable algorithm associated with the one execution unit.
- [0008]For a more complete understanding of the present invention and the advantages thereof, reference is now made to the following description taken in conjunction with the accompanying Drawings in which:
- [0009][0009]FIG. 1 illustrates a data flow diagram of a reconfigurable arithmetic data path processor in accordance with present invention;
- [0010][0010]FIG. 2 illustrates a top level block diagram of the MacroSequencer;
- [0011][0011]FIG. 3 illustrates a more detailed block diagram of the MacroSequencer;
- [0012][0012]FIG. 4 illustrates a logic diagram of the input register;
- [0013][0013]FIG. 5 illustrates a logic diagram of the input selector;
- [0014][0014]FIG. 6 illustrates a block diagram of the multiplier-accumulator;
- [0015][0015]FIG. 7 illustrates a logic diagram of the adder;
- [0016][0016]FIG. 8 illustrates a block diagram of the shifter;
- [0017][0017]FIG. 9 illustrates a block diagram of the logic unit;
- [0018][0018]FIG. 10 illustrates a block diagram of the one port memory;
- [0019][0019]FIG. 11 illustrates a block diagram of the three port memory;
- [0020][0020]FIG. 12 illustrates a diagram of the 3-port index pointers;
- [0021][0021]FIG. 13 illustrates a logic diagram of the output selector;
- [0022][0022]FIG. 14 illustrates a logic diagram of the I/O interface;
- [0023][0023]FIG. 15 illustrates a block diagram of the MacroSequencer data path controller;
- [0024][0024]FIG. 16 illustrates a block diagram of the dual PLA;
- [0025][0025]FIG. 17 illustrates a block diagram of basic multiplier;
- [0026][0026]FIG. 18 illustrates an alternate embodiment of the MAC;
- [0027][0027]FIG. 19 illustrates an embodiment of the MAC which is optimized for polynomial calculations;
- [0028][0028]FIG. 20 has an additional four numbers generated in the multiplier block;
- [0029][0029]FIG. 21 illustrates a basic multiplier-accumulator;
- [0030][0030]FIG. 22 illustrates an extended circuit which supports optimal polynomial calculation steps;
- [0031][0031]FIG. 23 illustrates a block diagram of a multiplier block with minimal support circuitry;
- [0032][0032]FIG. 24 is illustrates a block diagram of a multiplier-accumulator with Basic Core of Adder, one-port and three-port Memories; and
- [0033][0033]FIG. 25 illustrates a block diagram of a Multiplier-Accumulator with Multiplicity of Adders, and one-port and three-port Memories.
- [0034]Referring now to FIG. 1, there is illustrated a block diagram of the Reconfigurable Arithmetic Datapath Processor (RADP) of the present invention. The RADP is comprised of four (4) MacroSequencers,
**10**,**12**,**14**and**16**, respectively. MacroSequencers**10**and**12**comprised one (1) pair and MacroSequencers**14**and**16**comprised a second pair. Each of the MacroSequencers has associated therewith one of four Buses**18**, labeled Bus**0**, Bus**1**, Bus**2**and Bus**3**, respectively. Bus**0**is associated with MacroSequencer**10**, Bus**1**with MacroSequencer**12**, Bus**2**with MacroSequencer**14**and Bus**3**with MacroSequencer**16**. These are global 16-bit buses. There is also provided a control bus**20**, which is a 32-bit bus with 8-bits each associated with the MacroSequencer**10**-**16**. Each MacroSequencer also has associated therewith an I/O bus**22**, each Bus**22**comprises 16 I/O lines to allow each of the MacroSequencers**10**-**16**to interface with 64 I/O pins. Additionally, there is provided a 16-bit input bus**24**which interfaces with each of the MacroSequencers**10**-**16**to allow input of information thereto. A dual PLA**26**is provided which has associated therewith built-in periphery logic to control information to the bi-directional control bus**20**. The PLA**26**interfaces with a control bus**20**through a 12-bit bus**28**, with an external 20-bit control bus**30**interfacing with the control bus**20**and also with PLA**20**through an 8-bit control bus**32**. - [0035]Each of the MacroSequencers
**10**-**16**is a 16-bit a fixed-point processor that can be an individually initiated either by utilizing the dual PLA**26**or directly from the control bus**20**. The bus**18**allows data to be shared between the MacroSequencers**10**-**16**according to various design needs. By providing the buses**18**, a 16-bit data path is provided, thus increasing data throughput between MacroSequencers. Additionally, each pair of MacroSequencers**10**and**12**or**14**and**16**are interconnected to each other by two (2) private 16-bit buses**34**, 16-bits in each direction. These private buses**34**allow each pair of MacroSequencers to be paired together for additional data sharing. - [0036]Each MacroSequencer is designed with a Long Instruction Word (LIW) architecture enabling multiple operations per clock cycle. Independent operation fields in the LIW control the MacroSequencer's data memories, 16-bit adder, multiplier-accumulator, logic unit, shifter, and I/O registers so they may be used simultaneously with branch control. The pipe-lined architecture allows up to seven operations of the execution units during each cycle.
- [0037]The LIW architecture optimizes performance allowing algorithms to be implemented with a small number of long instruction words. Each Macro-Sequencer may be configured to operate independently, or can be paired for some 32-bit arithmetic operations.
- [0038]The Dual PLA
**26**may be used for initiating stream processes, output enable signal generation, and interface glue logic. The eight I/O pins**36**can be configured individually as input only or output only pins. These can be used for external interface control. Process initiation and response may be provided externally via input pins**38**directly to the MacroSequencers or it may be provided by the programmable PLA via the control bus**20**. The RADP operates in either a configuration operating mode or a normal mode. The configuration mode is used for initializing or reconfiguring the RADP and the normal mode is used for executing algorithms. - [0039]The MacroSequencers may be used individually for 16-bit operations or in pairs for standard 32-bit addition, subtraction, and logic operations. When pairing, the MacroSequencers are not interchangeable. MacroSequencers
**10**and**12**form one pair, and MacroSequencers**14**and**16**form the other pair. The least significant sixteen bits are processed by MacroSequencers**10**and**12**. The two buses**34**are available to the MacroSequencer pairs for direct interchange of data. - [0040]The five global data buses consisting of data buses
**18**and input data bus**24**can be simultaneously accessed by all of the MacroSequencers. Four of the buses**18**, bus**0**, bus**1**, bus**2**, and bus**3**, are associated with MacroSequencers**10**,**12**,**14**, and**16**, respectively. These four buses receive data from either the MacroSequencer I/O pins**22**or an output register (not shown) in the MacroSequencer. The fifth bus, bus**4**, always receives data from BUS**4**IN[**15**:**0**] pins. - [0041]The Control Bus
**20**is used to communicate control, status, and output enable information between the MacroSequencer and the PLA**26**or external MacroSequencer pins. There are six signals associated with each MacroSequencer. Two control signals sent to the MacroSequencer are described hereinbelow with reference to a MacroSequencer Datapath Controller and are used to: - [0042]Initiate one of two available LIW sequences,
- [0043]Continue execution of the LIW sequence, or
- [0044]Acknowledge the MacroSequencer status flags by resetting the send and await state bits.
- [0045]Two status signals, Await and Send, are sent from the MacroSequencer which are described in more detail with respect to the MacroSequencer Datapath Controller hereinbelow and indicate:
- [0046]the Program Counter is sequencing;
- [0047]the MacroSequencer is in the send state it has executed a specific LiW;
- [0048]the Program Counter is continuing to sequence;
- [0049]the MacroSequencer is in the await state and it has executed a specific LIW; and
- [0050]the Program Counter is not continuing to sequence, and it is awaiting further commands before resuming.
- [0051]Two output enable signals for each MacroSequencer are described with reference to an Output Selection operation described hereinbelow and allow for output enable to be:
- [0052]from the Dual PLA
**26**oepla outputs or from MacroSequencer(n) output enable MSnOE pins; - [0053]always output;
- [0054]Always input (the power up condition); or
- [0055]Optionally inverted.
- [0056]Five input clocks are provided to allow the RADP to process multiple data streams at different transmission speeds. There is one clock for each Macro-Sequencer, and a separate clock for the PLA
**26**. Each MacroSequencer can operate on separate data paths at different rates. The clock signals can be connected, for synchronization between the four MacroSequencers**10**-**16**and the Dual PLA**26**. - [0057]Referring now to FIG. 2, there is illustrated a overall block diagram of each of MacroSequencers
**10**-**14**. The MacroSequencer generally is comprised of two (2) functional blocks, an arithmetic datapath block**40**and a datapath controller block**42**. The arithmetic datapath block**40**includes a three (3) port memory**43**and one port memory**44**, in addition to various execution blocks contained therein (not shown). The execution blocks are defined as the arithmetic datapath, represented by block**46**. The three port memory**43**and a one port memory**44**are accessed by the arithmetic datapath**46**. The datapath controller**42**includes an instruction memory**48**. The three port memory**43**, the one port memory**44**and the instruction memory**48**are all loaded during an Active Configuration Mode. The arithmetic datapath**40**receives input from the data-in bus**24**and provides an interface through the interface buses**18**and also through the dedicated pair of interfaced buses**34**. Control signals are received on 6-bits of the control bus**20**through control signal bus**50**with status signals provided by 2-bits of the control bus**20**through status signal lines**52**. - [0058]The control signals may initiate one of two programmed LIW sequences in instruction memory
**48**in normal operating mode. Once a sequence begins, it will run, or loop indefinitely until stopped by the control signals. An await state programmed into the LIW sequence will stop the Program Counter from continuing to increment. The LIW sequences are a combination of data steering, data processing, and branching operations. Each MacroSequencer may execute a combination of branch, memory access, logic, shift, add, subtract, multiply-accumulate, and input/output operations on each clock cycle. The instruction memory can be reloaded dynamically at any time by transitioning to Active Configuration Mode which will also initialize all registers in the entire device. - [0059]Referring now to FIG. 3 is illustrated a block diagram of the MacroSequencer datapath for MacroSequencers
**10**-**16**. The databus**18**and databus**24**are input to input register**60**, which also receives a constant as a value. There are two (2) registers in the input registers**60**, an input register A and input register B. The output of the input register A is output on the line**62**and the output of the input register B is output on the line**64**. The contents of input registers A and B on lines**62**and**64**are input to an input selector block**66**. As will be described hereinbelow, the input selector is operable to provide a central portion of a pipeline structure where data is processed through six stages. - [0060]There are nine (9) basic elements in the MacroSequencer Arithmetic Datapath. Six (6) of these are data processing elements and six (6) are data steering functions, of which the input selector
**66**is one of the data steering functions. The data processing elements include a multiplier-accumulator (MAC)**68**, an adder**70**, a logic unit**72**and a shifter**74**. The three port memory**43**and the one port memory**44**also comprise the data processing elements. The data steering functions, in addition to the input selector**66**, also include the input register block**60**and an output register block**76**. - [0061]The input register block
**60**, as noted above, can capture any two (2) inputs thereto. Input selector**66**is operable to, in addition to, receive the two line**62**and**64**, as noted above, and also receive two (2) outputs on two (2) lines**78**from the output of the three port memory**43**and one (1) output line**80**from the one port memory**44**. It also receives on a line**82**an output from the output register block**76**which is from a register A. The output of the register B, also output from the output register block**76**is output on a line**84**to the input selector. In addition, a value of “0” is input to the input selector block**66**. The input selector block**66**is operable to select any three operands for data processing elements. These are provided on three buses, a bus**86**, a bus**88**, and a bus**90**. A bus**86**is input to the MAC**68**, the adder**70**and the logic unit**72**, with bus**88**input to the MAC**68**, adder**70**and logic unit**72**. The Bus**90**is input only to a shifter**74**. The MAC**68**also receives as an input the output of the register B on a line**92**and the output of the one port memory**44**. The output of MAC**68**comprises another input of the adder**70**, the out put of the adder**70**input to the output selector block**76**. The logic unit**72**has an output that is connected to the output selector**76**, as well as a shifter**74**having an output to the output selector block**76**. The output selector block**76**also receives as an input the output from register B in the input register block**60**. The output of register B is connected to the output one of the MacroSequencer pier bus**34**, whereas the output of register B is output to the input of an interface block**96**which is connected to one of the four data buses**18**and the I/O bus**22**. The I/O bus**22**also comprises an input to the output selector**76**. Therefore, the output selector/register block**76**is operable to select which two of the data processing elements are stored, as will be described in more detail hereinbelow. - [0062]Each of the four (4) parallel data processing units, the MAC
**68**, Adder**70**, logic unit**72**and shifter**74**, runs in the parallel with the others allowing the execution of multiple operations per cycle. Each of the data processing functions in the MacroSequencer datapath will be discussed hereinbelow in detail. However, they are controlled by the operation fields in the MacroSequencers LIW register. It is noted that, as described herein, the terms “external” and “internal” do not refer to signals external and internal to the RADP; rather, they refer only to signals external and internal to an individual MacroSequencer. - [0063]The 16-bit input registers in register block
**60**comprise InRegA and InRegB. There are six external inputs and one internal input available to the Input Registers. The input registers are comprised of an 8-to-1 multiplexer**100**with the output thereof connected to a register**102**, the output of register**102**comprising the InRegA output. Also, an 8-to-1 multiplexer**104**is provided having the output thereof connected to a register**106**, which provides the output InRegB. Seven of the inputs of both multiplexers**100**and**104**connected to six inputs, one input being the 16-bit input of bus**24**, one being a 16-bit constant input bus**108**, four being the 16-bit data buses**18**and one being the pair bus**34**, which is also a 16-bit bus. The constant is a value that varies from “0” to “65535 ”, which is generated from the LIW register bits. The eighth input of the multiplexor**100**is connected to the output of register**102**, whereas the 8 input of register**106**is connected to the output of register**106**. - [0064]The Constant introduces 16-bit constants into any calculation. The constant of the MacroSequencer shares internal signals with the MacroSequencer Controller as well as the MAC
**68**, the Shifter**74**, and the Logic Unit**72**. Since the Constant field of the LIW is shared, care must be taken to insure that overlap of these signals does not occur. The RADP Assembler detects and reports any overlap problems. - [0065]Referring now to FIG. 5, there is illustrated a block diagram of the input selector block
**66**. The input selector block**66**is comprised of a four-to-one multiplexer**110**, a six-to-one multiplexer**112**and a two-to-one multiplexer**114**. The multiplexer**112**is connected to one input of an Exclusive OR gate**116**. The output of multiplexer**110**is connected to a bus**118**to provide the InBusA signals, the output of Exclusive OR gate**116**is connected to a bus**120**to provide the InBusB signals and the output of multiplexer**114**is connected to a bus**122**to provide the InBusC signals. Inputs to the Input Selector**66**include: - [0066]InRegA and InRegB from the Input Register
**60**, - [0067]OutRegA and OutRegB from the Output Register
**76**, - [0068]mem
**1**and mem**2**from the Three-Port Memory read ports**1**and**2**. respectively on lines**78**, - [0069]mem
**0**from the One-Port Memory read port on line**80**, and - [0070]Constant ‘0’ which is generated in the Input Selector
**66**. - [0071]Control signals from the MacroSequencer Controller (not shown) determine which three of the eight possible inputs are used and whether InBusB is inverted or not. The Input Selector
**66**is automatically controlled by assembly language operations for the MAC**68**, Adder**70**, Shifter**74**, and Logic Unit**72**and does not require separate programming. The input selections are controlled by the same assembly operations used by the MAC**68**, Adder**70**, Logic Unit**72**and Shifter**74**. - [0072]Referring now to FIG. 6, there is illustrated a block diagram of the MAC
**78**. The Multiplier-Accumulator (MAC)**78**is a three-stage, 16 by 8 multiplier capable of producing a full 32-bit product of a 16 by 16 multiply every two cycles. The architecture allows the next multiply to begin in the first stages before the result is output from the last stage so that once the pipe-line is loaded, a 16 by 8 result (24 -bit product) is generated every clock cycle. - [0073]The input to the MAC
**78**is comprised of an Operand A and an Operand B. The Operand A is comprised of the output of the One-Port memory**44**on the bus**80**and the InBusA**86**. These are input to a three-to-one multiplexer**126**, the output thereof input to a register**130**, the output of the register**130**connected to a 16-bit bus**132**. The output of the register**130**is also input back as a third input of the multiplexer**126**. The Operand B is comprised of the OutRegB bus**84**and the InBusB bus**88**. These buses are input to a three-to-one multiplexer**134**, the output thereof connected to the register**136**. They are also input to a 2-input multiplexer**138**, the output thereof input to a register**140**, the output of register**140**input as a third input to the multiplexer**130**. The output of registers**130**and**136**are input to a 16×8-bit multiplier**142**which is operable to multiply the two Operands on the inputs to provide a 24-bit output on a bus**144**. This is input to a register**146**, the output thereof input to a 48-bit accumulator**148**. The output of the accumulator**148**is stored in a register**150**, the output thereof fed back to the input of the accumulator**148**and also to the indut of a four-to-two multiplexer**152**. the output of the register**150**connected to all four inputs of multiplexer**152**. The multiplexer**152**then provides two outputs for input to the Adder**70**on buses**154**and**156**. The operation of the MAC**68**will be described in more detail hereinbelow. Either or both operands may be signed or unsigned. The multiplier input multiplexers**126**,**134**and**138**serve two purposes: - [0074]1) They align the high or low bytes from Operand B for the multiplier which allows 16 by 8 or 16 by 16 multiply operations; and
- [0075]2) They allow each operand to be selected from three different sources:
- [0076]Operand A is selected from the One-Port Memory
**44**, InBusA**86**, or Operand A from the previous cycle. - [0077]Operand B is selected from the high byte of OutRegB
**84**, InBusB**88**, or the least significant byte of the previous Operand B. - [0078]The Multiplier Stage
**142**produces a 24-bit product from the registered 16-bit Operand A and either the most significant byte (8-bits) or the least significant byte of Operand B. The Accumulator Stage**148**aligns and accumulates the product. Controls in the accumulator allow the product to be multiplied by: 1 when <weight> is low, or**28**when <weight> is high. The result is then: added to the result in the accumulator**148**when <enable> is acc, placed in the accumulator replacing any previous value when <enable> is clr, or held in the accumulator in lieu of mult**3**operation. - [0079]The number of cycles required for Multiplies and MACs are shown in Tables 1 and 2.
TABLE 1 Cycles Between New Multiplies Multiply Accuracy Cycles 16 by 8 16 bits 1 24 bits 2 16 by 16 16 bits 2 16 by 816 by 832 3 bits - [0080][0080]
TABLE 2 Cycles Between New Multiply - Accumulates of n Products Multiply Accuracy Cycles 16 by 8 16 bits n 32 bits n + 1 48 bits n + 2 16 by 16 16 bits 2n 32 bits 2n + 1 48 bits 2n + 2 - [0081]The MAC internal format is converted to standard integer format by the Adder
**70**. For this reason, all multiply and multiply-accumulate outputs must go through the Adder**70**. - [0082]If a 16- by 8-bit MAC
**68**is desired, new operands are loaded every cycle. The Multiplier**142**results in a 24-bit product which is then accumulated in the third stage to a 4-bit result. This allows at least**224**multiply-accumulate operations before overflow. If only the upper 16-bits of a 24-bit result are required, the lower eight bits may be discarded. If more than one 16-bit word is extracted, the accumulated result must be extracted in a specific order. First the lower 16-bit word is moved to the Adder**70**, followed in order by the middle 16 bits and then the upper 16 bits. This allows at least 2^{16 }of these 16- by 16-bit multiply-accumulate operations before overflow will occur. - [0083]Referring now to FIG. 7, there is illustrated a block diagram of the Adder
**70**. The Adder**70**produces a 16-bit result of a 16- by 16-bit addition, subtraction, or 16-bit data conversion to two's complement every cycle. The Adder**70**is also used for equality, less-than and greater-than comparisons. The Adder**70**is comprised of two Adder pipes, an Adder pipe**160**and Adder pipe**162**. There are provided two multiplexers**164**and**166**on the input, with multiplexer**164**receiving the multiplier output signal on bus**154**and the multiplexer**166**receiving the multiplier output on bus**156**. Additionally, multiplexer**164**receives the signal on the InBusA**86**with multiplexer**166**receiving as an input the signals on InBusB**88**. The output of multiplexers**164**and**166**are input to the Adder pipe**160**, the output thereof being input to a register**168**. The output of register**168**is input to the Adder pipe to**162**, which also receives an external carry N-bit, a signal indicating whether tne operation is a 32-bit or 16-bit operation and a signed/unsigned bit. The Adder pipe to**162**provides a 4-bit output to a register**170**which combines the Adder status flags for equality, overflow, sign and carry and also a 16-bit output selector on a bus**172**. The architecture allows the next adder operation to begin in the first stage before the result is output from the last stage. - [0084]The input multiplexers
**164**and**166**select one of two sources of data for operation by the Adder**70**. The operands are selected from either InBusA**86**and InBusB**88**, or from the Multiplier**68**. Select InBusA**86**and InBusB**88**are selected for simple addition or subtraction and setting the Adder Status flags. The multiplier**68**outputs, MultOutA**154**and MultOutB**156**, are selected for conversion. The first adder stage**160**receives the operands and begins the operation. The second adder stage**162**completes the operation and specifies the output registers in the Output Selector where the result will be stored. The two adder stages**160**and**162**may be controlled separately for addition and subtraction operations. - [0085]The Adders
**70**from a pair of MacroSequencers may be used together to produce 32 bit sums or differences. There is no increase in the pipe-line latency for these 32 bit operations. The Adder**70**may be placed in the sign or unsigned mode. - [0086]Adder Status Bits—The Equal, Sign, Overflow, and Carry flags are set two cycles after an addition operation (add
**1**or sub**1**) occurs and remain in effect for one clock cycle: - [0087]The Equal flag is set two cycles later when the two operands are equal during an addition operation;
- [0088]The Overflow flag is set when the result of an addition or subtraction results in a 16-bit out-of-range value;
- [0089]When the adder
**70**is configured for unsigned integer arithmetic, Overflow=Carry. Range=0 to 65535; - [0090]When the adder is configured for signed integer arithmetic, Overflow=Carry XOR Sign. Range=−32768 to +32767;
- [0091]The Sign flag is set when the result of an addition or subtraction is a negative value;
- [0092]The Carry flag indicates whether a carry value exists.
- [0093]The Adder
**70**may be used to convert the data in the Accumulator**148**of the Multiplier**142**to standard integer formats when inputs are selected from the output of the MAC**68**. Since the Accumulator**148**is 48 bits, the multiplier's accumulated result must be converted in a specific order: lower-middle for 32-bit conversion, and lower-middle-upper for 48-bit conversion. Once the conversion process is started, it must continue every cycle until completed. Signed number conversion uses bits**30**:**15**. - [0094]Shift Mode signals control which Shifter functions are performed:
- [0095]Logical Shift Left by n bits (shift low order bits to high order bits). The data shifted out of the Shifter is lost, and a logical ‘0’ is used to fill the bits shifted in.
- [0096]Logical Shift Right by n bits (shift high order bits to low order bits). The data shifted out of the Shifter is lost, and a logical ‘0’ is used to fill the bits shifted in.
- [0097]Arithmetic Shift Right by n bits. This is the same as logical shift right with the exception that the bits shifted in are filled with Bit[
**15**], the sign bit. - [0098]This is equivalent to dividing the number by 2
^{n}. - [0099]Rotate Shift Left by n bits. The bits shifted out from the highest ordered bit are shifted into the lowest ordered bit.
- [0100]Normalized Shift Right by 1 bit. All bits are shifted one lower in order. The lowest bit is lost and the highest bit is replaced by the Overflow Register bit of the Adder. This is used to scale the number when two 16-bit words are added to produce a 17-bit result.
- [0101]Logical, Arithmetic and Rotate shifts may shift zero to fifteen bits as determined by the Shift Length control signal.
- [0102]Referring now to FIG. 9, there is illustrated a block diagram of the Logic Unit
**72**. The Logic Unit**72**is able to perform a bit-by-bit logical function of two 16-bit vectors for a 16-bit result. All bit positions will have the same function applied. All sixteen logical functions of 2 bits are supported. The Logic Function controls determine the function performed The Logic Unit**72**is described in U.S. Pat. No. 5,394,030, which is incorporated herein by reference. - [0103]Referring now to FIG. 10, there is illustrated a block diagram of the One-Port Memory
**44**. The One-Port Memory**44**is comprised of a random access memory (RAM) which is a 32×16 RAM. The RAM**44**receives on the input thereof the data from the OutRegA bus**82**. The output of the RAM**44**is input to a multiplexer**180**, the output thereof input to a register**182**, the output of the register**182**connected to the bus**80**. Also, the bus**80**is input back to the other input of the multiplexer**180**. A 5-bit address for the RAM**178**is received on a 5-bit address bus**184**. The One-Port Memory**44**supports single-cycle read and single-cycle write operations, but not both at the same time. There are 32 addressable 16-bit memory locations in the One-Port Memory**44**. The register**182**is a separate register provided to store and maintain the result of a read operation until a new read is executed. Read and write operands control whether reading or writing memory is requested. No operation is performed when both the Read and Write Controls are inactive. Only one operation, read or write, can occur per cycle. Index registers provides the read and write address to the One-Port Memory. The index register may be incremented, decremented, or held with each operation. Both the index operation and the read or write operation are controlled by the MacroSequencer LIW. - [0104]Referring now to FIG. 11, there is illustrated a block diagram of a Three-Port Memory
**43**. The Three-Port Memory**43**is comprised of a 16×16 RAM**186**, which receives as an input the OutRegB contents as an input on the bus**84**and provides two outputs, one output providing an input to a multiplexer**188**and one output providing an input to a multiplexer**190**. The output of multiplexer**188**is input to a register**192**and the output of the multiplexer**190**is input to a register**194**. The output of register**192**provides the mem**1**output on the line**78**and the output of register**194**provides the mem**2**output on buses**78**, buses**78**each comprising the 16-bit bus. Additionally, the output of register**192**is fed back to the other input of multiplexer**188**and the output of register**194**is fed back to the input of the multiplexer**190**. There are two read operations that are provided by the RAM**186**and they are provided by two read addresses, a Read**1**address on a 4-bit bus**196**and a 4-bit read address on a bus**198**, labeled Read**2**. The write address is provided on a 4-bit bus**200**. The Three-Port Memory**43**supports two read and one write operation on each clock cycle. The two read ports may be used independently; however, data may not be written to the same address as either read in the same clock cycle. Four index registers are associated with the Three-Port Memory. Two separate registers are provided for write indexing: Write Offset and Write Index. These two registers may be loaded or reset simultaneously or independently. Write Offset provides a mechanism to offset read index registers from the Write Index by a fixed distance. Increment and Decrement apply to both write registers so that the offset is maintained. The two Read Index registers may be independently reset or aligned to the Write Offset. - [0105]Referring now to FIG. 12, there is illustrated a block diagram of the Three-Port Memory Index Pointers. Smart Indexing operates multiple memory addresses to be accessed. This is particularly useful when the data is symmetrical. Symmetrical coefficients are accessed by providing the Write Offset from the center of the data and aligning both Read Indices to the Write Offset. The Read Indices may be separated by a dummy read. Additional simultaneous reads with one index incrementing and the other decrementing allows for addition or subtraction of data that uses the same or inverted coefficients. Each index has separate controls to control its direction. Each index may increment or decrement, and/or change its direction. The change in each index register's address takes place after a read or write operation on the associated port. Smart Indexing is ideal for Filter, and DCT applications where pieces of data are taken from equal distance away from the center of symmetrical data. The Smart Indexing method used in the Data Memory allows symmetrical data to be multiplied in half the number of cycles that would have normally been required. Data from both sides can be added together and then multiplied with the common coefficient. For example, a 6-tap filter which would normally take 6 multiplies and 7 cycles, can be implemented with a single MacroSequencer and only requires 3 cycles to complete the calculation. An 8-point DCT which normally requires 64 multiplies and 65 cycles can be implemented with a single Macro-Sequencer and only requires 32 clock cycles to complete the calculation.
- [0106]Referring now to FIG. 13, there is illustrated a block diagram of the output selector
**76**. The output selector**76**is comprised of two multiplexers, a 4-input multiplexer**202**and a 6-input multiplexer**204**. Both multiplexers**202**and**204**receive the outputs from the Adder**70**, Logic Unit**72**and Shifter**74**on the respective 16-bit buses. The output of multiplexer**202**is input to a register**206**, the output thereof providing the 16-bit signal for the OutRegA output on bus**82**. This bus**82**is fed back to the remaining input of the multiplexer**202**and also back to the input selector**66**. The multiplexer**204**also receives as an input InRegB contents on bus**64**and the MacroSequencer share the data on the bus**34**. Tne output of the multiplexer**204**is input to a register**208**, the output thereof comprising the OutRegB contents on the bus**84**, which is also input back to an input of the multiplexer**204**and to the input selector**66**. The Output Selector**76**controls the state of output registers OutRegA**206**and OutRegB**208**and controls the state of the MSnI/O[**15**:**0**] bus pins. The Output Selector**76**multiplexes five 16-bit buses and places the results on the two 16-bit output registers**206**and**208**which drive the two on-chip buses**82**and**84**and the MacroSequencer I/O pins**22**. The Output registers may be held for multiple cycles. - [0107]Referring now to FIG. 14, there is illustrated a block diagram of the MacroSequencer I/O interface. The contents of the output register
**206**on the bus**82**are input to a 2-input multiplexer**210**, the other input connected to bus**203**to provide the MacroSequencer I/O data. The output of multiplexer**210**provides the data to the associated one of the four buses**18**, each being a 16-bit bus. Additionally, the 16-bit bus**82**is input to a driver**212**which is enabled with an output enable signal OE. The output of driver**212**drives the I/O bus**22**for an output operation and, when it is disabled, this is provided back as an input to the multiplexer**204**. The output enable circuitry for the driver**212**is driven by an output enable signal MsnOE and a signal OEPLA which is an internal signal from the PLA**26**. These two signals are input to a 2-input multiplexer**214**, which is controlled by a configuration bit**5**to input multiplexer**216**, the other input connected to a “1” value. This multiplexer is controlled by a configuration bit**6**. The output of multiplexer**216**drives one input of the 2-input multiplexer**218**directly and the other input thereof through an inverter**220**. The multiplexer**218**is controlled by the configuration bit**7**and provides the OE signal to the driver**212**. The configuration bit**4**determines the state of the multiplexer**210**. The I/O Interface selection for each MacroSequencer determines: Input source for data busn and the output enable configuration. - [0108]The input data on the buses
**18**, busn, is selected from the MSNI/O[**15**:**0**] pins**22**or the OutRegA**206**output of MacroSequencer(n) by configuration bit**4**. When the MacroSequencer(n)'s associated busn is connected to the OutRegA**206**signal, the MacroSequencer still has input access to the MSnI/O pins**22**via the Output Selector. - [0109]Output Enable to the MSnI/O pins is controlled by configuration bit selections. Inputs to the output enable control circuitry include the MSnOE pin for MacroSequencer(n) and the oepla[n] signal from the PLA
**26**. The Output Selector diagram for the output enable circuitry represents the equivalent of the output enable selection for configuration bits**5**,**6**, and**7**in the normal operating mode. - [0110]Referring now to FIG. 15, there is illustrated a block diagram of the MacroSequencer Datapath Controller
**42**. The MacroSequencer Datapath Controller**42**contains and executes one of two sequences of Long Instruction Words (LIWs) that may be configured into the instruction memory**48**. The Datapath Controller**42**generates LIW bits which control the MacroSequencer Arithmetic Datapath. It also generates the values for the One-Port and Three-Port index registers. The Datapath Controller**42**operation for each MacroSequencer is determined by the contents of its LIW register and the two control signals. - [0111]The Datapath Controller
**42**has associated therewith a sequence controller**220**which is operable to control the overall sequence of the instructions for that particular MacroSequencer. The sequence controller**220**receives adder status bits from the Adder**70**which were stored in the register**170**and also control signals from either an internal MacroSequencer control bus**222**or from the PLA**26**which are stored in a register**224**. The contents of the register**224**or the contents of the bus**222**are selected by a multiplexer**226**which is controlled by the configuration bit**8**. There are provided two counters, a counter**0****228**and a counter**1****230**which are associated with the sequence controller**220**. The instruction memory**48**is controlled by a program counter**232**which is interfaced with a stack**234**. The program counter**232**is controlled by the sequence controller**220**as well as the stack**234**. The instruction memory**48**, as noted above, is preloaded with the instructions. These instructions are output under the control of sequence controller**220**to an LIW register**236**to provide the LIW control bits which basically configure the entire system. In addition, there are provided read addresses, with an index register**238**storing the address for the One-Port address on bus**84**, an index register**240**for storing the read address for the Three-Port read address on bus**196**, an index register**242**for storing a read address for the Three-Port read address bus**198**, an index register**244**for storing the write address for the Three-Port write address bus**200**. These are all controlled by the sequence controller**220**. The status bits are also provided for storage in a register**248**to provide status signals. - [0112]The LIW register
**236**, as noted above, contains the currently executing LIW which is received from the instruction memory**48**, which is a 32×48 reprogrammable memory. The program counter**232**is controlled by the stack**234**which is a return stack for “calls”, and is operable to hold four return addresses. - [0113]The controller
**48**accepts control signals from the PLA CtrlReg signals or external MSnCTRL pins which initiates one of two possible LIW sequences. It outputs Send and Await status signals to the PLA**26**and to external MSnSEND and MSnAWAIT pins. - [0114]The Datapath Controller
**42**is a synchronous pipelined structure. A 48-bit instruction is fetched from instruction memory**48**at the address generated by the program counter**232**and registered into the LIW register**236**in one clock cycle. The actions occurring during the next clock cycle are determined by the contents of the LIW register**236**from the previous clock cycle. Meanwhile, the next instruction is being read from memory and the contents of the LIW register**236**are changed for the next clock cycle so that instructions are executed every clock cycle. Due to the synchronous pipe-lined structure, the Datapath Controller**42**will always execute the next instruction before branch operations are executed. The program counter**232**may be initiated by control signals. It increments or branches to the address of the LIW to be executed next. - [0115]The Adder status signals, Stack
**234**and the two Counters**228**and**230**in the Datapath Controller support the program counter**232**. Their support roles are: - [0116]the Adder status bits report the value of the Equal, Overflow, and Sign, for use in branch operations;
- [0117]the Stack
**234**contains return addresses; and - [0118]counter
**0****228**and Counter**1****230**hold down loop-counter values for branch operations. - [0119]The five index registers
**238**-**246**hold write, read, and write offset address values for the One-Port and Three-Port memories. The write offset index register**246**is used for alignment of the two read index registers, and it holds the value of an offset distance from the Three-Port Memory**63**write index for the two read indices. - [0120]The MSn Direct Control and Status pins illustrated in FIG. 2 are the control and status interface signals which connect directly between the pins and each MacroSequencer. The direct control signals are MSnCTRL[
**1**:**0**] and MSnOE. The direct status signals are MSnAWAIT and MSnSEND. Alternatively, the MacroSequencers**10**-**16**may use control signals from the Dual PLA**26**. The Dual PLA also receives the MacroSequencer status signals. Two Control signals for each MacroSequencer specify one of four control commands. They are selected from either the MSnCTRL[**1**:**0**] pins or from the two PLA Controln signals. The control state of the MacroSequencer on the next clock cycle is determined by the state of the above components and the value of these Controln[**1**:**0**] signals. - [0121]The four control commands include:
- [0122]SetSequence
**0** - [0123]SetSequence
**0**sets and holds the Program Counter**232**to ‘0’ and resets the Send and Await state registers to ‘0’ without initializing any other registers in the MacroSequencer. Two clock cycles after the SetSequence**0**is received, the Datapath Controller**42**will execute the contents of the LIW register**236**(which is the contents of the LIW memory at address ‘0’) every clock cycle until a Run or Continue control command is received. - [0124]SetSequence
**2** - [0125]SetSequence
**2**sets and holds the Program Counter**232**to ‘2’ and resets the Send and Await state registers to ‘0’ without initializing any other registers in the MacroSequencer. Two clock cycles after the SetSequence**0**is received, the Datapath Controller**2**will execute the contents of the LIW register**236**(which is the contents of the LIW memory at address ‘2’) every clock cycle until a Run or Continue control command is received. - [0126]Run
- [0127]Run permits normal operation of the Datapath Controller
**42**. This control command should be asserted every cycle during normal operation except when resetting the Send and/or Await flags, or initiating an LIW sequence with SetSequence**0**or SetSequence**2**. - [0128]Continue
- [0129]Continue resets both the Send and Await status signals and permits normal operation. If the Await State was asserted, the Program Counter
**232**will resume normal operation on the next cycle. - [0130]If an await operation is encountered while the Continue control command is in effect, the Continue control command will apply, and the await operation will not halt the program counter
**232**, nor will the Await status register be set to a ‘1’. Therefore, the Continue control command should be changed to a Run control command after two clock cycles. If a send operation is encountered while the Continue control command is in effect, the Continue control command will apply, and the Send status register will not be set to a ‘1’. - [0131]The following table summarizes the four control command options for Controln[
**1**:**0**] which may be from CtrlPLAn or from MSnCTRL pins:TABLE 3 Control n [1:0] Command Description 0 0 Run Normal Operating Condition 0 1 Continue Reset Send and Await registers. 1 0 SetSequence0 The program counter is set to ‘0’. Resets the Send and Await registers. This must be asserted for at least two cycles. 1 1 SetSequence2 The program counter is set to ‘2’. Resets the Send and Await registers. This must be asserted for at least two cycles. - [0132]By allowing two sequence starting points, each MacroSequencer can be programmed to perform two algorithms without reloading the sequences. The two PLA Controln signals are synchronized within the MacroSequencer. The two MSnCTRL pin signals are not synchronized within the Macro-Sequencer; therefore, consideration for timing requirements is necessary.
- [0133]There are two single-bit registered status signals that notify the external pins and the PLA
**26**when the MacroSequencer has reached a predetermined point in its sequence of operations. They are the Await and Send status signals. Both of the Status signals and their registers are reset to ‘0’ in any of these conditions: during Power On Reset, active configuration of any part of the RADP, or during Control States: - [0134]SetSequence
**0**, SetSequence**2**, or Continue. - [0135]When an await operation is asserted from the LIW register, the MacroSequencer executes the next instruction, and repeats execution of that next instruction until a Continue or SetSequence control command is received. The await operation stops the program counter from continuing to change and sets the Await status signal and register to ‘1’. A Continue control command resets the Await status signal and register to ‘0’ allowing the program counter
**232**to resume. When send operation is asserted, the Send status signal and register is set to ‘1’ and execution of the sequence continues. The program counter**232**is not stopped. A Continue control command resets the Send status signal and register to ‘0’. Status signals are resynchronized by the Dual PLA**26**with the PLACLK. - [0136]The Adder status bits, Equal, Overflow, and Sign are provided for conditional jumps.
- [0137]The purpose of the 48-bit LIW Register
**236**is to hold the contents of the current LIW to be executed. Its bits are connected to the elements in the datapath. The LIW register**236**is loaded with the contents of the instruction pointed to by the Program Counter**232**one cycle after the Program Counter**232**has been updated. The effect of that instruction is calculated on the next clock cycle. Each of the MacroSequencers**10**-**16**is composed of elements that are controlled by Long Instruction Word (LIW) bits. LIWs are programmed into Macro-Sequencer Instruction memory**48**during device configuration. The Datapath Controller executes the LIWs which control the arithmetic datapath. Some of these fields are available in every cycle. Some are shared between more than one operational unit. The following operational fields are available on every cycle: - [0138]One-Port Memory access
- [0139]Three-Port Memory access
- [0140]Input Register multiplexers
- [0141]Input Mux A, B, C
- [0142]Output multiplexers
- [0143]Adder
**1** - [0144]Adder
**2** - [0145]These operational fields are available on every cycle except when a Constant is required by an in operation:
- [0146]Multiplier
- [0147]Multiplier-Accumulator
- [0148]These operational fields conflict with each other. Only one is allowed in each LIW:
- [0149]Shifter
- [0150]Logic Unit
- [0151]Datapath Controller (if parameters are required)
- [0152]The Program Counter
**232**is a 5-bit register which changes state based upon a number of conditions. The program counter may be incremented, loaded directly, or set to ‘0’ or ‘2’. The three kinds of LIW operations which affect the MacroSequencer Program Counter explicitly are: - [0153]Branch Operations,
- [0154]SetSequence
**0**and SetSequence**2**operations, and - [0155]Await status operations.
- [0156]The Program Counter
**232**is set to zero ‘0’: - [0157]During power-on Reset,
- [0158]During Active configuration of any part of the RADP,
- [0159]During the SetSequence
**0**control command, - [0160]When the Program Counter
**232**reaches the value ‘31’, and the previous LIW did not contain a branch to another address, or - [0161]Upon the execution of a branch operation to address ‘0’.
- [0162]The Controln[
**1**:**0**] signals are used to reset the program counter to either ‘0’ or ‘2’ at any time with either SetSequence**0**or SetSequence**2**respectively. A Run control command begins and maintains execution by the program counter according to the LIW. A Continue control state resumes the program counter operation after an Await state and resets the Send and Await registers to ‘0’ on the next rising clock signal. A Continue control command after a Send status state resets the Send register to ‘0’ on the next rising clock signal. - [0163]The Await status register is set to ‘1’ and the Program Counter
**232**stops on the next clock cycle after an await operation is encountered. A Continue control state resets the Send and Await registers and permits the Program Counter**232**to resume. The Send status register is set to ‘1’ on the next clock cycle after a send operation. In the Send status, the Program. Counter continues to function according to the LIW. A Continue control state is required to reset the Send register. - [0164]The LIW register may contain one Branch Operation at a time. Conditional Branches should not be performed during the SetSequence control commands to insure predictable conditions.
TABLE 4 Result in the Program Branch Operation Assembly Instruction Counter Unconditional branch jump <address> Program Counter is set to <address>. Branch on loop jumpcounter0 Program Counter is set to Counter0 or loop <address> <address> if the Counter1 not equal to jumpcounter1 respective branch loop ‘0’ <address> counter has a non-zero value. The respective loop counter will then be decremented in the next clock cycle. Branch on an Adder jumpequal <address> Program Counter is set status condition: jumpoverflow <address> if the Adder Equal, Overflow, <address> status bits agree with the Sign jumpsign <address> branch condition. Call subroutine call <address> The current address plus ‘1’ in the Program Counter is pushed onto the Stack. The contents of the Program Counter on the next clock cycle will be set to the address in the LIW. Return from return The address from the top subroutine operation of the Stack is popped into the Program Counter. - [0165]The Instruction memory
**48**consists of thirty-two words of 48-bit RAM configured according to the MacroSequencer assembly language program. The Instruction memory**48**is not initialized during Power On Reset. For reliability, the LIW RAM must be configured before MacroSequencer execution begins. Bit fields in the LIW Registers control datapath operations and program flow. - [0166]The counters
**228**and**230**are 5-bit loop counters. Both loop counters are filled with ‘0’ s during Power On Reset and active configuration of any component in the RADP. Counter**0**and Counter**1**may be loaded by the setcounter**0**and setcounter**1**operations respectively. The jumpcounter**0**and jumpcounter**1**operations will decrement the respective counter on the next clock cycle until the Counter value reaches ‘0’. The SetSequence**0**and SetSequence**2**control signals do not alter or reset the loop counters. Therefore, the counters should be initialized with setcounter**0**and setcounter**1**operations before they are referenced in the program. - [0167]The Stack
**234**holds return addresses. It contains four 5-bit registers and a 2-bit stack pointer. After Power On Reset or the active configuration of any component in the RADP, the stack pointer and all of the 5-bit registers are initialized to ‘0’s. A call performs an unconditional jump after executing the next instruction, and pushes the return address of the second instruction following the call into the Stack**234**. A return operation pops the return address from the Stack**234**and into the Program Counter**232**. The call and return operations will repeat and corrupt the Stack**234**if these operations are in the next LIW after an await operation because the program counter**232**is held on that address, and the MacroSequencer repeats execution of the LIW in that address. - [0168]The LIW Register
**236**controls the five index registers which are used for data memory address generation. The index register**238**holds the One-Port Memory address. The other four index registers**240**-**246**hold Three-Port Memory address information. During Power On Reset or the active configuration of any component in the RADP, all index register bits are reset to ‘0’s. The control states, Run, Continue, SetSequence**0**or SetSequence**2**do not effect or reset the index registers. Each clock cycle that a relevant memory access is performed, the memory address can be loaded, incremented, decremented or held depending upon the control bit settings in each index register. - [0169]In each MacroSequencer there are nine programmable configuration bits. They are listed in the table below. The three signed/unsigned related bits are set with directives when programming the MacroSequencer. The others are set by the software design tools when the configuration options are selected.
TABLE 5 MacroSequencer Configuration Bits Functional Bit Block Function If Bit = 0 If Bit = 1 0 Multiplier Must A is unsigned. A is signed. operand A sign 1 Multiplier Must B is unsigned. B is signed. operand B sign 2 Adder Signed/ Unsigned Add Signed Add Unsigned Bit 3 Adder 32/16 Bit 16 bit Datapath 32 bit Datapath mode mode 4 Data Bus Select Busn inputs are Busn inputs are Connec- OutRegA or from OutRegA of from MSnI/O pins tions MSnI/O pins MacroSequencer(n) for Macro- Sequencer busn inputs 5 I/O Output OE from MSnOE OE from PLA Interface Enable pin Select 6 I/O Select OE OE = OE OE = ‘1’ Interface signal or ‘1’ 7 I/O OE Polarity OE = OE OE = OE Interface Select 8 Datapath Control[1:0] Control[1:0] from Control[1:0] from Controller source select MSnCTRL[1:0] PLA0 pins CtrlPLAn[1:0] - [0170]The configuration bits are configured with the instruction memory
**48**, where bits**0**through**8**of the 16-bit program data word are the nine configuration bits listed above. - [0171]Referring now to FIG. 16, there is illustrated a block diagram of the dual PLA
**26**. There are provided two PLAs, a PLA**0****260**and a PLA**1****261**. Each of the PLAs is comprised of an input selector**264**for receiving seven inputs. Each receives the 16-bit BUS**4**IN bus**24**which is a 16-bit bus, the send states bits on a bus**266**, the await status bits on a bus**268**, the PLA input signal on the bus**38**, the PLA I/O signal on the bus**40**, the output of each of the PLAs**260**and**261**. Each of the input selectors provides an A and a B output on 16-bit buses to a minimum term generator**268**which provides a 64-bit output. This is input to a 34×32 AND array**270**for each of the PLAs**260**and**261**, the output thereof being a 32-bit output that is input to a fixed OR gate**272**. The AND array**270**also provides output enable signals, two for the PLA**260**and two for the PLA**261**. For PLA**260**, the fixed OR output**272**is an 8-bit output that is input to a control OR gate**274**, whereas the output of the fixed OR gate**272**and PLA**261**is a 14-bit output that is input to an output OR gate**276**and also is input to the control OR gate**274**and PLA**260**. The output of the control OR gate**274**and PLA**260**is input to an 8-bit control register**278**, the output thereof providing the PLA control signals, there being four 2-bit control signals output therefrom. This control register**278**also provides the output back to the input selectors**264**for both PLAs**260**and**261**. The output of the output OR gate**276**and the PLA**261**is input to an output register**280**, the output thereof providing an 8-bit output that is input back to the input selectors**264**for both PLAs**260**and**261**and also to an I/O buffer**282**. The output of the I/O buffer is connected to the I/O bus**40**that is input to the input selector**264**and comprising 8-bit output. The I/O buffer**282**also receives the output of the output OR**276**. The general operation of the PLA is described in U.S. Pat. No. 5,357,152, issued Oct. 18, 1994 to E. W. Jennings and G. H Landers, which is incorporated herein by reference. - [0172]The Dual PLA
**26**provides the two in-circuit programmable,**32**input by**34**product term PLAs**260**and**261**. PLA**0****260**may serve as a state machine to coordinate the Macro-Sequencer array operation with external devices. PLA**1****261**may be used for random interface logic. The Dual PLA**26**may perform peripheral logic or control functions based upon the state of BUS**4**IN, PLAIN and PLAI/O bus states and the Control bus**20**. The Dual PLA control functions which may be used by any or all of the MacroSequencers include: - [0173]Registered control outputs, CtrlReg[
**7**:**0**], for: - [0174]Initiation of LIW sequences; and
- [0175]Control response to Send and Await status signals.
- [0176]Combinatorial outputs, oepla[
**3**:**0**], used to generate Output Enable signals for the MacroSequencers. The oepla[**3**:**0**] signals are generated from individual product terms. - [0177]The PLA
**0****260**produces eight CtrlReg outputs that can be used as MacroSequencer control signals where two signals are available for each of the MacroSequencers**10**-**14**to use as Control signals. They are also available as feedbacks to both PLA**0****260**and PLA**1****261**. The CtrlReg[**7**:**0**] signals are useful in multi-chip array processor applications where system control signals are transmitted to each RADP. PLA**1****261**produces combinatorial or registered I/O outputs for the PLAI/O [**7**:**0**] pins**40**. The fourteen Fixed OR outputs(FO**1**) from OR gate**272**from PLA**1****261**are also available to the Control OR array**274**in the PLA**0****260**. The PLAI/O signals are useful for single chip applications requiring a few interface/handshake signals, and they are useful in multi-chip array processor applications where system control signals are transmitted to each device. - [0178]The RADP is configured by loading the configuration file into the device.
- [0179]There are three memories in each of the four MacroSequencers and a Dual PLA configuration memory. Within each of the MacroSequencers, there is an:
- [0180]LIW memory with the nine configuration bits,
- [0181]One-Port data memory, and
- [0182]Three-Port data memory.
- [0183]The nine programmable configuration bits within each MacroSequencer are configured as additional configuration data words in the LIW configuration data packet. The LIW memory, configuration bits, and Dual PLA memory may only be loaded during Active Configuration Mode. The One-Port and Three-Port data memories for each MacroSequencer may be loaded during Active Configuration and accessed during normal operating mode as directed by each MacroSequencer's LIW Register.
- [0184]The configuration is to be loaded into the RADP during Active Configuration Mode. The RADP may be in one of three operating modes depending on the logic states of PGM
**0**and PGM**1**: - [0185]In the Normal Operation mode, the RADP MacroSequencers concurrently execute the LIWs programmed into each LIW memory.
- [0186]The RADP is configured during the Active Configuration mode which allows each MacroSequencer's instruction memory and Data Memories and the Dual PLA to be programmed.
- [0187]Passive Configuration mode disables the device I/O pins from operating normally or being configured which allows other RADPs in the same circuit to be configured.
- [0188]Four configuration pins, named PGM
**0**, PGM**1**, PRDY, and PACK, are used to control the operating mode and configuration process. BUS**4**IN[**15**:**0**] pins are used to input the configuration data words. - [0189]The Multiplier-Accumulator (MAC)
**68**is described hereinabove with reference to the FIG. 3 and FIG. 6. In general, this is a synchronous multiplier-accumulator circuit and is composed of two pipe stages. - [0190]The first pipe stage is composed of a network of a multiplicity small bit multipliers, a multiplicity of local carry propagate adders forming a multiplicity of trees and a pipeline register circuit for holding the results of the roots of each adder tree. The leaves of these adder trees are from the multiple digit output of the small bit multiplier circuits. The second pipe stage is composed of a multiplicity of local carry propagate adders of which all but one of which comprise a tree taking the synchronized results of the multiplicity of adder trees of the first pipe stage and forming a single sum of all adder tree results from the first pipe stage. An interface circuit operates on this resulting sum and on a possibly selected component of the accumulator register(s) contents of this pipe stage. The interface circuit either: may zero the feedback from the accumulator register(s)
**14**in accumulator**148**and pass the resultant sum from the above mentioned adder tree in this pipe stage through or it may align the resultant sum and the (possibly) selected accumulator result for processing by the last local carry propagate adder. The output of this adder is again submitted to a second interface circuit which can modify the adders output by alignment, or by zeroing the result. The output of this interface circuit is then stored in one of the (possibly) multiplicity of accumulator registers which comprise the pipeline register bank of this pipe stage. Extensions of this multiplier-accumulator embodying input pipe registers potentially containing portions of the small bit multiplier circuitry, variations to the tree structure of the local carry propagate adder trees in both pipe stages are claimed. Implementations of this basic circuit and extensions embodying standard integer, fixed point and floating point arithmetic, as well as scalar and matrix modular decomposition, p-adic fixed and p-adic floating point and extended scientific precision standard and p-adic floating point arithmetic are included. Extensions embedding implementations of the multiplier-accumulator including one or more carry propagate adders, multiple data memories circuitry minimally comprising one-port RAM and thee-port (2 read port and 1 write port) RAM with synchronization registers, shift and alignment circuitry plus content addressable memory(ies) as well as bit level pack and unpack circuitry are also included. Extensions embedding multiple instances of implementations of any of the above claimed circuitry within a single integrated circuit are also included. - [0191]For the purpose of describing the MAC
**68**, some definitions may be useful. They will be set forth as follows: - [0192]Wire
- [0193]A wire is a means of connecting a plurality of communicating devices to each other through interface circuits which will be identified as transmitting, receiving or bi-directional interfaces. A bi-directional interface will consist of a transmitter and receiver interface. Each transmitter may be implemented so that it may be disabled from transmitting. This allows more than one transmitter may be interfaced to a wire. Each receiver may be implemented so that it may be disabled from receiving the state of the wire it is interfaced to. A wire will be assumed to distribute a signal from one or more transmitters to the receivers interfaced to that wire in some minimal unit of time. This signal can be called the state of the wire. A signal is a member of a finite set of symbols which form an alphabet. Often this alphabet consists of a 2 element set, although use of multi-level alphabets with more than 2 symbols have practical applications. The most common wire is a thin strip of metal whose states are two disjoint ranges of voltages, often denoted as ‘0’ and ‘1’. This alphabet has proven extremely useful throughout the development of digital systems from telegraphy to modern digital computers. Other metal strip systems involving more voltages ranges, currents and frequency modulation have also been employed. The key similarity is the finite, well defined alphabet of wire states. An example of this is multiple valued current-mode encoded wires in VLSI circuits such as described in “High-Speed Area-Efficient Multiplier Design Using Multiple-Valued Current-Mode Circuits” by Kawhito, et. al. Wires have also been built from optical transmission lines and fluidic transmission systems. The exact embodiment of the wires of a specific implementation can be composed of any of these mechanisms, but is not limited to the above. Note that in some high speed applications, the state of a wire in its minimal unit of time may be a function of location within the wire. This phenomena is commonly observed in fluidic, microwave and optical networks due to propagation delay effects. This may be a purposeful component of certain designs and is encompassed by this approach.
- [0194]Signal Bundle and Signal Bus
- [0195]A signal bundle and a signal bus are both composed of a plurality of wires. Each wire of a signal bundle is connected to a plurality of communicating devices through interface circuitry which is either a transmitter or a receiver. The direction of communication within a signal bundle is constant with time, the communication devices which are transmitting are always transmitting. Those which are receiving are always receiving. Similarly, each wire of a signal bus is also connected to a plurality of communicating devices. The communicating devices interfaced to a signal bus are uniformly attached to each wire so that whichever device is transmitting transmits on all wires and whichever device(s) are receiving are receiving on all wires. Further, each communicating device may have both transmitters and receivers, which may be active at different time intervals. This allows the flow of information to change in direction through an succession of intervals of time, i.e., the source and destinations(s) for signals may change over a succession of time intervals.
- [0196]Pipeline Register and Stage
- [0197]The circuitry being claimed herein is based upon a sequential control structure known as a pipeline stage. A pipeline stage will be defined to consist of a pipeline register and possibly a combinatorial logic stage. The normal operational state of the pipeline stage will be the contents of the memory components within the pipeline register. Additional state information may also be available to meet testability requirements or additional systems requirements outside the intent of this patent. Typical implementations of pipeline stage circuits are found in synchronous Digital Logic Systems. Such systems use a small number of control signals known as clocks to synchronize the state transition events within various pipeline stages. One, two and four phase clocking schemes have been widely used in such approaches. See the references listed in the section entitled Typical Clocking Schemes for a discussion of these approaches applied to VLSI Design. These typical approaches face severe limitations when clocks must traverse large distances and/or large varying capacitive loads across different paths within the network to be controlled. These limitations are common in sub-micro CMOS VLSI fabrication technologies. The use of more resilient timing schemes has been discussed in the Alternative Clocking Scheme references. It will be assumed that a pipeline stage will contain a pipeline register component governed by control signals of either a traditional synchronous or a scheme such as those mentioned in the Alternative Clocking Scheme References.
- [0198]K-ary Trees, K-ary and Uniform Trees with Feedback
- [0199]For the purposes of this document, a directed graph G(V,E) is a pair of objects consisting of a finite, non-empty set of vertices V={v[
**1**], . . . , v[n]} and a finite set of edges E=(e[**1**], . . . , e[k]) where each edge e is an ordered pair of vertices belonging to V. Denote the first component of (e[j] by e[j][**1**] and the second component by e[j][**2**]. Vertices will also be known as nodes in what follows. A directed graph is connected if each vertex is a component in at least one edge. A directed graph G(V,E) possesses a path if there exists a finite sequence of edges (ek[**1**],ek[**2**], . . . ,ek[h]) where h>=2 is a subset of E such that the first component of ek[j+1] is also the second component of ek[j] for j=1, . . . , h−1. A directed graph G(V,E) possesses a cycle if there exists a path (ek[**1**],ek[**2**], . . . ,ek[h]) where h>=2 such that the second component of ek[h] is also the first component of ek[**1**]. A connected directed graph which possesses no cycles is a tree. Note that typically, this would be called a directed tree, but since directed graphs are the only kind of graphs considered here, the name has been simplified to tree. A k-ary tree is a tree where k is a positive integer and each vertex(node) of the tree is either the first component in k edges or is the first component in exactly one edge. A k-ary tree with feedback is a directed graph G(V,E) such that there exists an edge ew such that the directed graph G**1**(V,E**1**) is a k-ary tree, where E**1**contains all elements of E except ew. Note that G(V,E) contains one cycle. A uniform tree is a tree such that the vertices form sets called layers L[**1**], . . . , L[m] such that the height of the tree is m and the root of the tree belongs to L[**1**], all vertices feeding the this root vertex belong to L[**2**], . . . , all vertices feed vertices of L[k] belonging to L[k+1], etc. It is required the vertices in each layer all have the same number of edges which target each vertex in that layer. The notation (k**1**, k**2**. . . , kn) where k**1**, . . . , kn are positive integers will denote the k**1**edges feeding the vertex in L[**1**], k**2**edges feeding each vertex in L[**2**], . . . , kn edges feeding each vertex in L[n]. A uniform tree with feedback differs from a uniform tree in that one edge forms a circuit within the graph. - [0200]p-adic Number Systems
- [0201]A p-adic number system is based upon a given prime number p. A p-adic representation of an unsigned integer k is a polynomial—k=a
_{n}p^{n}+a_{n−1}p^{n−1}+. . . +a_{1}p+a_{0}, where a_{n}, a_{n−1}, . . . , a_{0 }are integers between 0 and p−1. A fixed length word implementation of signed p-adic numbers is also represented as a polynomial with the one difference being that the most significant p-digit, an now ranges between (p−1)/2 and (p−1)/2. - [0202]Two's Complement Number System
- [0203]Two's complement Numbers is a signed
**2**-adic number system implemented in a fixed word length or multiples of a fixed word length. This is the most commonly used integer number system in contemporary digital computers. - [0204]Redundant Number Systems and Local Carry Propagation Adders
- [0205]A redundant number system is a number system which has multiple distinct representations for the same number. A common redundant number system employs an entity consisting of two components. Each component possesses the same bit length. The number represented by such an entity is a function (often the difference) between the two components. A local carry propagation adder will be defined as any embodiment of an addition and/or subtraction function which performs its operation within a constant time for any operand length implementation. This is typically done by propagating the carry signals for any digit position only to a small fixed number of digits of higher precision. This phenomena is called local carry propagation. A primary application of redundant number systems is to provide a notation for a local carry propagation form of addition and subtraction. Such number systems are widely used in the design of computer circuitry to perform multiplication. In the discussion that follows, Redundant Binary Adder Cells are typically used to build implementations such as those which follow. The local carry propagate adder circuits discussed herein may also be built with Carry-Save Adder schemes. There are other local or limited carry propagation adder circuits which might be used to implement the following circuitry. However, for the sake of brevity and clarity, only redundant adder schemes will be used in the descriptions that follow. Many of the references hereinbelow with respect to the High Speed Arithmetic Circuitry discuss or use redundant number systems.
- [0206]Modular Decomposition Number Systems
- [0207]Modular Decomposition Number Systems are based upon the Chinese Remainder Theorem. This theorem was first discovered and documented for integers twenty centuries ago in China. The Chinese Remainder Theorem states that: Let m[
**1**], m[**2**], . . . , m[n] be positive integers such that m[i] and m[j] are relatively prime for I not equal j. If b[**1**], b[**2**], . . . , b[n] be any integers, then the system of congruences x=b[i] (mod m[i]) for I=1, . . . , n, has integral solution that is uniquely determined modulo m=m[**1**]* m[**2**]* . . . * m[n]. The Chinese Remainder Theorem has been extended in the last hundred and fifty years to a more general result which is true in any nontrivial algebraic ring. Note that square matrices form algebraic rings and that both modular decomposition matrix and p-adic number systems can be built which have performance and/or accuracy advantages over typical fixed or floating point methods for a number of crucial operations, including matrix inversion. Modular Decomposition Number Systems have found extensive application in cryptographic systems. An important class of cryptographic systems are based upon performing multiplications upon very large numbers. These numbers often involve 1000 bits. Arithmetic operations have been decomposed into modular multiplications of far smaller numbers. These decompositions allow for efficient hardware implementations in integrated circuits. The modular multiplications of these smaller numbers could well be implemented with the multiplier architectures described hereinbelow. Such multiplier implementation would have the same class of advantages as in traditional numerical implementations. - [0208]Standard Floating Point Notations
- [0209]Standard Floating Point Notation is specified in a document published by ANSI. Floating point arithmetic operations usually require one of four rounding mode to be invoked to complete the generation of the result. The rounding modes are used whenever the exact result of the operation requires more precision in the mantissa than the format permits. The purpose of rounding modes is to provide an algorithmic way to limit the result to a value which can be supported by the format in use. The default mode used by compiled programs written in C, PASCAL, BASIC, FORTRAN and most other computer languages is round to nearest. Calculation of many range limited algorithms, in particular the standard transcendental functions available in FORTRAN, C, PASCAL and BASIC require all of the other three modes: Round to positive infinity, Round to negative infinity and round to zero. Round to nearest looks at the bits of the result starting from the least significant bit supported and continuing to the least significant bit in the result. The other three rounding modes are round to 0, round to negative infinity and round to positive infinity, which are well documented in IEEE-ANSI specification for standard floating point arithmetic.
- [0210]Extended Precision Floating Point Notations
- [0211]Extended Precision Floating Point Notations are a proposed notational and semantic extension of Standard Floating Point to solve some of its inherent limitations. Extended Precision Floating Point requires the use of accumulator mantissa fields twice as long as the mantissa format itself. This provides for much more accurate multiply-accumulate operation sequences. It also minimally requires two accumulators be available, one for the lower bound and one for the upper bound for each operation. The use of interval arithmetic with double length accumulation leads to significantly more reliable and verifiable scientific arithmetic processing. Long Precision Floating Point Notations involve the use of longer formats. For example, this could take the form of a mantissa which is 240 bits (including sign) and an exponent of 16 bits. Extended Long Precision Floating Point Notations would again possess accumulators supporting mantissas of twice the length of the operands. These extensions to standard floating point have great utility in calculations where great precision is required, such as interplanetary orbital calculations, solving non-linear differential equations, performing multiplicative inverse calculations upon nearly singular matrices.
- [0212]p-adic Floating Point Systems
- [0213]P-adic arithmetic can be used as the mantissa component of a floating point number. Current floating point implementations use p=2. When p>2, rounding to nearest neighbor has the effect of converging to the correct answer, rather than often diverging from it in the course of executing a sequence of operations. The major limitation of this scheme is that a smaller subset of the real numbers than can be represented compared with the base
**2**arithmetic notation. Note that the larger p is and the closer it is to a power of two, the more numbers can be represented in such a notation for a fixed word length. One approach to p-adic floating point arithmetic would be based upon specific values of p with standard word lengths. The next two tables assume the following format requirements: - [0214]The mantissa field size must be a multiple of the number of bits it takes to store p.
- [0215]The mantissa field size must be at least as big as the standard floating point notation.
- [0216]The exponent field will be treated as a signed
**2**'s complement integer. - [0217]The mantissa sign bit is an explicit bit in the format.
- [0218]The following Table 6 summarizes results based upon these assumptions for Word Length
**32**:TABLE 6 Man- Exponent tissa Mantissa Digits base p Field Field Numerical Dynamic Range (in p Size Size Expression base 10) 3 7 24 Mantissa*3 ^{Exponent}12 digits 3 ^{63 }to 3^{−64 }(10^{30 }to 10^{−31})7 7 24 Mantissa*7 ^{Exponent}8 digits 7 ^{63 }to 7^{−64 }(l0^{53 }to 10^{−54})15 7 24 Mantissa*15 ^{Exponent}6 digits 15 ^{63 }to 15^{−64 }(10^{74 }to 10^{−75})31 6 25 Mantissa*31 ^{Exponent}5 digits 31 ^{31 }to 31^{−32 }(10^{46 }to 10^{−47}) - [0219]The following table summarizes results based upon these assumptions for Word Length
**64**:TABLE 7 Mantissa Digits Exponent Mantissa Numerical base p Dynamic p Field Size Field Size Expression Range (in base 10) 3 9 54 Mantissa*3 ^{Exponent}27 digits 3 ^{255 }to 3^{−256}(10 ^{121 }to 10^{−122})7 9 54 Mantissa*7 ^{Exponent}18 digits 7 ^{255 }to 7^{−256}(10 ^{215 }to 10^{−216})15 7 56 Mantissa*15 ^{Exponent}14 digits 15 ^{63 }to 15^{−64}(10 ^{74 }to 10^{−75})31 8 55 Mantissa*31 ^{Exponent}11 digits 31 ^{127 }to 31^{−128}(10 ^{189 }to 10^{−191}) - [0220]One may conclude from the above two tables that p-adic floating point formats based upon p=7 and p=31 offer advantages in dynamic range with at least as good mantissa accuracy for both single and double precision(32 and 64 bit) formats. It seems reasonable that p=7 has distinct advantages over p=31 in terms of inherent implementation complexity. The mantissa component of a floating point number system can also be composed of two components, known here as MSC and LSC, for Most Significant Component and Least Significant Component, respectively. The MSC can be constructed as a binary or 2-adic system and the LSC can be constructed from a p-adic system where p>2. Such an arrangement would also converge to the correct answer in round to nearest neighbor mode and would have the advantage of making full use of the bits comprising the MSC. If the LSC occupies the “guard bits” of the floating point arithmetic circuitry, then the visible effect upon the subset of floating point numbers which can be represented is the consistent convergence of resulting operations. This would aid standard Floating Point notation implementation. If p is near a power of two, then p-adic number based mantissa calculations would be efficiently stored in memory . Particularly for p=3 and 7, the modular arithmetic multiplier architecture could amount to specializing the redundant binary adder chain in each adder strip and slightly changing the Booth encoding algorithms discussed in the following implementation discussions. If the MSC represented all but 2, 3 or 5 bits of the mantissa, then p=3, 7 or 31 versions of p-adic arithmetic could respectively be used with minimal impact on how many numbers could be represented by such notations. Note that for this kind of application, p need not be restricted to being prime. As long as p was odd, the desired rounding convergence would result. It will be general assumed throughout this document that p=3, 7, 15 and 31 are the most optimal choices for p-adic floating point extensions, which are “mostly” prime. Both the number systems discussed in the previous paragraphs will be designated as p-adic floating point systems with the second version involving the MSC and LSC components being designated the mixed p-adic floating point system when relevant in what follows. Both of these notations can be applied to Extended Precision Floating Point Arithmetic.
- [0221]The basic operation of a multiplier
**142**is to generate from two numbers A and B, a resulting number C which represents something like standard integer multiplication. The accumulation of such results, combined with the multiplication are the overall function of a multiplier/accumulator. It is noted that the accumulation may be either additive, subtractive or capable of both. - [0222]This description starts with a basic block diagram of a multiplier-accumulator and one basic extension of that multiplier/accumulator which provides significant cost and performance advantages over other approaches achieving similar results. These circuit blocks will be shown advantageous in both standard fixed and floating point applications, as well as long precision floating point, extended precision floating point, standard p-adic fixed and floating point and modular decomposition multiplier applications.
- [0223]Optimal performance of any of these multiplier-accumulator circuits in a broad class of applications requires that the multiplier-accumulator circuit receive a continuous stream of data operands. The next layer of the claimed devices entail a multiplier-accumulator circuit plus at least one adder and a local data storage system composed of two or more memories combined in a network. The minimum circuitry for these memories consists of two memories, the one-port memory
**44**and the 3-port memory**43**. The circuitry described to this point provides for numerous practical, efficient fixed point algorithmic engines for processing linear transformations, FFT's, DCT's, and digital filters. - [0224]Extension to support various floating point schemes requires the ability align one mantissa resulting from an arithmetic operation with a second mantissa. This alignment operation is best performed by a specialized circuit capable of efficient shifting, Shifter
**74**. Support of the various floating point formats also requires efficient logical merging of exponent, sign and mantissa components. The shift circuitry mentioned in this paragraph (assuming it also supports rotate operations) combined with the logical merge circuitry provides the necessary circuitry for bit-packing capabilities necessary for image compression applications, such as Huffman coding schemes used in JPEG and MPEG. Once aligned, these two mantissas must be able to be added or subtracted from each other. The long and extended precision formats basically require at least one adder to be capable of performing multiple word length “chained” addition-type operations, so that the carry out results must be available efficiently to support this. - [0225]Support for p-adic arithmetic systems requires that the multiplier-accumulator implementation support p-adic arithmetic. Similar requirements must be made of at least one adder in an implementation. The p-adic mantissa alignment circuitry also makes similar requirements upon the shifter. Modular arithmetic applications are typically very long integer systems. The primary requirement becomes being able to perform high speed modular arithmetic where the modular decomposition may change during the execution of an algorithm. The focus of such requirements is upon the multiplier-accumulator and adder circuitry.
- [0226]Referring now to FIG. 17, there is illustrated a block diagram of basic multiplier. A very fast way to sum 2
^{P }numbers (where P is assumed to be a positive integer) is called a Binary Adder Tree. Adders D**1**-D**7**form a Binary Adder Tree summing 8=2^{3 }numbers, C**1**to C**8**in a small bit multiplier**300**. The numbers C**1**to C**8**are the partial products of operand A and portions of operand B input to multiplier**300**, which are then sent to the adder tree D**1**-D**7**. These partial products are generated within the multiplier**300**by a network of small bit multipliers. The Adder D**8**and the logic in block G**1**align the resulting product from Adder D**7**and the selected contents of the block H**1**representing the second stage of pipeline registers an alignment. The accumulated results are held in memory circuitry in block H**1**. This provides for the storage of accumulated products, completing the basic functions required of a multiplier-accumulator. - [0227]The circuitry in the stage-one pipeline registers E
**1**acts as pipeline registers making the basic circuit into a two pipe-stage machine. The time it takes for signals to propagate from entry into multipliers**30**to the pipeline registers of E**1**is about the same as the propagation time from entry into Adder D**7**to the pipeline registers in H**1**. Thus the pipeline cycle time is about half of what it would be without the registers of E**1**. - [0228]Transform circuitry J
**1**is provided on the output of H**1**that performs several functions. It selects which collection of memory contents are to be sent outside the multiplier/accumulator, it transforms the signal bundle to be sent to a potentially different format, it selects which collection of memory contents are to be sent to Adder D**8**for accumulation and it transforms that signal bundle to be sent to Adder D**8**, if necessary, to a potentially different format. The circuitry in J**1**permits the reduction of propagation delay in the second pipeline stage of this multiplier-accumulator, since the final logic circuitry required to generate the results can occur in J**1**after the pipeline registers of H**1**and the use of non-standard arithmetic notations such as redundant binary notations in the adder cells of D**1**to D**9**, since the notation used internally to the multiplier-accumulator can be converted to be used with a standard 2's complement adder for final conversion. - [0229]An example of the above can be seen in implementing a redundant binary notation as follows:
TABLE 8 A Standard Notation A Non-standard Represented as used in Takagi's Signed Magnitude number Research St[1:0] Notation Sn[1:0] 0 00 10 1 01 11 −1 10 01 - [0230]This notation turns out to be optimal for certain CMOS logic implementations of an 8 by 16-bit multiplier based upon FIG. 17. Conversion by a standard two's complement adder required conversion from the Non-standard Signed Magnitude notation to a Standard Notation. This was done by implementing the logic transformation:
- [0231]St[
**1**]=not Sn[**1**] - [0232]St[
**0**]=Sn[**0**] - [0233]Optimal implementations of redundant p-adic notations to carry propagate p-adic notation conversion may also require this.
- [0234]With the above noted structure, the following operations can be realized:
- [0235]Signed and Unsigned 8 by 16 bit multiplication and multiply-accumulate Signed and Unsigned 16 by 16 bit multiplication and multiply-accumulate
- [0236]Signed and Unsigned 24 by 16 multiplication and multiply-accumulate
- [0237]Signed and Unsigned 24 by 24 bit multiplication and multiply-accumulate
- [0238]Signed and Unsigned 24 by 32 bit multiplication and multiply-accumulate
- [0239]Signed and Unsigned 32 by 32 bit multiplication and multiply-accumulate
- [0240]Optimal polynomial calculation step
- [0241]Fixed point versions of the above:
- [0242]Standard Floating Point Single Precision Mantissa Multiplication
- [0243]Extended Precision Floating Point Single Precision Mantissa Multiplication
- [0244]P-Adic Floating Point Single Precision Mantissa Multiplication
- [0245]P-Adic Fixed Point Multiplication and Multiplication/accumulation.
- [0246]These operations can be used in various applications, some of which are as follows:
- [0247]1. 8 by 16 multiplication/accumulation is used to convert between 24 bit RGB to YUV color encoding. YUV is the standard broadcast NTSC color coding format. The standard consumer version of this requires 8 bit digital components to the RGB and/or YUV implementation.
- [0248]2. 16 bit arithmetic is a very common form of arithmetic used embedded control computers.
- [0249]3. 16 by 24 bit multiplication/accumulation with greater than 48 bits accumulation is capable of performing 1024 point complex FFTs on audio data streams for Compact Disk Applications, such as data compression algorithms. The reason for this is that the FFT coefficients include numbers on the order PI/512, which has an approximate magnitude of {fraction (1/256)}. Thus a fixed point implementation requires accumulation of 16 by 24 bit multiplications to preserve the accuracy of the input data.
- [0250]4. 24 by 24 bit multiplication/accumulation is also commonly used in audio signal processing requirements. Note that by a similar argument to the last paragraph, 24 by 32 bit multiplications are necessary to preserve the accuracy of the data for a 1024 point complex FFT.
- [0251]5. 32 bit arithmetic is considered by many to be the next most common used form of integer arithmetic after 16 bit. It should be noted that this arithmetic is required for implementations of the long integer type by C and C++ computer language execution environments.
- [0252]6. Polynomial calculation step operations, particularly fixed point versions, are commonly used for low degree polynomial interpolation. These operations are a common mechanism for implementing standard transcendental functions, such as sin, cos, tan, log, etc.
- [0253]7. Standard Floating Point Arithmetic is the most widely used dynamic range arithmetic at this time.
- [0254]8. Extended Precision Floating Point arithmetic is applicable wherever Standard Floating Point is currently employed and resolves some serious problems with rounding errors or slow convergence results. Tne major drawback to this approach is that it will run more slowly the comparable Standard Floating Point Arithmetic. It is important to note that with this approach, there is no performance penalty and very limited additional circuit complexity involved in supporting this significant increase in quality.
- [0255]9. P-Adic Floating Point and Fixed Point arithmetic are applicable where Standard Floating point or fixed point arithmetic are used, respectively. The advantage of these arithmetics is that they will tend to converge to the correct answer rather than randomly diverging in round to nearest mode and can take about the same amount of time and circuitry as standard arithmetic when implemented in this approach. It should be noted that in the same number of bits as Standard Floating Point, implementations of p=7 p-adic floating point have greater dynamic range and at least the same mantissa precision, making these numeric formats better than standard floating point.
- [0256]Referring further to FIG. 17, the operation of the various components will be described in more detail. The multipliers in a small bit multiplier block
**300**perform small bit multiplications on A and B and transform signal bundles A and B into a collection of signal bundles C**1**to C**8**which are then sent to the Adder circuits D**1**-D**4**. Signal bundles A and B each represent numbers in some number system, which does not have to be the same for both of them. For instance, A might be in a redundant binary notation, whereas B might be a two's complement number. This would allow A to contain feedback from an accumulator in the second pipe stage. This would support an optimal polynomial calculation step operations. Number systems which may be applicable include, but are not limited to, signed and unsigned 2's complement, p-adic, redundant binary arithmetic, or a modular decomposition systems based on some variant of the Chinese Remainder Theorem. - [0257]The signal bundles C
**1**to C**8**are partial products based upon the value of a small subset of one of the operands (A or B) and all of the other operand. In the discussion that follows, it will be assumed that the A signal bundle is used in its entirety for generating each C signal bundle and a subset of the B signal bundle is used in generating each C signal bundle. The logic circuitry generating signal bundles C**1**-C**8**will vary, depending upon the number systems being used for A and B, the number systems being employed for the D**1**-D**4**adders, the size of the signal bundles A and B plus the exact nature of the multiplication algorithm being implemented. In the discussion of following embodiments, certain specific examples will be developed. These will by no means detail all practical implementations which could be based upon this patent, but rather, demonstrate certain applications of high practical value that are most readily discussed. - [0258]Referring now to FIG. 18, there is illustrated an alternate embodiment of the MAC
**68**. In this embodiment, a 16 bit by 16 bit multiplier/accumulator based upon a**4-3**modified Booth coding scheme is illustrated, wherein only C**1**-**6**are needed for the basic operation. C**7**=Y would be available for adding an offset. This leads to implementations capable of supporting polynomial step calculations starting every cycle, assuming that the implementation possessed two accumulators in the second pipe stage. The polynomial step entails calculating X*Z+Y, where X and Y are input numbers and Z is the state of an accumulator register in H**1**. Implementation of 4-3 Modified Booth Coding schemes and other similar mechanisms will entail multipliers**300**containing the equivalent of an adder similar to those discussed hereinbelow. - [0259]Referring now to FIG. 19, there is illustrated an embodiment of the MAC
**68**which is optimized for polynomial calculations. In this case, all eight small bit multiplications (C**1**to C**8**) are used. In such situations, the J**1**component can provide Z for the calculation through a multiplexer**302**. G**1**performs alignment of the accumulator(s) being used for potential input to both multipliers**300**and Adder D**7**. Adder D**9**now requires controls to support alignment of the product with the target accumulator. This is done by transmitting through the local carry propagation chain in D**9**signals which act to mask carry propagation to successive digit cells and control transmission of top-most digit(s) carry propagation signals to the bottom most cell(s). This makes the Adder D**9**into a loop of adder cells which can be broken at one of several places. J**1**already had a requirement of aligning and potentially operating on the stored state of its accumulator(s) before feedback, this circuit implementation just adds slightly to that requirement. - [0260]Note that in the circuits represented by FIGS. 18 and 19, the presence of at least two accumulators is highly desirable, such that two polynomial calculations can then be performed in approximately the same time as one is performed. This is due to the 2 pipe stage latency in the multiplier.
- [0261]Adders D
**1**to D**4**perform local carry propagation addition, typically based upon some redundant binary notation or implementation of carry-save adders. They serve to sum the partial products C**1**to C**8**into four numbers. The partial products C**1**to C**8**are digit-aligned through how they are connected to the adders in a fashion discussed in greater detail later. These adders and those subsequently discussed herein can be viewed as a column or chain of adder cells, except where explicitly mentioned. Such circuits will be referred to hereafter as adder chains. It is noted that all adders described herein can be implemented to support p-adic and modular arithmetic in a redundant form similar to the more typical**2**-adic or redundant binary form explicitly used hereafter. - [0262]Adders D
**5**and D**6**perform local carry propagation addition upon the results of Adders D**1**, D**2**and D**3**, D**4**respectively. - [0263]The circuitry in E
**1**acts as pipeline registers making the basic circuit into a two pipe-stage machine. The memory circuits of E**1**hold the results of adders D**5**and D**6**. It may also hold Y in FIG. 19, which may either be sent from a bus directly to E**1**, or may have been transformed by the multiplier block**300**to a different notation than its form upon input. In certain embodiments, the last layers of the logic in Adders D**5**and D**6**may be “moved” to be part of the output circuitry of the pipeline registers of E**1**. This would be done to balance the combinatorial propagation delay between the first and second pipeline stages. The time it takes for signals to propagate from entry into multiplier block**300**to the pipeline registers of E**1**is then about the same as the propagation time from output of the E**1**registers into Adder D**7**to the pipeline registers in H**1**. Thus the pipeline cycle time is about half of what it would be without the registers of E**1**. In certain applications, this register block E**1**may be read and written by external circuitry with additional mechanisms. This could include, but is not limited to, signal bus interfaces and scan path related circuitry. - [0264]Adders D
**7**and D**8**receive the contents of the memory circuits of E**1**, which contain the results of the Adders D**5**and D**6**from the previous clock cycle. D**7**and D**8**perform local carry propagation addition on these signal bundles. The result of Adder D**7**is the completed multiplication of A and B. This is typically expressed in some redundant binary notation. - [0265]G
**1**aligns the product which has been generated as the result of Adder D**7**to the accumulator H**1**'s selected contents. G**1**selects for each digit of the selected contents of H**1**either a digit of the result from Adder D**7**or a ‘0’ in the digit notation to be added in the Adder D**8**. G**1**also can support negating the product resulting from D**8**for use in accumulation with the contents of a register of H**1**. Assume that the contents of H**1**are organized as P digits and that the multiplication result of Adder D**7**is Q digits and the length of A is R digits and B is S digits. It is reasonable to assume that in most numeric systems, Q>=R+S and P>=Q. If P>=Q+S, then G**1**can be used to align the result of Adder D**7**to digits S to Q+Max(R,S), thus allowing for double (or multiple) precision multiplications to be performed within this unit efficiently. This provides a significant advantage, allowing multiple precision integer arithmetic operations to be performed with a circuit possessing far fewer logic components than would be typically required for the entire operation to be performed. Combined with the two pipe stage architecture, this makes double precision multiplications take place about as fast as a single pipestage version with somewhat more half the number of logic gates. - [0266]In FIGS. 17 and 18, Adder D
**9**is composed of local carry propagation adder cells as in Adders D**1**to D**7**. It adds the aligned results of the Adder D**7**to the selected contents of H**1**to provide the signal bundle to H**1**for storage as the new contents of one memory component in H**1**. In FIG. 19, Adder D**9**is composed of a loop of local carry propagate adder cells which may be broken at one of several places to perform the alignment of the product with the accumulator. - [0267]H
**1**contains one or more clocked memory components (known hereafter as registers) which act as temporary storage accumulators for accumulating multiplications coming from Adder D**9**. Given the exact nature of multiplier block**300**, G**1**and the number of digits in each of H**1**'s registers, and the performance requirements for a particular implementation of this circuit, the optimal number of registers contained in H**1**will vary. In certain applications, this register block H**1**may be read and written by external circuitry using additional mechanisms. This could include, but is not limited to signal bus interfaces and scan path related circuitry. - [0268]If H
**1**has more than one register, J**1**selects which of these registers will be output to external circuitry. J**1**also selects which of these registers is to be used for feedback to Adder D**9**in FIGS. 1 and 2 and Adder D**8**in FIG. 19. J**1**selects which portion of H**1**'s selected register(s) will be transmitted in cases where the register is longer than either the receiving buss or carry propagate adder it will enter. If the internal notation of an implementation of this circuit is not a standard notation, then the signal bundle to be transmitted to external circuitry is transformed by J**1**into a standard notation which can then be converted by a carry propagate adder into the relevant standard arithmetic notation. In embodiments where extended precision arithmetic is a requirement, J**1**can be used to “move the more significant bits down” and insert 0's in the vacated most significant bits. In embodiments requiring the accumulator contents be subtracted from the generated product from Adder D**7**, J**1**would also perform negating the selected registers contents for delivery to the input of Adder D**9**in FIGS. 1 and 2 and Adder D**8**in FIG. 19. - [0269]Embodiments of this architecture support high-speed multiple-precision operations, which is not possible in typical integer or fixed-point arithmetic circuits. The performance of multiple- precision operations lowers throughput, but preserves the exactness of result. These are not possible at anything approaching the throughput and size of circuitry based upon this block diagram. Embodiments of this architecture can support standard single-precision floating point mantissa multiplications with significantly less logic circuitry than previous approaches. Embodiments of this architecture appear to be the only known circuits to support small p-adic mantissa multiplications. The authors believe that this is the first disclosure of such a floating point representation. Embodiments of this architecture provide a primary mechanism for implementing Extended precision Floating Point Arithmetic in a minimum of logic circuitry. Embodiments of this architecture also provide implementations of efficient high speed modular arithmetic calculators.
- [0270]In this discussion, A
**0**represents the least significant digit of the number A. The digits of A are represented in descending order of significance as AfAeAdAc, AbAaA**9**A**8**, A**7**A**6**A**5**A**4**, A**3**A**2**A**1**A**0**. B is represented as an 8 digit number represented by B**7**B**6**B**5**B**4**, B**3**B**2**B**1**B**0**. - [0271]Multipliers
**300**are controlled by a signal bundle. One control signal, to be referred to as U**1**. A sign determines whether the A operand is treated as a signed or an unsigned integer. A second control signal, referred to as U**1**.Bsign determines whether the B operand is treated as a signed or unsigned integer. Four distinct one digit by one digit multiplications are performed in the generation of the C**1**to C**8**digit components for the.adders D**1**to D**4**. Let Ax represent a digit of A and By represent a digit of B. The operation AxuBy is an always unsigned multiplication of digit Ax with digit By. The operation AxsBy is an unsigned multiplication of Ax and By when the U**1**.Asign indicates the A operand is unsigned. The operation AxsBy is a signed multiplication when the U**1**.Asign indicates that the A operand is a signed integer. The operation BysAx is an unsigned multiplication of Ax and By when the U**1**.Bsign indicates the B operand is unsigned. The operation BysAx is a signed multiplication when the U**1**.Bsign indicates that the B operand is a signed integer. The operation AxSBy is an unsigned multiplication when both U**1**.Asign and U**1**.Bsign indicate unsigned integer operands. The operation AxSBy is a related to the multiplication of the most significant bits of A and B. This operation is determined by controls which specify whether the individual operands are signed or unsigned. - [0272]The following Table
**9**illustrates C**1**-C**8**for digits 0 to 23:TABLE 9 C1 C2 C3 C4 C5 C6 C7 C8 Digit k 0 0 0 0 0 0 0 0 23 0 0 0 0 0 0 0 AfSB7 22 0 0 0 0 0 0 AfsB6 AeuB7 21 0 0 0 0 0 AfsB5 AeuB6 AduB7 20 0 0 0 0 AfsB4 AeuB5 AduB6 AcuB7 19 0 0 0 AfsB3 AeuB4 AduBS AcuB6 AbuB7 18 0 0 AfsB2 AeuB3 AduB4 AcuB5 AbuB6 AauB7 17 0 AfsB1 AeuB2 AduB3 AcuB4 AbuB5 AauB6 A9uB7 16 AfsB0 AeuB1 AduB2 AcuB3 AbuB4 AauB5 A9uB6 A8uB7 15 AeuB0 AduB1 AcuB2 AbuB3 AauB4 A9uB5 A8uB6 A7uB7 14 AduB0 AcuB1 AbuB2 AauB3 A9uB4 A8uB5 A7uB6 A6uB7 13 AcuB0 AbuB1 AauB2 A9uB3 A8uB4 A7uB5 A6uB6 A5uB7 12 AbuB0 AauB1 A9uB2 A8uB3 A7uB4 A6uB5 A5uB6 A4uB7 11 AauB0 A9uB1 A8uB2 A7uB3 A6uB4 A5uB5 A4uB6 A3uB7 10 A9uB0 A8uB1 A7uB2 A6uB3 A5uB4 A4uB5 A3uB6 A2uB7 9 A8uB0 A7uB1 A6uB2 A5uB3 A4uB4 A3uB5 A2uB6 A1uB7 8 A7uB0 A6uB1 A5uB2 A4uB3 A3uB4 A2uB5 A1uB6 A0uB7 7 A6uB0 A5uB1 A4uB2 A3uB3 A2uB4 A1uB5 A0uB6 0 6 A5uB0 A4uB1 A3uB2 A2uB3 A1uB4 A0uB5 0 0 5 A4uB0 A3uB1 A2uB2 A1uB3 A0uB4 0 0 0 4 A3uB0 A2uB1 A1uB2 A0uB3 0 0 0 0 3 A2uB0 A1uB1 A0uB2 0 0 0 0 0 2 A1uB0 A0uB1 0 0 0 0 0 0 1 A0uB0 0 0 0 0 0 0 0 0 - [0273]Adders D
**1**to D**4**contain**18**digit cells for addition. Adders D**5**and D**6**contain 21 digits cells for addition. Adder D**7**contains 25 digit cells for addition. Each of these adders contains one more cell than the number of digits for which they have no inputs. Implementations of D**8**, G**1**, H**1**and J**1**to achieve various arithmetic requirements. - [0274]Table
**10**illustrates Capability Versus Size Comparison with N=16 based upon FIG. 17.TABLE 10 Cyc Cyc Typical Typical Align- E1 + Start to Adder Register Acc ment Adder H1 to start Cell Bit Operation Bits Slots Cells Bits End next Count Count Remarks Mul 8*16 40 2 172 120 2 1 128 80 Allows 2 ^{16}accumulations Note 1 Mul 3 2 256 80 Allows 2 ^{8}16*16 accumulations Mul 8*16 48 3 180 128 2 1 128 96 Allows 2 ^{24}accumulations Note 2 Mul 3 2 256 96 Allows 2 ^{16}16*16 accumulations Mul 4 3 384 96 Allows 2 ^{8}16*24 accumulations Mul 8*16 56 4 188 136 2 1 128 112 Allows 2 ^{32}accumulations Note 3 Mul 3 2 256 112 Allows 2 ^{24}16*16 accumulations Mul 4 3 384 112 Allows 2 ^{16}24*16 accumulations Mul 5 4 576 112 Allows 2 ^{8}32*16 accumulations #implementation will be discussed in the note regarding each circuit referenced in the “Remarks” column. “Adder Cells” refers to the number of adder cells needed to implement the adders involved in implementing the noted circuit based upon this patent's relevant block diagram. Unless otherwise noted, the adder cells will be two input cells, i.e. they perform the sum of two numbers. In cases where not only 2-input but also 3-input adder cells are involved, #the notation used will be “a, b” where a represents the number of 2-input adder cells and b represents the number of 3-input adder cells. “E1 + H1 Bits” will refer to the number of bits of memory storage required to build the circuit assuming a radix-2 redundant binary arithmetic notation. “Cyc Start to End” refers to the number of clock cycles from start of the operation until all activity is completed. #“Cyc to start next” refers to the number of clock cycles from the start of the operation until the next operation may be started. “Typical Adder Cell Count” #represents a circuit directly implementing the operation with an accumulating final adder chain with no middle pipe register or alignment circuitry. Larger multiplications will require bigger adder trees. The columnar figure will be based upon using a similar small bit multiplier cell as described in the appropriate discussion #of multipliers 300. “Typical Register Bit Count” refers to the number of bits of memory that a typical design would require to hold a radix-2 redundant binary representation of the accumulator alone in a typical application. “Remarks” contains a statement regarding the minimum number operations the circuit could perform before there was a possibility of overflow. #The Remarks entry may also contain a reference to a “Note” #, which will describe the implementation details of the multiplier-accumulator circuit being examined. The row of the table the Note resides in describes the basic multiplication operation performed, the size of the accumulator, number of alignment slots. The Note will fill in details should as the weighting factor between the alignment slot entries and any other pertinent details, comparisons and any other specific comments. #equivalent device and would have the same throughput as the standard implementation. Alignment in this new circuit is the same as multiplying the product by 1, 2 ^{8 }= 256 and 2^{16 }= 256^{2}. It is functionally equivalent to a 16 by 24 bit multiplier with follow-on local carry propagate adder for accumulation. The equivalent circuit would require 384 adder cells and 96 bits of accumulator memory compared to 180 adder cells and 128 bits of#memory. The new circuit would require about half the logic of the standard functional equivalent circuit. Its clock cycle time is approximately half that of the standard equivalent device: Throughput of the standard implementation would be once every one of its clock cycles (or two of this new circuit), whereas performance of 16 by 24 bit multiply could be performed every three cycles in the new circuit. However, the new circuit would be twice as fast at multiplying 8 by 16 #bits and would have identical performance for 16 by 16 bit multiplications. Alignment in this new circuit is the same as multiplying the product by 1, 2 ^{8 }= 256, 2^{16 }= 256^{2 }and 2^{24 }= 256^{3}. It is functionally equivalent to a 16 by 32 bit multiplier with follow-on local carry propagate adder for accumulation. The equivalent circuit would require 576 adder cells and 112 bits of accumulator memory compared to 188 adder cells and 136#bits of memory. The new circuit would require about a third the logic of the standard functional equivalent circuit. Its clock cycle time is approximately half that of the standard equivalent device. Throughput for a 16 by #32 bit multiplication with the standard implementation would be once every one of its clock cycles (or two of this new circuit), whereas performance of 16 by 24 bit multiply could be performed every four cycles in the new circuit. However, the new circuit would be twice #as fast at multiplying 8 by 16 bits, would have identical performance for 16 by 16 bit multiplications, as well as being able to perform a 16 by 24 bit multiplication every 3 clock cycles. - [0275]Table 11 illustrates Capability Versus Size Comparison with N=24 based upon FIG.
**17**:TABLE 11 Cyc Cyc Typical Typical Align- E1 + Start to Adder Register Acc ment Adder H1 to start Cell Bit Operation Bits Slots Cells Bits End next Count Count Remarks Mul 8*24 48 3 236 160 3 1 192 80 Allows 2 ^{16}accumulations Note 1 Mul 4 2 384 96 Allows 2 ^{8}16*24 accumulations Mul 6 3 576 96 Allows 1 24*24 operation Mul 8*24 64 4 244 184 3 1 192 128 Allows 2 ^{32}accumulations Note 2 Mul 4 2 128 128 Allows 2 ^{24}16*24 accumulations Mul 5 3 576 128 Allows 2 ^{16}24*24 accumulations Mul 65 43 1098 128 Allows 2 ^{8}32*24 accumulations Mul 8*24 64 64 244 312 3 1 192 256 Allows 2 ^{32}accumulations Note 3 Mul 4 2 128 256 Allows 2 ^{24}16*24 accumulations Mul 5 3 576 256 Allows 2 ^{16}24*24 accumulations Mul 6 4 1098 256 Allows 2 ^{8}32*24 accumulations Fmul 5 3 576 256 Allows 24*24 indefinite number of accumulations #standard functional equivalent circuit. Its clock cycle time is approximately half that of the standard equivalent device. Throughput of the standard implementation would be once every one of its clock cycles (or two of this new circuit), whereas performance of 24 by 24 bit multiply could be performed every three cycles in the new circuit. However, the new circuit would be twice as fast at multiplying 8 by 24 bits and would have identical performance for 16 by 24 bit multiplications. #Alignment in this multiplier-accumulator is the same as multiplying the product by 1, 2 ^{8 }= 256, 2^{16 }= 256^{2 }and 2^{24 }= 256^{3}. It is functionally equivalent to a 24 by 32 bit multiplier with follow-on local carry propagate adder for accumulation. The equivalent circuit would require 1098 adder cells and 128 bits of accumulator memory compared to 244 adder cells and 184 bits of memory. The multiplier-accumulator would#require about a quarter the logic of the standard functional equivalent circuit. Its clock cycle time would be less than half that of the standard equivalent device. Throughput for a 24 by 32 bit multiplication with the standard implementation would be once every one of its clock cycles (or two of this multiplier-accumulator), whereas performance of 32 by 24 bit multiply could be performed every four cycles in the multiplier-accumulator. However, the multiplier-accumulator would be twice as #fast at multiplying 8 by 24 bits, would have identical performance for 16 by 24 bit multiplications, as well as being able to perform a 24 by 24 bit multiplication every 3 clock cycles. This is the first of the multiplier-accumulators capable of performing single precision mantissa multiplication. It is specified as supporting an Extended Scientific Notation, which forces the implementation of dual accumulators. Alignment of a product is to any bit #boundary, so that weights of every power of two must be supported. Truncation of “dropped bits” in either the accumulator or partial product circuitry require G1 to be able to mask digits. Integer performance regarding 2*24, 16*24, 24*24 and 32*24 arithmetic is the same as that described in the previous note. This circuit can also perform 40*24 arithmetic every 5 clock cycles, which has utility in FETs with greater than 1K complex points. - [0276]The Modified 3-2 bit Booth Multiplication Coding Scheme in multiplier block
**300** - [0277]The primary distinction between the 8 by N implementation and this implementation is in the multiplier block
**300**. In this implementation a version of Booth's Algorithm is used to minimize the number of add operations needed. The Booth Algorithm is based upon the arithmetic identity−2^{n−1}+2^{n−. . . +2}2+1=2^{n}−1. The effect of this identity is that multiplication of a number by a string of 1's can be performed by one shift operation, an addition and a subtraction. - [0278]The following algorithm is based upon examining 3 successive bits, determining whether to perform an add or subtract, then processing over 2 bit positions and repeating the process. This is known as the 3-2 bit coding scheme. There is a one bit overlap, the least significant bit of one examination is the most significant bit of its predecessor examination.
- [0279]Table 12 of 3-2 bit Booth Multiplication Coding Scheme:
TABLE 12 B[i + 1] B[i] B[i − 1] Operation Remarks 0 0 0 +0 String of 0's 0 0 1 +A String of 1's terminating at B[i] 0 1 0 +A Solitary 1 at B[i] 0 1 1 +2A String of 1's terminating at B[i + 1] 1 0 0 −2A String of 1's starting at B[i + 1] 1 0 1 −A String of 1's terminating at B[i] plus String of 1's starting at B[i + 1] 1 1 0 −A String of 1's starting at B[i] 1 1 1 −0 String of 1's traversing all examined bits of B - [0280]Table 13 of C
**1**-C**8**for digits 0 to 30:TABLE 13 C1 C2 C3 C4 C5 C6 C7 C8 Digit k 0 0 0 0 0 0 0 ABe 30 0 0 0 0 0 0 0 AfsBe 29 0 0 0 0 0 0 ABc AeuBe 28 0 0 0 0 0 0 AfsBc AduBe 27 0 0 0 0 0 ABa AeuBc AcuBe 26 0 0 0 0 0 AfsBa AduBc AbuBe 25 0 0 0 0 AB8 AeuBa AcuBC AauBe 24 0 0 0 0 AfsB8 AduBa AbuBc A9uBe 23 0 0 0 AB6 AeuB8 AcuBa AauBc A8uBe 22 0 0 0 AfsB6 AduB8 AbuBa A9uBc A7uBe 21 0 0 AB4 AeuB6 AcuB8 AauBa ABuBc A6uBe 20 0 0 AfsB4 AduB6 AbuB8 A9uBa A7uBc A5uBe 19 0 AB2 AeuB4 AcuB6 AauB8 A8uBa A6uBc A4uBe 18 0 AfsB2 AduB4 AbuB6 A9uB8 A7uBa A5uBC A3uBe 17 AB0 AeuB2 AcuB4 AauB5 A8uB8 A6uBa A4uBc A2uBe 16 AfsB AduB2 AbuB4 A9uB6 A7uB8 A5uBa A3uBc A1uBe 15 0 AeuB AcuB2 AauB4 A8uB6 A6uB8 A4uBa A2uBc A0uBe 14 0 AduB AbuB2 A9uB4 A7uB6 A5uB8 A3uBa A1uBc 0 13 0 AcuB AauB2 A8uB4 A6uB6 A4uB8 A2uBa A0uBc 0 12 0 AbuB A9uB2 A7uB4 A5uB6 A3uB8 A1uBa 0 0 11 0 AauB A8uB2 A6uB4 A4uB6 A2uB8 A0uBa 0 0 10 0 A9uB A7uB2 A5uB4 A3uB6 A1uB8 0 0 0 9 0 A8uB A6uB2 A4uB4 A2uB6 A0uB8 0 0 0 8 0 A7uB A5uB2 A3uB4 A1uB6 0 0 0 0 7 0 A6uB A4uB2 A2uB4 A0uB6 0 0 0 0 6 0 A5uB A3uB2 A1uB4 0 0 0 0 0 5 0 A4uB A2uB2 A0uB4 0 0 0 0 0 4 0 A3uB A1uB2 0 0 0 0 0 0 3 0 A2uB A0uB2 0 0 0 0 0 0 2 0 A1uB 0 0 0 0 0 0 0 1 0 A0uB 0 0 0 0 0 0 0 0 0 - [0281]Implementation Parameters to achieve various requirements are summarized in the following table 14 that illustrates performance evaluation with (3,2) Booth Encoder Small Bit Multipliers Cells is shown in the following table of Capability versus size comparison (N=16) based upon FIG. 1. The typical adder cell count in this table is based upon using a 3-2 bit Modified Booth Coding scheme similar in Table 12.
TABLE 14 Cyc Cyc Typical Typical Align- E1 + Start to Adder Register Acc ment Adder H1 to start Cell Bit Operation Bits Slots Cells Bits End next Count Count Remarks Mul 16*16 56 2 205 148 2 1 128 112 Allows 2 ^{24 }accumulationsNote 1 Mul 16*32 3 2 256 128 Allows 2 ^{8 }accumulationsMul 16*16 64 3 213 156 2 1 128 128 Allows 2 ^{32 }accumulationsNote 2 Mul 16*32 3 2 256 128 Allows 2 ^{16 }accumulationsMul 32*32 6 4 512 128 Allows 1 operation Mul 16*16 72 4 221 164 3 1 128 144 Allows 2 ^{24 }accumulationsNote 3 Mul 16*32 4 2 256 144 Allows 2 ^{24 }accumulationsMul 32*32 6 4 512 144 Allows 2 ^{8 }accumulationsMul 32*48 8 6 768 144 Allows 2 ^{8 }accumulations#bits of memory. It would have about the same amount of logic circuitry. Its clock cycle time is approximately half that of the standard equivalent device and would have the same throughput as the standard implementation. Alignment in this multiplier-accumulator is the same as multiplying the product by 1, 2 ^{16 }= 65536 and (2^{16})^{2}. It is functionally equivalent to a 32#by 32 bit multiplier with follow-on local carry propagate adder for accumulation. The equivalent circuit would require 512 adder cells and 128 bits of accumulator memory compared to 213 adder cells and 156 bits of memory. It would be about half the logic circuitry. Its clock cycle time is approximately half that of the standard equivalent device. It would take twice as long to perform a 32 by 32 bit multiply. The #multiplier-accumulator would be twice as fast the standard circuit for 16 by 16 multiplication. It would perform a 16 by 32 bit multiplication at the same rate as the standard multiplier-accumulator would perform. Alignment is the same as multiplying the product by 1, 2 ^{16 }= 65536, (2^{16})^{2 }and (2^{16})^{3}. It is functionally equivalent to a 32 by 48 bit multiplier with follow-on local carry propagate adder for accumulation. The#equivalent circuit would require 768 adder cells and 144 bits of accumulator memory compared to 221 adder cells and 164 bits of memory. It would be about a third the logic circuitry. Its clock cycle time is approximately half that of the standard equivalent device. It would take three times as long to perform a 32 by 48 bit multiply. The present multiplier-accumulator would be twice as fast the standard circuit for 16 by 16 multiplication. It would perform a 16 by 32 bit #multiplication at the same rate as the standard circuit would perform. It would perform a 32 by 32 bit multiplication in about twice as long as the standard circuit. - [0282]The following table 15 illustrates a Capability versus size comparison (N=24) based upon FIG. 17. The typical adder cell count in this table is based upon using a 3-2 bit Modified Booth Coding scheme similar in Table 12.
TABLE 15 Cyc Cyc Typical Typical Align- E1 + Start to Adder Register Acc ment Adder H1 to start Cell Bit Operation Bits Slots Cells Bits End next Count Count Remarks Mul 16*24 64 2 283 196 3 1 256 128 Allows 2 ^{16 }accumulationsNote 1 Mul 32*24 4 2 448 128 Allows 2 ^{8 }accumulationsMul 16*24 88 4 303 212 3 1 280 176 Allows 2 ^{48 }accumulationsNote 2 Mul 32*24 4 2 472 176 Allows 2 ^{32 }accumulationsMul 16*48 5 2 465 176 Allows 2 ^{24 }accumulationsMul 32*48 6 4 768 176 Allows 2 ^{8 }accumulations#about the same amount of logic circuitry. Its clock cycle time is approximately half that of the standard equivalent device and would have the same throughput as the standard implementation. Alignment is the same as multiplying the product by 1, 2 ^{24}, 2^{16 }and 2^{40 }= 2^{16+24}. It is functionally equivalent to a 32 by 48 bit multiplier with follow-on local carry propagate adder for accumulation. The equivalent#circuit would require 768 adder cells and 176 bits of accumulator memory compared to 303 adder cells and 212 bits of memory. It would have about half as much logic circuitry. Its clock cycle time would be somewhat less than half the standard implementation. It would take 4 new circuit clock cycles to perform what would take 1 standard clock cycle (or 2 new circuit clock cycles) in the new circuit to perform. #However, in one clock cycle, a 16 by 24 bit multiplication could occur and in two clock cycles either a 16 by 48 or a 32 by 24 bit multiplication could occur. This circuit is half the size and for a number of important DSP arithmetic operations, either as fast or significantly faster than a standard circuit with the same capability. - [0283]Use of a Modified 4-3 bit Booth Multiplication Coding Scheme
- [0284]This embodiment primarily differs from its predecessors in the multiplier block
**300**. As before, a version of Booth's Algorithm is used to minimize the number of add operations needed. The following algorithm is based upon examining four successive bits, determining whether to perform an add or subtract, then processing over three bit positions and repeating the process. This is what has lead to the term 4-3 bit coding scheme. There is a 1-bit overlap, the least significant bit of one examination is the most significant bit of its successor examination. - [0285]Table 16 illustrates a Modified 4-3 Bit Booth Multiplication Coding Scheme:
TABLE 16 B[i + 2] B[i + 1] B[i] B[i − 1] Operation Remark 0 0 0 0 +0 string of 0's 0 0 0 1 +A string of 1's terminating at B[i] 0 0 1 0 +A Solitary 1 at B[i] 0 0 1 1 +2A sting of 1's terminating at B[i + 1] 0 1 0 0 +2A Solitary 1 at B[i + 1] 0 1 0 1 +3A String of 1's terminating at B[i] plus solitary 1 at B[i + 1] 0 1 1 0 +3A Short string(=3) at B[i + 1] and B[i] 0 1 1 1 +4A String of 1's terminating at B[i + 2] 1 0 0 0 −4A String of 1's starting at B[i + 2] 1 0 0 1 −3A String of 1's starting at B[i + 2] plus string of 1's terminating at B[i] 1 0 1 0 −3A String of 1's starting at B[i + 2] plus solitary 1 at B[i] 1 0 1 1 −2A String of 1's starting at B[i + 2] plus string of 1's terminating at B[i + 1] 1 1 0 0 −2A String of 1's starting at B[i + 1] 1 1 0 1 −A String of 1's starting at B[i + 1] plus string of 1's terminating at B[i] 1 1 1 0 −A String of 1's starting at B[i] 1 1 1 1 −0 String of 1's starting traversing all bits - [0286]Optimal Double Precision Floating Point Mantissa Multiplication
- [0287]An implementation based upon 24- by 32-bit multiplication would be capable of performing a standard 56-bit precision floating point mantissa multiplication every two cycles. The 56-bit length comes from the inherent requirement of IEEE Standard Double Precision numbers, which require a mantissa of 64-10 bits, plus two guard bits for intermediate rounding accuracy. Such an implementation would require only two alignment slots. An implementation of 16- by 24-bit multiplication would be capable of supporting the 56-bit floating point mantissa calculation, but with the liability of taking more clock cycles to complete. More alignment slots would be required. Such an implementation would however much less logic circuitry as the application dedicated multiplier. Implementation of a p-adic mantissa for either p=3 or 7 would be readily optimized in such implementations.
- [0288]Table 17 of C
**1**-C**8**for digits 0 to 47TABLE 17 C1 C2 C3 C4 C5 C6 C7 C8 Digit k 0 0 0 0 0 0 0 AB15 47 0 0 0 0 0 0 0 A19uB15 46 0 0 0 0 0 0 0 A18uB15 45 0 0 0 0 0 0 AB12 A17uB15 44 0 0 0 0 0 0 A19uB12 A16uB15 43 0 0 0 0 0 0 A18uB12 A15uB15 42 0 0 0 0 0 ABf A17uB12 A14uB15 41 0 0 0 0 0 A19uBf A16uB12 A13uB15 40 0 0 0 0 0 A18uBf A15uB12 A12uB15 39 0 0 0 0 ABc A17uBf A14uB12 A11uB15 38 0 0 0 0 A19uBc A16uBf A13uB12 A10uB15 37 0 0 0 0 A18uBc A15uBf A12uB12 AfsB15 36 0 0 0 AB9 A17uBc A14uBf A11uB12 AeuB15 35 0 0 0 A19uB9 A16uBc A13uBf A10uB12 AduB15 34 0 0 0 A18uB9 A15uBc A12uBf AfsB12 AcuB15 33 0 0 AB6 A17uB9 A14uBc A11uBf AeuB12 AbuB15 32 0 0 A19uB6 A16uB9 A13uBc A10uBf AduB12 AauB15 31 0 0 A18uB6 A15uB9 A12uBc AfsBf AcuB12 A9uB15 30 0 AB3 A17uB6 A14uB9 A11uBc AeuBf AbuB12 A8uB15 29 0 A19uB3 A16uB6 A13uB9 A10uBc AduBf AauB12 A7uB15 28 0 A18uB3 A15uB6 A12uB9 AfsBc AcuBf A9uB12 A6uB15 27 AB0 A17uB3 A14uB6 A11uB9 AeuBc AbuBf A8uB12 A5uB15 26 A19sB0 A16uB3 A13uB6 A10uB9 AduBc AauBf A7uB12 A4uB15 25 A18sB0 A15uB3 A12uB6 AfsB9 AcuBc A9uBf A6uB12 A3uB15 24 A17sB0 A14uB3 A11uB6 AeuB9 AbuBc A8uBf A5uB12 A2uB15 23 A16sB0 A13uB3 A10uB6 AduB9 AauBc A7uBf A4uB12 A1uB15 22 A15sB0 A12uB3 AfsB6 AcuB9 A9uBc A6uBf A3uB12 A0uB15 21 A14sB0 A11uB3 AeuB6 AbuB9 A8uBc A5uBf A2uB12 0 20 A13sB0 A10uB3 AduB6 AauB9 A7uBc A4uBf A1uB12 0 19 A12sB0 AfsB3 AcuB6 A9uB9 A6uBc A3uBf A0uB12 0 18 A11sB0 AeuB3 AbuB6 A8uB9 A5uBc A2uBf 0 0 17 A10sB0 AduB3 AauB6 A7uB9 A4uBc A1uBf 0 0 16 AfsB0 AcuB3 A9uB5 A6uB9 A3uBc A0uBf 0 0 15 AeuB0 AbuB3 A8uB5 A5uB9 A2uBc 0 0 0 14 AduB0 AauB3 A7uB5 A4uB9 A1uBc 0 0 0 13 AcuB0 A9uB3 A6uB6 A3uB9 A0uBc 0 0 0 12 AbuB0 A8uB3 A5uB5 A2uB9 0 0 0 0 11 AauB0 A7uB3 A4uB6 A1uB9 0 0 0 0 10 A9uB0 A6uB3 A3uB5 A0uB9 0 0 0 0 9 A8uB0 A5uB3 A2uB5 0 0 0 0 0 8 A7uB0 A4uB3 A1uB6 0 0 0 0 0 7 A6uB0 A3uB3 A0uB6 0 0 0 0 0 6 A5uB0 A2uB3 0 0 0 0 0 0 5 A4uB0 A1uB3 0 0 0 0 0 0 4 A3uB0 A0uB3 0 0 0 0 0 0 3 A2uB0 0 0 0 0 0 0 0 2 A1uB0 0 0 0 0 0 0 0 1 A0uB0 0 0 0 0 0 0 0 0 - [0289]The following table 18 illustrates the performance evaluation of Capability versus size comparison (N=24) based upon FIG. 17. The typical adder cell counts in the above table are based upon a multiplier design using a 4-3 bit Modified Booth Encoding Algorithm.
TABLE 18 Align- E1 + Start to Adder Register Acc ment Adder H1 to start Cell Bit Operation Bits Slots Cells Bits End next Count Count Remarks Mul 24*24 56 1 272 244 3 1 272 112 Allows 2 ^{8 }accumulationsNote 1 Mul 24*24 80 2 296 292 3 1 296 160 Allows 2 ^{32 }accumulationsNote 2 Mul 24*48 4 2 512 160 Allows 2 ^{8 }accumulationsMul 24*24 64 64 280 260 3 1 576 256 Allows 2 ^{16 }accumulationsNote 3 FMul 24*24 33 12 256 Allows indefinite number of accumulations Allows 2 ^{8}accumulations Mul 24*24 48 16 264 260 3 1 576 192 Allows 1 operation P-adic Note 4 P-adic FMul 3 1 192 Allows indefinite number of 24*24 accumulations #Alignment in this new circuit is the same as multiplying the product by 1 and 2 ^{24 }= (2^{8})^{3}. It is functionally equivalent to a 24 by 48 bit multiplier with follow-on local carry propagate adder for accumulation. The equivalent circuit would require 512 adder cells and 160 bits of accumulator memory compared to 296 adder cells and 292 bits of memory. It would have about 60% as much logic circuitry. Its#clock cycle time is approximately half that of the standard equivalent device. The new circuit would have the same throughput as the standard implementation for 24 by 48 bit multiplications, but for 24 by 24 bit multiplications, would perform twice as fast. This circuit is capable of performing single precision mantissa multiplication. It is specified as supporting an Extended Scientific #Notation, which forces the implementation of dual accumulators. Alignment of a product is to any bit boundary, so that weights of every power of two must be supported. Truncation of “dropped bits” in either the accumulator or partial product circuitry require G1 to be able to mask digits. Integer performance is the same as that described in the previous note. Note that the present multiplier-accumulator can support a new #single precision floating point multiplication-accumulation every clock cycle. This is the first circuit discussed in this patent capable of p-adic floating point support, P = 7. Since alignment is at p-digit boundaries, a 48 bit (which is 16 p-digits) accumulator only requires 16 alignment slots, making its implementation of the alignment mechanism much less #demanding. The adder cells used here are p-adic adder cells, which are assuming to work on each of the three bits of a redundant p-digit notation. These adder cells may well be different for each bit within a digit, but will be counted as having the same overall complexity in this discussion. The primary advantage of this circuit is that its performance is twice the performance of the standard implementation. - [0290]Table 19 illustrates coefficient generation for multipliers
**300**:TABLE 19 Di- C1 C2 C3 C4 C5 C6 C7 C8 git k 0 0 0 0 0 ABf Z1f 0 31 0 0 0 0 0 AfsBf Z1e 0 30 0 0 0 0 0 AeuBf Z1d 0 29 0 0 0 0 ABc AduBf Z1c 0 28 0 0 0 0 AfsBc AcuBf Z1b 0 27 0 0 0 0 AeuBc AbuBf Z1a 0 26 0 0 0 AB9 AduBc AauBf Z19 0 25 0 0 0 AfsB9 AcuBc A9uBf Z18 0 24 0 0 0 AeuB9 AbuBc A8uBf Z17 0 23 0 0 AB6 AduB9 AauBc A7uBf Z16 0 22 0 0 AfsB6 AcuB9 A9uBc A6uBf Z15 0 21 0 0 AeuB6 AbuB9 A8uBc A5uBf Z14 0 20 0 AB3 AduB6 AauB9 A7uBc A4uBf Z13 0 19 0 AfsB3 AcuB6 A9uB9 A6uBc A3uBf Z12 0 18 0 AeuB3 AbuB6 A8uB9 A5uBa A2uBf Z11 0 17 AB0 AduB3 AauB6 A7uB9 A4uBc A1uBf Z10 0 16 AfsB AcuB3 A9uB6 A6uB9 A3uBc A0uBf Zf 0 15 0 AeuB AbuB3 A2uB6 A5uB9 A2uBc 0 Ze 0 14 0 AduB AauB3 A7uB6 A4uB9 A1uBc 0 Zd 0 13 0 AcuB A9uB3 A6uB6 A3uB9 A0uBc 0 Zc 0 12 0 AbuB ABuB3 A5uB6 A2uB9 0 0 Zb 0 11 0 AauB A7uB3 A4uB6 A1uB9 0 0 Za 0 10 0 A9uB A6uB3 A3uB6 A0uB9 0 0 Z9 0 9 0 A8uB A5uB3 A2uB6 0 0 0 Z8 0 8 0 A7uB A4uB3 A1uB6 0 0 0 Z7 0 7 0 A6uB A3uB3 A0uB6 0 0 0 Z6 0 6 0 A5uB A2uB3 0 0 0 0 Z5 0 5 0 A4uB A1uB3 0 0 0 0 Z4 0 4 0 A3uB A0uB3 0 0 0 0 Z3 0 3 0 A2uB 0 0 0 0 0 Z2 0 2 0 A1uB 0 0 0 0 0 Z1 0 1 0 A0uB 0 0 0 0 0 Z0 0 0 0 - [0291]Examination of Table 19 shows that Adder D
**4**is not needed to achieve a fixed point polynomial step implementation. Adder D**4**and D**6**would be unnecessary for implementations which did not support single cycle polynomial step operations. - [0292]Fixed point arithmetic polynomial step calculations would not need Adder D
**4**. The assumption would be that the computation's precision would match or be less than N bits, so that the Z input in this case would be 16 bits, which would be aligned to the most significant bits of the product. Integer arithmetic polynomial step calculations would also not need Adder D**4**. The major difference would be that the offset in such a situation would be assumed to be of the same precision as the result of the multiplication, so that Z would be assumed to be 32 bits. - [0293]Table 20 illustrates Performance versus Size for N=16.
TABLE 20 Cyc Cyc Typical Typical Align- E1 + Start to Adder Register Acc ment Adder H1 to start Cell Bit Operation Bits Slots Cells Bits End next Count Count Remarks Mul 16*16 40 1 148 13 2 1 196 80 Allows 2 ^{8 }accumulations2 Note 1 Mul 16*16 56 2 196 14 2 1 196 112 Allows 2 ^{24 }accumulations8 Note 2 Mul 16*32 3 2 300 112 Allows 2 ^{8 }accumulationsMul 16*16 64 3 220 15 2 1 220 128 Allows 2 ^{32 }accumulations6 Note 3 Mul 16*32 3 2 316 128 Allows 2 ^{16 }accumulationsMul 32*32 5 4 600 144 Allows 2 ^{8 }accumulationsMul 16*16 88 4 270 19 2 1 270 176 Allows 2 ^{56 }accumulations6 Note 4 Mul 16*32 3 2 374 176 Allows 2 ^{56 }accumulationsMul 32*32 5 4 648 176 Allows 2 ^{16 }accumulationsMul 32*48 8 6 900 176 Allows 2 ^{8 }accumulations#multiplies as the standard circuit and the same performance for 16 by 32 bit multiplies. This new circuit has alignment weights of 1, 2 ^{16 }and 2^{32 }= (2^{16})^{2}. It possesses about half of the logic of a standard implementation. It performs one 32 by 32 bit multiply in 4 of its clock cycles, compared to the standard implementation taking about 2 new circuit clock cycles.#However, it performs a 16 by 16 bit multiply every clock cycle, which is twice as fast as the standard implementation. This new circuit has alignment weights of 1, 2 ^{16}, 2^{32 }= (2^{16})^{2 }and 2^{48 }= (2^{16})^{3}. It possesses about a third of the logic of a standard implementation. It performs one 32 by 48 bit multiply in 6 of its clock cycles, compared to the standard implementation taking about 2 new#circuit clock cycles. However, it performs a 16 by 16 bit multiply every clock cycle, which is twice as fast as the standard implementation. - [0294]The basic difference in the MAC of FIG. 20 and the above MAC of FIG. 19 is that there are an additional four numbers generated in multiplier block
**300**, C**9**-C**12**. This requires six holders D**1**-D**6**on the output. The Adders D**5**and D**6**extend the precision of the multiplication which can be accomplished by 50% beyond that which can be achieved by a comparable circuit of the basic Multiplier described above. A 32 bit by N bit single cycle multiplication could be achieved without the necessity of D**6**. In such an implementation, D**6**would provide the capability to implement a polynomial step operation of the form X*Y+Z, where X and Z are input numbers and Y is the state of an accumulator register contained in H**1**. This would be achieved in a manner similar to that discussed regarding FIGS. 18 and 19. Such an implementation would require at least two accumulator registers in H**1**for optimal performance. If N>=32, then with the appropriate alignment slots in G**1**and G**2**, these operations could support multiple precision integer calculations. Such operations are used in commercial symbolic computation packages, including Mathematica, Macsyma, and MAPLE V, among others. - [0295]An implementation of 28 by N bit multiplication would be sufficient with the use of D
**6**to provide offset additions supporting two cycle X*Y+Z polynomial step calculation support for Standard Double Precision Floating Point mantissa calculations. - [0296]Implementations of either of the last two implementations which contained four accumulation registers in H
**1**would be capable of supporting Extended Precision Floating Point Mantissa Multiplication/Accumulations acting upon two complex numbers, which is a requirement for FORTRAN runtime environments. Any of the above-discussed implementations could be built with the capability of supporting p-adic floating point operations of either Standard or Extended Precision Floating Point, given the above discussion. Adder chains D**7**, D**8**and D**9**are provided on the output of Adders D**1**-D**6**in a true configuration. These Adder chains D**7**, D**8**and D**9**take as inputs the results of D**1**, D**2**, D**3**, D**4**, D**5**and D**6**, respectively. The primary Multiplier does not contain D**9**. It is specific to the embodiment discussed herein. - [0297]As in the initial Multiplier/Accumulator architecture of FIG. 17, the inputs of Adder D
**10**are the results of Adders D**7**and D**8**, which have been registered in Block E**1**. Adder D**11**takes as inputs the aligned results of Adder D**9**and aligned results of selected memory contents of H**1**. In this embodiment to the Basic Multiplier/Accumulator Architecture. Adder D**11**takes as inputs the aligned results of Adder D**9**and aligned results of selected memory contents of H**1**. The alignment mentions in the last sentence is performed by G**1**. The aligned results of Adder D**9**have traversed E**1**, where they synchronously captured. - [0298]Adder D
**12**receives the aligned results of the Adders D**10**and the results of Adder D**11**. G**2**aligns the results of Adder D**10**prior to input of this aligned signal bundle by Adder D**12**. The results of its operation are sent to Block H**1**, where one or more of the registers(s) internal to Block H**11**may store the result. The primary performance improvement comes from being able to handle more bits in parallel in one clock cycle. The secondary performance improvement comes from being able to start a second operation while the first operation has traversed only about half the adder tree as in the primary circuitry discussion. The third performance improvement comes from the ability to perform multiple-precision calculations without significantly affecting the size of the circuit. An implementation based upon this diagram with a trimmed adder tree can support 32 by N bit multiply-accumulates. - [0299]Table 21 illustrates a Trimmed adder tree supporting 32 by 32 Multiplication (Performance versus Size for N=32).
TABLE 21 Cyc Cyc Typical Typical Acc Align- E1 + Start to Adder Register (2) ment Adder H1 to start Cell Bit Operation Bits Slots Cells Bits End next Count Count Remarks Mul 32*32 80 1 508 400 2 1 508 160 Allows 2 ^{16 }accumulationsNote 1 Mul 32*32 112 2 572 464 2 1 572 224 Allows 2 ^{56 }accumulationsNote 2 Mul 32*64 3 2 860 224 Allows 2 ^{16 }accumulationsMul 32*32 144 3 636 528 2 1 636 288 Allows 2 ^{80 }accumulationsNote 3 Mul 32*64 3 2 924 288 Allows 2 ^{48 }accumulationsMul 64*64 5 4 1664 288 Allows 2 ^{16 }accumulationsMul 32*32 160 4 672 560 2 1 668 320 Allows 2 ^{56 }accumulationsNote 4 Mul 32*64 3 2 960 320 Allows 2 ^{40 }accumulationsMul 64*64 5 4 1694 320 Allows 2 ^{16 }accumulationsMul 64*96 8 6 2176 320 Allows 2 ^{8 }accumulations#32 bit multiplies as the standard circuit and the same performance for 32 by 64 bit multiplies. This circuit has alignment weights of 1, 2 ^{32 }and 2^{64 }= (2^{32})^{2}. It possesses less than half of the logic of a standard implementation. It performs one 64 by 64 bit multiply in 4 of its clock cycles, compared to the standard implementation taking about two circuit clock cycles.#However, it performs a 32 by 32 bit multiply every clock cycle, which is twice as fast as the standard implementation. This circuit has alignment weights of 1, 2 ^{32}, 2^{64 }= (2^{32})^{2 }and 2^{96 }= (2^{32})^{3}. It possesses about a third of the logic of a standard implementation. It performs one 64 by 96 bit multiply in 6 of its clock cycles, compared to the standard implementation taking about two#circuit clock cycles. However, it performs a 32 by 32 bit multiply every clock cycle, which is twice as fast as the standard implementation. - [0300]Referring now to FIGS. 21 and 22, there are illustrated two additional embodiments of the MAC
**68**. Both of these FIGS. 21 and 22 support single-cycle double precision floating point mantissa multiplications. They may be implemented to support Extended Scientific Floating Point Notations as well as p-adic floating point and extended floating point with the same level of performance. FIG. 21 represents a basic multiplier-accumulator. FIG. 22 represents an extended circuit which supports optimal polynomial calculation steps. - [0301]Use of 4-3 Modified Booth Multiplication Encoding will be assumed for multiplier block
**300**. The support of small p-adic floating point mantissa or Modular Arithmetic multiplication would require a modification of this scheme. The 18 partial products which are generated support the 54 bit mantissa fields of both standard double precision and also p=7 p-adic double precision. These FIGS. 21 and 22 represent circuitry thus capable of 54 by 54 bit standard mantissa multiplication as well as 18 by 18 digit (54 bits) p-adic mantissa calculation. - [0302]Starting from the left, the first layer of adders (D
**1**-D**6**) on the output of multiplier block**300**and the third layer of adders (D**10**) on the output of pipeline registers E**1**are the sum of three-number adder chains. The second and fourth layers of adders (D**7**-**9**and D**11**) are the sum of two number adders. The alignment circuitry G**1**and the use of an adder ring in D**11**provide the alignment capabilities needed for the specific floating point notations required. Circuitry in H**1**may be implemented to support Extended Scientific Notations as well as optimize performance requirements for Complex Number processing for FORTRAN. The functions performed by J**1**are not substantially different from the above-noted embodiments. - [0303]With further reference to FIG. 21, the major item to note is that there are an additional six numbers generated in multiplier block
**300**beyond what FIG. 20 could generate. The Adders D**1**to D**6**each add three numbers represented by the signal bundles C**1**to C**18**. Standard, as well as p=7 p-adic, floating point double precision mantissa multiplications require 54 bit (18 p=7 p-adic digit) mantissas. This multiplier block**300**would be able to perform all the small bit multiplications in parallel. The results of these small bit multiplications would then be sent to Adders D**1**to D**6**to create larger partial products. - [0304]The adder chains D
**7**, D**8**and D**9**take as inputs the results of D**1**, D**2**, D**3**, D**4**, D**5**and D**6**, respectively. The primary Multiplier claimed does not contain D**9**. It is specific to the embodiment being discussed here. Adder D**10**also sums three numbers. The inputs of Adder D**10**are the results of Adders D**7**, D**8**and D**9**, which have been registered in Block E**1**. Adder D**11**receives the aligned results of the Adders D**10**and the selected contents of H**1**. G**1**aligns the results of Adder D**10**. The results of its operation are sent to Block H**1**, where one or more of the registers(s) internal to Block H**1**may store the result. - [0305]Register Block H
**1**and Interface J**1**have an additional function in FIG. 22: The ability to be loaded with an additional number “Y” which may then be used to compute B*Z+Y. The primary performance improvement comes from being able to handle a double precision mantissa multiplication every clock cycle with the necessary accumulators to support Extended Scientific Precision Floating Point for either standard or p=7 p-adic arithmetic. The secondary performance improvement comes from being able to start a second operation while the first operation has traversed only about half the adder tree as in the primary circuitry discussion. - [0306]The following Table 22 describes the performance analysis of Multipliers with two accumulators capable of supporting Extended Scientific Double Precision Standard and p=7 p-adic multiplication-accumulation on every cycle.
TABLE 22 Cyc Cyc Typical Typical Align- E1 + Start to Adder Register Acc ment Adder H1 to start Cell Bit Operation Bits Slots Cells Bits End next Count Count Remarks FMul 256 128 475(3) 932 2 1 475(3) 512 Note 1 54*54 338(2) 338(2) PFMul 216 36 475(3) 812 2 1 475(3) 432 Note 2 18*18 298(2) 298(2) #holding 128 bits in the redundant binary notation. Note that complex number support would double the number of accumulators required. Such support is needed for FORTRAN and optimal for Digital Signal Processing applications based upon complex number arithmetic. The number of adder cells is decomposed into two types: those which sum 3 numbers (3) and those sum two numbers(2). These adder cell numbers represent the #cells in the respective adders D1-D11 as all being of the same type, which is a simplification. The primary difference between this and a standard approach is performance: the new circuit performs twice as many multiplies in the same amount of time. Use of FIG. 22-based circuitry enhances performance by permitting polynomial calculation step optimization. This represents a speedup of a factor of two #in these calculations. This design implements p = 7 p-adic double precision mantissa multiplication-accumulate targeting extended scientific notation acculators. Double length accumulators require 36 digit storage, which poses a problem: if the approach taken in new circuit 1 (simplicity of the alignment slots) were used here, it would require 64 alignment slots, resulting in 64 digit accumulators. This is a lot more #accuracy than would seem warranted. The assumptions made here are that there are 36 alignment slots, with 36 redundant p-adic digits required of each of the two accumulators. Each redundant p-adic digit will be assumed to require 6 bits of memory. Note that complex number support would double the number of accumulators required. Such support is needed for FORTRAN and optimal for Digital Signal #Processing applications based upon complex number arithmetic. It will be further assumed that each digit of the redundant p-adic adder cell is roughly equivalent to 3 of the redundant binary adder cells. The number of adder cells is decomposed into two types: those which sum 3 numbers (3) and those sum two numbers(2). These adder cell numbers represent the cells in the respective adders D1-D11 as all being of the same type, which is a simplification. #Since there is no known equivalent circuit, comparison is more hypothetical: this circuit's throughput is twice a circuit lacking the E1 pipe registers. Use of FIG. 22-based circuitry enhances performance by permitting polynomial calculation step optimization. This represents a speedup of a factor of two in these calculations. - [0307]Referring now to FIG. 23, there is illustrated a block diagram of a Multiplier Block with minimal support Circuitry. A Multiplier-Accumulator Block
**310**contains a multiplier-accumulator comprised of a multiplier**312**and an accumulator**314**, as described hereinabove, plus an input register block**316**labeled ‘L**2**:MulInReg’. Signal bundles whose sources are external to this circuit are selected by a plurality of multiplexors**318**labeled ‘K**2**:IN Mux(s)’. The selected signal bundles are synchronously stored in the memory of a block**320**labeled ‘L**1**:IN Reg(s)’. The inputs to the Multiplier-Accumulator block**31**are selected by a multiplexor circuit**322**labeled ‘K**3**:Mult Mux(s)’. A plurality of signals bundles from block**322**would then be sent to**322**and to a block**324**labeled ‘K**4**:Add Mux(s)’. - [0308]The K
**4**block selects between synchronized externally sourced signal bundles coming from the block**320**and the contents (or partial contents) of selected memory contents of the accumulator block**314**labeled ‘L**4**:MulAcReg(s)’. These signal bundles are then synchronously stored in the memory contents of a block**326**, labeled ‘L**5**:AddInReg’ in an Adder block**328**. The Adder is considered to optionally possess a mid-pipe register block labeled ‘L**6**:AddMidReg(s)’. The synchronous results of the Adder are stored in the memory component(s) of the block labeled ‘L**7**:AddAccReg(s)’. In the simplest implementations, the following components would not be populated: K**2**, L**1**, K**3**, K**4**and L**6**. - [0309]Referring now to FIG. 24, there is illustrated a block diagram of a Multiplier-Accumulator with Basic Core of Adder, one-port and three-port Memories. This circuit incorporates all the functional blocks of FIG. 23
**7**plus a one-port memory**330**, similar to one-port memory**44**, a three-port memory**322**, similar to three-port memory**43**, output register multiplexors**334**and output registers**336**. The Multiplier's input selector**322**now selects between signal bundles from the input register block**320**(L**1**(ir**0**-irn)), the memory read port synchronized signal bundles(mr**0**-mr**2**) and the synchronized results of the output register block**336**(L**7**(or**0**-orn)). The Adder's accumulators L**7**now serve as the output registers, with the block**334**‘K**5**:OutRegMux(s)’ selecting between adder result signal bundle(s), input register signal bundles (ir**0**-irn) and memory read port signal bundles (mr**0**-mr**2**). The Adder**328**may also possess status signals, such as equality, zero-detect, overflow, carry out, etc. which may also be registered. They are left silent in this diagram to simplify the discussion. - [0310]The one-port memory block
**330**contains a write data multiplexor block**340**, labeled ‘K**6**:**1**-port Write Mux’ which selects between the input register signal bundles ‘ir**0**-irn’ and the output register signal bundles ‘or**0**-orn’. The selected signal bundle is sent to the write port of the memory. The read port sends its signal bundle to a read register**342**, labeled ‘L**8**:**1**-port Read Reg’, which synchronizes these signals for use elsewhere. This memory can only perform one access in a clock cycle, either reading or writing. The contents of block**342**are assumed to change only when the memory circuit performs a read. Note that address generation and read/write control signal bundles are left silent in this diagram to simplify the discussion. - [0311]The three-port memory block
**332**contains a write data multiplexor block**344**, labeled ‘K**7**:**3**-port Write Mux’ which selects between the input register signal bundles ‘ir**0**-irn’ and the output register signal bundles ‘or**0**-orn’. The selected single bundle is sent to the write port of the memory. The read ports send their signal bundles to a read register block**346**, labeled ‘L**9**:-port Rd**1**Reg’ and a read register block**348**, labeled ‘L**10**:**3**-port Rd**2**Reg’, which synchronize these signals for use elsewhere. This memory**332**can perform two read and one write access in a clock cycle. The contents of**346**and**349**are assumed to change only when the memory circuit performs a read. Note that address generation and read/write control signal bundles are left silent in this diagram to simplify the discussion. - [0312]Referring now to FIG. 25, there is illustrated a block diagram of a Multiplier-Accumulator with Multiplicity of Adders, and one-port and three-port Memories. This circuit incorporates all the functional blocks of FIG. 24 plus one or more additional Adder blocks, each containing a multiplicity of Accumulators
**350**, labeled ‘L**7**:AddAcc(s)’. Adder input multiplexing may be independently controlled to each Adder Block. Multiple signal bundles (ac[**1**,**0**] to ac[p,k]) are assumed to be generated from these Adder Blocks. Any adder status signals, such as overflow, equality, zero detect, etc., are assumed synchronously stored and made available to the appropriate control signal generation circuitry. These status signal bundles, synchronizing circuitry and control signal generation circuitry are left silent in this figure for reasons of simplicity. The Multiplier Multiplexor**332**is extended to select any from the generated adder signal bundles (ac[**1**,**0**] to ac[p,k]). The Output Register Multiplexor**334**is extended any from the generated adder signal bundles (ac[**1**,**0**] to ac[p,k]). - [0313]The basic Advantages of Circuit represented by FIGS.
**23**to**25**will now be described. Circuitry based upon FIG. 23 incorporates the advantages of the implemented multiplier-accumulators based upon the embodiments described hereinabove. The major systems limitation regarding multipliers is efficiently providing operands to the circuitry. The embodiment of FIG. 23 does not address this problem. Circuitry based upon FIGS. 24 and 25 solves the systems limitation in FIG. 23 for a broad class of useful algorithms which act upon a stream of data. A stream of data is characterized by a sequential transmission of data values. It possesses significant advantages in the ability to perform linear transformations (which includes Fast Fourier Transforms(FFTs), Finite Impulse Response (FIR) filters, Discrete Cosine Transforms(DCTs)), convolutions and polynomial calculations upon data streams. Linear Transformations are characterized as a square M by M matrix a times a vector v generating a resultant vector. In the general case, each result to be output requires M multiplications of a[i,j] with v[j] for j=0, . . . , M. The result may then be sent to one or more output registers where it may be written into either of the memories. If the matrix is symmetric about the center, so that a[**1**,j]=a[i,n−j] or a[i,j]=−a[i,n−j], then an optimal sequencing involves adding or subtracting v[j] and v[n−j], followed by multiplying the result by a[i,j], which is accumulated in the multiplier's accumulator(s). This dataflow reduces the execution time by a factor of two. Note that assuming the matrix a can be stored in the one port memory and the vector v can be stored in the three port memory, the multiplier is essentially always busy. This system data flow does not stall the multiplier. In fact, when the matrix is symmetric around the center, the throughput is twice as fast. - [0314]Convolutions are characterized by acting upon a stream of data. Let x[−n], . . . , x[
**0**], . . . , x[n] denote a stream centered at x[**0**]. A convolution is the sum c[**0**]* x[−n] * x[**0**]+. . . +c[n]*x[**0**]*x[n]. After calculating each convolution result, the data x[−n] is removed, the remaining data is “moved down” one element and a new piece of data becomes x[n]. Assuming that the x vector can be stored in the three-port memory, the acquiring of a new data element does not slow down the multiplier. The multiplier is essentially busy all the time. Polynomial calculations are optimized inside the multiplier-accumulator architecturally. Assuming sufficient memory to hold the coefficients, these multiplier-accumulator calculations can be performed on every clock cycle. Large-word integer multiplications are also efficiently implemented with these circuitry of FIGS. 7 and 8. Let A[**0**] to A[n] be one large integer and B[**0**] to B[m] be a second large integer. The product is a number C[**0**] to C[n+m] which can be represented as: - [0315]C[
**0**]=Least Significant Word of A[**0**]*B[**0**], - [0316]C[
**1**]=A[**1**]*B[**0**]+A[**0**]*B[**1**]+Second word of C[**0**] - [0317]. . .
- [0318]C[n+m]=A[n]*B[m]+Most Significant Word of C[n+m−1]
- [0319]These calculations can also be performed with very few lost cycles for the multiplier. Circuitry built around FIG. 25 has the advantage in that bounds checking (which requires at least two adders) can be done in a single cycle, and symmetric Matrix Linear Transformations can simultaneously be adding or subtracting vector elements while another adder is converting the multiplier's accumulator(s).
- [0320]Although the preferred embodiment has been described in detail, it should be understood that various changes, substitutions and alterations can be made therein without departing from the spirit and scope of the invention as defined by the appended

Referenced by

Citing Patent | Filing date | Publication date | Applicant | Title |
---|---|---|---|---|

US7062635 | Aug 20, 2002 | Jun 13, 2006 | Texas Instruments Incorporated | Processor system and method providing data to selected sub-units in a processor functional unit |

US7352205 * | Feb 7, 2005 | Apr 1, 2008 | Siemens Aktiengesellschaft | Reconfigurable switching device parallel calculation of any particular algorithms |

US7437401 * | Feb 20, 2004 | Oct 14, 2008 | Altera Corporation | Multiplier-accumulator block mode splitting |

US7466763 * | Nov 8, 2005 | Dec 16, 2008 | Fujitsu Limited | Address generating device and method of generating address |

US7543013 * | Aug 18, 2006 | Jun 2, 2009 | Qualcomm Incorporated | Multi-stage floating-point accumulator |

US7673257 * | Mar 5, 2007 | Mar 2, 2010 | Calypto Design Systems, Inc. | System, method and computer program product for word-level operator-to-cell mapping |

US7685405 * | Aug 24, 2006 | Mar 23, 2010 | Marvell International Ltd. | Programmable architecture for digital communication systems that support vector processing and the associated methodology |

US7694111 * | Feb 19, 2008 | Apr 6, 2010 | University Of Washington | Processor employing loadable configuration parameters to reduce or eliminate setup and pipeline delays in a pipeline system |

US7814137 | Jan 9, 2007 | Oct 12, 2010 | Altera Corporation | Combined interpolation and decimation filter for programmable logic device |

US7822799 | Jun 26, 2006 | Oct 26, 2010 | Altera Corporation | Adder-rounder circuitry for specialized processing block in programmable logic device |

US7836117 | Jul 18, 2006 | Nov 16, 2010 | Altera Corporation | Specialized processing block for programmable logic device |

US7865541 | Jan 22, 2007 | Jan 4, 2011 | Altera Corporation | Configuring floating point operations in a programmable logic device |

US7904905 * | Nov 14, 2003 | Mar 8, 2011 | Stmicroelectronics, Inc. | System and method for efficiently executing single program multiple data (SPMD) programs |

US7930336 | Dec 5, 2006 | Apr 19, 2011 | Altera Corporation | Large multiplier for programmable logic device |

US7948267 | Feb 9, 2010 | May 24, 2011 | Altera Corporation | Efficient rounding circuits and methods in configurable integrated circuit devices |

US7949699 | Aug 30, 2007 | May 24, 2011 | Altera Corporation | Implementation of decimation filter in integrated circuit device using ram-based data storage |

US8005885 * | Oct 14, 2005 | Aug 23, 2011 | Nvidia Corporation | Encoded rounding control to emulate directed rounding during arithmetic operations |

US8041759 | Jun 5, 2006 | Oct 18, 2011 | Altera Corporation | Specialized processing block for programmable logic device |

US8266198 | Jun 5, 2006 | Sep 11, 2012 | Altera Corporation | Specialized processing block for programmable logic device |

US8266199 | Jun 5, 2006 | Sep 11, 2012 | Altera Corporation | Specialized processing block for programmable logic device |

US8301681 * | Jun 5, 2006 | Oct 30, 2012 | Altera Corporation | Specialized processing block for programmable logic device |

US8307023 | Oct 10, 2008 | Nov 6, 2012 | Altera Corporation | DSP block for implementing large multiplier on a programmable integrated circuit device |

US8386550 | Sep 20, 2006 | Feb 26, 2013 | Altera Corporation | Method for configuring a finite impulse response filter in a programmable logic device |

US8386553 | Mar 6, 2007 | Feb 26, 2013 | Altera Corporation | Large multiplier for programmable logic device |

US8396914 | Sep 11, 2009 | Mar 12, 2013 | Altera Corporation | Matrix decomposition in an integrated circuit device |

US8412756 | Sep 11, 2009 | Apr 2, 2013 | Altera Corporation | Multi-operand floating point operations in a programmable integrated circuit device |

US8468192 | Mar 3, 2009 | Jun 18, 2013 | Altera Corporation | Implementing multipliers in a programmable integrated circuit device |

US8484265 | Mar 4, 2010 | Jul 9, 2013 | Altera Corporation | Angular range reduction in an integrated circuit device |

US8510354 | Mar 12, 2010 | Aug 13, 2013 | Altera Corporation | Calculation of trigonometric functions in an integrated circuit device |

US8539014 | Mar 25, 2010 | Sep 17, 2013 | Altera Corporation | Solving linear matrices in an integrated circuit device |

US8539016 | Feb 9, 2010 | Sep 17, 2013 | Altera Corporation | QR decomposition in an integrated circuit device |

US8543634 | Mar 30, 2012 | Sep 24, 2013 | Altera Corporation | Specialized processing block for programmable integrated circuit device |

US8577951 | Aug 19, 2010 | Nov 5, 2013 | Altera Corporation | Matrix operations in an integrated circuit device |

US8589463 | Jun 25, 2010 | Nov 19, 2013 | Altera Corporation | Calculation of trigonometric functions in an integrated circuit device |

US8601044 | Mar 2, 2010 | Dec 3, 2013 | Altera Corporation | Discrete Fourier Transform in an integrated circuit device |

US8620980 | Jan 26, 2010 | Dec 31, 2013 | Altera Corporation | Programmable device with specialized multiplier blocks |

US8645449 | Mar 3, 2009 | Feb 4, 2014 | Altera Corporation | Combined floating point adder and subtractor |

US8645450 | Mar 2, 2007 | Feb 4, 2014 | Altera Corporation | Multiplier-accumulator circuitry and methods |

US8645451 | Mar 10, 2011 | Feb 4, 2014 | Altera Corporation | Double-clocked specialized processing block in an integrated circuit device |

US8650231 | Nov 25, 2009 | Feb 11, 2014 | Altera Corporation | Configuring floating point operations in a programmable device |

US8650236 | Aug 4, 2009 | Feb 11, 2014 | Altera Corporation | High-rate interpolation or decimation filter in integrated circuit device |

US8706790 | Mar 3, 2009 | Apr 22, 2014 | Altera Corporation | Implementing mixed-precision floating-point operations in a programmable integrated circuit device |

US8762443 | Nov 15, 2011 | Jun 24, 2014 | Altera Corporation | Matrix operations in an integrated circuit device |

US8788562 | Mar 8, 2011 | Jul 22, 2014 | Altera Corporation | Large multiplier for programmable logic device |

US8812573 | Jun 14, 2011 | Aug 19, 2014 | Altera Corporation | Calculation of trigonometric functions in an integrated circuit device |

US8812576 | Sep 12, 2011 | Aug 19, 2014 | Altera Corporation | QR decomposition in an integrated circuit device |

US8862650 | Nov 3, 2011 | Oct 14, 2014 | Altera Corporation | Calculation of trigonometric functions in an integrated circuit device |

US8949298 | Sep 16, 2011 | Feb 3, 2015 | Altera Corporation | Computing floating-point polynomials in an integrated circuit device |

US8954803 * | Feb 18, 2011 | Feb 10, 2015 | Mosys, Inc. | Programmable test engine (PCDTE) for emerging memory technologies |

US8959137 | Nov 15, 2012 | Feb 17, 2015 | Altera Corporation | Implementing large multipliers in a programmable integrated circuit device |

US8988956 | Mar 15, 2013 | Mar 24, 2015 | Mosys, Inc. | Programmable memory built in self repair circuit |

US8996600 | Aug 3, 2012 | Mar 31, 2015 | Altera Corporation | Specialized processing block for implementing floating-point multiplier with subnormal operation support |

US9007382 | Jul 20, 2009 | Apr 14, 2015 | Samsung Electronics Co., Ltd. | System and method of rendering 3D graphics |

US9053045 | Mar 8, 2013 | Jun 9, 2015 | Altera Corporation | Computing floating-point polynomials in an integrated circuit device |

US9063870 | Jan 17, 2013 | Jun 23, 2015 | Altera Corporation | Large multiplier for programmable logic device |

US9098332 | Jun 1, 2012 | Aug 4, 2015 | Altera Corporation | Specialized processing block with fixed- and floating-point structures |

US9189200 | Mar 14, 2013 | Nov 17, 2015 | Altera Corporation | Multiple-precision processing block in a programmable integrated circuit device |

US9207909 | Mar 8, 2013 | Dec 8, 2015 | Altera Corporation | Polynomial calculations optimized for programmable integrated circuit device structures |

US9348558 * | Aug 23, 2013 | May 24, 2016 | Texas Instruments Deutschland Gmbh | Processor with efficient arithmetic units |

US9348795 | Jul 3, 2013 | May 24, 2016 | Altera Corporation | Programmable device using fixed and configurable logic to implement floating-point rounding |

US9395953 | Jun 10, 2014 | Jul 19, 2016 | Altera Corporation | Large multiplier for programmable logic device |

US9519460 * | Sep 25, 2014 | Dec 13, 2016 | Cadence Design Systems, Inc. | Universal single instruction multiple data multiplier and wide accumulator unit |

US9575725 * | Mar 18, 2014 | Feb 21, 2017 | Altera Corporation | Specialized processing block with embedded pipelined accumulator circuitry |

US9600278 * | Jul 15, 2013 | Mar 21, 2017 | Altera Corporation | Programmable device using fixed and configurable logic to implement recursive trees |

US9684488 | Mar 26, 2015 | Jun 20, 2017 | Altera Corporation | Combined adder and pre-adder for high-radix multiplier circuit |

US20030236808 * | Jun 19, 2002 | Dec 25, 2003 | Hou Hsieh S. | Merge and split discrete cosine block transform method |

US20040039898 * | Aug 20, 2002 | Feb 26, 2004 | Texas Instruments Incorporated | Processor system and method providing data to selected sub-units in a processor functional unit |

US20050108720 * | Nov 14, 2003 | May 19, 2005 | Stmicroelectronics, Inc. | System and method for efficiently executing single program multiple data (SPMD) programs |

US20050187998 * | Feb 20, 2004 | Aug 25, 2005 | Altera Corporation | Multiplier-accumulator block mode splitting |

US20060149921 * | Dec 20, 2004 | Jul 6, 2006 | Lim Soon C | Method and apparatus for sharing control components across multiple processing elements |

US20070030920 * | Nov 8, 2005 | Feb 8, 2007 | Fujitsu Limited | Address generating device and method of generating address |

US20070171101 * | Feb 7, 2005 | Jul 26, 2007 | Siemens Aktiengesellschaft | Reconfigurable switching device parallel calculation of any particular algorithms |

US20070185951 * | Jun 5, 2006 | Aug 9, 2007 | Altera Corporation | Specialized processing block for programmable logic device |

US20070185952 * | Jun 5, 2006 | Aug 9, 2007 | Altera Corporation | Specialized processing block for programmable logic device |

US20080046495 * | Aug 18, 2006 | Feb 21, 2008 | Yun Du | Multi-stage floating-point accumulator |

US20080046497 * | Aug 17, 2007 | Feb 21, 2008 | Conexant Systems, Inc. | Systems and Methods for Implementing a Double Precision Arithmetic Memory Architecture |

US20080141001 * | Feb 19, 2008 | Jun 12, 2008 | University Of Washington | Processor employing loadable configuration parameters to reduce or eliminate setup and pipeline delays in a pipeline system |

US20100164949 * | Jul 20, 2009 | Jul 1, 2010 | Samsung Electronics Co., Ltd. | System and method of rendering 3D graphics |

US20110137969 * | Dec 9, 2009 | Jun 9, 2011 | Mangesh Sadafale | Apparatus and circuits for shared flow graph based discrete cosine transform |

US20110161389 * | Mar 8, 2011 | Jun 30, 2011 | Altera Corporation | Large multiplier for programmable logic device |

US20110209002 * | Feb 18, 2011 | Aug 25, 2011 | Mosys, Inc. | Programmable Test Engine (PCDTE) For Emerging Memory Technologies |

US20110219052 * | Mar 2, 2010 | Sep 8, 2011 | Altera Corporation | Discrete fourier transform in an integrated circuit device |

US20110238720 * | Mar 25, 2010 | Sep 29, 2011 | Altera Corporation | Solving linear matrices in an integrated circuit device |

US20150058391 * | Aug 23, 2013 | Feb 26, 2015 | Texas Instruments Deutschland Gmbh | Processor with efficient arithmetic units |

CN102652314A * | Dec 8, 2010 | Aug 29, 2012 | 德克萨斯仪器股份有限公司 | Circuits for shared flow graph based discrete cosine transform |

EP1391813A1 * | Aug 13, 2003 | Feb 25, 2004 | Texas Instruments Incorporated | Processor system and method providing data to selected sub-units in a processor functional unit |

Classifications

U.S. Classification | 708/501, 712/E09.071, 708/503, 712/E09.032 |

International Classification | G06F15/78, G06F9/30, G06F7/544, G06F7/49, G06F9/38 |

Cooperative Classification | G06F7/49, G06F7/5443, G06F9/30054, G06F9/30003, G06F7/483, G06F7/49936, G06F2207/3884, G06F15/7867, G06F9/3897, G06F9/3885 |

European Classification | G06F9/38T8C2, G06F9/30A, G06F9/30A3B, G06F9/38T, G06F7/544A, G06F15/78R, G06F7/49 |

Rotate