WO2003054691A1 - Programmable delay indexed data path register file for array processing - Google Patents

Programmable delay indexed data path register file for array processing Download PDF

Info

Publication number
WO2003054691A1
WO2003054691A1 PCT/IB2002/005126 IB0205126W WO03054691A1 WO 2003054691 A1 WO2003054691 A1 WO 2003054691A1 IB 0205126 W IB0205126 W IB 0205126W WO 03054691 A1 WO03054691 A1 WO 03054691A1
Authority
WO
WIPO (PCT)
Prior art keywords
register
registers
data
sample
register bank
Prior art date
Application number
PCT/IB2002/005126
Other languages
French (fr)
Inventor
Krishnamurthy Vaidyanathan
Geoffrey F. Burns
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Priority to KR10-2004-7009643A priority Critical patent/KR20040069335A/en
Priority to AU2002351109A priority patent/AU2002351109A1/en
Priority to JP2003555339A priority patent/JP2005513643A/en
Priority to EP02785822A priority patent/EP1459168A1/en
Publication of WO2003054691A1 publication Critical patent/WO2003054691A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/3013Organisation of register space, e.g. banked or distributed register file according to data content, e.g. floating-point registers, address registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/30134Register stacks; shift registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/355Indexed addressing

Definitions

  • This invention relates to digital signal processing, and more particularly, to optimizing data access in array processing and other multiprocessor systems.
  • Circular buffers are commonly found in digital signal processors, such as, for example, the Analog Devices ADSP 2181 or the Philips REAL DSP, where a memory segment can be addressed after modifying the address by a modulo operation.
  • the data is fetched in one cycle, stored in a register, and used as an operand in the next cycle.
  • the circular buffer is maintained in memory and in order to process the data stored therein, or properly write new data thereto, memory read/write instructions must be used.
  • Such instructions increase computing overhead, the complexity of the instruction set, as well the additional time taken by the memory handling.
  • Modulo addressing allows the facilitation of a sequentially linked series of data elements, where when the end of the series is reached, the sequence wraps around to the beginning.
  • a delay addressed data path register file is designed for use in a programmable processor making up a cell in a multi-processor or array signal processing system.
  • the delay addressable register file is particularly useful in, inter alia, adaptive filters where the filter update latency is variable, interpolation filters where the interpolation factor needs to be programmable, and decimation filters where the decimation factor needs to be programmable.
  • the programmability is achieved in an efficient manner, reducing the number of cycles required to perform this task.
  • a single parameter, the "delay limit" value is programmed at start-up, setting up an internal delay-line within the register file of the processor.
  • any of the delayed registers can be addressed by specifying the delay index during run-time.
  • the delay line advances one location, modulo "delay-limit", when the processing loop starts a new iteration.
  • Figures 1-2 illustrate pointer modified register addressing
  • Figures 1 A and 2 A are Figures 1 and 2, respectively, with exemplary contents of the data registers
  • Figure 3 depicts an example delay-indexed register set according the present invention
  • Figure 4 depicts the register of Figure 3, shifted by one
  • Figure 5 depicts a typical configuration of an adaptive filter as an equalizer
  • Figure 6 depicts a polyphase implementation of an interpolation filter
  • Figures 7 depicts a decimation filter
  • Figure 8 depicts a dual register file for a decimation filter.
  • Convolution is a basic signal processing operation found in many applications, especially in digital filters.
  • Digital filters can be elegantly implemented using array processing techniques, such as the reconfigurable adaptive filter array processor used in the Multi-Standard Channel Decoder (MSCD) described in copending United States Patent Application Serial No. 09/968,119 (the "Parent Application"), discussed above.
  • the reconfigurable processor array is composed of identical processor cells, each capable of communicating with its nearest neighbors and capable of being programmed individually to perform a single task. Because of the high data rates that need to be supported and the constraints on cost, the cells are constrained to be simple and efficient. The efficiency of the cell is determined in part by the design of an efficient instruction set and the supporting architecture that implements the instruction.
  • the present invention describes the design of a delay addressed register file and the corresponding instructions. Such an instruction can be put to good use in a variety of filtering applications including, for example, adaptive filtering and multi-rate filtering in the context of array processing.
  • the delay addressed data path register file design can be applied to any array based design of filters and is not limited to the two-dimensional array described in the Parent Application.
  • RI x a register file set labeled RI x
  • N the total number of datapath registers.
  • the processor also have a typical RISC like instruction set and a sequential controller that executes a specified loop.
  • an add instruction is of the form ADD SRC1 SRC2 DST, where SRC1 is source operandl, SRC2 is source operand 2 and DST is the destination register. All the three operands are drawn from the register file.
  • Delay indexed addressing is a modification on pointer modified addressing. It is, essentially, a pointer modified addressing of the register file with certain initial conditions on the contents of the RD (pointer) register file, and a mechanism for automatic shift of the pointers every data cycle.
  • Each register bank contains, for the purposes of this example, 4 registers, with addresses 0-3. These addresses of the registers 150, 250 are shown on the (outer) sides of each register. Next here is defined as subsequent, so at each shift the contents of a given RD_x register is shifted to the subsequent register, and the contents of the last register folds into the first.
  • the contents of the RD_x registers are the addresses of the RI x registers.
  • the contents of the RI x registers are the data being processed by the processor. In general the data will change with time, as data enters and exits the processor. It is easily seen that if each time the program counter resets a new datum enters the RI x register set 120, 220, then a delay line of depth equal to one less than the number of registers in the RD x set is set up. In the example of Figures 1-2, a delay line of depth 3 can be thus set up, the processor having access to the current datum (usually a sample of some analog value procured at a given sampling rate), and the previous three data, or samples. I.e., the processor has access to data samples Xn, Xn-i, Xn-2, and Xn-3.
  • Figures 1 A and 2 A respectively correspond to Figures 1 and 2, to which they are identical, with the addition of example contents of the data register set RD_X.
  • the new sample is always written over the oldest, or most delayed sample, stored in the register set.
  • the new sample is always written - in this example - to the RI register one behind the register with the current sample, or to the RI X register pointed to by the RD_(0-1) register, RI[RD_3].
  • Figure 3 illustrates such delay-indexed addressing for a delay buffer of depth 3.
  • Figure 4 shows the advancement of the register pointers upon arrival of the new state.
  • a delay limit called rlimit in Figures 3-4
  • the pointer register shift is done modulo (rlimit+1); thus the contents of each RD_x register are changed by the subtraction of 1 (modulo (rlimit+1)).
  • the modulus is (rlimit +1) because rlimit is the maximum delay stored in the RI x registers, but the actual number of registers in the delay line is (rlimit+1), to include the zero delay, or current, sample Xn.
  • the value of rlimit is 3, thus there are four registers utilized in the delay line.
  • a delay indexed pointer register allows a processor to implement any filter or other data processing operation whose inputs are a current datum and a number of data preceding the current datum in some sense. If the data vary relative to each other in time, then a temporal delay line can be maintained, allowing access to a current sample and a number of prior samples, such as is commonly required in FIR filters.
  • the number of samples stored in the delay line will correspond in such a case to the number of delays in the filtering equation plus one, or in terms of the system depicted in Figures 3-4, (rlimit+1).
  • the processor knows how many data samples are in the delay line by means of a preprogrammed variable rlimit, which gives the maximum delay stored in the data registers.
  • index registers are automatically incremented using modular arithmetic so as to preserve the delay relationships between the ever-changing data.
  • a "delay line” could be implemented where the samples vary not in time, but in space, such as in image processing operations, where "prior" corresponds to the prior in space, as defined by some direction within an image.
  • RD_x The delay-indexed datapath register (RD_x) can be used to simplify programming of the tap delay line for adaptive FIR filters.
  • LMS least mean squares
  • x n are the filter states and c n are the filter coefficients.
  • the filter coefficients are updated according to the formula:
  • the filter update latency is the difference, measured in input data sampling periods, between the time the newly calculated error arrives at the cell and the time at which the filter tap output was calculated in the cell.
  • the cell needs a delay buffer.
  • the processor is programmed so as to automatically interpret operands in instructions of the type RI X as RI[RD_X].
  • RI[RD_X] the type of operands in instructions of the type RI X as RI[RD_X].
  • RI X operands it being understood that the processor is programmed to automatically convert those to RI[RD_X] operands.
  • An inte ⁇ olation filter is a multi-rate filter where the output data rate is a multiple of the input data rate. A frequently used case is when this multiple is an integer.
  • Such an inte ⁇ olation filter implements equation 1, but the input sequence is x is the actual input data with zeros stuffed in between. For example, if the inte ⁇ olation multiple is 3, then the input data stream 601 is modified by inserting 2 zeros between every pair of data samples before applying the filter 602. Since two in three data values are zeros, at any point in time only one third of the filter taps produce a non-zero output.
  • a poly-phase filter utilizes this fact to avoid implementing the zero output taps.
  • Figure 6 shows the working of a polyphase filter used as the inte ⁇ olation filter for an inte ⁇ olation multiple of 3. Equation 1 is then implemented as three filters that take a common input and whose outputs are multiplexed in time. The mapping of the filter taps to the cells is also shown in the figure.
  • the delay limit register, rlimit is programmed to be 2, to be 2, for example by means of a dedicated instruction. Coefficients 0, 1, and 2 are stored in RI O, RI 1 and RI 2 respectively. The coefficients are thus stored in consecutive registers which are delay addressed.
  • the controller program executes three loops, for every data sample period. Let the input data in a cell be stored in RI_3.
  • RI_4 RI_3*RI_2
  • RI_2 has coefficient CO
  • RI_2 has Cl
  • RI_2 has C2.
  • the filter output in each program cycle corresponds to the inte ⁇ olation filter output, thereby inherently implementing the output multiplexer. Note that the state is shared between the filters; for a 9-tap filter and an inte ⁇ olation factor of 3 there are only 3 states needed.
  • the decimation filter is just the dual of the inte ⁇ olation filter. Such a decimation filter is depicted in Figure 7. For a decimation factor of 3 710, two out of three output samples after filtering are discarded. This means that the discarded filter outputs need not be calculated in the first place.
  • This structure can be derived by simply reversing the flow graph of the inte ⁇ olator depicted in Figure 6, which results in the structure shown in Figure 7. However, unlike the inte ⁇ olation structure of Figure 6, the states are not shared. The two output delays inherent in the system are shown at 720 and 730 in Figure 7.
  • a second delay addressed register buffer is required, addressed by the same pointer register RD_X
  • An example implementation of just such a system is shown in Fig. 8.
  • the two delay addressed register buffers are addressed in lock- step, fetching the corresponding pairs of coefficients and states.
  • RI0 X 810 and RI1 X 820 Let the coefficients be stored in RI0 X 810; specifically for the example of decimation by 3, let RI0_0 be CO, RI0_1 be Cl and RI 2 be C2, as above. Let the incoming data be stored in RI1 X 820. Specifically, let the new data sample be stored in RI1_0, so that RI1_0 is Xn, RI1_ 1 is Xn-1 and RI1 2 is Xn-2. Let the parameter rlimit be 2 (modulo 3) as in the case of the inte ⁇ olator example discussed above, setting up a delay line with three consecutive elements.
  • (rlimit+1) is the number of FIR taps being computed in one cell.
  • one or more data register banks RI X can be indexed by the same RD X pointer register bank, each data register bank being addressed in lock step.
  • the data register bank and the pointer register bank can each be incremented at a rate different than the data sample rate.

Abstract

A delay addressed data path register file is designed for use in a programmable processor making up a cell in a multi-processor or array signal processing system. The delay addressable register file is particularly useful in, inter alia, adaptive filters where the filter update latency is variable, interpolation filters where the interpolation factor needs to be programmable, and decimation filters where the decimation factor needs to be programmable. The programmability is achieved in an efficient manner, reducing the number of cycles required to perform this task. A single parameter, the 'delay limit' value, is programmed at start-up, setting up an internal delay-line within the register file of the processor. Thus, any of the delayed registers can be addressed by specifying the delay index during run-time. The delay line advances one location, modulo 'delay-limit', when the processing loop starts a new iteration.

Description

Programmable delay indexed data path register file for array processing
CROSS REFERENCE TO RELATED APPLICATIONS
This application is a continuation-in-part of United States Patent Application Serial No. 09/968,119, filed on October 1, 2001, for "Programmable Array for Efficient Computation of Convolutions in Digital Signal Processing", applicants Krishnamurthy Vaidyanathan and Geoffrey Burns, the specification of which is hereby incorporated herein by this reference.
TECHNICAL FIELD
This invention relates to digital signal processing, and more particularly, to optimizing data access in array processing and other multiprocessor systems.
BACKGROUND OF THE INVENTION:
Circular buffers are commonly found in digital signal processors, such as, for example, the Analog Devices ADSP 2181 or the Philips REAL DSP, where a memory segment can be addressed after modifying the address by a modulo operation. In such cases, the data is fetched in one cycle, stored in a register, and used as an operand in the next cycle. In such examples, the circular buffer is maintained in memory and in order to process the data stored therein, or properly write new data thereto, memory read/write instructions must be used. Such instructions increase computing overhead, the complexity of the instruction set, as well the additional time taken by the memory handling.
Besides such conventional uses of circular buffers, there are no designs known to exist that allow modulo addressing of a register file directly, or the use of modulo addressing in an array processor. Modulo addressing allows the facilitation of a sequentially linked series of data elements, where when the end of the series is reached, the sequence wraps around to the beginning. As an example, in a circular buffer of N data storage positions, numbered say, from 0 to N-1, where the system is set up such that the next storage position from a given position X is defined as X+l, modulo addressing allows (N-1) +1 = 0 (mod N), thus achieving the wrapping effect. Alternatively, a circular memory could be set up such that the next memory position from a given position X is defined as X-l, and then 0 - 1 = (N-1) (mod N), again achieving the wrapping effect.
In the context of a multi-processor, or an array processor designed for high- throughput repetitive signal processing, such as that disclosed in copending United States Patent Application Serial No. 09/968,119, the individual cell has limited or no memory addressing capability. In such case, maintaining a circular buffer in memory is more than an added complexity to deal with; it is simply impossible.
Thus, what would facilitate a delay line or the like in the cell of such an array processor, i.e., the equivalent to the implementation of a circular buffer in memory, is the facility to modulo address the actual registers where data is stored while under processing. There are no known designs which allow modulo addressing in a datapath instruction.
What is needed to solve these lacunae in the conventional art, is a method and apparatus for modulo addressing of registers in a datapath instruction. Such a method would allow a processor to maintain a sequential series of data, such as a delay line, in the actual registers themselves, thus obviating the need for memory handling capability.
SUMMARY OF THE INVENTION
A delay addressed data path register file is designed for use in a programmable processor making up a cell in a multi-processor or array signal processing system. The delay addressable register file is particularly useful in, inter alia, adaptive filters where the filter update latency is variable, interpolation filters where the interpolation factor needs to be programmable, and decimation filters where the decimation factor needs to be programmable. The programmability is achieved in an efficient manner, reducing the number of cycles required to perform this task. A single parameter, the "delay limit" value, is programmed at start-up, setting up an internal delay-line within the register file of the processor. Thus, any of the delayed registers can be addressed by specifying the delay index during run-time. The delay line advances one location, modulo "delay-limit", when the processing loop starts a new iteration.
BRIEF DESCRIPTION OF THE DRAWINGS:
Figures 1-2 illustrate pointer modified register addressing; Figures 1 A and 2 A are Figures 1 and 2, respectively, with exemplary contents of the data registers; Figure 3 depicts an example delay-indexed register set according the present invention;
Figure 4 depicts the register of Figure 3, shifted by one;
Figure 5 depicts a typical configuration of an adaptive filter as an equalizer;
Figure 6 depicts a polyphase implementation of an interpolation filter;
Figures 7 depicts a decimation filter; and
Figure 8 depicts a dual register file for a decimation filter.
Before one or more embodiments of the invention are explained in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangements of components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments and of being practiced or being carried out in various ways. Also, it is to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as in any way limiting.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS:
Convolution is a basic signal processing operation found in many applications, especially in digital filters. Digital filters can be elegantly implemented using array processing techniques, such as the reconfigurable adaptive filter array processor used in the Multi-Standard Channel Decoder (MSCD) described in copending United States Patent Application Serial No. 09/968,119 (the "Parent Application"), discussed above. The reconfigurable processor array is composed of identical processor cells, each capable of communicating with its nearest neighbors and capable of being programmed individually to perform a single task. Because of the high data rates that need to be supported and the constraints on cost, the cells are constrained to be simple and efficient. The efficiency of the cell is determined in part by the design of an efficient instruction set and the supporting architecture that implements the instruction. The present invention describes the design of a delay addressed register file and the corresponding instructions. Such an instruction can be put to good use in a variety of filtering applications including, for example, adaptive filtering and multi-rate filtering in the context of array processing. The delay addressed data path register file design can be applied to any array based design of filters and is not limited to the two-dimensional array described in the Parent Application.
To illustrate the present invention concretely, some preliminary discussion on register addressing modes is in order. Let a given processor have a register file set labeled RI x, where x is a value between 0 through N-1, and N is the total number of datapath registers. Let the processor also have a typical RISC like instruction set and a sequential controller that executes a specified loop. For example, an add instruction is of the form ADD SRC1 SRC2 DST, where SRC1 is source operandl, SRC2 is source operand 2 and DST is the destination register. All the three operands are drawn from the register file. Normally, an instruction like ADD RI O RI_1 RI_2 would simply add up the contents of register in location 0 of the register file with register in location 1 and store the results in location 2. In a C language notation this would be written as RI[2] = RI[0] + RI[1]. In these examples all addressing is implicit and static (fixed in time).
Pointer modified addressing works slightly differently. Pointer modified addressing is a form of indirect addressing. An additional register set, the pointer register set, is maintained to map the address of a datapath register with the contents of the corresponding pointer register set. Thus, let the pointer register set be called RD_x. An instruction like ADD RI_0 RI_1 RI_2 is actually translated to mean RI[RD[2]] = RI[RD[0]] + RI[RD[1]]. Thus, the operands of the instruction are the data registers whose addresses are contained in the RD x register set. If the contents of the pointer register set were such that RD_x = x, then the behavior under the pointer modified addressing would be exactly the same as that of the implicit addressing described in the previous paragraph.
The present invention utilizes delay indexed addressing. Delay indexed addressing is a modification on pointer modified addressing. It is, essentially, a pointer modified addressing of the register file with certain initial conditions on the contents of the RD (pointer) register file, and a mechanism for automatic shift of the pointers every data cycle. At start up, the contents of RD are sequentially increasing, which means that RD O = 0, RD_1 = 1, ... , RD_N = N, etc. Then, whenever the processing loop starts over, which means whenever the program counter becomes 0, the contents of a register in the pointer register set is shifted to the next register therein, which means (for "next" defined as subsequent) RD_x (current) = RD (x-l) (prior), and the contents of the first register folds in to the last. (If "next" is defined as precedent, the equivalent shifting can occur, with RD x (current) = RD_x+l (prior), and the contents of the last register folds into the first). This can be illustrated with reference to Figures 1 and 2. In each of these figures depictions of the RD_x 110, 210 and RI_x 120, 220 register sets are shown. Each register bank contains, for the purposes of this example, 4 registers, with addresses 0-3. These addresses of the registers 150, 250 are shown on the (outer) sides of each register. Next here is defined as subsequent, so at each shift the contents of a given RD_x register is shifted to the subsequent register, and the contents of the last register folds into the first. The arrows indicate where the RD x registers' contents point to in the RI_x register set. In Figure 1 the t=n 101, or startup condition is illustrated on the left, where RD_x =x. At t= n+1 102, illustrated on the right side of Figure 1, the contents of the pointer registers RD x are shifted such that RD_x (current) = RD (x l) (prior) as described above. This addition is carried out modulo 4, such that 0-1= 3 (mod 4), and thus the address contained in RD 0 is 3 at t=n+l. Figure 2 completes the temporal sequence, and depicts the register sets for t=n+2 201 and t=n+3 202, respectively. As is seen, for a four register set t=n+4 is identical to t=n. This addressing system creates a circular buffer, as will be described below. The contents of the RD_x registers are the addresses of the RI x registers. The contents of the RI x registers are the data being processed by the processor. In general the data will change with time, as data enters and exits the processor. It is easily seen that if each time the program counter resets a new datum enters the RI x register set 120, 220, then a delay line of depth equal to one less than the number of registers in the RD x set is set up. In the example of Figures 1-2, a delay line of depth 3 can be thus set up, the processor having access to the current datum (usually a sample of some analog value procured at a given sampling rate), and the previous three data, or samples. I.e., the processor has access to data samples Xn, Xn-i, Xn-2, and Xn-3.
Figures 1 A and 2 A, respectively correspond to Figures 1 and 2, to which they are identical, with the addition of example contents of the data register set RD_X. The asterisk at any given time shows where the next incoming sample (i.e., sample Xn+1 at time t=n; in general sample Xk+1 at time t=k etc.) will be written to. As can be seen, the new sample is always written over the oldest, or most delayed sample, stored in the register set. For the depicted exemplary delay of three, the new sample always overwrites the sample three sample periods behind the current sample, or for t=n, the Xn+1 sample overwrites the Xn-3 data sample. Thus the new sample is always written - in this example - to the RI register one behind the register with the current sample, or to the RI X register pointed to by the RD_(0-1) register, RI[RD_3]. As one steps forward through all the data registers one at a time from the RI[RD_0] register, modulo 4 (so RI[RD_(3 +1)] = RI[RD_0]), one finds samples of increasing delay. The RD x registers thus create a circular buffer whose elements are indexed (addressed) by the delay. Figure 3 illustrates such delay-indexed addressing for a delay buffer of depth 3. In Figure 3 only a portion of the available RD_x registers are shown, there thus being the possibility of a depth equal to the actual number of registers in the RD x set. Due to only four registers in the RD_x set being utilized for the delay line, only registers 0-4 of the RI_x set are involved in storing the delay line data. An operand of RD O in an instruction points to the register with the most recent value in the delay buffer, while an operand of RD 3 points to the value of delay 3, or Xn-3. Thus the addresses for the RD x register set are actually interpreted as delays. Where these RD x registers point to in the RI x set changes with time.
Figure 4 shows the advancement of the register pointers upon arrival of the new state. To implement a circular buffer on a partial set of registers from the datapath register file, a delay limit, called rlimit in Figures 3-4, is introduced and the pointer register shift is done modulo (rlimit+1); thus the contents of each RD_x register are changed by the subtraction of 1 (modulo (rlimit+1)). The modulus is (rlimit +1) because rlimit is the maximum delay stored in the RI x registers, but the actual number of registers in the delay line is (rlimit+1), to include the zero delay, or current, sample Xn. In Figures 3 and 4, the value of rlimit is 3, thus there are four registers utilized in the delay line.
To preserve the three most recent samples in the circular buffer, the new sample, with a delay of zero, is written in to the ever changing (modulo rlimit+1) RI x register which is pointed to by the RD_0 register. For the system of Figures 3 and 4, the contents of the RD_x registers will cycle in time as depicted in Figures 1-2; Figure 3 corresponds to t=n+2 201, in Figure 2, and Figure 4 to t=n+3 202, in Figure 2.
In general, a delay indexed pointer register allows a processor to implement any filter or other data processing operation whose inputs are a current datum and a number of data preceding the current datum in some sense. If the data vary relative to each other in time, then a temporal delay line can be maintained, allowing access to a current sample and a number of prior samples, such as is commonly required in FIR filters. The number of samples stored in the delay line will correspond in such a case to the number of delays in the filtering equation plus one, or in terms of the system depicted in Figures 3-4, (rlimit+1). The processor knows how many data samples are in the delay line by means of a preprogrammed variable rlimit, which gives the maximum delay stored in the data registers. The index registers are automatically incremented using modular arithmetic so as to preserve the delay relationships between the ever-changing data. Alternatively, a "delay line" could be implemented where the samples vary not in time, but in space, such as in image processing operations, where "prior" corresponds to the prior in space, as defined by some direction within an image.
The usefulness of such a delay indexed pointer register will be next illustrated by the following examples.
Application 1: Compensation of error latency in an adaptive filter. The delay-indexed datapath register (RD_x) can be used to simplify programming of the tap delay line for adaptive FIR filters. Consider the least mean squares (LMS) algorithm in particular. The filtering equation is provided by,
-v-i
where xn are the filter states and cn are the filter coefficients. The filter coefficients are updated according to the formula:
Figure imgf000008_0001
where l is a constant, and E is the error in the filtered output, calculated from a previous filter calculation. Figure 5 shows the use of such a filter in a channel equalizer. In practice there is a finite latency, measurable in terms of number of input sample periods, between the time a given sample "Xn" appears at the input of the adaptive filter 510 and the time the error "E" is calculated and made available to the adaptation unit 520. If this filter update latency is more than or equal to one sample period, then the update equation has to be modified to use an equally delayed state value x, such as Xn-d, where d is the appropriate delay.
If the adaptive filter is implemented on an array processor, and a single tap of the FIR filter is mapped to one cell of the array, the filter update latency is the difference, measured in input data sampling periods, between the time the newly calculated error arrives at the cell and the time at which the filter tap output was calculated in the cell. In order to fetch the delayed state, the cell needs a delay buffer. This delay buffer is constituted from a subset of the existing internal registers, as described above, with each element addressed by its relative delay to the most recently arrived local state d=0, stored at RI[(RD_0)]. For example, let the latency be 3, let the coefficient Cn+ be stored in register RI 5, the error in RI 4, and the current state Xn be stored in RI[RD_0]. To implement the filter update equation, the cell is programmed with a delay limit, rlimit = 3, and the update equation becomes RI_5 = RI_5 + RI_4*RI[RD_3]. Since the register contents of the delay line are automatically shifted, every data sample period, no additional data movements are required.
The processor is programmed so as to automatically interpret operands in instructions of the type RI X as RI[RD_X]. Thus, the user need not be at all concerned with the mapping of the pointer registers to the data registers. Accordingly, in the examples that follow, instructions will be illustrated in terms of
RI X operands, it being understood that the processor is programmed to automatically convert those to RI[RD_X] operands.
Application 2 : Efficient Implementation of a programmable Inteφolation Filter
An inteφolation filter is a multi-rate filter where the output data rate is a multiple of the input data rate. A frequently used case is when this multiple is an integer. Such an inteφolation filter implements equation 1, but the input sequence is x is the actual input data with zeros stuffed in between. For example, if the inteφolation multiple is 3, then the input data stream 601 is modified by inserting 2 zeros between every pair of data samples before applying the filter 602. Since two in three data values are zeros, at any point in time only one third of the filter taps produce a non-zero output. A poly-phase filter utilizes this fact to avoid implementing the zero output taps. For a full description of this see Proakis and Manolakis, Introduction to Digital Signal Processing (MacMillan Publishing Company New York, 1988) ISBN: 0-02-396810-9, pp: 662 - 670, and pages 667 and 668 respectively.
Figure 6 shows the working of a polyphase filter used as the inteφolation filter for an inteφolation multiple of 3. Equation 1 is then implemented as three filters that take a common input and whose outputs are multiplexed in time. The mapping of the filter taps to the cells is also shown in the figure. The delay limit register, rlimit, is programmed to be 2, to be 2, for example by means of a dedicated instruction. Coefficients 0, 1, and 2 are stored in RI O, RI 1 and RI 2 respectively. The coefficients are thus stored in consecutive registers which are delay addressed. The controller program executes three loops, for every data sample period. Let the input data in a cell be stored in RI_3. Then, an FIR tap can be modeled by the instruction RI_4 = RI_3*RI_2; since delay addressing is in effect, during the first program cycle RI 2 has coefficient CO, during the second cycle RI_2 has Cl and in the third RI_2 has C2. This is equivalent to the entire array being reconfigured to implement HI 605 in the first cycle, H2 606 in the second and H3 607 in the third. The filter output in each program cycle corresponds to the inteφolation filter output, thereby inherently implementing the output multiplexer. Note that the state is shared between the filters; for a 9-tap filter and an inteφolation factor of 3 there are only 3 states needed.
Application 3: Efficient Implementation of a Programmable Decimation
Filter
The decimation filter is just the dual of the inteφolation filter. Such a decimation filter is depicted in Figure 7. For a decimation factor of 3 710, two out of three output samples after filtering are discarded. This means that the discarded filter outputs need not be calculated in the first place. This structure can be derived by simply reversing the flow graph of the inteφolator depicted in Figure 6, which results in the structure shown in Figure 7. However, unlike the inteφolation structure of Figure 6, the states are not shared. The two output delays inherent in the system are shown at 720 and 730 in Figure 7. In order to maintain independent state registers a second delay addressed register buffer is required, addressed by the same pointer register RD_X An example implementation of just such a system is shown in Fig. 8. The two delay addressed register buffers are addressed in lock- step, fetching the corresponding pairs of coefficients and states.
To illustrate this, let the two delay addressed register buffers be labeled RI0 X 810 and RI1 X 820. Let the coefficients be stored in RI0 X 810; specifically for the example of decimation by 3, let RI0_0 be CO, RI0_1 be Cl and RI 2 be C2, as above. Let the incoming data be stored in RI1 X 820. Specifically, let the new data sample be stored in RI1_0, so that RI1_0 is Xn, RI1_ 1 is Xn-1 and RI1 2 is Xn-2. Let the parameter rlimit be 2 (modulo 3) as in the case of the inteφolator example discussed above, setting up a delay line with three consecutive elements. In general, (rlimit+1) is the number of FIR taps being computed in one cell. An instruction like RI1 4 = RI0_0*RI1_0 models the FIR tap calculation. This actually implements C2*Xn-2, Cl*Xn-l, C0*Xn in three consecutive cycles, generating time multiplexed ouφuts, which are synchronized using delays 720 and 730 and added outside of the cell. This is equivalent to the entire array being configured to perform filter H3 770 (with respect to Figure 7) in the first cycle, H2 760 in the second and HI 750 in the third cycle. The oldest data Xn-3, which is located in RI1_0 prior to being overwritten by the newest data Xn, is passed on to the next cell in the array.
While the invention has been described in details with reference to various embodiments, it shall be appreciated that various changes and modifications are possible to those skilled in the art without departing the gist of the invention. For example, one or more data register banks RI X can be indexed by the same RD X pointer register bank, each data register bank being addressed in lock step. As well, in other embodiments the data register bank and the pointer register bank can each be incremented at a rate different than the data sample rate. Thus, the scope of the invention is intent to be solely defined in the following claims.

Claims

CLAIMS:
1. A modulo addressable data path register file for a processor, comprising: a first set of registers (RD_X) (110, 210, 310, 410) and a second set of registers (RI_X) (120, 220, 320, 420); where the first set of registers (110, 210, 310, 410) stores addresses (150, 250, 350, 450) of the second set of registers (120, 220, 320, 420) and where the second set of registers (120, 220) stores data ; and where two or more of the first set of registers (110, 210, 310, 410) are ordered sequentially in a circular structure such that the first register falls next in sequence after the last.
2. The register file of claim 1, where the registers in the circular structure (110, 210, 310, 410) change their contents according to the equation RD X = RD_(X+k) (modulo M), where k is an integer, each time a processor loop begins a new iteration; and where the modulus M is equal to the number of registers from the first set
(110, 210, 310, 410) used in the circular structure multiplied by |k|, for nonzero k, and by 1 for k=0.
3. The register file of claim 2, where k is one of 0, +/- 1, +/- 2, +/-3 or +/- 4.
4. The register file of claim 3, where the N registers in the first set (110, 210, 310, 410) are numbered from 0 to N-1.
5. The register file of claim 4, where the circular structure is used to store a sequence of N data samples, each being delayed one sample period from the prior sample in the sequence.
6. The register file of claim 5, where the parameter N is programmable at a startup of processor operation, and is equal to one greater than that the maximum supported delay, as expressed in units of sample periods.
7. The register file of claim 6, where the samples are stored in the RI_X register set (120, 220, 320, 420).
8. The register file of claim 7, where the RI X (120, 220, 320, 420) registers are pointed to by the RD X registers in the circular structure.
9. The register file of claim 8, where the samples are stored in sequential locations in the RI_X register set (120, 220, 320, 420).
10. A multi processor system, comprising: a plurality of cells, each with an individual processor; where each cell has the register file of claim 6, and where the processor can execute instructions whose operands are the RD X registers (110, 210, 310, 410).
11. The system of claim 10, where each cell has a programmable parameter which sets the value of N for that cell.
12. A method of optimizing digital signal processing, comprising; implementing modulo addressing in a first register bank (RD X) (110, 210, 310, 410); enabling the processor to operate on data in a second register bank (120, 220, 320, 420) by operating on a register that points to the data.
13. The method of claim 12, where the registers in the first register bank (110, 210, 310, 410) change their contents according to either the equation RD X = RD_(X+k) (modulo M), where k is an integer, each time a processor loop begins a new iteration; and where the modulus M is equal to a number equal to the registers in the first register bank (110, 210, 310, 410) used to point to data in the second register bank (120, 220, 320, 420) multiplied by |k| for nonzero k, and by 1 for k=0.
14. The method of claim 13 , where an unused register in either the first (110, 210,
310, 410) or the second (120, 220, 320, 420) register banks stores the value of M.
15. The method of claim 14, where a dedicated register in the first register bank (110, 210, 310, 410) stores the value of M.
16. A method of implementing digital filtering, comprising: storing a current data sample and a number of prior data samples in a first register bank (120, 220, 320, 420); indexing said current sample and prior data samples by the relative delay to the current sample; and automatically updating the contents of the first register bank (120, 220, 320, 420) each sample period to write a new data sample over the most delayed sample stored in the register bank (120, 220, 320, 420).
17. The method of claim 16, where the indexing of the data samples in the first register bank (120, 220, 320, 420) is maintained by a second register bank (RD_X) (110, 210, 310, 410) which stores the addresses of the registers in the first register bank (120, 220, 320, 420).
18. The method of claim 17, where the second register bank (110, 210, 310, 410) is automatically incremented each sample period according to the equation RD_X = RD_(X+k) (modulo M), where k is an integer, each time a processor loop begins a new iteration; and where the modulus M is equal to a number equal to the registers in the first register bank (110, 210, 310, 410) used to point to data in the second register bank (120, 220, 320, 420) multiplied by k multiplied by |k|, for nonzero k, and by 1 for k=0.
19. A method of implementing digital filtering, comprising: storing a first data set comprising a current data sample and a number of prior data samples in a first register bank (810); storing one or more additional data sets, each comprising a current data sample and a number of prior data samples in an additional register bank (820) ; indexing each said data set (810, 820) by the relative delay of a sample to the current sample; and automatically updating the contents of each of the first register bank (810) and the additional register banks (820) each sample period to write a new data sample over the most delayed sample stored in each register bank.
20. The method of claim 19, where the indexing of the data samples in the first register bank (810) and each of the additional register banks (820) is maintained by a pointer register bank (RD X) (800) which stores the addresses of the registers in the first (810) and each of the additional register banks (820).
21. The method of claim 20, where the pointer register bank (800) is automatically incremented each sample period according to the equation RD_X = RD_(X+k) (modulo M), where k is an integer, each time a processor loop begins a new iteration; and where the modulus M is equal to a number equal to the registers in the pointer register bank (800) used to point to data in the first (810) and each of the additional (820) register banks, multiplied by |k|, for nonzero k, and by 1 for k=0.
PCT/IB2002/005126 2001-12-21 2002-12-03 Programmable delay indexed data path register file for array processing WO2003054691A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
KR10-2004-7009643A KR20040069335A (en) 2001-12-21 2002-12-03 Programmable delay indexed data path register file for array processing
AU2002351109A AU2002351109A1 (en) 2001-12-21 2002-12-03 Programmable delay indexed data path register file for array processing
JP2003555339A JP2005513643A (en) 2001-12-21 2002-12-03 Programmable delay index type data path register file for array processing
EP02785822A EP1459168A1 (en) 2001-12-21 2002-12-03 Programmable delay indexed data path register file for array processing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/026,258 US6970895B2 (en) 2001-10-01 2001-12-21 Programmable delay indexed data path register file for array processing
US10/026,258 2001-12-21

Publications (1)

Publication Number Publication Date
WO2003054691A1 true WO2003054691A1 (en) 2003-07-03

Family

ID=21830768

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2002/005126 WO2003054691A1 (en) 2001-12-21 2002-12-03 Programmable delay indexed data path register file for array processing

Country Status (7)

Country Link
US (1) US6970895B2 (en)
EP (1) EP1459168A1 (en)
JP (1) JP2005513643A (en)
KR (1) KR20040069335A (en)
CN (1) CN1286003C (en)
AU (1) AU2002351109A1 (en)
WO (1) WO2003054691A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1310133C (en) * 2004-08-04 2007-04-11 联合信源数字音视频技术(北京)有限公司 Video image pixel interpolation device

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7107401B1 (en) * 2003-12-19 2006-09-12 Creative Technology Ltd Method and circuit to combine cache and delay line memory
US7937557B2 (en) 2004-03-16 2011-05-03 Vns Portfolio Llc System and method for intercommunication between computers in an array
US7480689B2 (en) * 2004-11-19 2009-01-20 Massachusetts Institute Of Technology Systolic de-multiplexed finite impulse response filter array architecture for linear and non-linear implementations
US7904695B2 (en) 2006-02-16 2011-03-08 Vns Portfolio Llc Asynchronous power saving computer
KR100781358B1 (en) * 2005-10-21 2007-11-30 삼성전자주식회사 System and method for data process
US7966481B2 (en) 2006-02-16 2011-06-21 Vns Portfolio Llc Computer system and method for executing port communications without interrupting the receiving computer
US7904615B2 (en) 2006-02-16 2011-03-08 Vns Portfolio Llc Asynchronous computer communication
KR101241892B1 (en) * 2006-06-22 2013-03-11 엘지전자 주식회사 A receiving apparatus and a receiving method for broadcasting
WO2008101045A1 (en) * 2007-02-15 2008-08-21 Massachusetts Institute Of Technology Architecture for systolic nonlinear filter processors
EP1978449A2 (en) * 2007-04-06 2008-10-08 Technology Properties Limited Signal processing
JP2011090592A (en) * 2009-10-26 2011-05-06 Sony Corp Information processing apparatus and instruction decoder for the same
US8954714B2 (en) * 2010-02-01 2015-02-10 Altera Corporation Processor with cycle offsets and delay lines to allow scheduling of instructions through time

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5996063A (en) * 1997-03-03 1999-11-30 International Business Machines Corporation Management of both renamed and architected registers in a superscalar computer system
EP1150202A2 (en) * 2000-04-27 2001-10-31 Institute for the Development of Emerging Architectures, L.L.C. Method and apparatus for optimizing execution of load and store instructions

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5644677A (en) 1993-09-13 1997-07-01 Motorola, Inc. Signal processing system for performing real-time pitch shifting and method therefor
US5659700A (en) 1995-02-14 1997-08-19 Winbond Electronis Corporation Apparatus and method for generating a modulo address
KR100236536B1 (en) 1997-01-10 1999-12-15 윤종용 Modulo address generator
US6000834A (en) * 1997-08-06 1999-12-14 Ati Technologies Audio sampling rate conversion filter
US6366938B1 (en) * 1997-11-11 2002-04-02 Ericsson, Inc. Reduced power matched filter using precomputation
US6665695B1 (en) * 2000-01-14 2003-12-16 Texas Instruments Incorporated Delayed adaptive least-mean-square digital filter

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5996063A (en) * 1997-03-03 1999-11-30 International Business Machines Corporation Management of both renamed and architected registers in a superscalar computer system
EP1150202A2 (en) * 2000-04-27 2001-10-31 Institute for the Development of Emerging Architectures, L.L.C. Method and apparatus for optimizing execution of load and store instructions

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BENNOUR I E ET AL: "REGISTER ALLOCATION USING CIRCULAR FIFOS", 1996 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS). CIRCUITS AND SYSTEMS CONNECTING THE WORLD. ATLANTA, MAY 12 - 15, 1996, IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), NEW YORK, IEEE, US, vol. 4, 12 May 1996 (1996-05-12), pages 560 - 563, XP000704661, ISBN: 0-7803-3074-9 *
WONG K-L ET AL: "Fast address generation for the computation of prime factor algorithms", INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH & SIGNAL PROCESSING, vol. 2, no. 14, 23 May 1989 (1989-05-23) - 26 May 1989 (1989-05-26), New York, US, pages 1091 - 1094, XP010083124 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1310133C (en) * 2004-08-04 2007-04-11 联合信源数字音视频技术(北京)有限公司 Video image pixel interpolation device

Also Published As

Publication number Publication date
CN1605061A (en) 2005-04-06
KR20040069335A (en) 2004-08-05
US20030062927A1 (en) 2003-04-03
JP2005513643A (en) 2005-05-12
AU2002351109A1 (en) 2003-07-09
CN1286003C (en) 2006-11-22
EP1459168A1 (en) 2004-09-22
US6970895B2 (en) 2005-11-29

Similar Documents

Publication Publication Date Title
US6970895B2 (en) Programmable delay indexed data path register file for array processing
US6665790B1 (en) Vector register file with arbitrary vector addressing
JPH11296493A (en) Reconstitutable coprocessor for data processing system
JPH11272631A (en) Data processing system and method therefor
KR20080053327A (en) Shared memory and shared multiplier programmable digital-filter implementation
JP2020109605A (en) Register files in multi-threaded processor
US7308559B2 (en) Digital signal processor with cascaded SIMD organization
US6088782A (en) Method and apparatus for moving data in a parallel processor using source and destination vector registers
JPH11163680A (en) Filter structure and method
US8166087B2 (en) Microprocessor performing IIR filter operation with registers
US4809208A (en) Programmable multistage digital filter
US4939684A (en) Simplified processor for digital filter applications
JP4955149B2 (en) Digital signal processor with bit FIFO
EP3559803B1 (en) Vector generating instruction
US6658440B1 (en) Multi channel filtering device and method
US7260709B2 (en) Processing method and apparatus for implementing systolic arrays
KR100188013B1 (en) Fir filter embodying method used for limited impulse response by element movement
JPH047910A (en) Integrated circuit device for signal processing
JP3701033B2 (en) Data-driven information processing device
Ferry Implementation of FIR filters for fast multi-channel processing
JP2005353094A (en) Product-sum computing unit
WO1994001827A1 (en) Adaptive canceller filter module
JP2006352724A (en) Digital filter

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2002785822

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 20028253388

Country of ref document: CN

Ref document number: 1020047009643

Country of ref document: KR

WWE Wipo information: entry into national phase

Ref document number: 2003555339

Country of ref document: JP

WWP Wipo information: published in national office

Ref document number: 2002785822

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 2002785822

Country of ref document: EP