WO2010033298A1

WO2010033298A1 - Address generation

Info

Publication number: WO2010033298A1
Application number: PCT/US2009/051224
Authority: WO
Inventors: Colin Stirling; David I. Lawrie; David Andrews
Original assignee: Xilinx, Inc.
Priority date: 2008-09-18
Filing date: 2009-07-21
Publication date: 2010-03-25
Also published as: CN102160032B; KR101263152B1; US20100070737A1; JP5242796B2; EP2329362B1; US8219782B2; JP2012503248A; CN102160032A; KR20110069108A; EP2329362A1

Abstract

Address generation by an integrated circuit (100) is described. An aspect relates generally to an address generator (220) which has first and second processing units (310, 320). The second processing unit (320) is coupled to receive a stage output from the first processing unit (310) and configured to provide an address output. The stage output is in a first range, and the address output is in a second range. The first range is from -K to -1 for a block size of K, and the second range is from 0 to K- 1.

Description

ADDRESS GENERATION

FIELD OF THE INVENTION The invention relates to integrated circuit devices ("ICs"). More particularly, the invention relates to address generation by an IC.

BACKGROUND OF THE INVENTION

Programmable logic devices ("PLDs") are a well-known type of integrated circuit that can be programmed to perform specified logic functions. One type of PLD, the field programmable gate array ("FPGA"), typically includes an array of programmable tiles. These programmable tiles can include, for example, input/output blocks ("lOBs"), configurable logic blocks ("CLBs"), dedicated random access memory blocks ("BRAMs"), multipliers, digital signal processing blocks ("DSPs"), processors, clock managers, delay lock loops ("DLLs"), and so forth. As used herein, "include" and "including" mean including without limitation.

Each programmable tile typically includes both programmable interconnect and programmable logic. The programmable interconnect typically includes a large number of interconnect lines of varying lengths interconnected by programmable interconnect points ("PIPs"). The programmable logic implements the logic of a user design using programmable elements that can include, for example, function generators, registers, arithmetic logic, and so forth.

The programmable interconnect and programmable logic are typically programmed by loading a stream of configuration data into internal configuration memory cells that define how the programmable elements are configured. The configuration data can be read from memory (e.g., from an external PROM) or written into the FPGA by an external device. The collective states of the individual memory cells then determine the function of the FPGA.

Another type of PLD is the Complex Programmable Logic Device, or CPLD. A CPLD includes two or more "function blocks" connected together and to input/output ("I/O") resources by an interconnect switch matrix. Each function block of the CPLD includes a two-level AND/OR structure similar to those used in Programmable Logic Arrays ("PLAs") and Programmable Array Logic ("PAL") devices. In CPLDs, configuration data is typically stored on-chip in non-volatile memory. In some CPLDs, configuration data is stored on-chip in non-volatile memory, then downloaded to volatile memory as part of an initial configuration (programming) sequence.

For all of these programmable logic devices ("PLDs"), the functionality of the device is controlled by data bits provided to the device for that purpose. The data bits can be stored in volatile memory (e.g., static memory cells, as in

FPGAs and some CPLDs), in non-volatile memory (e.g., FLASH memory, as in some CPLDs), or in any other type of memory cell.

Other PLDs are programmed by applying a processing layer, such as a metal layer, that programmably interconnects the various elements on the device. These PLDs are known as mask programmable devices. PLDs can also be implemented in other ways, e.g., using fuse or antifuse technology. The terms "PLD" and "programmable logic device" include but are not limited to these exemplary devices, as well as encompassing devices that are only partially programmable. For example, one type of PLD includes a combination of hard- coded transistor logic and a programmable switch fabric that programmably interconnects the hard-coded transistor logic.

Turbo-channel codes conventionally are used to code data. Turbo codes use data in the order in which it is received and in an interleaved order. Original data is therefore used twice. By turbo-channel codes, it is meant convolutional codes. The data is shuffled using an interleaver, and such interleaver may be part of an encoder, a decoder, or an encoder/decoder ("codec").

Data may be interleaved prior to encoding and then deinterleaved for decoding. In some coding, including either or both encoding and decoding, systems, have high throughputs achieved through parallel processing. Data is generally interleaved by an encoder and deinterleaved by a decoder. Because decoding is more computationally intensive than encoding, and in order to achieve overall system high throughput, deinterleaving should be capable of being implemented in parallel in the decoder.

In the 3^rd Generation Partnership Project ("3GPP"), a quadratic permutation polynomial ("QPP") interleaver is called out in the proposed Long Term Evolution ("LTE") 3GPP specification to facilitate contention-free addressing. Additional details regarding 3GPP LTE may be found at http://www.3gpp.org. In particular, the 3GPP TS 36.212 version 8.3.0 Technical Specification dated May 2008 discloses channel coding, multiplexing, and interleaving in section 5 thereof, particularly sub-sections 5.1.3, 5.1.4.1.1 , and 5.2.2.8 describing a channel interleaver.

Using a QPP interleaver allows individual blocks of data to be split into multiple threads and processed in parallel. If multiple independent blocks of data each have their threads processed, then processing such threads of all such data blocks in parallel involves replicating the QPP interleaver. Accordingly, it should be appreciated that the size and performance of an interleaver circuit used to implement a QPP interleaver affects both efficiency of encoding and decoding turbo-channel codes.

SUMMARY OF THE INVENTION

An embodiment of an address generator comprises a first processing unit, and a second processing unit coupled to receive a stage output from the first processing unit and configured to provide an address output. The stage output is in a first range from -K to -1 for a block size of K, and the address output is in a second range from 0 to K- 1.

In this embodiment, the address generator can be part of a coding device selected from a group consisting of an encoder, a decoder, and a codec, where the address generator provides the address output for quadratic permutation polynomial interleaving. The address output can include multiple address sequences. The first processing unit and the second processing unit respectively can be initialized with a first initialization value or a second initialization value. The first initialization value can be for a first sequence of the multiple address sequences, and the second initialization value can be for a second sequence of the multiple address sequences. The address output can be for at least part of an address sequence from 0 to K- 1 ; the first processing unit can be initialized with a first initialization value and a second initialization value; and the second processing unit can be initialized with a third initialization value and a fourth initialization value. In this embodiment, the first processing unit can comprise a first adder; a first register, coupled to the first adder; a first multiplexer, coupled to the first register; a first subtractor, coupled to the first multiplexer and the first register; and a second register, coupled to the subtractor, to output the stage output, where the stage output is fed-back to the first adder. The first register can process a first sequence and the second register can simultaneously processes a second sequence. The second processing unit can comprise: a second adder to receive the stage output; a third register, coupled to the second adder; a second multiplexer, coupled to the third register; a third adder, coupled to the second multiplexer and the third register; and a fourth register, coupled to the third adder, to output the address output, where the address output can be fed- back to an input of the second adder.

An embodiment of a method to generate addresses comprises: obtaining a step size and a block size; obtaining a first initialization value and a second initialization value; adding the step size to a difference to provide a first sum; subtracting either a null value or the block size from the first sum responsive to a sign bit of the first sum to provide another difference, where the other difference is in a range of -K to -1 for block size of K; registering the first sum or the other difference; and feeding back the other difference in order to add the other difference to the step size.

In this embodiment, the method can further comprise: generating a second sum by adding the other difference to a third sum; adding either the null value or the block size to the second sum in response to a sign bit of the second sum to provide another third sum, where the other third sum is in a range of 0 to K- 1 ; registering the second sum or the other third sum; and feeding back the other third sum for another iteration of the step for adding to provide the second sum. The registering the first sum or the other difference can include registering the other difference within respective feedback loops for pipelined operation, and where registering the second sum or the other third sum can include registering the other third sum within respective feedback loops for pipelined operation. The registering the first sum or the other difference can include registering the first sum within respective feedback loops for pipelined operation, and where the registering the second sum or the other third sum can include registering the second sum within respective feedback loops for pipelined operation. The step of adding the step size to the difference to provide the first sum can be performed simultaneously with the step of adding to provide the second sum by addition of the other difference to the third sum. The method can further comprise providing the other third sum for quadratic permutation polynomial interleaving. BRIEF DESCRIPTION OF THE DRAWINGS

Accompanying drawing(s) show exemplary embodiment(s) in accordance with one or more aspects of the invention; however, the accompanying drawing(s) should not be taken to limit the invention to the embodiment(s) shown, but are for explanation and understanding only.

FIG. 1 is a simplified block diagram depicting an exemplary embodiment of a columnar Field Programmable Gate Array ("FPGA") architecture in which one or more aspects of the invention may be implemented.

FIG. 2 is a block diagram depicting an exemplary embodiment of an interleaver.

FIG. 3 is a circuit diagram depicting an exemplary embodiment of an address generator of the interleaver of FIG. 2.

FIG. 4 is a flow diagram depicting an exemplary embodiment of an address generation flow of the address generator of FIG. 3. FIG. 5 is a pseudo-code listing depicting an exemplary embodiment of an address generation flow.

DETAILED DESCRIPTION

In the following description, numerous specific details are set forth to provide a more thorough description of the specific embodiments of the invention. It should be apparent, however, to one skilled in the art, that the invention may be practiced without all the specific details given below. In other instances, well known features have not been described in detail so as not to obscure the invention. For ease of illustration, the same number labels are used in different diagrams to refer to the same items; however, in alternative embodiments the items may be different.

As noted above, advanced FPGAs can include several different types of programmable logic blocks in the array. For example, FIG. 1 illustrates an FPGA architecture 100 that includes a large number of different programmable tiles including multi-gigabit transceivers ("MGTs") 101 , configurable logic blocks ("CLBs") 102, random access memory blocks ("BRAMs") 103, input/output blocks ("lOBs") 104, configuration and clocking logic ("CONFIG/CLOCKS") 105, digital signal processing blocks ("DSPs") 106, specialized input/output blocks ("I/O") 107 (e.g., configuration ports and clock ports), and other programmable logic 108 such as digital clock managers, analog-to-digital converters, system monitoring logic, and so forth. Some FPGAs also include dedicated processor blocks ("PROC") 110.

In some FPGAs, each programmable tile includes a programmable interconnect element ("INT") 111 having standardized connections to and from a corresponding interconnect element in each adjacent tile. Therefore, the programmable interconnect elements taken together implement the programmable interconnect structure for the illustrated FPGA. The programmable interconnect element 111 also includes the connections to and from the programmable logic element within the same tile, as shown by the examples included at the top of FIG. 1.

For example, a CLB 102 can include a configurable logic element ("CLE") 112 that can be programmed to implement user logic plus a single programmable interconnect element ("INT") 111. A BRAM 103 can include a BRAM logic element ("BRL") 113 in addition to one or more programmable interconnect elements. Typically, the number of interconnect elements included in a tile depends on the height of the tile. In the pictured embodiment, a BRAM tile has the same height as five CLBs, but other numbers (e.g., four) can also be used. A DSP tile 106 can include a DSP logic element ("DSPL") 114 in addition to an appropriate number of programmable interconnect elements. An IOB 104 can include, for example, two instances of an input/output logic element ("IOL") 115 in addition to one instance of the programmable interconnect element 111. As will be clear to those of skill in the art, the actual I/O pads connected, for example, to the I/O logic element 115 typically are not confined to the area of the input/output logic element 115. In the pictured embodiment, a columnar area near the center of the die

(shown in FIG. 1 ) is used for configuration, clock, and other control logic. Horizontal areas 109 extending from this column are used to distribute the clocks and configuration signals across the breadth of the FPGA.

Some FPGAs utilizing the architecture illustrated in FIG. 1 include additional logic blocks that disrupt the regular columnar structure making up a large part of the FPGA. The additional logic blocks can be programmable blocks and/or dedicated logic. For example, processor block 110 spans several columns of CLBs and BRAMs.

Note that FIG. 1 is intended to illustrate only an exemplary FPGA architecture. For example, the numbers of logic blocks in a column, the relative width of the columns, the number and order of columns, the types of logic blocks included in the columns, the relative sizes of the logic blocks, and the interconnect/logic implementations included at the top of FIG. 1 are purely exemplary. For example, in an actual FPGA more than one adjacent column of CLBs is typically included wherever the CLBs appear, to facilitate the efficient implementation of user logic, but the number of adjacent CLB columns varies with the overall size of the FPGA.

As previously described, a QPP interleaver is specified in an LTE 3GPP specification, and such QPP interleaver may be formulated as quadratic equation modulo the block size, K. A direct implementation of the specified QPP interleaving process would involve complex multiplication and complex modulo operations, which are extremely inefficient for implementation in hardware. A more efficient hardware implementation is described in co-pending U.S. Patent Application entitled "Address Generation for Quadratic Permutation Polynomial Interleaving" by Ben J. Jones et al, assigned application number 12/059,731 , filed March 31 , 2008 (Attorney Docket No. X-2726 US) [hereinafter "Jones"]. Jones shows and describes how the quadratic formula may be reduced to produce a circuit which may be implemented using adders, subtracters, and selection circuits, such as multiplexers. As described below in additional detail, an even further simplified circuit for address generation for interleaving may be obtained by removing selection operations associated with Jones and reducing the number of adders and subtracters of Jones. Furthermore, such reduction of circuitry in turn reduces register count in comparison to Jones, but as shall be appreciated from the following description such simplified address generator has same or comparable performance to that of Jones. Another reduction in comparison to Jones is elimination of registers between first and second stages allowing control logic to be further simplified as initialization values may be applied simultaneously as described below in additional detail.

Even though the following description is in terms of an LTE 3GPP QPP interleaver and address sequence therefor, it should be appreciated that other address sequences may be used. An LTE 3GPP QPP interleaver has an address sequence as defined by:

π(x) = føx + f₂x²)mod K, where 0 ≤ x,f_vf₂ < K , (1 ) where fi and h are coefficients of the polynomial, x is an increment in a linear sequence from 0 to K-1 , and K is block size. An x-th interleaved address may be obtained by using Equation (1 ), where fi, and h are fixed coefficients for any integer block size, K. Accordingly, the sequence of addresses for increments of x are from 0 to K-1 in a permutated order for x. It should be understood that even though a sequence is described as going from 0 to K-1 , it should be appreciated that a sequence need not start at 0 and need not go all the way to K-1 , namely it need not step through each linear increment of the sequence for all K increments. Furthermore, there may be skip value for skipping linear increments for generating a sequence. Again, it should be appreciated that a block of data may be broken out into multiple threads or streams for processing in parallel as described below in additional detail.

As indicated in Jones, a first derivation of Equation (1 ) is:

π^r(x) = [f₂(2nx + n²)+f,n]moό K, (2)

and a second derivation of Equation (1 ) is:

ir(x) = [2n²f₂]mod K . (3)

In Equations (2) and (3), n is a skip value which may be any integer value greater than 0. Thus, for example, if n is equal to 1 , there is no skipping and each linear increment of a sequence, 0, 1 , 2,..., to some number which may be as large as K-1 , is processed in order to provide at most K interleaved addresses for such sequence. Thus, the skip value, n, may be used to determine the stride or jump in an interleaved address sequence generated.

Again, when n is set to 1 , a complete sequence of K addresses may be generated; however, if n is set to an integer value larger than 1 then a subset of addresses of a sequence may be generated. For example, if n is set equal to 2, then every other address in a sequence may be generated starting from 0, namely 0, 2, 4,..., K-2. Because the difference between successive terms in Equations (2) and (3) is a linear function and a constant, respectively, the circuit may be implemented using only add, subtract, and select operations, as described below in additional detail, for generating addresses of a sequence. Additionally, for purposes of pipelining multiple sequences, namely multiple threads or streams, where multiple streams are processed with one another, temporary storing operations, such as registering operations, may be added. Thus, as should be appreciated from the following description, multiple phases or sequences may be pipelined in a circuit implementation of an address generator to enhance throughput for generating interleaved addresses. Alternatively, depending on the parallel nature of turbo-code processing blocks, pipelining may be used to generate interleaved address sequences for different threads of a single or multiple blocks of data in an alternating manner. Thus it should be appreciated that many different sequence start points, namely many different starting points for x, and/or skip values, n, may be supported for a variety of data blocks. Initialization values may be predetermined and stored in memory for initialization of address generation for a sequence.

FIG. 2 is a block diagram depicting an exemplary embodiment of an interleaver 200. Interleaver 200 may be part of a decoder, an encoder, or a codec. More particularly, interleaver 200 may be associated with convolutional codes, such as turbo-channel codes for QPP interleaving. Block size 201 may be input to storage 210, which may be part of or separate from interleaver 200. Storage 210 may be a look-up table, a random access memory, or other form of storage. Additionally, block size 201 may be input to address generator 220. Another input to storage 210 may be skip value 202. With block size 201 and skip value 202, initialization values 203 and step size 204 may be obtained from storage 210 for providing to address generator 220. Address generator 220 produces addresses 221 to provide one or more sequences of addresses. FIG. 3 is a circuit diagram depicting an embodiment of address generator

220 of FIG. 2. Address generator 220 includes a first stage address engine 310 and a second stage address engine 320. First stage address engine 310 is an initial stage for address generation and generates a stage output 302. Stage output 302 is provided to second stage address engine 320 for generating at least one sequence of addresses 221.

First stage address engine 310 includes adder 311 , subtractor 312, and a select circuit, such as multiplexer 313. For this exemplary embodiment, first stage address engine 310 includes registers 314 and 315. For a single stream/sequence, only one of register, namely either register 314 or 315, may be implemented within the feedback loop of first stage address engine 310. The setup of registers in first stage engine 310 mirrors that of second stage engine 320 to ensure that the values for a particular stream/sequence are coincident at the input to adder 321 from stage output 302 at the same point in time for iterations. However, pipelining may be used to enhance throughput. Additionally, by having at least one each of registers 314 and 315, two sequences of addresses, namely two threads or streams, may be generated together. Furthermore, even though only one of each of registers 314 and 315 is illustratively shown, it should be appreciated that more than one of each of registers 314 and 315 may be implemented. For example, if there were two of each of registers 314 and 315, then as many as four threads or streams of sequences may be generated with pipelined concurrency. It should be understood that streams are generated on alternate clock cycles. Furthermore, edge triggered flip-flops may be used to generate streams on alternate edges. For purposes of clarity by way of example and not limitation, it shall be assumed that there is only one each of registers 314 and 315.

As previously described, initialization values 203 may be obtained from storage 210. These initialization values are indicated as initialization value l(x) 203-1 and initialization value A(x) 203-2.

Second stage address engine 320 includes adder 321 , adder 322, and select circuitry, such as multiplexer 323. Additionally, if pipelining is used, second stage address engine 320 may include at least one register 324 and at least one register 325. Again, there may be at least one of registers 324 and 325 or multiples of each of registers 324 and 325 as previously described with reference to registers 314 and 315. Again, however, for purposes of clarity by way of example and not limitation, it shall be assumed that there is one each of registers 324 and 325. At this point, it should be understood that address engines 310 and 320 may be implemented with three adders, one subtractor, and two select circuits.

Initialization value l(x) 203-1 is provided as a loadable input to loadable adder 311. On an initial clock cycle of clock signal 301 , which is provided to a clock portion of each of registers 314, 315, 324, and 325, output of adder 311 uses initialization value l(x) 203-1 as its initial valid output for a sequence. Likewise, for an initial cycle of a sequence, initialization value A(x) 203-2, which is provided as a loadable input to loadable adder 321 , is used for an initial valid output therefrom. A step size 204 is provided as a data input to adder 311. Another data input to adder 311 is stage output 302, which is provided as a feedback input. Accordingly, step size 204 may be added with initial stage output 302 for output after an initialization value l(x) 203-1 is output from such adder. More particularly, for the exemplary embodiment of FIG. 3, because registers 314 and

315 are present, a first initialization value applied at adder 311 is not fed back as feedback 302 to adder 311 until two clock cycles later when it should have step value 204 added to it (i.e., in the third cycle). On the second cycle, an additional initialization value may be applied for a second sequence/stream as supported when there are two registers/pipe-stages within the feedback loop.

Output of adder 311 is provided to a data input port of register 314. Output of register 314 is provided to a plus port of subtractor 312. Additionally, a sign bit, such as a most significant bit ("MSB") 316 is obtained from the output of register 314 as a control select signal of multiplexer 313. It should be appreciated that the MSB output from register 314 is also provided to the plus port of subtractor 312.

A logic 0 port of multiplexer 313 is coupled to receive block size 201 , and a logic 1 port of multiplexer 313 is coupled to receive logic Os 330. If MSB bit

316 is a logic 1 indicating a negative value, then multiplexer 313 outputs logic Os 330, namely a null value. If, however, MSB bit 316 is a logic 0 indicating output of register 314 is a positive value, then multiplexer 313 outputs block size 201.

Output of multiplexer 313 is provided to a minus port of subtractor 312 for subtracting from the data input to a plus port thereof. Alternatively, multiplexer 313 and subtractor 312 in combination may be considered a loadable adder, where the value to be loaded is the candidate to be subtracted from (i.e., connected to plus input port) and the load control bit is the MSB of this value. Accordingly, it should be appreciated that if output from register 314 is positive, subtraction of block size 201 , namely -K, forces output of subtractor 312 to be negative, namely in a range of -K to -1. If output of register 314 is already negative, adding logic Os 330 to such output has no affect, and thus output of subtractor 312 is the negative output of register 314. Accordingly, output of subtractor 312 is in a range of -K to -1 for input to a data port of register 315. Output of register 315 is stage output 302. Thus, stage output 302 will be in a range of -K to -1 , for K being block size 201. Thus, first stage address engine 310 shifts the range to negative values, namely a move of -K. Stage output 302 from first stage address engine 310 is provided to a data port of adder 321 for addition with an address 221. Address 221 is an address output from register 325 and provided as a feedback address. It should be appreciated that a sequence of addresses 221 is produced from multiple clock cycles during operation. On clock cycles where valid data is output from address generator 220, address 221 constitutes an address output forming part of address sequence.

After outputting an initial initialization value A(x) 203-2, loadable adder 321 may output the sum of a feedback address 221 and a stage output 302. On a next cycle, another initialization value for another sequence, as previously described with reference to loadable adder 311 and not repeated here for purposes of clarity. Output from loadable adder 321 is provided to a data port of register 324. Output of register 324 is provided to a data port of adder 322, and a sign bit, such as an MSB bit 326, output from register 324 is provided as a control select signal to multiplexer 323 as well as being provided to a data port of adder 322.

A logic 0 port of multiplexer 323 is coupled to receive logic Os 330, and a logic 1 port of multiplexer 323 is coupled to receive block size 201. For MSB bit 326 being a logic 0, namely indicating that output of register 324 is positive, multiplexer 323 selects logic Os 330 for output. If, however, MSB bit 326 is a logic 1 indicating that output of register 324 is a negative value, then multiplexer 323 selects block size 201 for output.

Output of multiplexer 323 is provided to a data input port of adder 322. Adder 322 adds the output from register 324 with the output from multiplexer 323. Accordingly, it should be appreciated that output of adder 322 is in a positive range, namely from 0 to K- 1. In other words, by adding K back in address engine 320, the shift or move of values by -K in address engine 310 is effectively neutralized, namely has no net affect on the calculation.

Output of adder 322, which is in a range of 0 to K- 1 , is provided to data input port of register 325. Output of register 325 is an address 221 , which is fed back to adder 321 and which is used as part of an address sequence.

First stage address engine 310 and second stage address engine 320 may be implemented with respective DSPs 106 and CLBs 102 of FPGA 100 of FIG. 1. Alternatively, only CLBs 102 may be used for implementing engines 310 and 320. By having an address engine stage implemented with one of each of a CLB and a DSP implementing multiple address engines operate in parallel is facilitated, as so few resources are consumed by each address engine stage. In other words, because so few circuit components may be used to provide address generator 220, there are more opportunities for implementing multiple address generators within an FPGA.

In the exemplary embodiment of FIG. 3, address engines 310 and 320 are coupled in series and thus have a sequential operation. However, it should be understood that address engines 310 and 320 are operated concurrently for processing a sequence. Thus, for the exemplary embodiment having registers 314, 315, 324, and 325, rather than having a four cycle latency before a valid address 221 , is output as part of an address sequence 321 , there is only a two cycle latency. This is described in additional detail with reference to FIG. 4, where there is shown a flow diagram depicting an exemplary embodiment of an address generation flow 400 of address generator 220 of FIG. 3. Flow 400 is further described with simultaneous reference to FIGS. 3 and 4.

At 401 , block and skip sizes, such as block size 201 and skip size 202, are obtained. At 402, initialization sizes, such as initialization values l(x) 203-1 and A(x) 203-2, and a step size, such as step size 204, are obtained from storage responsive to values obtained at 401. At 403, a sum is generated, such as by adder 311 , as previously described. At 404, a sum is generated by adder 321 , as previously described. It should be appreciated that sums generated at 403 and 404 are generated concurrently, namely in parallel.

At 405, the sum generated at 403 is used in generating a difference, such as by subtractor 312. Again, this difference is in a range of -K to -1. The difference generated at 405 is provided for generating another sum at 404 on a next cycle.

At 406, a sum is generated, such as by adder 322, using the sum generated at 404. Again, generating of a difference at 405 and generating of a sum at 406 was previously described with reference to FIG. 3, and is not repeated here for purposes of clarity. Again, the range of the sum generated at 406 is from 0 to K- 1. Furthermore, an address may be output at 406, such as address 221.

The address output at 406 is fed back to generate another sum at 404, in case the sequence is not completed. Moreover, the difference generated at 405 is fed back to generate another sum at 403, in case the sequence is not completed.

From output at 406, it may be determined whether the sequence is to be incremented at 407. For a hardware implementation, a counter (not shown) coupled to receive clock signal 301 may be preset for a linear sequence responsive to a step size 204 and/or a block size 201. However, for an implementation in software, including firmware, a decision may be made. If the sequence is to be incremented, then at 408 the sequence is incremented, namely x, or i as described below, is incremented, for generating other sums at 403 and 404 on a next clock cycle. Accordingly, the sequence of operations may be in hardware, software, or a combination thereof.

If at 407, it is determined that the sequence is not to be incremented, then at 409, it may be determined whether there is another sequence to be processed. If at 409 it is determined that another sequence is to be processed, then flow 400 returns to 401 for obtaining block and skip sizes for such other sequence. If there is no additional sequence to be processed, then flow 400 ends at 499.

FIG. 5 is a pseudo-code listing depicting an exemplary embodiment of an address generation flow 500. Values are set and initialized as generally indicated at 501 for loop 502.

For FIG. 5, it is assumed that block size K is equal to 256 for a turbo code and that skip value n is equal to two, namely two phases or two sequences being processed simultaneously, for setting block and skip sizes at 503. For this exemplary embodiment, the sequences are an odd sequence and an even sequence. For the even sequence, x starts at 0, and for the odd sequence x starts at 1. Accordingly, for the even sequence, initialization value ("A_cand[x]") 203-2(even) is Equation (1 ) with x equal to 0. Furthermore, for the even sequence, initialization value ("l_cand[x]") 203-1 (even) is Equation (2) with x set equal to 0. It should be appreciated that both initialization values 203-1 and 203- 2 for an even sequence reduce to respective constants, as coefficients fi and h are constants.

For an odd sequence, x starts at 1 , and thus substituting x equal to 1 in Equation (1 ) yields an initialization value 203-2(odd), and substituting x equal to 1 in Equation (2) yields initialization value 203-1 (odd). Likewise, it should be appreciated that initialization values 203-1 and 203-2 for an odd sequence each reduce to constants.

Step size 204 is not dependent on x as indicated in Equation (3), and thus step size ("s") 204 is a constant value. By constant values with respect to initialization values 203-1 and 203-2 for odd and even sequences, as well as step size 204, it should be understood that these are constants for one or more sequences of a data block. In this example, there are two threads or streams, but more than two threads may be implemented. As x is incremented as part of a linear sequence, initialization address candidate ("A_cand[x]") and increment candidate ("l_cand[x]") progress for each increase in x. Thus for a first phase, namely an even sequence in this example, x is of the sequence 0, 2, 4,...,K-2, and for a second phase, x has a progression of 1 , 3, 5,...,K-1 , for this exemplary embodiment.

An address candidate is positive on a first iteration for a sequence, so it may be output directly. Furthermore, an increment candidate is positive on a first iteration for a sequence, so has a block size subtracted therefrom. Thus, for x equal to 0, the first address value output for the even sequence is initialization value 203-2(even), namely 0, and the initial stage output for such first iteration is initialization value 203-1 (even) minus K. By first iteration, it should be understood that there may be some cycle latency as previously described, and thus the first iteration means the first valid output. For the second iteration, namely the second valid output but the first for the odd sequence, the address candidate is positive and thus it may be output directly, namely without addition of K, and the increment candidate is positive on the second iteration, so it has the block size subtracted from it. Thus, on a second iteration, initialization value 203-2(odd) is output as address 221 of FIG. 3, and initialization value 203-1 (odd) minus K is output as stage output 302. Again, step size 204 is a constant which may be initialized as it depends only on skip value n for both odd and even phases. In other words, both odd and even phases have the same step sizes. It is not necessary that skip value be set for n equal to 2. In other words larger skip values may be used or skip value n may be set equal to 1. Furthermore, even though a block size of K equal to 256 is described for purposes of clarity by way of example and not limitation, it should be understood that block sizes greater than or less than 256 may be used. Furthermore, even though a fixed block size is used for this example for purposes of clarity, it should be appreciated that a variable block size may be used. Thus, it is not necessary to use an odd and even sequence or even to alternate among multiple sequences using skip value. For example, skip value may be set to some fraction of the block size. It is not necessary for the linear sequence to progress all the way from 0 through to K- 1 , but some fraction of a sequence may be processed. However, for purposes of clarity by way of example and not limitation, it shall be assumed that the entire sequence from 0 to K- 1 is processed in loop 502.

It is not necessary that x have initialization values corresponding to skip value. For example, x may be reinitialized at a fraction of the block size.

Continuing the above example for K equal to 256, if x was to be initialized again at one half of K, then x equal to 128 would be substituted into Equations (1 ) and (2) for generating initialization values 203-2 and 203-1 , respectively, for such processing. However, the first value, namely x equal to 0 in this sequence would be as previously described.

At 511 , an increment i is set as going from 0 to K- 1 for loop 502. If the address candidate is negative, then the block size K is added to the address candidate as indicated at 512. If the increment candidate is positive, then block size K is subtracted at indicated at 513. At 514, the next address candidate for a then current phase is calculated.

At 515, the next increment candidate for a then current phase is calculated. At 516, an address for the current phase is output. Loop 502 in this example is for i from 0 to K- 1 in increments of one, and when i is equal to K- 1 after 516, then loop 502 ends at 517. Even though address generation flow 500 has been described for multiple threads or sequences, it should be understood that such flow may be reduced down for a single sequence, in which case only one set of address and increment candidates would be obtained. Furthermore, it should be understood that more than two sets of address and increment candidates may be incremented for more than two threads or phases.

While the foregoing describes exemplary embodiment(s) in accordance with one or more aspects of the invention, other and further embodiment(s) in accordance with the one or more aspects of the invention may be devised without departing from the scope thereof, which is determined by the claim(s) that follow and equivalents thereof. For example, initialization may take place before any register in each engine whereas the above description assumes initialization using the logic located in front of or just before an initial register of each engine. In other words, the exemplary embodiments just happen to show initialization in loadable adders 311 and 321 before registers 314 and 324, respectively, of FIG. 3. Initialization was assumed to be in adders 311 and 321 because these adders are less complex as they do not involve respective multiplexers. However, initialization may take place by at a loadable subtractor 312 and a loadable adder 322. Or both streams may be initialized at once rather than sequentially. So the difference from subtractor 312 and the sum from adder 322 may be initialized for a first sequence at the same time as the sum from adder 311 and the sum from adder 321 are initialized for a second sequence. Also when extra registers are inserted to allow for one or more extra streams, there may be no logic in front of such registers, and thus such registers may be used for initialization. Furthermore, if a first stream/sequence used first and third initialization values and a second stream/sequence used second and fourth initialization values, it should be understood that such first and second streams/sequences may be completely independent of one another and each may be started at any point in a block though both may not have a same starting point. However, the first steam/sequence does not necessarily have to be initialized before or after the second stream/sequence. Furthermore, where the third initialization value corresponds to the same stream/sequence as the first initialization value, and where the third initialization value initializes the second processing engine, the first initialization value may be used to initialize the first processing engine for the same stream/sequence with a specific start location between 0 and K-1

(inclusive). Similarly, the second initialization value and the fourth initialization value may correspond to the same stream/sequence.

Although the invention has been described with reference to particular embodiments thereof, it will be apparent to one of ordinary skill in the art that modifications to the described embodiment may be made without departing from the spirit of the invention. Accordingly, the scope of the invention will be defined by the attached claims and not by the above detailed description. It is noted that claims listing steps do not imply any order of the steps and that trademarks are the property of their respective owners.

Claims

What is claimed is: 1. An address generator, comprising: a first processing unit; and a second processing unit coupled to receive a stage output from the first processing unit and configured to provide an address output, wherein the stage output is in a first range from -K to -1 for a block size of K, and the address output is in a second range from 0 to K- 1.

2. The address generator according to claim 1 , wherein the address generator is part of a coding device selected from a group consisting of an encoder, a decoder, and a codec, wherein the address generator provides the address output for quadratic permutation polynomial interleaving.

3. The address generator according to claim 2, wherein the address output includes multiple address sequences.

4. The address generator according to claim 3, wherein the first processing unit and the second processing unit are respectively initialized with a first initialization value or a second initialization value.

5. The address generator according to claim 4, wherein the first initialization value is for a first sequence of the multiple address sequences; and wherein the second initialization value is for a second sequence of the multiple address sequences.

6. The address generator according to claim 2, wherein: the address output is for at least part of an address sequence from 0 to

K-1 ; the first processing unit is initialized with a first initialization value and a second^' initialization value; and the second processing unit is initialized with a third initialization value and a fourth initialization value.

7. The address generator according to claim 1 , wherein the first processing unit comprises: a first adder; a first register, coupled to the first adder; a first multiplexer, coupled to the first register; a first subtractor, coupled to the first multiplexer and the first register; and a second register, coupled to the subtractor, to output the stage output; wherein the stage output is fed-back to the first adder.

8. The address generator according to claim 7, wherein the first register processes a first sequence, and the second register simultaneously processes a second sequence.

9. The address generator according to claim 8, wherein the second processing unit comprises: a second adder to receive the stage output; a third register, coupled to the second adder ; a second multiplexer, coupled to the third register; a third adder, coupled to the second multiplexer and the third register; and a fourth register, coupled to the third adder, to output the address output; wherein the address output is fed-back to an input of the second adder.

10. A method for generating addresses, comprising: obtaining a step size and a block size; obtaining a first initialization value and a second initialization value; adding the step size to a difference to provide a first sum; subtracting either a null value or the block size from the first sum responsive to a sign bit of the first sum to provide another difference, wherein the other difference is in a range of -K to -1 for block size of K; registering the first sum or the other difference; and feeding back the other difference in order to add the other difference to the step size.

11. The method according to claim 10, further comprising: generating a second sum by adding the other difference to a third sum; adding either the null value or the block size to the second sum in response to a sign bit of the second sum to provide another third sum, wherein the other third sum is in a range of 0 to K- 1 ; registering the second sum or the other third sum; and feeding back the other third sum for another iteration of the step for adding to provide the second sum.

12. The method according to claim 11 , wherein the registering the first sum or the other difference includes registering the other difference within respective feedback loops for pipelined operation, and wherein the registering the second sum or the other third sum includes registering the other third sum within respective feedback loops for pipelined operation.

13. The method according to claim 11 , wherein the registering the first sum or the other difference includes registering the first sum within respective feedback loops for pipelined operation, and wherein the registering the second sum or the other third sum includes registering the second sum within respective feedback loops for pipelined operation.

14. The method according to claim 11 , wherein the step of adding the step size to the difference to provide the first sum is performed simultaneously with the step of adding to provide the second sum by addition of the other difference to the third sum.

15. The method according to claim 11 , further comprising providing the other third sum for quadratic permutation polynomial interleaving.