Publication number | US7218624 B2 |

Publication type | Grant |

Application number | US 10/172,113 |

Publication date | May 15, 2007 |

Filing date | Jun 14, 2002 |

Priority date | Nov 14, 2001 |

Fee status | Lapsed |

Also published as | CA2466684A1, CN1582544A, CN2579091Y, CN2686248Y, DE20217636U1, DE20217637U1, EP1444798A1, EP1444798A4, US7606207, US20030091007, US20070206543, WO2003043236A1 |

Publication number | 10172113, 172113, US 7218624 B2, US 7218624B2, US-B2-7218624, US7218624 B2, US7218624B2 |

Inventors | Peter E. Becker, Shane S. Supplee |

Original Assignee | Interdigital Technology Corporation |

Export Citation | BiBTeX, EndNote, RefMan |

Patent Citations (10), Non-Patent Citations (1), Referenced by (5), Classifications (20), Legal Events (5) | |

External Links: USPTO, USPTO Assignment, Espacenet | |

US 7218624 B2

Abstract

A user equipment or base station recovers data from a plurality of data signals received as a received vector. The user equipment determines data of the received vector by determining a Cholesky factor of an N by N matrix and using the determined Cholesky factor in forward and backward substitution to determine data of the received data signals. The user equipment or base station comprises an array of at most N scalar processing elements. The array has input for receiving elements from the N by N matrix and the received vector. Each scalar processing element is used in determining the Cholesky factor and performs forward and backward substitution. The array outputs data of the received vector.

Claims(20)

1. A user equipment for recovering data from a plurality of data signals received as a received vector, the user equipment configured to determine data of the received vector by determining a Cholesky factor of an N by N matrix and to use the determined Cholesky factor in forward and backward substitution to determine the data of the received data signals, the user equipment comprising an array of at most N scalar processing elements, the array having an input configured to receive elements from the N by N matrix and the received vector, each scalar processing element used in determining the Cholesky factor and performing forward and backward substitution, the array configured to output data of the received vector wherein the N×N matrix has a bandwidth P and a number of the at most N scalar processing elements is P and P is less than N.

2. The user equipment of claim 1 wherein each scalar processing element is configured to process a diagonal of a matrix being processed by the array in determining the Cholesky factor and to perform forward and backward substitution.

3. The user equipment of claim 2 wherein each processing element is configured to perform processing for a plurality of diagonals of the N by N matrix.

4. The user equipment of claim 2 wherein a delay element is operatively coupled between each scalar processing element and the array is capable of processing two N by N matrices concurrently.

5. The user equipment of claim 2 wherein all the scalar processing elements have a common reconfigurable implementation.

6. A user equipment for recovering data from a plurality of data signals received as a received vector, the user equipment configured to determine data of the received vector by determining a Cholesky factor of an N by N matrix and to use the determined Cholesky factor in forward and backward substitution to determine the data of the received data signals, the user equipment comprising:

an array of at most N scalar processing elements, the array having an input configured to receive elements from the N by N matrix and the received vector, each scalar processing element used in determining the Cholesky factor and performing forward and backward substitution, the array configured to output data of the received vector; and

a square root and reciprocal device wherein the square root and reciprocal device is coupled only to a single scalar processing element of the array and no scalar processing elements of the array can perform a square root and reciprocal function.

7. The user equipment of claim 6 wherein the square root and reciprocal device is configured to use a look up table.

8. A user equipment for recovering data from a plurality of data signals received as a received vector, the user equipment configured to determine data of the received vector by determining a Cholesky factor of an N by N matrix and to use the determined Cholesky factor in forward and backward substitution to determine the data of the received data signals, the user equipment comprising an array of at most N scalar processing elements, the array having an input configured to receive elements from the N by N matrix and the received vector, each scalar processing element used in determining the Cholesky factor and performing forward and backward substitution, the array configured to output data of the received vector wherein each scalar processing element is configured to process a diagonal of a matrix being processed by the array in determining the Cholesky factor and to perform forward and backward substitution and for each of a plurality of folds, a scalar processing element is configured to process elements from a single diagonal of the N by N matrix.

9. The user equipment of claim 8 wherein a number of folds minimizes a number of the scalar processing elements and allows a processing time for the N by N matrix to be less than a maximum permitted.

10. The user equipment of claim 8 wherein the scalar processing elements are functionally arranged linearly with data flowing two directions through the array.

11. A base station for recovering data from a plurality of data signals received as a received vector, the base station configured to determine data of the received vector by determining a Cholesky factor of an N by N matrix and using the determined Cholesky factor in forward and backward substitution to determine the data of the received data signals, the base station comprising an array of at most N scalar processing elements, the array having an input configured to receive elements from the N by N matrix and the received vector, each scalar processing element used in determining the Cholesky factor and performing forward and backward substitution, the array configured to output data of the received vector wherein the N×N matrix has a bandwidth P and a number of the at most N scalar processing elements is P and P is less than N.

12. The base station of claim 11 wherein each scalar processing element is configured to process a diagonal of a matrix being processed by the array in determining the Cholesky factor and to perform forward and backward substitution.

13. A base station for recovering data from a plurality of data signals received as a received vector, the base station configured to determine data of the received vector by determining a Cholesky factor of an N by N matrix and using the determined Cholesky factor in forward and backward substitution to determine the data of the received data signals, the base station comprising:

an array of at most N scalar processing elements, the array having an input configured to receive elements from the N by N matrix and the received vector, each scalar processing element used in determining the Cholesky factor and performing forward and backward substitution, the array configured to output data of the received vector; and

a square root and reciprocal device wherein the square root and reciprocal device is coupled only to a single scalar processing element of the array and no scalar processing elements of the array can perform a square root and reciprocal function.

14. The base station of claim 13 wherein the square root and reciprocal device is configured to use a look up table.

15. A base station for recovering data from a plurality of data signals received as a received vector, the base station configured to determine data of the received vector by determining a Cholesky factor of an N by N matrix and using the determined Cholesky factor in forward and backward substitution to determine the data of the received data signals, the base station comprising an array of at most N scalar processing elements, the array having an input configured to receive elements from the N by N matrix and the received vector, each scalar processing element used in determining the Cholesky factor and performing forward and backward substitution, the array configured to output data of the received vector wherein each processing element performs processing for a plurality of diagonals of the N by N matrix.

16. The base station of claim 15 wherein for each of a plurality of folds, each scalar processing element processes elements from a single diagonal of the N by N matrix.

17. The base station of claim 16 wherein a number of folds minimizes a number of the scalar processing elements and allows a processing time for the N by N matrix to be less than a maximum permitted.

18. The base station of claim 17 wherein the scalar processing elements are functionally arranged linearly with data flowing two directions through the array.

19. The base station of claim 18 wherein a delay element is operatively coupled between each scalar processing element and the array is capable of processing two N by N matrices concurrently.

20. The base station of claim 19 wherein all the scalar processing elements have a common reconfigurable implementation.

Description

This application is a continuation-in-part of patent application Ser. No. 10/083,189, filed on Feb. 26, 2002, now abandoned which claims priority from U.S. Provisional Patent Application No. 60/332,950, filed on Nov. 14, 2001.

This invention generally relates to solving linear systems. In particular, the invention relates to using array processing to solve linear systems.

Linear system solutions are used to solve many engineering issues. One such issue is joint user detection of multiple user signals in a time division duplex (TDD) communication system using code division multiple access (CDMA). In such a system, multiple users send multiple communication bursts simultaneously in a same fixed duration time interval (timeslot). The multiple bursts are transmitted using different spreading codes. During transmission, each burst experiences a channel response. One approach to recover data from the transmitted bursts is joint detection, where all users data is received simultaneously. Such a system is shown in

The multiple bursts **90**, after experiencing their channel response, are received as a combined received signal at an antenna **92** or antenna array. The received signal is reduced to baseband, such as by a demodulator **94**, and sampled at a chip rate of the codes or a multiple of a chip rate of the codes, such as by an analog to digital converter (ADC) **96** or multiple ADCs, to produce a received vector, __r__. A channel estimation device **98** uses a training sequence portion of the communication bursts to estimate the channel response of the bursts **90**. A joint detection device **100** uses the estimated or known spreading codes of the users' bursts and the estimated or known channel responses to estimate the originally transmitted data for all the users as a data vector, __d__.

The joint detection problem is typically modeled by Equation 1.

*A d+n=r * Equation 1

Two approaches to solve Equation 1 is a zero forcing (ZF) and a minimum mean square error (MMSE) approach. A ZF solution, where __n__ is approximated to zero, is per Equation 2.

__d__=(A^{H}A)^{−1}A^{H} __r__ (Equation 2

A MMSE approach is per Equations 3 and 4.

__d__=R^{−1}A^{H}r Equation 3

*R=A* ^{H} *A+σ* ^{2} *I* Equation 4

σ^{2 }is the variance of the noise, __n__, and I is the identity matrix.

Since the spreading codes, channel responses and average of the noise variance are estimated or known and the received vector is known, the only unknown variable is the data vector, __d__. A brute force type solution, such as a direct matrix inversion, to either approach is extremely complex. One technique to reduce the complexity is Cholesky decomposition. The Cholesky algorithm factors a symmetric positive definite matrix, such as Ã or R, into a lower triangular matrix G and an upper triangular matrix G^{H }by Equation 5.

Ã or R=G G^{H} Equation 5

A symmetric positive definite matrix, Ã, can be created from A by multiplying A by its conjugate transpose (hermetian), A^{H}, per Equation 6.

Ã=A^{H}A Equation 6

For shorthand, {tilde over (r)} is defined per Equation 7.

{tilde over (r)}=A^{H} __r__ Equation 7

As a result, Equation 1 is rewritten as Equations 8 for ZF or 9 for MMSE.

Ã__d__={tilde over (r)} Equation 8

R__d__={tilde over (r)} Equation 9

To solve either Equation 8 or 9, the Cholesky factor is used per Equation 10.

G G^{H} __d__={tilde over (r)} Equation 10

A variable y is defined as per Equation 11.

G^{H}d=y Equation 11

Using variable y, Equation 10 is rewritten as Equation 12.

Gy={tilde over (r)} Equation 12

The bulk of complexity for obtaining the data vector is performed in three steps. In the first step, G is created from the derived symmetric positive definite matrix, such as Ã or R, as illustrated by Equation 13.

*G*=CHOLESKY(*Ã* or *R*) Equation 13

Using G, y is solved using forward substitution of G in Equation 8, as illustrated by Equation 14.

*y*=FORWARD SUB(*G, {tilde over (r)}*) Equation 14

Using the conjugate transpose of G, G^{H}, __d__ is solved using backward substitution in Equation 11, as illustrated by Equation 15.

*d*=BACKWARD SUB(*G* ^{H} *,y*) Equation 15

An approach to determine the Cholesky factor, G, per Equation 13 is the following algorithm, as shown for Ã or R, although an analogous approach is used for R.

for i = 1 : N | ||

for j = max(1, i − P) : i − 1 | ||

λ = min(j + P, N) | ||

a_{I λ, I }= a_{I λ, I }− a*_{I, J }· a_{I λ, J}; | ||

end for; | ||

λ = min(i + P, N) | ||

a_{I λ, I }= a_{I · λ, I }/a_{ii}; | ||

end for; | ||

G = Ã or R; | ||

a

Another approach to solve for the Cholesky factor uses N parallel vector-based processors. Each processor is mapped to a column of the Ã or R matrix. Each processor's column is defined by a variable μ, where μ=1:N. The parallel processor based subroutine can be viewed as the following subroutine for μ=1:N.

j = 1 | ||

while j < μ | ||

recv(g, _{N}, left) | ||

if μ < N | ||

send(g_{J N}, right) | ||

end | ||

a_{μ N ,μ }= a_{μ N ,μ }− g*_{μ}g_{μ.N} | ||

j = j + 1 | ||

end | ||

a_{μ N ,μ }= a_{μ N ,μ }/ √{square root over (a_{μμ})} | ||

if μ < N | ||

send(a_{μ · N ,μ}, right) | ||

end | ||

recv(·, left) is a receive from the left processor operator; send(·, right) is a send to the right processor operator; and g

This subroutine is illustrated using *a*–**2** *h*. *a *is a block diagram of the vector processors and associated memory cells of the joint detection device. Each processor **50** _{1 }to **50** _{N }(**50**) operates on a column of the matrix. Since the G matrix is lower triangular and Ã or R is completely defined by is lower triangular portion, only the lower triangular elements, a_{k,1 }are used.

*b *and **2** *c *show two possible functions performed by the processors on the cells below them. In *b*, the pointed down triangle function **52** performs Equations 16 and 17 on the cells (a_{μμ} to a_{Nμ}) below that μ processor **50**.

v←a_{μN,μ}/√{square root over (a_{μμ})} Equation 16

a_{μ:N,μ}:=v Equation 17

“←” indicates a concurrent assignment; “:=” indicates a sequential assignment; and v is a value sent to the right processor.

In *c*, the pointed right triangle function **52** performs Equations 18 and 19 on the cells below that μ processor **50**.

v←u Equation 18

*a* _{μ:N,μ} *:=a* _{μN,μ} *−v* _{μ} *v* _{μN} Equation 19

v_{k }indicates a value associated with a right value of the k^{th }processor **50**.

*d*–**2** *g *illustrate the data flow and functions performed for a 4×4 G matrix. As shown in the *d*–**2** *g *for each stage **1** through **4** of processing, the left most processor **50** drops out and the pointed down triangular function **52** moves left to right. To implement *d*–**2** *g*, the pointed down triangle can physically replace the processor to the right or virtually replace the processor to the right by taking on the function of the pointed down triangle.

These elements are extendable to an N×N matrix and N processors **50** by adding processors **50** (N—4 in number) to the right of the fourth processor **50** _{4 }and by adding cells of the bottom matrix diagonal (N—4 in number) to each of the processors **50** as shown in *h *for stage **1**. The processing in such an arrangement occurs over N stages.

The implementation of such a Cholesky decomposition using either vector processors or a direct decomposition into scalar processors is inefficient, because large amounts of processing resources go idle after each stage of processing.

Accordingly, it is desirable to have alternate approaches to solve linear systems.

A user equipment or base station recovers data from a plurality of data signals received as a received vector. The user equipment determines data of the received vector by determining a Cholesky factor of an N by N matrix and using the determined Cholesky factor in forward and backward substitution to determine data of the received data signals. The user equipment or base station comprises an array of at most N scalar processing elements. The array has input for receiving elements from the N by N matrix and the received vector. Each scalar processing element is used in determining the Cholesky factor and performs forward and backward substitution. The array outputs data of the received vector.

*a*–**2** *h *are diagrams illustrating determining a Cholesky factor using vector processors.

*a *and **3** *b *are preferred embodiments of N scalar processors performing Cholesky decomposition.

*a*–**4** *e *are diagrams illustrating an example of using a three dimensional graph for Cholesky decomposition.

*a*–**5** *e *are diagrams illustrating an example of mapping vector processors performing Cholesky decomposition onto scalar processors.

*a*–**6** *j *for a non-banded and *e*–**6** *j *for a banded matrix are diagrams illustrating the processing flow of the scalar array.

*a *along the k axis to an N×N matrix.

*a*–**8** *d *are diagrams illustrating the processing flow using delays between the scalar processors in the 2D scalar array.

*e *is a diagram of a delay element and its associated equation.

*a *illustrates projecting the scalar processor array of *a*–**8** *d *onto a ID array of four scalar processors.

*b *illustrates projecting a scalar processing array having delays between every other processor onto a 1 D array of four scalar processors.

*c*–**9** *n *are diagrams illustrating the processing flow for Cholesky decomposition of a banded matrix having delays between every other processor.

*o*–**9** *z *illustrate the memory access for a linear array processing a banded matrix.

*a *and **10** *b *are the projected arrays of *a *and **9** *b *extended to N scalar processors.

*a *and **11** *b *illustrate separating a divide/square root function from the arrays of *a *and **10** *b. *

*a *is an illustration of projecting a forward substitution array having delays between each processor onto four scalar processors.

*b *is an illustration of projecting a forward substitution array having delays between every other processor onto four scalar processors.

*c *and **12** *d *are diagrams showing the equations performed by a star and diamond function for forward substitution.

*e *is a diagram illustrating the processing flow for a forward substitution of a banded matrix having concurrent assignments between every other processor.

*f*–**12** *j *are diagrams illustrating the processing flow for forward substitution of a banded matrix having delays between every other processor.

*k*–**12** *p *are diagrams illustrating the memory access for a forward substitution linear array processing a banded matrix.

*a *and **13** *b *are the projected arrays of *a *and **12** *b *extended to N scalar processors.

*a*–**14** *d *are diagrams illustrating the processing flow of the projected array of *b. *

*a *is an illustration of projecting a backward substitution array having delays between each processor onto four scalar processors.

*b *is an illustration of projecting a backward substitution array having delays between every other processor onto four scalar processors.

*c *and **15** *d *are diagrams showing the equations performed by a star and diamond function for backward substitution.

*e *is a diagram illustrating the processing flow for backward substitution of a banded matrix having concurrent assignments between every other processor.

*f*–**15** *j *are diagrams illustrating the processing flow for backward substitution of a banded matrix having delays between every other processor.

*k*–**15** *p *are diagrams illustrating the memory access for a backward substitution linear array processing a banded matrix.

*a *and **16** *b *are the projected arrays of *a *and **15** *b *extended to N scalar processors.

*a*–**17** *d *are diagrams illustrating the processing flow of the projected array of *b. *

*a *and **18** *b *are the arrays of *a*, **13** *b*, **16** *a *and **16** *b *with the division function separated.

*a *and **19** *b *are diagrams of a reconfigurable array for determining G, forward and backward substitution.

*a *and **20** *b *are illustrations of breaking out the divide and square root function from the reconfigurable array.

*a *illustrates bi-directional folding.

*b *illustrates one directional folding.

*a *is an implementation of bi-directional folding using N processors.

*b *is an implementation of one direction folding using N processors.

*a *and **3** *b *are preferred embodiments of N scalar processors **54** _{1 }to **54** _{N }(**54**) performing Cholesky decomposition to obtain G. For simplicity, the explanation and description is explained for a 4×4 G matrix, although this approach is extendable to any N×N G matrix as shown in *a *and **3** *b. *

*a *illustrates a three-dimensional computational dependency graph for performing the previous algorithms. For simplicity, *a *illustrates processing a 5 by 5 matrix with a bandwidth of 3. The functions performed by each node are shown in *b*–**4** *e*. The pentagon function of *b *performs Equations 20 and 21.

y←√{square root over (a_{in})} Equation 20

a_{out}←y Equation 21

← indicate a concurrent assignment. a_{in }is input to the node from a lower level and a_{out }is output to a higher level. *c *is a square function performing Equations 22 and 23.

y←z* Equation 22

*a* _{out} *←a* _{m} *−|z|* ^{2} Equation 23

*d *is an octagon function performing Equations 24, 25 and 26.

y←w Equation 24

x←a_{in}/w Equation 25

a_{out}←x Equation 26

*e *is a circle function performing Equations 27, 28 and 29.

y←w Equation 27

x←z Equation 28

*a* _{out} *←a* _{in} *−w*z* Equation 29

*a *is a diagram showing the mapping of the first stage of a vector based Cholesky decomposition for a 4×4 G matrix to the first stage of a two dimensional scalar based approach. Each vector processor **52**, **54** is mapped onto at least one scalar processor **56**, **58**, **60**, **62** as shown in *a*. Each scalar processor **56**, **58**, **60**, **62** is associated with a memory cell, a_{ij}. The function to be performed by each processor **56**, **58**, **60**, **62** is shown in *b*–**5** *e*. *b *illustrates a pentagon function **56**, which performs Equations 30 and 31.

y←√{square root over (a_{ij})} Equation 30

a_{ij}:=y Equation 31

:=indicates a sequential assignment. y indicates a value sent to a lower processor. *c *illustrates an octagonal function **58**, which performs Equations 32, 33 and 34.

y←w Equation 32

x←a_{ij}/w Equation 33

a_{ij}:=x Equation 34

w indicates a value sent from an upper processor. *d *illustrates a square function **60**, which performs Equations 35 and 36.

y←z* Equation 35

*a* _{ij} *:=a* _{ij} *−|z|* ^{2} Equation 36

x indicates a value sent to a right processor. *e *illustrates a circular function **62**, which performs Equations 37, 38 and 39.

y←w Equation 37

x←z Equation 38

*a* _{ij} *:=a* _{ij} *−w*z* Equation 39

*a*–**6** *d *illustrate the data flow through the scalar processors **56**, **58**, **60**, **62** in four sequential stages (stages **1** to **4**). As shown in *a*–**6** *d*, a column of processors **56**, **58** drops off after each stage. The process requires four processing cycles or N in general. One processing cycle for each stage. As shown in *a*, ten (10) scalar processors are required to determine a 4×4 G matrix. For an N×N matrix, the number of processors required is per Equation 40.

*e*–**6** *j *illustrate the processing flow for a banded 5 by 5 matrix. Active processors are unhatched. The banded matrix has the lower left three entries (a_{41}, a_{51}, a_{52}, not shown in *e*–**6** *j*) as zeros. As shown in *e*, in a first stage, the upper six processors are operating. As shown in *f*, the six active processors of stage **1** have determined g_{11}, g_{21 }and g_{31 }and three intermediate results, α_{22}, α_{32 }and α_{33 }for use in stage **2**.

In stage **2**, six processors (α_{22}, α_{32}, α_{33}, ã_{42}, ã_{43}, ã_{44}) are operating. As shown in *g *(stage **3**), values for g_{22}, g_{32 }and g_{42 }and intermediate values for β_{33}, β_{43}, β_{44 }have determined in stage **2**. In *h *(stage **4**), values for g_{33}, g_{43 }and g_{53 }and intermediate values for γ_{44}, γ_{54 }and γ_{55 }have been determined. In **5**), g_{44 }and g_{54 }and intermediate value δ_{55 }have been determined. In *j *(final stage), the remaining value g_{55 }is available. As shown in the figures, due to the banded nature of the matrix, the lower left processors of an unloaded matrix are unnecessary and not shown.

The simplified illustrations of *a*–**6** *d *are expandable to an N×N matrix as shown in **56** performs a pentagon function. Octagon function processors **58** extend down the first column and dual purpose square/pentagon processors **64** along the main diagonal, as shown by the two combined shapes. The rest of the processors **66** are dual purpose octagonal/circle processors **66**, as shown by the two combined shapes. This configuration determines an N×N G matrix in N processing cycles using only scalar processors.

If the bandwidth of the matrix has a limited width, such as P, the number of processing elements can be reduced. To illustrate, if P equals N−1 the lower left processor for a_{N1}, drops off. If P equals N−2, two more processors (a_{N−11 }and a_{N2}) drop off.

Reducing the number of scalar processing elements further is explained in conjunction with *a*–**8** *e *and **9** *a *and **9** *b*. *a*–**8** *e *describe one dimensional execution planes of a four (4) scalar processor implementation of *a*–**6** *d*. A delay element **68** of *e *is inserted between each concurrent connection as shown in *a*. The delay element **68** of *e *delays the input y to be a sequential output x, per Equation 41.

y:=x Equation 41

For each processing cycle starting at t_{1}, the processors sequentially process as shown by the diagonal lines showing the planes of execution. To illustrate, at time t_{1}, only processor **56** of a_{11 }operates. At t_{2}, only processor **58** of a_{21 }operates and at t_{3}, processors **58**, **60** of a_{31 }and a_{22 }operate and so until stage **4**, t_{16}, where only processor **56** of a_{44 }operates. As a result, the overall processing requires N^{2 }clock cycles across N stages.

Multiple matrices can be pipelined through the two dimensional scalar processing array. As shown in *a*–**8** *d*, at a particular plane of execution, t_{1 }to t_{16}, are active. For a given stage, up to a number of matrices equal to the number of planes of execution can be processed at the same time. To illustrate for stage **1**, a first matrix is processed along diagonal t_{1}. For a next clock cycle, the first matrix passes to plane t_{2 }and plane t_{1 }is used for a second matrix. The pipelining can continue for any number of matrices. One drawback to pipelining is pipelining requires that the data for all the matrices be stored, unless the schedule of the availability of the matrix data is such that it does not stall.

After a group of matrices have been pipelined through stage **1**, the group is pipelined through stage **2** and so forth until stage N. Using pipelining, the throughput of the array can be dramatically increased as well as processor utilization.

Since all the processors **56**, **58**, **60**, **62** are not used during each clock cycle, when processing only 1 matrix, the number of processing elements **56**, **58**, **60**, **62** can be reduced by sharing them across the planes of execution. *a *and **9** *b *illustrate two preferred implementations to reduce processing elements. As shown in *a*, a line perpendicular to the planes of execution (along the matrix diagonals) is shown for each processing element **56**, **58** of the first column. Since all of the processors **56**, **58**, **60**, **62** along each perpendicular operate in different processing cycles, their functions **56**, **58**, **60**, **62** can be performed by a single processor **66**, **64** as projected below. Processing functions **56** and **60** are performed by a new combined function **64**. Processing functions **58** and **62** are performed by a new combined function **66**. The delay elements **68** and connections between the processors are also projected. Although the left most processing element is shown as using a dual function element **66**, that element can be simplified to only perform the octagonal function **58**, if convenient for a non-banded matrix.

*a *is an expansion of *a *to accommodate an N×N G matrix. As shown in *a*, N processors **66**, **64** are used to process the N×N G matrix. As shown in *a*, the processing functions of *a *can be performed by N scalar processors **54**. The same number of scalar processors as the bandwidth, P, can be used to process the G matrix in the banded case.

In the implementation of *a*, each processor is used in every other clock cycle. The even processors operate in one cycle and the odd in the next. To illustrate, processor **2** (second from the right) of *a *processes at times t_{2}, t_{4 }and t_{6 }and processor **3** at t_{3 }and t_{5}. As a result, two G matrices can be determined by the processing array at the same time by interlacing them as inputs to the array. This approach greatly increases the processor utilization over the implementation of

To reduce the processing time of a single array, the implementation of *b *is used. The delay elements between every other processor connection is removed, as shown in *b*. At time t_{1}, only processor **56** of a_{11 }operates. However, at t_{2}, processors **58**, **60** at a_{21}, a_{22 }and a_{31 }are all operating. Projecting this array along the perpendicular (along the diagonals of the original matrix) is also shown in *b*. As shown, the number of delay elements **68** is cut in half. Using this array, the processing time for an N×N G matrix is cell (NP-(P^{2}-P)/2). Accordingly, the processing time for a single G matrix is greatly reduced.

Another advantage to the implementations of **3** *a *and **3** *b *is that each processing array is scalable to the matrix bandwidth. For matrices having lower bandwidths (lower diagonal elements being zero), those elements' processors **58**, **66** in *a *and **3** *b*, since the lower diagonal elements correspond to the left most perpendicular lines of *a *and **9** *b*, the processors projected by those perpendicular lines drop out. To illustrate using *a*, the bandwidth of the matrix has the processing elements **58**, **62** of a_{41}, a_{31 }and a_{42 }as zeros. As a result, the projection to processors **66** (left most two) are unnecessary for the processing. As a result, these implementations are scalable to the matrix bandwidth.

*c*–**9** *n *illustrate the timing diagrams for each processing cycle of a banded 5 by 5 matrix having a bandwidth of 3 with delays between every other connection. At each time period, the value associated with each processor is shown. Active processors are unhatched. As shown in the figures, the processing propagates through the array from the upper left processor in *c*, stage **1**, time **0** (ã_{11}) to the lower right processor in *n*, stage **5** (δ_{55 }). As shown in the figures, due to the banded nature of the matrix, the lower left processors of an unbanded matrix processing are unnecessary and not shown.

*o*–**9** *z *illustrate the timing diagrams and memory access for each processing cycle of a linear array, such as per *b*, processing a banded 5 by 5 matrix. As shown, due to the 5 by 5 matrix having a bandwidth of 3, only three processors are needed. The figures illustrate that only three processors are required to process the banded matrix. As also shown, each stage has a relatively high processor utilization efficiency, which increases as N/p increases.

To reduce the complexity of the processing elements, the divide and square root function are not performed by those elements (pulled out). Divides and square roots are more complex to implement on an ASIC than adders, subtractors and multipliers.

The only two functions which perform a divide or a square root is the pentagon and octagon functions **56**, **58**. For a given stage, as shown in *a*–**6** *d*, the pentagon and octagon functions **56**, **58** are all performed on a single column during a stage. In particular, each of these columns has a pentagon **58** on top and octagons **58** underneath. Since each octagon **58** concurrently assigns its w input to its y output, the output of the pentagon **56** flows down the entire column, without the value for w being directly stored for any a_{ij}. The octagon **58** also uses the w input to produce the x output, which is also fed back to a_{ij}. The x output is used by the square and circle functions **60**, **62** in their a_{ij }calculations. As a result, only the value for each octagon's x output needs to be determined. The x output of the octagon is the a_{ij}, for that octagon **58** divided by the value of the w input, which is the same for each octagon **58** and is the y output of the pentagon **56**. Accordingly, the only division/square root function that is required to be performed is calculating x for the octagon **58**.

Using Equations 34 and 30, each octagon's x output is that octagon's a_{ij }divided by the square root of the pentagon's a_{ij}. Using a multiplier instead of a divider within each octagon processor, for a given stage, only the reciprocal of the square root of the pentagon's a_{ij }needs to be determined instead of the square root, isolating the divide function to just the pentagon processor and simplifying the overall complexity of the array. The reciprocal of the square root would then be stored as the a_{ij }of the matrix element associated with the pentagon instead of the reciprocal. This will also be convenient later during forward and backward substitution because the divide functions in those algorithms become multiples by this reciprocal value, further eliminating the need for dividers in other processing elements, i.e. the x outputs of *d *and **15** *d*. Since the pentagon function **56** as shown in *a *and **9** *b *is performed by the same processor **64**, the processors **66**, **64** can be implemented using a single reciprocal/square root circuit **70** having an input from the pentagon/square processor **64** and an output to that processors **64**, as shown in *a *and **10** *b*. The result of the reciprocal of the square root is passed through the processors **66**. *a *and **11** *b *correspond to *a *and **10** *b*. Separating the reciprocal/square root function **70** simplifies the complexity of the other processor **66**, **64**. Although the divide/square root circuit **70** can be implemented by using a reciprocal and a square root circuit, it is preferably implemented using a look up table, especially for a field programmable gate array (FPGA) implementation, where memory is cost efficient.

After the Cholesky factor, G, is determined, __y__ is determined using forward substitution as shown in *a *and **12** *b*. The algorithm for forward substitution is as follows.

for j=1:N

end

For a banded matrix, the algorithm is as follows.

for j = 1:N | ||

for i = j + 1:min(j + p, N) | ||

r_{I }= r_{I }− G_{IJ}r_{J}; | ||

end for; | ||

end for; | ||

y = r_{J}; | ||

g

*a *and **12** *b *are two implementations of forward substitution for a 4×4 G matrix using scalar processors. Two functions are performed by the processors **72**, **74**, the star function **72** of *c *and the diamond function **74** of *d*. The star **72** performs Equations 42 and 43.

y←w Equation 42

*x←z−w*g* _{ij} Equation 43

The diamond function **74** performs Equations 44 and 45.

*x←z/g* _{ij} Equation 44

y←x Equation 45

Inserting delay elements between the concurrent connections of the processing elements as in *a *and projecting the array perpendicular to its planes of execution (t_{1 }to t_{7}) allows the array to be projected onto a linear array. The received vector values from {tilde over (r)}, r_{1}–r_{4}, are loaded into the array and y_{1}–y_{4 }output from the array. Since the diamond function **74** is only along the main diagonal, the four (4) processing element array can be expanded to process an N×N matrix using the N processing elements per *a*. The processing time for this array is 2 N cycles.

Since each processing element is used in only every other processing cycle, half of the delay elements can be removed as shown in *b*. This projected linear array can be expanded to any N×N matrix as shown in *b*. The processing time for this array is N cycles.

The operation per cycle of the processing elements of the projected array of *b *is illustrated in *a *–**14** *d*. In the first cycle, t_{1}, of *a*, r_{1 }is loaded into the left processor **1** (**74**) and y_{1 }is determined using r_{1 }and g_{11}. In the second cycle, t_{2}, of *b *, r_{2 }and r_{3 }are loaded, g_{3}, g_{21 }and g_{22 }are processed and y_{2 }is determined. In the third cycle, t_{3}, of *c*, r_{4 }is loaded, g_{41}, g_{42}, g_{32}, g_{33 }are loaded, and y_{3 }is determined. In the fourth cycle, t_{4}, of *d*, g_{43 }and g_{44 }are processed and y_{4 }is determined.

*e*–**12** *j *illustrate the timing diagrams for each processing cycle of a banded 5 by 5 matrix. *e *shows the banded nature of the matrix having three zero entries in the lower left corner (a bandwidth of 3).

To show that the same processing elements can be utilized for forward as well as Cholesky decomposition, *f *begins in stage **6**. Stage **6** is the stage after the last stage of *c*–**9** *n. *

Similarly, *k*–**12** *p *illustrate the extension of the processors of *o*–**9** *z *to also performing forward substitution. These figures begin in stage **6**, after the 5 stages of Cholesky decomposition. The processing is performed for each processing cycle from stage **6**, time **0** (*k*) to the final results (*p*), after stage **6**, time **4** (*o*).

After the y variable is determined by forward substitution, the data vector can be determined by backward substitution. Backward substitution is performed by the following subroutine.

for j=N:1

end

For a banded matrix, the following subroutine is used.

for j = N : 1 | ||

y _{J }= y _{J }/G_{JJ} ^{H }j ; | ||

for i = min(1, j − P): j − 1 | ||

y_{I = }y_{I }− G_{IJ} ^{H }y _{J} | ||

end for; | ||

end for; | ||

d = y; | ||

(·)* indicates a complex conjugate function. g*

Backward substitution is also implemented using scalar processors using the star and diamond functions **76**, **78** as shown in *a *and **15** *b *for a 4×4 processing array. However, these functions, as shown in *c *and **15** *d*, are performed using the complex conjugate of the G matrix values. Accordingly, Equations 42–45 become 46–49, respectively.

y←w Equation 46

*x←z−w*g** _{ij} Equation 47

*x←z/g** _{jj} Equation 48

y←x Equation 49

The delays **68** at the concurrent assignments between processors **76**, **78**, the array of *a *is projected across the planes of execution to a linear array. This array is expandable to process an N×N matrix, as shown in *a*. The __y__ vector values are loaded into the array of *a *and the data vector, __d__, is output. This array takes 2N clock cycles to determine __d__. Since every other processor operates in every other clock cycle, two __d__s can be determined at the same time.

Since each processor **76**, **78** in **16** *a *operates in every other clock cycle, every other delay can be removed as shown in *b*. The projected array of *b *is expandable to process an N×N matrix as shown in *b*. This array takes N clock cycles to determine __d__.

The operations per cycle of the processing elements **76**, **78** of the projected array of *b *is illustrated in *a*–**17** *d*. In the first cycle, t_{1}, of *a*, y_{4 }is loaded, g*_{44 }is processed and d_{4 }is determined. In the second cycle, t_{2}, of *b*, y_{2 }and y_{3 }are loaded, g*_{43 }and g*33 are processed and d_{3 }is determined. In the third cycle, t_{3}, of *c*, y_{1 }is loaded, g*_{41}, g*_{42}, g*_{32 }and g*_{22 }are processed and d_{2 }is determined. In the fourth cycle, t_{4}, of *d*, g*_{43 }and g*_{44 }are processed and d_{4 }is determined.

*e*–**15** *j *illustrates the extension of the processors of *e*–**12** *j *to performing backward substitution on a banded matrix. *e *shows the banded nature of the matrix having three zero entries in the lower left corner.

The timing diagrams begin in stage **7**, which is after stage **6** of forward substitution. The processing begins in stage **7**, time **0** (*f*) and is completed at stage **7**, time **4** (*j*). After stage **7**, time **4** (*j*), all of the data, d_{1 }to d_{5}, is determined.

Similarly, *k*–**15** *p *illustrate the extension of the processors of *k*–**12** *p *to also performing backward substitution. These figures begin in stage **7**, after stage **6** of forward substitution. The processing is performed for each processing cycle from stage **7**, time **0** (*k*) to the final results (*p*). As shown in *c*–**9** *n*, **12** *e*–**12** *j *and **15** *e*–**15** *j, *the number of processors in a two dimensional array can be reduced for performing Cholesky decomposition, forward and backward substitution for banded matrices. As shown by *o*–**9** *z*, **12** *k*–**12** *p*, the number of processors in a linear array is reduced from the dimension of matrix to the bandwidth of banded matrices.

To simplify the complexity of the individual processing elements **72**, **74**, **76**, **78** for both forward and backward substitution, the divide function **80** can be separated from the elements **72**, **74**, **76**, **78**, as shown in *a *and **18** *b*. *a *and **18** *b *correspond to *a *and **16** *b*, respectively. Although the data associated with the processing elements **72**, **74**, **76**, **78** for forward and backward substitution differ, the function performed by the elements **72**, **74**, **76**, **78** is the same. The divider **80** is used by the right most processor **74**, **78** to perform the division function. The divider **80** can be implemented as a look up table to determine a reciprocal value, which is used by the right most processor **74**, **78** in a multiplication. Since during forward and backward substitution the reciprocal from Cholesky execution already exists in memory, the multiplication of the reciprocal for forward and backward substitution can utilize the reciprocal already stored in memory.

Since the computational data flow for all three processes (determining G, forward and backward substitution) is the same, N or the bandwidth P, all three functions can be performed on the same reconfigurable array. Each processing element **84**, **82** of the reconfigurable array is capable of operating the functions to determine G and perform forward and backward substitution, as shown in *a *and **19** *b*. The right most processor **82** is capable of performing a pentagon/square and diamond function, **64**, **74**, **78**. The other processors **84** are capable of performing a circle/octagon and star function **66**, **72**, **76**. When performing Cholesky decomposition, the right most processor **82** operates using the pentagon/square function **64** and the other processors **84** operate using the circle/octagon function **66**. When performing forward and backward substitution, the right most processor **82** operates using the diamond function **74**, **78** and the other processors **84** operate using the star function **72**, **76**. The processors **82**, **84** are, preferably, configurable to perform the requisite functions. Using the reconfigurable array, each processing element **82**, **84** performs the two arithmetic functions of forward and backward substitution and the four functions for Cholesky decomposition, totaling six arithmetic functions per processing element **82**, **84**. These functions may be performed by an arithmetic logic unit (ALU) and proper control logic or other means.

To simplify the complexity of the individual processing elements **82**, **84** in the reconfigurable array, the divide and square root functionality **86** are preferably broken out from the array by a reciprocal and square root device **86**. The reciprocal and square root device **86**, preferably, determines the reciprocal to be in a multiplication, as shown in *a *and **20** *b *by the right most processor **82** in forward and backward substitution and the reciprocal of the square root to be used in a multiplication using the right most processor data and passed through the processors **84**. The determination of the reciprocal and reciprocal/square root is, preferably, performed using a look up table. Alternately, the divide and square root function block **86** may be a division circuit and a square root circuit.

To reduce the number of processors **82**, **84** further, folding is used. *a *and **21** *b *illustrate folding. In folding, instead of using P processing elements **82**, **84** for a linear system solution, a smaller number of processing elements, F, are used for Q folds. To illustrate, if P is nine (9) processors **82**, **84**, three (3) processors **82**, **84** perform the function of the nine (9) processors over three (3) folds. One drawback with folding is that the processing time of the reduced array is increased by a multiple Q. One advantage is that the efficiency of the processor utilization is typically increased. For three folds, the processing time is tripled. Accordingly, the selection of the number of folds is based on a trade off between minimizing the number of processors and the maximum processing time permitted to process the data.

*a *illustrates bi-directional folding for four processing elements **76** _{1}, **76** _{2}, **76** _{3}, **76** _{4}/**78** performing the function of twelve elements over three folds of the array of **11** *b*. Instead of delay elements being between the processing elements **76** _{1}, **76** _{2}, **76** _{3}, **76** _{4}/**78**, dual port memories **86** _{1}, **86** _{2}, **86** _{3}, **86** _{4 }(**86**) are used to store the data of each fold. Although delay elements (dual port memories **86**) may be present for each processing element connection, such as for the implementation of *a*, it is illustrated for every other connection, such as for the implementation of *b*. Instead of dual port memories, two sets of single port memories may be used.

During the first fold, each processors' data is stored in its associated dual port memory **86** in an address for fold **1**. Data from the matrix is also input to the processors **76** _{1}–**76** _{3}, **76** _{4}/**78** from memory cells **88** _{1}–**88** _{4 }(**88**). Since there is no wrap-around of data between fold **1** processor **76** _{4}/**78** and fold **3** processor **76** _{1}, a dual port memory **86** is not used between these processors. However, since a single address is required between the fold **1** and fold **2** processor **76**, and between fold **2** and fold **3** processor **76** _{4}/**78**, a dual port memory **86** is shown as a dashed line. During the second fold, each processor's data is stored in a memory address for fold **2**. Data from the matrix is also input to the processors **76** _{1}–**76** _{3}, **76** _{4}/**78** for fold **2**. Data for fold **2** processor **76**, comes from fold **1** processor **76** _{1}, which is the same physical processor **76** _{1 }so (although shown) this connection is not necessary. During the third fold, each processor's data is stored in its fold **3** memory address. Data from the matrix is also input to the processors **76** _{1}–**76** _{3}, **76** _{4}/**78** for fold **3**. Data for fold **3** processor **76** _{4}/**78** comes from fold **2** processor **76** _{4}/**78** so this connection is not necessary. For the next processing stage, the procedure is repeated for fold **1**.

*a *is an implementation of bidirectional folding of *a *extended to N processors **76** _{1}–**76** _{N−1}, **76** _{N}/**78**. The processors **76** _{1}–**76** _{N−1}, **76** _{N}/**78** are functionally a array, accessing the dual port memory **86** or two sets of single port memories.

*b *illustrates a one directional folding version of the array of **11** *b*. During the first fold, each processor's data is stored in its associated dual port memory address for fold **1**. Although fold **1** processor **76** _{4}/**78** and fold **3** processor **76** _{1 }are physically connected, in operation no data is transferred directly between these processors. Accordingly, the memory port **86** _{4 }between them has storage for one less address. Fold **2** processor **76** _{4}/**78** is effectively coupled to fold **1** processor **76** _{1 }by the ring-like connection between the processors. Similarly, fold **3** processor **76** _{4}/**78** is effectively coupled to fold **2** processor **76** _{1}.

*b *is an implementation of one directional folding of *b *extended to N processors. The processors **76** _{1}–**76** _{N−1}, **76** _{N}/**78** are functionally arranged in a ring around the dual memory.

To implement Cholesky decomposition, forward and backward substitution onto folded processors, the processor, such as the **76** _{4}/**78** processor, in the array must be capable of performing the functions for the processors for Cholesky decomposition, forward and backward substitution, but also for each fold. As shown in *a *and **20** *b *for processor **76** _{4}/**78**. Depending on the implementation, the added processor's required capabilities may increase the complexity of that implementation. To implement folding using ALUs, one processor (such as **76** _{4}/**78** processor) performs twelve arithmetic functions (four for forward and backward substitution and eight for Cholesky) and the other processors only perform six functions.

The signals w, x, y, and z are the same as those previously defined in the PE function definitions. The signals a^{q }and a^{d }represent the current state and next state, respectively, of a PE's memory location being read and/or written in a particular cycle of the processing. The names in parentheses indicate the signals to be used for the second slice.

This preferred processing element can be used for any of the PEs, though it is desirable to optimize PE**1**, which performs the divide function, independently from the other PEs. Each input to the multiplexers **94** _{1 }to **94** _{8 }is labeled with a ‘−’ to indicate that it is used for PE**1** only, a ‘+’ to indicate that it is used for every PE except PE**1**, or a ‘+’ to indicate that it is used for all of the PEs. The isqr input is connected to zero except for the real slice of PE1, where it is connected to the output of a function that generates the reciprocal of the square root of the a^{q} _{r }input. Such a function could be implemented as a LUT with a ROM for a reasonable fixed-point word size.

As shown in **94** _{1 }and **94** _{2 }are multiplied by multiplier **96** _{1}. The output of multiplexers **94** _{3 }and **94** _{4 }are multiplied by a multiplier **96** _{2}. The outputs of multipliers **96** _{1 }and **96** _{2 }is combined by an add/subtract circuit **98**. The output of the add/subtract circuit **98** is combined with the output of multiplexer **94** _{5 }by a subtractor **99**. The output of subtractor **99** is an input to multiplexer **94** _{8}.

Patent Citations

Cited Patent | Filing date | Publication date | Applicant | Title |
---|---|---|---|---|

US4964126 | Sep 30, 1988 | Oct 16, 1990 | Massachusetts Institute Of Technology | Fault tolerant signal processing machine and method |

US5630154 | Oct 11, 1994 | May 13, 1997 | Hughes Aircraft Company | Programmable systolic array system arranged in a found arrangement for passing data through programmable number of cells in a time interleaved manner |

US6064689 * | Jun 4, 1999 | May 16, 2000 | Siemens Aktiengesellschaft | Radio communications receiver and method of receiving radio signals |

US6707864 * | Dec 27, 2001 | Mar 16, 2004 | Interdigital Technology Corporation | Simplified block linear equalizer with block space time transmit diversity |

US6714527 * | Jun 18, 2002 | Mar 30, 2004 | Interdigital Techology Corporation | Multiuser detector for variable spreading factors |

US6870882 * | Sep 28, 2000 | Mar 22, 2005 | At&T Corp. | Finite-length equalization over multi-input multi-output channels |

US6937644 * | Jan 8, 2004 | Aug 30, 2005 | Interdigital Technology Corporation | Generalized two-stage data estimation |

US6985513 * | Jan 5, 2001 | Jan 10, 2006 | Interdigital Technology Corporation | Channel estimation for time division duplex communication systems |

US20030026325 * | Feb 20, 2002 | Feb 6, 2003 | Interdigital Technology Corporation | Fast joint detection base station |

JPH103468A | Title not available |

Non-Patent Citations

Reference | ||
---|---|---|

1 | Golub et al., "Matrix Computations (Johns Hopkins Series in the Mathematical Sciences)," 3<SUP>rd </SUP>Edition, Chapter 4, "Special Linear Systems," pp. 133-205. |

Referenced by

Citing Patent | Filing date | Publication date | Applicant | Title |
---|---|---|---|---|

US7633913 * | Nov 5, 2004 | Dec 15, 2009 | Nextel Communications Inc. | Wireless communication system using joint detection to compensate for poor RF condition based on user priority |

US7924778 | Aug 12, 2005 | Apr 12, 2011 | Nextel Communications Inc. | System and method of increasing the data throughput of the PDCH channel in a wireless communication system |

US20060099953 * | Nov 5, 2004 | May 11, 2006 | Nextel Communications, Inc. | Wireless communication system using joint detection to compensate for poor RF condition based on user priority |

US20070036070 * | Aug 12, 2005 | Feb 15, 2007 | Mansour Nagi A | System and method of increasing the data throughput of the PDCH channel in a wireless communication system |

US20080065928 * | Aug 24, 2007 | Mar 13, 2008 | International Business Machines Corporation | Technique for supporting finding of location of cause of failure occurrence |

Classifications

U.S. Classification | 370/335, 375/E01.025, 370/342, 375/149, 375/229 |

International Classification | H04K1/00, H04B15/00, H04B7/155, G06F17/16, H04L27/30, H04B1/707, G06F17/12, H04B7/216 |

Cooperative Classification | G06F17/16, H04B1/7105, G06F17/12, H04B1/71055 |

European Classification | H04B1/7105, G06F17/16, G06F17/12 |

Legal Events

Date | Code | Event | Description |
---|---|---|---|

Nov 13, 2007 | CC | Certificate of correction | |

Oct 14, 2010 | FPAY | Fee payment | Year of fee payment: 4 |

Dec 24, 2014 | REMI | Maintenance fee reminder mailed | |

May 15, 2015 | LAPS | Lapse for failure to pay maintenance fees | |

Jul 7, 2015 | FP | Expired due to failure to pay maintenance fee | Effective date: 20150515 |

Rotate