US 20020154727 A1 Abstract Cone-beam reconstruction is performed within a practically acceptable time on a computer having one or more microprocessors. The calculations involved in the reconstruction are divided into calculations to be performed on the MMX, ALU and SSE units of each of the microprocessors. For pure floating-point data, it is preferred to use the MMX unit to adjust the data address and map data and to use the SSE unit to perform the backprojection. The data are partitioned by z-line so that the data to be processed in each stage of the backprojection fit within the L
1 cache. Claims(31) 1. A system for generating a three-dimensional image representative of an interior portion of an object, the system comprising:
a radiation cone-beam scanner which generates cone-beam projection signals by passing a radiation cone beam through the object onto a detector; and a computer, receiving the cone-beam projection signals, for generating the three-dimensional image by performing a plurality of calculations on the cone-beam projection signals, the computer comprising at least one fixed-point processing unit and at least one floating-point processing unit, the at least one fixed-point processing unit operating in parallel with the at least one floating-point processing unit, the computer dividing the plurality of calculations into a first plurality of calculations to be performed in the at least one fixed-point processing unit and a second plurality of calculations to be performed in the at least one floating-point processing unit. 2. The system of the first plurality of calculations comprise generation of a projection map; and the second plurality of calculations comprise backprojection of the cone-beam projection signals in accordance with the projection map to produce the image. 3. The system of the generation of the projection map comprises mapping a world coordinate system (x, y, z) to a coordinate system (u, t, s) of the detector, where u is independent of z; the computer organizes the cone-beam projection signals into z-lines for which z varies but x and y are constant; and the computer performs the backprojection for each of the z-lines. 4. The system of 5. The system of 6. The system of a boundary of the object is known a priori; and the three-dimensional image is generated only within the boundary. 7. The system of 8. The system of 9. The system of 10. The system of 11. The system of 12. A method of generating a three-dimensional image representative of an interior portion of an object, the method comprising:
(a) passing a cone beam through the object onto a detector to generate cone-beam projection signals; and (b) receiving the cone-beam projection signals and generating the three-dimensional image by performing a plurality of calculations on the cone-beam projection signals; wherein step (b) is performed on a computer comprising at least one fixed-point processing unit and at least one floating-point processing unit, the at least one fixed-point processing unit operating in parallel with the at least one floating-point processing unit, the computer dividing the plurality of calculations into a first plurality of calculations to be performed in the at least one fixed-point processing unit and a second plurality of calculations to be performed in the at least one floating-point processing unit. 13. The method of the first plurality of calculations comprise generation of a projection map; and the second plurality of calculations comprise backprojection of the cone-beam projection signals in accordance with the projection map to produce the reconstructed image. 14. The method of the generation of the projection map comprises mapping a world coordinate system (x, y, z) to a coordinate system (u, t, s) of the detector, where u is independent of z; the computer organizes the cone-beam projection signals into z-lines for which z varies but x and y are constant; and the computer performs the backprojection for each of the z-lines. 15. The method of 16. The method of 17. The method of a boundary of the object is known a priori; and the three-dimensional image is generated only within the boundary. 18. The method of 19. The method of 20. The method of 21. The method of 22. The method of 23. The method of 24. The method of 25. The method of 26. The method of 27. The method of 28. The method of 29. The method of (i) performing hybrid computing utilizing single instruction multiple data (SIMD) over a plurality of execution units; (ii) using multi-thread and fiber support in an operating system so as to automatically enable reconstruction parallelism in a multi-processor environment effective datat I/O; and (iii) optimizing memory and cache access through data partitioning. 30. The method of 31. The method of Description [0001] The present invention is directed to a system and method for cone-beam reconstruction in medical imaging or the like and more particularly to such a system and method implemented on one or more microprocessors. The present invention is also useful for nondestructive testing, single photon emission tomography and CT-based explosive detection, micro CT or micro cone beam volume CT, etc. [0002] Cone-beam reconstruction has attracted much attention in the medical imaging community. Examples of cone-beam reconstruction are found in the commonly assigned U.S. Pat. Nos. 5,999,587 and 6,075,836 and U.S. patent application Ser. Nos. 09/589,115 and 09/640,713, whose disclosures are hereby incorporated by reference in their entireties into the present disclosure. [0003] CT (computed tomography) image reconstruction algorithm can be classified into two major classes: filtered backprojection (FBP) and iterative reconstruction (IR). The filtered backprojection is more often discussed because it is accurate and amenable to fast implementation. The filtered backprojection can be implemented as an exact reconstruction method or as an approximate reconstruction method, both based on the Radon transform and/or the Fourier transform. [0004] The cone beam reconstruction process is time-consuming and needs a lot of computing operation. Currently, the cone beam reconstruction process is prohibitively long for clinical and other practical applications. Considering a set of data with projection size N=512, since the time and computation for FBP is O(N [0005] Existing fast algorithms for reconstruction are based on either the Fourier Slice Theorem or a multi-resolution re-sampling of the backprojection. Algorithms based on the Fourier Slice Theorem use interpolations to transform the Fourier projection data from the polar to the Cartesian grid, from which the reconstruction can be obtained by an inverse FFT. Many works have been done to bring down the FBP time, and most of them are focused on fan-beam data. These include the linogram method and the “links” method as well as related fast methods for re-projection. An approximate method has been proposed based on the sinogram and “link”; such a method works for 2D FBP and can achieve O(N [0006] A customized backprojection hardware engine having parallelism and pipelining of various kinds can push the execution speed to the very limit. The hardware can be an FPGA based module or an ASIC module, a customized mask-programmable gate array, a cell-based IC and field programmable logic device or an add-in board with high speed RISC or DSP processors. Those boards usually use high-speed multi-port buffer memory or a DMA controller to increase data exchanging speed between boards. Some techniques, like vector computing and pre-interpolating projection data, are used with the customized engine to decrease reconstruction operation. Most of the customized hardware is built for 2D FBP reconstruction applications. No reconstruction engine-based a single or multiple microprocessors that is specially designed for fast cone beam reconstruction is commercially available. [0007] A multi-processor computer or a multi-computer system can be used to accelerate the cone beam reconstruction algorithm. Many large-scale parallel computers have tightly coupled processors interconnected by high-speed data paths. The multi-processor computer can be a shared memory computer or a distributed memory computer. Much work has been done on the large-scale and extremely expensive parallel computer. Most of that work uses an algorithm based on the 3D Radon transform. As an example, the Feldkamp algorithm and two iterative algorithms, 3D ART and SIRT, have been implemented on large-scale computers such as Cray-3D, Paragon and SP1. In such implementations, the local data partition is used for the Feldkamp algorithm and the SIRT algorithm, while the global data partition is used for the ART algorithm. The implementation is voxel driven. The communication speed between processors is important to the reconstruction time, and the Feldkamp implementation can gain best performance in Multiple Instruction Multiple Data (MIMD) computers. Parallel 2D FBP has been implemented on Intel Paragon and CM5 computers. Using customized accelerating hardware or a large-scale parallel computer is not a cost-effective fast reconstruction solution, and it is not convenient to modify or add a new algorithm for research work. [0008] In a distributed computing environment, many computers can be connected together to work as a multi-computer system. The computing tasks are distributed to each computer. Usually the parallel program running on a multi-computer system uses some standard library such as MPI (message passing interface) or PVM (parallel virtual machine). Parallel reconstruction has been tested on a group of Sun Sparc2 computers connected with an Ethernet network, and the implementation is based on the PVM library. The Feldkamp algorithm has been implemented on heterogeneous workstation clusters based on the MPI library. The implementation runs on six computer clusters, and the result shows that the implementation in load balancing resulted in processor utilization of 81.8%, and use of asynchrous communication has improved processor utilization to 91.9%. The biggest disadvantage of multi-computer clusters is that communication speed decreases reconstruction speed. Since cone beam reconstruction involves a large data memory, the data is usually distributed into each computer. The computers need to exchange data in the backprojection phase. The memory communication is a big trade-off for reconstruction speed. Another disadvantage is the inability to get a small size reconstruction engine with multi-computer clusters. There are also some attempts to implement cone beam reconstruction on distributed computing technology such as COBRA (common object request broker architecture and specification). Usually the distributed computing library costs more communication time trade-off than directly using the MPI library, thus resulting in lower reconstruction speed. [0009] Besides parallelism between processors, a single processor can gain data and operation parallelism with some micro-architecture techniques. Instruction-level Parallelism (ILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Modem processors can divide instruction executing into several stages; some techniques such as pipeline and branch prediction permit the execution of multiple instructions simultaneously. To enable data processing parallelism, some processors add single instruction multiple data (SIMD) instructions, which can process several data in one instruction. Such processors include Intel's IA-32 architecture with MMX™ and SSE/SSE2, Motorola's PowerPC™ with AltVec™ and AMD Athlon with 3Dnow™. However, to date, such parallelism has not been exploited in cone-beam reconstruction. [0010] In light of the above, it will be readily apparent that a need exists in the art to perform cone-beam reconstruction at a practically acceptable speed without the need for customized hardware or a large-scale computer. It is therefore an object of the invention to provide a system and method for cone-beam reconstruction which can be performed quickly on inexpensive, widely available equipment. [0011] To achieve the above and other objects, the present invention is directed to a practical implementation for high-speed CBR on a commercially available PC based on hybrid computing (HC). Feldkamp CBR is implemented with multi-level acceleration, performing HC utilizing single instruction multiple data (SIMD) and making execution units (EU) in the processor work effectively. The multi-thread and fiber support in the operating system can be exploited, which automatically enable the reconstruction parallelism in a multi-processor environment and also make data I/O to the hard disk more effective. Memory and cache access are optimized by proper data partitioning. Tested on an Intel Pentium III 500 Mhz computer and compared to the traditional implementation, the present invention can decrease filtering time by more than 75% for 288 projections each having 512 [0012] In the present invention, the Feldkamp algorithm cone beam reconstruction (FACBR) can achieve high speed with good precision. The test environment is an Intel Pentium III 500 Mhz with 640 MB 100 Mhz memory. The result shows that the reconstruction for a 512 [0013] A higher speed SSE-2 enabled Pentium IV and a 2- or 4-processor PC are expected to permit 512 [0014] A high-speed implementation will be disclosed for FACBR on a PC. Techniques for hybrid execution (HE) and hybrid data (HD) will also be disclosed. With these hybrid computing features, good memory organization and instruction optimization, a high speed Feldkamp implementation can be implemented on a general purpose PC with a high performance to price ratio. The HD and HE can also be applied to implementation on other hardware platforms to improve the FACBR performance. With higher clock frequency processors and an inexpensive market available SMP PC, it is possible to gain good performance as done by expensive, inconvenient customized hardware. As a commercial market available PC is used to achieve high performance, it is convenient to design new algorithms and a new system for cone beam reconstruction, and it is useful to integrate an image grab system and 3D rendering system, in a single system which is easy to configure and upgrade. [0015] As Intel x86 CPU frequency has increased to the GHz level, it is practical and economically feasible to build a Multi-Processor x86-based high-speed cone beam reconstruction computing engine. Although the Feldkamp algorithm is an approximate cone beam reconstruction algorithm, it is a practical and efficient 3D reconstruction algorithm and is a basic component in a few exact cone-beam reconstruction algorithms including the present invention. [0016] The present invention implements parallel processing on a single microprocessor or multiple processors. The use of hybrid computing (both fixed and floating point calculation) accelerates the cone-beam reconstruction without reducing the accuracy of the reconstruction and without increasing image noise. Those characteristics are particularly important for the reconstruction of soft tissue, e.g., cancer detection. [0017] A preferred embodiment of the present invention will be set forth in detail with reference to the drawings, in which: [0018]FIG. 1 shows a cone-beam coordinate system used in reconstruction in the preferred embodiment; [0019]FIG. 2 shows the architecture of an Intel x86 processor; [0020]FIG. 3 shows an UML diagram of a known FACBR implementation; [0021]FIG. 4 shows a UML diagram of hybrid execution according to the preferred embodiment; [0022]FIG. 5 shows a data partition scheme used in the preferred embodiment; and [0023]FIG. 6 shows a block diagram of a system on which the preferred embodiment of the present invention can be implemented. [0024] A preferred embodiment of the present invention will now be set forth in detail with reference to the drawings. [0025] The preferred embodiment will be disclosed in terms of the coordinate system shown in FIG. 1. The O-XYZ is the world coordinate system. The X-Y-Z axis gives the physical coordinates for the reconstructed voxels. The Z-axis is the rotation axis. The t-s axis is the rotated gantry X-Y coordinate system. The s-axis always passes through the x-ray source and is perpendicular to the detector plane. [0026] Several groups have investigated the reconstruction for the cone beam geometry. The most efficient algorithm in use is the one developed by Feldkamp, L. A., Davis, L. C., and Kress, J. W., “Practical Cone-Beam Algorithm,” [0027] The coordinates used in the discussion are shown in FIG. 1 and are defined relative to an x-ray source [0028] a) Apply weight and ramp filter to the projections data; this is done by applying a weight to each P(θ) value and applying convolution to data in rows or in columns with filter data.
[0029] b) Backproject the data P [0030] As noted above, (t,s) is the coordinate in gantry system, which is the rotation transform of the X-Y axis with angle θ. [0031] Let the reconstruction volume be N [0032] The Intel x86 architecture diagram is shown in FIG. 2. The Intel processor 200 has multiple execution units (EU's) to do integer and float operations S simultaneously. In an Intel Pentium III, there are two integer unit (ALU [0033] It is possible to construct a multi-processor computer with several processors; most such of computers now on the market are SMP with two or four processors. Usually the operating system running on the computer has some techniques such as multi-thread and multi-process to utilize the multi-processor's hardware resource. For example, Microsoft Windows has Win32 threads, and Unix/Linux has pthread support. It is contemplated that the preferred embodiment will most often be implemented with Windows NT/2000, which has multi-thread and fiber support, which automatically reconstructs parallelism in a multi-processor environment. [0034] Several tricks can be used to minimize the actual number of operations in the backprojection loop, e.g. by changing the order in which x, y and z are incremented; applying a special boundary to reconstruction voxels; pre-interpolating projection data to allow for the simplest possible interpolation in the actual inner loop. The basic performance for a single-processor computer system can be expressed in terms of [0035] Where T is the total time to execute, n is the number of instructions executed, t is the time per instruction, and CPI is the number of cycles per instruction. Decreasing the clock time t is a matter of engineering. Generally, smaller, faster circuits lead to better clock speed. [0036] Decreasing the other two factors involves some version of parallelism. There are several levels of parallelism: First, a single program can be broken into constituent parts, and different processors compute each part; this is called Program-Level Parallelism. Second, some techniques such as pipelining allow more throughputs by the execution of overlapping instructions; this is called Instruction-Level Parallelism. Finally, Low-level parallelism is primarily of interest to designers of the arithmetic logic units and relatively invisible to user; this is called Arithmetic and Bit-Level Parallelism. The preferred embodiment relies primarily on Program-Level Parallelism and Instruction-Level Parallelism. The program level parallelism is manifested in independent sections of a program or in individual iterations of a loop. Such parallelism may be exploited by employing multiple processors. The Instruction-Level Parallelism has two basic kinds: Individual instructions are overlapped (executed at the same time) in the processor, a given instruction is decomposed into sub operations and the sub operations are overlapped. [0037] As described in Feldkamp algorithm, a set of M projections is used, each projection having a size N×N pixels, to reconstruct an N [0038] If the algorithm is not changed, the O (N [0039] The normal FACBR implementation can explained with a unified modeling language (UML) sequence diagram as FIG. 3. When the FACBR process starts, the M projection data are loaded one by one, each being filtered according to equation (1); then the filtered data are used to backproject to each voxel ƒ(x,y,z). After the backprojection is finished for all data, the reconstruction is finished, and the data are saved or rendered in a 3D display. Usually the filtering time is about {fraction (1/15)} to {fraction (1/30)} of the backprojection time or less. In the implementation with C/C++ code according to FIG. 3, only the FPU and to a lesser extent the ALU are used during the FACBR process, and the most powerful EU's are wasted. [0040] It is known in the art that there are four possible forms of parallelism in Feldkamp algorithm implementation: pixel parallelism, projection parallelism, rays parallelism and operation parallelism. In the reconstruction process, all voxels and projections are independent of one another, and rays can be backprojected independently. The operations for filtering and backprojection of each projection are independent; the low level multiplication and addition operation can even be divided independently. To implement fast cone beam reconstruction, the following methods are applied in the FACBR implementation: [0041] a) Split the FDK backprojection procedure into two phases: projection map generation to calculate (u,t,s) and data backprojection to calculate ƒ(x,y,z). Equation (2) shows that (u,t,s) depends only on (x,y), so that the projection map needs only O(N [0042] b) Use some a priori knowledge to generate some boundary as a sphere or cylinder; the computation can be skipped for some voxels which are outside the boundary and unable to be reconstructed, thereby providing a smaller k. If the reconstructed voxels are visualized as a cube with N length, then the full number of voxels is on the order of N [0043] c) The FACBR process involves a large memory requirement; for N=512, the data of one projection or one slice will be 1 Mbyte. This is even bigger than the L [0044] d) To get shorter t [0045] e) The reconstruction data and projection data processing can be split into several parts and run on different processors when SMP is available. With n-processor SMP, when the fraction of the task which cannot be converted to concurrent work is f, the k value is decreased with a theoretic speed-up as n/(1+(n−1)f) according to Amdahl's law. A multi-processor computer works by multithread implementation and carefully allocates the tasks among the processors. Operating systems capable of controlling a multi-processor computer in such a manner are known in the art, as noted above. For a single processor, the context switching will sacrifice the CPU time and so may actually decrease the performance, so it is contemplated that the multi-thread method will be used only when SMP is available. [0046] The method described above has been implemented and tested on an ordinary PC having the specifications set forth in Table 1 below:
[0047] Microsoft Visual C++ 6.0 and Intel C++ 4.5 were used as developing tools. In the traditional implementation, a pure float point (PF) calculation was performed and consequently only the FPU was fully used because of the compiler. Intrinsic and fine-tuned assembler code provide a hybrid computing method. Namely, both floating-point and fixed-point computing are used during the reconstruction process with HD, so as to fully utilize the EUs in Pentium III processors. [0048] Parallelism considerations will now be described. The first is the use of hybrid execution (HE) and hybrid data (HD). [0049] HD is used to decrease and make ALU units work in parallel. As the SSE unit is independent from the FPU unit and the MMX unit, the SSE unit can work with the ALU unit, the MMX unit and even the FPU unit simultaneously, thus allowing a hybrid execute mode for either PF data or HD. There are different methods to use the EUs to accelerate the FACBR process. The map data and some intermediate results can be processed by the ALU in fixed point data format, and the reconstruction data and finally output results can be processed in floating point format. That hybrid data format for different data and stages can improve the EU's efficiency. For PF data, the best method is to use the MMX unit to adjust the data address and map data, and to use the SSE to do the backprojection calculation. The MMX can process data address and map data for two or more points, while the ALU can deal with only one point. The HE method for PF can be shown as a UML activity diagram in FIG. 4. [0050] Since the MMX unit in a Pentium III processor can only process 8-bit and 16-bit integer multiplication, it is not so effective to do HE for HD data as to do HE for PF data. However, with new processor techniques such as SSE2 in the Pentium IV processor, the hybrid execution with HD will gain more improvements on speed. [0051] The second parallelism consideration is the data partition schema. For reconstruction for different axes, the reconstruction data are partitioned into different sub-units. A data partition scheme is shown in FIG. 5. Data are stored in memory as a one-dimensional array, in which the index of each data point increases for z, then for x and last for y. Data are processed in z-lines because the same projection data u value can be used for one z-line In fact, the projection data used to do backprojection for voxels in one z-line are actually in two adjacent u-lines, since four adjacent points (two in each of the adjacent u-lines) are used to interpolate the data for one voxel in the z-line. It is thus easy to prefetch the projection line data and a reconstruction z-line into cache, as one line occupies only 4N bytes. For N=512, the whole data line only need 6 Kbytes, which is suitable for both L [0052] To ensure precision when improving on cone beam reconstruction speed, first the reconstruction accuracy of the implementation will be determined using computer-simulated phantom. Second, the reconstruction error noise level and uniformity of the reconstructed images are quantified using both pure float point implementation and HD computing implementation, and the reconstruction results from the two implementations are compared with both simulated phantom and experimental phantom data. After it is determined that the HC implementation does not introduce artifacts and unacceptable reconstruction errors, the speedup of MC implementation is evaluated compared to normal pure float-point computing reconstruction. Experimental phantom data are also used to evaluate the effectiveness of the implementation in the real world. [0053] Two simulated phantoms shown in Table 2 below are used to evaluate the precision. The Shepp Logan phantom is used as a general precision error compare reference. The cylinder phantom is used to compare the precision error at different z positions. Normally, the Feldkamp Algorithm has the best result at center slice, and the precision error increases for the slices at two ends. The cylinder phantom is used to check whether the HD and PF precision error varies with z-distance to center.
[0054] The total FACBR time contains the filtering time and backprojection time; the filtering time takes only a small part in the total time. After optimization of the filtering processing with SSE in PF data format, the timing result for 288 512
[0055] The backprojection process is the most time-consuming part of FACBR. The acceleration of five implementations has been tested; the results are shown in Table 4 below. The tests used 288 512
[0056] Table 5 below shows the effective t
[0057] Precision will now be compared. The relative error between two images P and Q is calculated as
[0058] First, the error ratio will be calculated for each pixel. Then, the average error ratio will be calculated for the whole comparing Region of Interest (ROI). [0059] Since HE-PF works with floating-point data, it will not introduce an extra precision error compared to tradition PF method. The greatest concern is whether HD computing will bring more precision error or not. If the relative precision error between a PF reconstructed image and a phantom image is E [0060] The precision error has been determined for a HD reconstructed image relative to a simulated phantom image and a PF reconstructed image. For the Shepp Logan phantom the precision error between the HD image and the PF image is less 0.03%; for the cylinder phantom, the E [0061] An apparatus on which the invention can be implemented is shown in FIG. 6, which is reproduced from FIG. 9 of the above-referenced U.S. Pat. No. 5,999,587. In a standard CT, a 3-D reconstruction is obtained by stacking a series of slices. In a volume CT, a direct reconstruction of an object can be obtained. Referring now to FIG. 6, it is shown how the cone-beam tomography system [0062] A cone beam volume CT scanning apparatus examines a body P using a cone shaped radiation beam [0063] A first motor [0064] In order to obtain the arc measurements as previously discussed, a tilt control [0065] In addition to the method above to acquire circle and arc projections, alternatively, the circle-plus-arc geometry can be implemented in one of the following two ways. In the first and preferred of the three methods, the gantry [0066] The second alternative method is to mechanically modify a standard CT gantry such that two short arc orbits are added to the gantry, and the x-ray tube [0067] Mounted on the gantry frame [0068] The cone-shaped beam of radiation [0069] Alternatively, a continuous series of two-dimensional detectors (not shown) can be fixedly mounted proximate to the gantry frame [0070] A 2-D projection acquisition control and A/D conversion unit [0071] The analog-to-digital conversion unit [0072] In accordance with the principles of the invention discussed previously, a display processor [0073] While a preferred embodiment of the present invention has been set forth above, those skilled in the art who have reviewed the present disclosure will readily appreciate that other embodiments can be realized within the scope of the present invention. For example, specific numerical values are illustrative rather than limiting, as are mentions of specific commercial products. The present invention is not specific to the Feldkamp algorithm, but can be used to implement any filtered backprojection cone-beam algorithms efficiently and optimally. Nor is the present invention specific to x86 processors; instead, the invention can be used with any processor capable of implementing the algorithms described above and has particular utility with any processor that has a floating-point unit that can process more than one single-precision 32-bit datum within one instruction set and a fixed-point unit that can process more than one 16- or 32-bit data within one instruction set. Therefore, the present invention should be construed as limited only by the appended claims. Referenced by
Classifications
Legal Events
Rotate |