|Publication number||US7765386 B2|
|Application number||US 11/237,548|
|Publication date||Jul 27, 2010|
|Filing date||Sep 28, 2005|
|Priority date||Sep 28, 2005|
|Also published as||CN1983164A, CN1983164B, US20070074009, WO2007038576A1|
|Publication number||11237548, 237548, US 7765386 B2, US 7765386B2, US-B2-7765386, US7765386 B2, US7765386B2|
|Inventors||David D. Donofrio, Michael Dwyer|
|Original Assignee||Intel Corporation|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (9), Non-Patent Citations (2), Referenced by (1), Classifications (10), Legal Events (2)|
|External Links: USPTO, USPTO Assignment, Espacenet|
1. Field of the Invention
Embodiments of the invention relate to the field of microprocessors, and more specifically, to floating-point units.
2. Description of Related Art
Use of floating-point (FP) operations is becoming increasingly prevalent in many areas of computations such as three-dimensional (3-D) computer graphics, image processing, digital signal processing, weather predictions, space explorations, seismic processing, and numerical analysis. Specially designed floating-point units have been developed to enhance FP computational power in computer systems. Many of FP applications involve vector processing. Floating-point processors (FPP's) designed for vector processing typically employ a pipeline architecture.
Existing techniques for pipelined FPP's typically use a single deep pipeline for vector processing. While this approach may be simple and sufficient for some applications, it has a number of drawbacks for highly intensive vector processing. It is difficult to modify the architecture when the problem size changes, either increasing or decreasing. There may also be deadlocks, leading to low pipeline utilization. Simple instructions may have the same latency as complex instructions, leading to inefficient use of pipelines. Other disadvantages may include low throughput, throughput dependency on vector population, etc.
Embodiments of invention may best be understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the invention. In the drawings:
An embodiment of the present invention is a technique to perform floating-point operations for vector processing. An input queue captures a plurality of vector inputs. A scheduler dispatches the vector inputs. A plurality of floating-point (FP) pipelines generates FP results from operating on scalar components of the vector inputs dispatched from the scheduler. An arbiter and assembly unit arbitrates use of output section and assembles the FP results to write to the output section.
In the following description, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures, and techniques have not been shown to avoid obscuring the understanding of this description.
One embodiment of the invention may be described as a process which is usually depicted as a flowchart, a flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed. A process may correspond to a method, a program, a procedure, a method of manufacturing or fabrication, etc.
One embodiment of the invention is a technique to perform FP operations efficiently for vector processing. The technique employs a FP unit (FPU) architecture that uses multiple shallow pipelines instead of a single deep pipeline. This provides a high degree of scalability in computing power and buffer depths. By varying the number of pipelines, the architecture may accommodate any computational requirements. An incoming vector input is decomposed or broken into a set of independent scalar components that are forwarded to the multiple FP pipelines to be processed in parallel. A simple arbitration scheme assigns the FP results to an output buffer and re-assembles the entire vector result in an asynchronous manner. Such an out-of-order completion may allow short latency and simple instructions to complete before long latency and complex instructions, leading to high throughput. The arbitration scheme also provides an improved deadlock prevention leading to high pipeline utilization by assigning the output buffer space as commands complete rather than at dispatch time.
The processor unit 15 represents a central processing unit of any type of architecture, such as processors using hyper threading, security, network, digital media technologies, single-core processors, multi-core processors, embedded processors, mobile processors, micro-controllers, digital signal processors, superscalar computers, vector processors, single instruction multiple data (SIMD) computers, complex instruction set computers (CISC), reduced instruction set computers (RISC), very long instruction word (VLIW), or hybrid architecture.
The FPU 20 is a co-processor that performs floating-point operations for vector processing. It may have direct interface to the processing unit 15 and may share system resources with the processing unit 15 such as memory space. The processing unit 15 and the FPU 20 may exchange instructions and data including vector data and FP instructions. The FPU 20 may also be viewed as an input/output (I/O) processor that occupies an address space of the processing unit 15. It may also be interfaced to the MCH 25 instead of directly to the processor unit 15. It uses a highly scalable architecture with multiple FP pipelines for efficient vector processing.
The MCH 25 provides control and configuration of memory and input/output devices such as the main memory 30 and the ICH 40. The MCH 25 may be integrated into a chipset that integrates multiple functionalities such as graphics, media, isolated execution mode, host-to-peripheral bus interface, memory control, power management, etc. The MCH 25 or the memory controller functionality in the MCH 25 may be integrated in the processor unit 15. In some embodiments, the memory controller, either internal or external to the processor unit 15, may work for all cores or processors in the processor unit 15. In other embodiments, it may include different portions that may work separately for different cores or processors in the processor unit 15.
The main memory 30 stores system code and data. The main memory 30 is typically implemented with dynamic random access memory (DRAM), static random access memory (SRAM), or any other types of memories including those that do not need to be refreshed. The main memory 30 may be accessible to the processor unit 15 or both of the processor unit 15 and the FPU 20.
The ICH 40 has a number of functionalities that are designed to support I/O functions. The ICH 40 may also be integrated into a chipset together or separate from the MCH 20 to perform I/O functions. The ICH 40 may include a number of interface and I/O functions such as peripheral component interconnect (PCI) bus interface, processor interface, interrupt controller, direct memory access (DMA) controller, power management logic, timer, system management bus (SMBus), universal serial bus (USB) interface, mass storage interface, low pin count (LPC) interface, etc.
The interconnect 45 provides interface to peripheral devices. The interconnect 45 may be point-to-point or connected to multiple devices. For clarity, not all the interconnects are shown. It is contemplated that the interconnect 45 may include any interconnect or bus such as Peripheral Component Interconnect (PCI), PCI Express, Universal Serial Bus (USB), and Direct Media Interface (DMI), etc.
The mass storage device 50 stores archive information such as code, programs, files, data, and applications. The mass storage device 50 may include compact disk (CD) read-only memory (ROM) 52, digital video/versatile disc (DVD) 53, floppy drive 54, and hard drive 56, and any other magnetic or optic storage devices. The mass storage device 50 provides a mechanism to read machine-accessible media. The I/O devices 47 1 to 47 K may include any I/O devices to perform I/O functions. Examples of I/O devices 47 1 to 47 K include controller for input devices (e.g., keyboard, mouse, trackball, pointing device), media card (e.g., audio, video, graphic), network card, and any other peripheral controllers.
The graphics controller 65 is any processor that has graphic capabilities to perform graphics operations such as fast line drawing, two-dimensional (2-D) and three-dimensional (3-D) graphic rendering functions, shading, anti-aliasing, polygon rendering, transparency effect, color space conversion, alpha-blending, chroma-keying, etc. The FPU 70 is essentially similar to the FPU 20 shown in
The pixel processor 85 is a specialized graphic engine that can perform specific and complex graphic functions such as geometry calculations, affine conversions, model view projections, 3-D clipping, etc. The pixel processor 85 is also interfaced to the memory controller 70 to access the memory 80 and/or the graphic controller 65. The display processor 90 processes displaying the graphic data and performs display-related functions such as palette table look-up, synchronization, backlight controller, video processing, etc. The DAC 95 converts digital display digital data to analog video signal to the display monitor 97. The display monitor 97 is any display monitor that displays the graphic information on the screen for viewing. The display monitor 97 may be a Cathode Ray Tube (CRT) monitor, a television (TV) set, a Liquid Crystal Display (LCD), a Flat Panel, or a Digital CRT.
The input queue 210 captures or stores the vector inputs to be processed from the processor unit 15 (
The vector input selector 220 selects the vector inputs from the input queue 210 to send to the scheduler 230. It may include K multiplexers 225 1 to 225 K. Each multiplexer has a number of inputs connected one or more outputs of the input queue 210. The selection of the vector inputs may be based on suitable criteria.
The scheduler 230 receives the vector inputs selected by the vector input selector 220 and dispatches floating point operations contained within the vector inputs to the FP section 240 for processing depending on the FP instructions, the availability of the FP section 240, and optionally other criteria specific to the implementation. The scheduler 230 assigns a unique identification (ID) or serial number to each of the vector inputs and forward the unique IDs of the vector inputs to the arbiter and assembly unit 250, and includes this ID with each operation dispatched to the FP section 240.
The FP section 240 performs the FP operations on the dispatched vector inputs according to the FP instructions. It may include P FP pipelines 245 1 to 245 P operating in parallel and independently from one another. The number P may be any number suitable for the application. The P FP pipelines 245 1 to 245 P may be the same or different. They may include individual FP pipelines designed to perform specific FP operations, such as FP adder, FP subtractor, FP divider, FP multiplier, FP complex mathematical functions (e.g., trigonometric functions), etc. The P FP pipelines 245 1 to 245 P may have the same or different depths resulting in the same or different latencies. The scheduler 230 dispatches an operation to one of the P FP pipelines 245 1 to 245 P according to the FP instruction associated with the vector input and whether the FP pipeline is available or free. Each of the P FP pipelines 245 1 to 245 P generates a status signal to the scheduler 230 to indicate whether it is capable of accepting an additional FP operation or is unavailable. If a FP pipeline is unavailable and the FP instruction calls for the use of this FP pipeline, the scheduler 230 keeps the corresponding vector input in a waiting queue until the FP pipeline or another suitable FP pipeline is available. Each of the P FP pipelines 245 1 to 245 P generates a FP result at the output.
The arbiter and assembly unit 250 writes the FP results from the FP section 240 to the output section 260. The arbiter and assembly unit 250 arbitrates the use of the output section when there are multiple FP results available at the output of the FP section 240. When there is a winner among the FP results, it writes the FP result to an assigned output buffer in the output section at an appropriate location. In essence, the arbiter and assembly unit 250 assemble the FP results for each result vector after all the scalar components of a vector input are processed.
The output section 260 stores the FP results written by the arbiter and assembly unit 250. The FP results may be emptied, read, or consumed by another unit such as the processor unit 15, the graphics controller 60, or any other consuming entity for further processing. The output section 260 includes Q output buffers 265 1 to 265 Q. Each output buffer has sufficient storage capacity to store a result vector. The output buffers may be implemented as registers or memory. Each of the Q output buffers 265 1 to 265 Q has a status signal to indicate whether it has obtained all the scalar components of a vector result and whether its contents have been emptied, read, or consumed by other consuming entity in the system. The arbiter and assembly unit 250 monitors these status signals to determine if an output buffer is available or free for writing.
The ID generator 310 generates a unique ID for a vector input scheduled to be dispatched. The unique ID may be generated sequentially or incrementally according to the order the vector inputs arrive in the input queue 210 or selected by the vector input selector 220. The ID may be within zero to k where k is greater than the number of unique vector instructions that can be contained in the P FP pipelines 245 1 to 245 P shown in
The vector decomposer 320 decomposes a vector input into a plurality of scalar components. Each of the scalar components has a scalar position tag indicating its position in the associated input vector. The vector decomposer 320 forwards the decomposed scalar components and their position tags to the dispatcher 330.
The dispatcher 330 receives the status signals of the FP pipelines 245 1 to 245 P in the FP section 240 to determine their availability and sends the scalar components to the available FP pipelines in the FP section 240. It also forwards the IDs and the position tags of the vector inputs and their scalar components to the arbiter and assembly unit 250.
The selector 410 selects a scalar component sent from the scheduler 230 and a FP result at the output of the PL unit 420. The FP result is re-circulated when its ID is not matched with the ID assigned by the arbiter and assembly unit 250. The re-circulated FP result is tagged with a null operation so that it may be propagated through the PL unit unmodified, preserving its value. The selector 410 may be implemented as a multiplexer at the input of the PL unit 420.
The PL unit 420 performs floating-point operation on the scalar component. It has M PL stages 425 1 to 425 M. The depth M may be any suitable depth depending on the scalability and/or throughput requirements. It generates a FP PL result at the output to the arbiter and assembly unit 250. In addition, the ID of the vector input that the scalar component belongs to, and the position tag within that vector are also propagated along the M PL stages 425 1 through 425 M. This FP PL ID is matched to the ID assigned by the arbiter and assembly unit 250 to determine if the associated FP PL result may be written to an output buffer or re-circulated back to the selector 410. If the operation is incomplete such as when an additional pass is required for computational purposes, or the operation is complete and there is no match, the FP PL result is re-circulated waiting for an output buffer to become available. If the operation is complete and there is a match, the FP PL result is written to an appropriate location of the assigned output buffer using the corresponding position tag. The PL unit 420 also generates a status signal to the schedule 230 to indicate whether it is busy or available for processing a scalar component.
The ID queue 510 stores the ID's sent from the scheduler 230. It may be implemented by a FIFO or any other suitable storage mechanism. Its depth depends on the number P of the parallel FP pipelines, scalability and/or throughput requirements. In addition, multiple ID's may be made available to the arbiter 520.
The arbiter 520 receives the ID's from the ID queue 510 and the status signals from the output section 260. The status signals are provided by the output buffers to indicate whether they are available. The arbiter 520 assigns the ID to an output buffer in the output section 260 if that output buffer is available. The arbiter 520 may assign the ID's to the output buffers using a round robin, a first in first out, or any other suitable assignment scheme. Once an output buffer is assigned an ID, it uses that assigned ID until all the scalar components are written into it. The arbiter 520 then forwards all the assigned ID's to the matching circuit 530. By deferring the buffer assignment at the arbiter 520, deadlock prevention may be improved, leading to high pipeline utilization.
The matching circuit 530 matches the ID's associated with the FP PL results at the outputs of the FP pipelines 245 1 to 245 P in the output section 260 with the assigned IDs. If there is any match, it sends the matching result to the assembler 530. For any FP PL ID that is not matched, the corresponding FP pipeline re-circulates its FP PL result back to its input. The matching circuit 530 may match the FP PL IDs with a single assigned ID or multiple assigned ID's. Matching multiple assigned ID's may provide higher throughput because it allows an instruction associated with a short latency pipeline to pass an instruction associated with a long latency pipeline. The matching circuit 530 may be implemented by a number of comparators performing comparisons in parallel. The comparator may be formed by L exclusive OR gates to perform bit-wise comparisons followed by an L-input OR gate where L is the word size of the ID, or an equivalent logic circuit.
The assembler 540 receives the result of the matching circuit 530, the position tags from the scheduler 230 and the FP PL results from the FP section 240. It writes the FP result that has the ID matched with the assigned ID to the output buffer at a location indicated by the corresponding scalar position tag.
Upon START, the process 600 captures the vector inputs (Block 610). Each of the vector inputs has N scalar components. Then, the process 600 dispatches the vector inputs to the FP pipelines according to the FP instructions and the availability of the FP pipelines (Block 620). Next, the process 600 generates the FP results by performing FP operations on the vector inputs (Block 630). Then, the process 600 arbitrates the use of the output section and assembles the FP results to the output section (Block 640) and is then terminated.
Upon START, the process 620 generates an ID for the vector input (Block 710). The ID is unique for each vector input. Then, the process 620 forwards the ID to the arbiter and assembly unit (Block 720). Next, the process 620 decomposes the vector input into scalar components and associates each scalar component with a scalar position tag to indicate its position in the vector input (Block 730). Next, the process 620 determines if a FP pipeline for a scalar component is available (Block 740). If not, the process 620 returns to Block 740 to wait for an available FP pipeline. Otherwise, the process 620 sends the scalar component, the ID, and the position tag to the available FP pipeline (Block 750).
Next, the process 620 determines if there are any scalar components remaining for this vector (Block 760). If so, the process 620 returns to Block 740 to continue waiting for an available FP pipeline. Otherwise, the process 620 is terminated.
Upon START, the process 640 determines if an output buffer is available (Block 810). If not, the process 640 is terminated. Otherwise, the process 640 assigns an ID from the ID queue to the available output buffer (Block 820). The assignment may be done using a round robin, a first in first out, or any other suitable assignment scheme. Next, the process 640 determines if a FP ID of a completed operation at the output of a FP pipeline matched with the assigned ID (Block 830). If not, the process 640 returns to Block 830 to continue checking. Since the FP pipelines operate independently with each other and in parallel, eventually, there will be a FP ID that is matched with the assigned ID. If there is a match, the process 640 writes the FP result with the matched ID to the output buffer at the location indicated by the location tag associated with the FP result (Block 840).
Next, the process 640 determines if the vector in the output buffer is completed (Block 850). A vector in the output buffer is completed when all the scalar components of the vector are written into the output buffer. If not complete, the process 640 returns to Block 830 to check for subsequent matches. Otherwise, the process 640 marks the output buffer complete and notifies a consuming entity that the result data is ready (Block 860). The consuming entity may be the processor 15 or the graphics controller 65 shown in
While the invention has been described in terms of several embodiments, those of ordinary skill in the art will recognize that the invention is not limited to the embodiments described, but can be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US4763294 *||Dec 19, 1985||Aug 9, 1988||Wang Laboratories, Inc.||Method and apparatus for floating point operations|
|US5511210 *||Jun 18, 1993||Apr 23, 1996||Nec Corporation||Vector processing device using address data and mask information to generate signal that indicates which addresses are to be accessed from the main memory|
|US5887183 *||Jan 4, 1995||Mar 23, 1999||International Business Machines Corporation||Method and system in a data processing system for loading and storing vectors in a plurality of modes|
|US6141673 *||May 25, 1999||Oct 31, 2000||Advanced Micro Devices, Inc.||Microprocessor modified to perform inverse discrete cosine transform operations on a one-dimensional matrix of numbers within a minimal number of instructions|
|US6292886 *||Oct 12, 1998||Sep 18, 2001||Intel Corporation||Scalar hardware for performing SIMD operations|
|US6530011 *||Oct 20, 1999||Mar 4, 2003||Sandcraft, Inc.||Method and apparatus for vector register with scalar values|
|US6839828 *||Aug 14, 2001||Jan 4, 2005||International Business Machines Corporation||SIMD datapath coupled to scalar/vector/address/conditional data register file with selective subpath scalar processing mode|
|US20030018676||Mar 15, 2001||Jan 23, 2003||Steven Shaw||Multi-function floating point arithmetic pipeline|
|EP0682317A1 *||Apr 26, 1995||Nov 15, 1995||AT&T Corp.||Vector processor for use in pattern recognition|
|1||Chiueh, Tzi-cker, "Multi-Threaded Vectorization", Proceedings of the Annual Int. Symp. On Comp. Arch., Toronto, 1991, New York, IEEE, US, vol. Symp. 18, pp. 352-361.|
|2||Kechadi M-T et al., "Analysis and Simulation of an Out-of-Order Execution Model in Vector Multiprocessor Systems", Parallel Computing, vol. 23, pp. 1863-1986, 1997.|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US20160188532 *||Dec 27, 2014||Jun 30, 2016||Intel Corporation||Method and apparatus for performing a vector bit shuffle|
|U.S. Classification||712/222, 712/221, 712/2|
|Cooperative Classification||G06F9/3885, G06F9/3851, G06F15/8061|
|European Classification||G06F15/80V2, G06F9/38T, G06F9/38E4|
|Sep 28, 2005||AS||Assignment|
Owner name: INTEL CORPORATION, CALIFORNIA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:DONOFRIO, DAVID D.;DWYER, MICHAEL;REEL/FRAME:017054/0965
Effective date: 20050928
|Jan 2, 2014||FPAY||Fee payment|
Year of fee payment: 4