WO2000004484A2 - Wide instruction word graphics processor - Google Patents

Wide instruction word graphics processor Download PDF

Info

Publication number
WO2000004484A2
WO2000004484A2 PCT/US1999/016193 US9916193W WO0004484A2 WO 2000004484 A2 WO2000004484 A2 WO 2000004484A2 US 9916193 W US9916193 W US 9916193W WO 0004484 A2 WO0004484 A2 WO 0004484A2
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
processor
input
graphics
instructions
Prior art date
Application number
PCT/US1999/016193
Other languages
French (fr)
Other versions
WO2000004484A3 (en
Inventor
Vernon Brethour
Dale Kirkland
William Lazenby
Gary Shelton
Original Assignee
Intergraph Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intergraph Corporation filed Critical Intergraph Corporation
Publication of WO2000004484A2 publication Critical patent/WO2000004484A2/en
Publication of WO2000004484A3 publication Critical patent/WO2000004484A3/en

Links

Classifications

    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G5/00Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators
    • G09G5/36Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators characterised by the display of a graphic pattern, e.g. using an all-points-addressable [APA] memory
    • G09G5/363Graphics controllers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • G06F9/3806Instruction prefetching for branches, e.g. hedging, branch folding using address prediction, e.g. return stack, branch history buffer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09GARRANGEMENTS OR CIRCUITS FOR CONTROL OF INDICATING DEVICES USING STATIC MEANS TO PRESENT VARIABLE INFORMATION
    • G09G5/00Control arrangements or circuits for visual indicators common to cathode-ray tube indicators and other visual indicators

Definitions

  • the present invention relates to computers, and more particularly to computers using very large instruction words for various purposes, including for graphics processing.
  • Data presented to a computer graphics subsystem are often expressed as strips of polygons (often triangles) in accordance with a graphics processing standard, such as the well known OpenGL graphics library.
  • Rendering a scene involves transforming the coordinates of all of the polygons in all of the strips and determining the pixel values in the display that are associated with each portion of each of the polygons that appears in the display.
  • the large amount of data involved in these calculations in relation to the conflicting goals of achieving rendering both quickly and in detail, places heavy demands on computational resources.
  • THE REALITY ENGINE distributed by Silicon Graphics, Inc.
  • VLIW very large instruction words
  • high level programming languages are devised that employ instructions utilizing a register-to-register type of instruction set.
  • the effect of a successful VLIW machine is to launch and complete a great many instructions on each clock cycle, so the register-to-register instruction set requires a register file with many read ports and many write ports.
  • U.S. patent 5.644,780 assigned to International Business
  • VLIW register file for VLIW with 8 write ports and 12 read ports.
  • the result is a VLIW computation engine capable of high levels of parallelism, but which can be built only at great cost that requires many registers.
  • the present invention achieves high levels of parallelism in a graphics processor by providing in a first embodiment an apparatus for processing computer graphics requests utilizing a wide word instruction.
  • the apparatus of this embodiment has 1. a graphics request input;
  • each instruction is a wide word.
  • each instruction is a very wide word.
  • each instruction is a super wide word.
  • each instruction is an ultra wide word.
  • the processor has functional units producing n results per clock cycle and registers for storing not more than n/2 of such results.
  • the functional units are connected by a cross bar.
  • a “wide word” is an instruction, for a processor, that is issued in a single clock cycle of the processor, and providing greater than 64 bits of control to the processor.
  • a "very wide word” is an instruction, for a processor, that is issued in a single clock cycle of the processor, and providing greater than 99 bits of control to the processor.
  • a “super wide word” is an instruction, for a processor, that is issued in a single clock cycle of the processor, and providing greater than 128 bits of control to the processor.
  • An “ultra wide word” is an instruction, for a processor, that is issued in a single clock cycle of the processor, and providing greater than 255 bits of control to the processor.
  • a "register” is a storage element associated with a processor permitting reading of data on the processor clock cycle that immediately follows the clock cycle in which storage has been accomplished.
  • a computer having data stores with multiple addressing modes.
  • the computer has 1. a data input;
  • each data store having a plurality of addressing modes, wherein a single instruction individually selects an addressing mode for each of the data stores.
  • the plurality of addressing modes may include indirect and absolute addressing modes.
  • the indirect mode may further include a double level of indirect addressing.
  • each instruction is a wide word.
  • a computer for processing computer graphics requests wherein the data input is a graphics request input.
  • a multiple processor apparatus for processing computer graphics requests in which the control store is accessed an increased clock rate in relation to the clock rate of the processors.
  • the apparatus has: 1. a plurality n of processors. n>l , each processor running at a processor clock rate R: and
  • a single control store supplying instructions for the processors, running at a store clock rate nR.
  • the processors are responsive to instructions, and each instruction is a wide word or (in a yet further embodiment) a super wide word.
  • Another related embodiment also has a control store sequencer, for evaluating branch instructions, at a clock rate nR, so that each processor may be caused to branch without processor clock delay for evaluation of branch instructions.
  • Another embodiment of the invention provides an apparatus, for processing computer graphics requests, that uses a stack for storing instruction addresses arranged so as not to produce an overflow condition. This embodiment has
  • a processor coupled to the graphics request input, and having an output and responsive to a set of instructions, the set including a call and a return;
  • a program counter for addressing instructions and having a value for a current instruction, wherein i. each time a call is invoked, a number equal to one more than the value of the program counter is pushed onto the top of the stack and ii. each time a return is invoked, the top entry of the stack is popped off the stack and placed in the program counter;
  • a graphics accelerator includes a vertex input for receiving vertex data, an output for forwarding processed data, and a processor coupled with the vertex input and output.
  • the graphics accelerator also includes an instruction input that receives instructions for processing the vertex data received from the vertex input.
  • the processor is responsive to wide word instructions.
  • a graphics accelerator includes a vertex input for receiving vertex data, a processor coupled with the input, and a set of registers for storing results produced by the processor.
  • the processor includes a plurality of functional units that execute based upon a clock cycle. The plurality of functional units produce n results per each clock cycle.
  • the set of registers includes no more than n/2 registers. In preferred embodiments, n>l.
  • the graphics accelerator is responsive to a set of instructions that may include one of a plurality of wide words, super wide words, and ultra wide words.
  • Fig. 1 is a block diagram of a graphics geometry accelerator in accordance with a preferred embodiment of the invention
  • Fig. 2 is a more detailed diagram of the embodiment of Fig. 1 ;
  • Fig. 3 is a diagram showing yet further detail of the embodiment of Fig. 1.
  • ALU Arithmetic Logic Unit BDU Breaker-Distributor Unit.
  • Fig. 1 shows a block diagram of a graphics geometry accelerator arranged in accordance with a preferred embodiment of the invention.
  • the accelerator of this embodiment has been developed under the code name "Tomcat" and is sometimes referred to in this description by that name.
  • the accelerator takes graphics request data defining strips of polygons from a host computer in accordance with a standard such as OpenGL and transforms the data from world coordinates to display coordinates.
  • the Tomcat accelerator may be used to furnish data to a graphics processor, which performs rasterization and texture processing — that is, translating the transformed polygon data into pixel values associated with a display and overlaying any textures associated with objects in the displayed image.
  • the Tomcat accelerator is designed to operate in conjunction with a graphics processor developed by the assignee of the present invention under the code name "Cougar” and that graphics processor is sometimes referred to in this description by that name. Details of the Cougar graphics processor are disclosed in pending provisional U.S. patent application number 60/093,247, and copending U.S. patent application entitled. " MULTI-PROCESSOR GRAPHICS ACCELERATOR", hied on even date herewith, assigned attorney docket number 1247/A22. and naming Steven Hemnch. Mark Mosley. Clihord Whitmore. James Deming. Stewart Carlton. Matt Buckelew. and Dale Kirkland as inventors Both such application are incorporated, in their entireties, by reierence
  • Fig. 1 shows a Breaker-Distributor Module (BDU) 1 1 , which receives graphics request data from the host computer
  • BDU Breaker-Distributor Module
  • the BDU takes the incoming st ⁇ ps ol polygons, and, to the extent necessary, breaks the strips into subst ⁇ ps and distributes the subst ⁇ ps lor processing.
  • the subst ⁇ ps are passed to a pair of Programmable Graphics Units (PGUs) 13 and 14. which are identified as PGU 0 and PGU 1.
  • PGUs Programmable Graphics Units
  • Each PGU periorms the required data calculations tor the subst ⁇ ps distributed to it, operating pursuant to instructions stored in the W ⁇ teable Control Store (WCS) 12
  • WCS W ⁇ teable Control Store
  • the outputs of the respective PGUs 13 and 14 are then approp ⁇ ately interleaved by sequencer 15 to permit processing by a Cougar graphics processor
  • a Cougar graphics processor It is possible to employ a plurality of Tomcat processors simultaneously, in which case the BDU associated with each processor is coupled to the host computer and thus each BDU receives all graphics requests from the host.
  • Each BDU processes graphics requests to the extent required to maintain the current global rende ⁇ ng context. However, a given BDU forwards data to its associated PGUs when it is the turn ot the given BDU to do so.
  • Fig. 2 is a more detailed diagram of the embodiment of Fig. 1 showing the BDU module 1 1, the WCS 12, the PGUs 13 and 14, and the sequencer 15 ( identihed here as the
  • PGU 1 has the same configuration.
  • a sequencer module 121 controls the flow of the instruction stream to the PGU from the WCS 12.
  • the PGUs utilize "wide word” instructions as defined above.
  • each instruction is 256 bits wide, qualifying for the status of "ultra wide word "
  • the architecture described herein permits the use of ultra wide word instructions without the need tor a higher level language to produce machine level instructions for operation ot the PGUs. In other words.
  • ultra wide word instructions can be manually prepared at the machine level to guide operation of the PGUs to perform the limited number of calculations used for graphics geometry acceleration.
  • the limited number of calculations required therefore, has made it possible to utilize ultra wide word instructions without the need for a compiler w ⁇ tten tor a higher level language, and therefore without the development expense and complexity associated with a machine for which a compiler would be w ⁇ tten
  • Fig 2 and associated with the PGU are shown a collection of items that make up the calculation engine and Input/Output handler ot the PGU, including a frame-based FIFO 211, Vertex Assembly Module 212, Arithmetic Logic Unit (ALU) Module 22, Scratch File Module 23. Reciprocal Module 24, Multiplier Accumulator Module 25, Transcendental Module 26, Cougar Vertex Buffer Module 281 (feeding Cougar Vertex Handler Module
  • each of the items can pass data to and from each other via the cross bar 29, to which they are all connected.
  • each of the items can be handling data in parallel in accordance with the current instruction from the WCS that is accessed by the Sequencer Module 121
  • the PGU clocks are run 180 degrees out of phase with respect to each other Similarly, the Sequencer module 121 for each PGU is run at the same clock rate as the WCS. so that it may evaluate branch instructions in a manner that each PGU can be caused to branch without processor clock delay att ⁇ butablc evaluation of branch instructions More generally, a number n ol PGUs may be utilized in a graphics processor, and each PGU may run at a clock rate R In such a case, the n PGUs may employ a single control store supplying instructions at a clock rate nR In a similar manner, each PGU may employ a control store sequencer, for evaluating branch instructions, running at a clock rate nR, so that each processor can be caused to branch without processor clock delav for evaluation of branch instructions
  • the Sequencerl21 utilizes a stack for sto ⁇ ng instruction addresses Normally, it is possible in using such a stack to produce an overflow condition, but we have found that one can beneficially provide a stack that does not produce an overflow condition For example, we ha ⁇ e found it convenient to use the control store 12 to store about 4000 lines of ultra wide words
  • the Sequencer Module employs a stack of eight 12-bit addresses
  • a program counter (PC) addresses the instructions in the WCS 12 and has a value for the current instruction
  • PC program counter
  • Each time a return is invoked the top entry of the stack is popped off the stack and placed in the program counter
  • the excess of calls over returns is greater than eight (the number of ent ⁇ es in the stack in this example), program execution is maintained
  • current ent ⁇ es in the stack may be abandoned by invoking an instruction stream that is independent of return addresses in
  • the Frame-Based FIFO is the source tor all data from the BDU Module 1 1
  • the Vertex Assembly Module 212 is coupled to the Frame-Based FIFO 211 and performs in accordance with a cu ⁇ ent request header
  • the Vertex Assembly Module 212 builds the next vertex into one of four Vertex Files, which are part of the Vertex Assembly Module
  • the Vertex Files render the vertices therein available for processing
  • the request header can also specify two other ma]or tasks: processing of non-vertex headers from the BDU and (in a surp ⁇ sing usage) loading instructions into the WCS 12 See also copending U S patent application, entitled System for Processing Vertices From a Graphics Request Stream' .
  • W ⁇ teable Control Store loading occurs when a Load WCS Request is being processed
  • the BDU Module 1 1 sends the request header to all PGUs, but only the mastei PGU receives the Control Store data
  • the slave PGU processes the Load WCS Request header by waiting tor the software to execute a REQ sequencer instruction and then notifying the master PGU that it is ready tor the WCS to be loaded
  • the master PGU waits for the softw re to execute a REQ sequencer instruction and for the sla ⁇ e PGU to notify that it is ready
  • the master PGU updates the WCS and then acknowledges that the slave can resume
  • the request is unloaded from the Frame Based FIFO by the Vertex Assembly Control Module 212
  • the Vertex Assembly Control Module 212 is loading instructions into the W ⁇ teable Control Store (WCS) or processing vertex headers
  • the Frame Based FIFO 21 1 is
  • a Port Register File Contains sixteen 40-bit register values
  • Constants Contains thirty-two 40-bit constant values
  • a Port Register Contains the data selected by the ALU A Select on the previous clock
  • Utility Registers Accesses one of the utility registers used for controlling other modules
  • a Port Register Contains the data selected by the ALU A Select on the previous clock
  • the ALU Module 22 is capable of performing operations typical of ALUs, including bitwise inversion, bitwise logical AND. logical OR. exclusive OR, logical shifts, minimum and maximum of two numbers, addition, subtraction, etc.
  • the output register that is driven onto the cross bar is always updated.
  • the output of the ALU can also be w ⁇ tten into either the register file or one of the utility registers as specified by the target C address. This target register is updated immediately, and may be used as a source operand of an ALU operation on the following clock.
  • the Multiplier Accumulator Module 25 perlorms multiplication and addition on eight 40-bit floating point inputs A-H provided on each cycle. The following seven results are produced on the specified ports.
  • the MAC GH CTRL signal selects whether the G*H result (if Ctrl is clear) or zero (if set) is used as input to the right side accumulator. This permits the G*H multiplier to be used separately while the MAC still accumulates the results of three multiplies on MAC Port M With the MAC GH CTRL signal set, MAC Ports O and M therefore produce the following:
  • the Reciprocal Module 24 computes the reciprocal of the floa ⁇ ng point number provided as an input.
  • the Transcendental Module 26 performs one of the following functions on the floating point number specified as input:
  • the Render Context (RC) Module 27 is a general purpose memory unit which contains 1024 32-bit words of data. It has two pairs of read ports, and one pair of write ports. A single address (and associated RC Pointer Register) drives each pair, with one port accessing the contents of the specified address addr. and the second port accessing addr XOR 1. Absolute addresses may be specified in the WCS control word. Indirect addressing is done by adding the offset specified in the least significant 5 bits of the corresponding RC Address field to the RC Pointer register for a particular port. The resulting address is latched and may be used to drive the address into the Render Context RAM on the next cycle.
  • the three RC Pointer registers may also be loaded directly by the ALU unit, which takes precedence over any updates attempted by the Pointer increment logic above.
  • the data When reading from the RC, the data is available on the crossbar on the cycle after the instruction which specified the address and/or addressing mode. Sneak paths are provided from read ports A, B, C, and D to MAC Input Ports A, C, E, and G respectively that permit the data to be used as input to the MAC on the same cycle as the address is specified.
  • the data is latched on the cycle that it and its address are presented, but is not written to the render context RAM until the next cycle. The new data may not be accessed until the 2nd cycle after it is written.
  • the Scratch File Module 23 is a fast, general-purpose memory unit containing 256 40-bit words of data. It has two independent read ports and two independent write ports which use 8 bit addresses. The various addressing modes are described below. Data read from the scratch file is available for use on the same clock cycle as the read instruction is issued. Any register used to form an address or to select an entry out of the Frame Index
  • Each Scratch File port has control word fields which determine the addressing mode used to access the Scratch File on a given cycle.
  • the following addressing modes are provided:
  • the contents of the Scratch File register is used as the upper 4 bits of the address, and the lower 4 bits of the SF_ADDR control word field is used as the lower 4 bits, forming a complete 8 bit address.
  • the register is in effect selecting one of sixteen frames within the Scratch File.
  • the offset selects one of the sixteen words within that frame.
  • Offset Register The reverse of Register: Offset mode.
  • the Offset provides the upper 4 bits, and the register the lower 4 bits of the address.
  • Frame Register The selected output from the Frame Index array is used as the upper 4 bits of the address, and the specified register is used as the lower 4 bits. .
  • Register Register Only available on ports B and C. Bits 7:4 of Scratch File Pointer B (or C) are used as the upper four bits of the address, and bits 3:0 of Scratch File Pointer A (or D) are used as the lower four bits of the address.
  • the Scratch File Module 23 includes a Frame Index, a special purpose array of 18 4-bit numbers that can be used to generate the upper four bits of the Scratch File addresses.
  • the 256 word Scratch File can be thought of as 16 frames of 16 words each. This requires 4 bits to address a particular frame, and then 4 more bits to address a single word within that frame.
  • the Frame Index Array is manipulated by the ALU Frame Index operations and the SF B Index and SF C Index utility registers. The least significant 5 bits of the B and C SF Pointer registers control the output of the Frame Index B and C ports respectively. These Frame Index outputs can then be selected as one of the inputs to a high MUX (i.e.. the MUX selecting the upper four bits of the address) of each Scratch File Address.
  • Each Tomcat PGU contains a Cougar Vertex Buffer Module 281 having six independent Cougar Vertex Buffers (CVBs) .
  • Each CVB contains data that the Cougar Vertex Handler Module 282 extracts and sends to the Cougar FIFO 283. forming a valid Cougar vertex request. Only one CVB may be addressed at a time by the crossbar.
  • CVSEL utility register contains a three bit value that is used to address one of the six CVBs. There are two independent write ports, and two independent read ports to the currently selected CVB. Each port is controlled by a 5-bit address which selects one of the 32 words in the CVB.
  • the data When writing to the CVB. the data is latched on the cycle that it and its address are presented, but is not written to the specified buffer until the next cycle. The new data may not be accessed until the 2nd cycle after it is written.
  • the data is available on the crossbar on the cycle after the instruction which specified the address and/or addressing mode.
  • the Cougar Vertex Handler Module 282 is responsible for copying the relevant pieces of data from a specified CVB to the Cougar FIFO 283. It is controlled by requests that it pulls from its request FIFO, discussed below. If there are no requests for it to process, then it remains idle, waiting for a new request to be written into the request FIFO.
  • the request FIFO contains requests for the Cougar Vertex Handler. It is 3 words deep, with each word being 24 bits wide. The Request FIFO is written to one word at a time by the ALU via a utility register.
  • the Cougar FIFO 283 of each PGU feeds the Cougar Request Sequencer 15, which is responsible for transferring Cougar bound data from each PGU to the Cougar Graphics Accelerator in the same order as the requests originally came into the Tomcat BDU.
  • each PGU executes clipping operations to ensure that an object is within a viewing volume. Details of preferred clipping operations are discussed in U.S. provisional patent application serial number 60/093.184. the disclosure of which is incorporated, in its entirety, by reference.
  • the ALU 22 preferably includes one or more hardware clipping modules that cooperate to store a single list of vertices in the scratch file module 23.
  • This list is identified above as the file index array. Specifically, each vertex in a given triangle is checked to determine whether it is within a given view volume. To that end, the clipping modules initially store the initial three vertices in the index list array. Each initial vertex, and each newly determined vertex, then is clip checked to determine if it is within the given view volume. One or more new vertices, if any, are added to the list as they are calculated, while initial vertices are deleted from the list if determined to be outside the given view volume. As a result of these clipping operations, a single list is produced identifying the successive vertices of a resultant polygon formed within the given view volume. Examples of this process are disclosed in the immediately preceding above referenced provisional patent application.
  • the index array includes two pointers to vertices. This size of the array may be equal to one more than the maximum number of vertices that can be active at any one point during the clip check process. This is dependent upon the number of vertices in the original triangle, and the maximum number of active clipping planes.
  • the above noted two pointers are provided to address the array, along with read capability and
  • INSERT is given a new value to insert, and which pointer to use as the insertion point. All entries at the location addressed by the specified pointer and beyond are shifted up by one location, and the new value is inserted at the location addressed by the specified pointer. DELETE is given a pointer to use as the deletion point. The entry at the addressed location is deleted, and all subsequent entries are shifted down by one location.
  • Fig. 3 is a diagram showing yet further detail of the embodiment of Fig. 1.
  • control sequencer 321 schematically represents the sequencer module 121 of on each PGU of Fig. 2.
  • the BDU module 1 1 of Fig 2 is here shown to include both system FIFO 311 and BDU 312
  • Bit 62 Signal bit visible to the other PGU
  • Bits 209 - 218 Render Context ports A&B address ( A is even & B is odd)
  • Bit3 219 - 220 Render Context ports A&B address advance control
  • a Crossbar connects all of the functional units to each other.
  • the mux on each input port controls which of several possible sources is latched into that pun on a given cycle.
  • Table provides an overview of all the sources and to which ports they are available, Q _/
  • Each row of the table represents one source, and each column represents one of the uiuxs located on a particular functional units input port. There are 24 ports driving outputs onto the crossbar, but the input Muxs may also .select hardwired values, sneak paths, or immediate values, so there axe a loud of 42 possible inputs.

Abstract

A graphics accelerator includes a vertex input for receiving vertex data, an output for forwarding processed data, and a processor coupled with the vertex input and output. The graphics accelerator also includes an instruction input that receives instructions for processing the vertex data received from the vertex input. The processor is responsive to wide word instructions.

Description

WIDE INSTRUCTION WORD GRAPHICS PROCESSOR
RELATED APPLICATIONS In addition to those discussed above and below, this application is related to the following copending United States patent applications, each of which is incorporated herein, in their entireties, by reference and filed on even date herewith:
Attorney docket number 1247/A30 entitled, "GRAPHICS PROCESSING WITH TRANSCENDENTAL FUNCΗON GENERATOR" naming Vernon Brethour and Stacy Moore as inventors; and
Attorney docket number 1247/A34 entitled "GRAPHICS PROCESSING FOR EFFICIENT POLYGON HANDLING " naming Dale Kirkland and William Lazenby as inventors.
FIELD OF THE INVENTION
The present invention relates to computers, and more particularly to computers using very large instruction words for various purposes, including for graphics processing.
BACKGROUND OF THE INVENTION In the implementation of graphics display systems for digital computers, it is sometimes desirable to have dedicated hardware support for geometry calculations in addition to the more common support for triangle setup and rasterization. Because graphics display systems often involve the display of objects based on three-dimensional data describing the objects, the geometry calculations involve, among other things, transforming locations of objects expressed in three-dimensional world coordinates into locations expressed in two-dimensional coordinates as the objects appear on the display. For some applications and configurations of graphics systems, the processing capability of the geometry accelerator becomes critically important. In the simplest case, geometry computations are accomplished one coordinate at a time, one vertex at a time, one triangle a time, one triangle strip at a time.
Data presented to a computer graphics subsystem are often expressed as strips of polygons (often triangles) in accordance with a graphics processing standard, such as the well known OpenGL graphics library. Rendering a scene involves transforming the coordinates of all of the polygons in all of the strips and determining the pixel values in the display that are associated with each portion of each of the polygons that appears in the display. The large amount of data involved in these calculations, in relation to the conflicting goals of achieving rendering both quickly and in detail, places heavy demands on computational resources. Substantial opportunities exist for parallel computation by breaking up the triangle strips and presenting the resulting sub-strips to different computation engines in parallel. THE REALITY ENGINE, distributed by Silicon Graphics, Inc. of Mountain View, California, and the GLZ family of graphics accelerators, distributed by INTENSE 3D of Huntsville, Alabama, are examples of systems that employ this technique extensively. In these systems, once the strips are broken up, the sub-strips are passed to standard processor elements, where the rest of the computation takes place basically one coordinate at a time, one vertex at a time. In the Reality Engine, these computations are done with an i860 processor from Intel. In the GLZ family of graphics accelerators, these computations are done with DSP chips from Analog Devices of Norwood, Massachusetts. In systems like these, some limited parallelism takes place in the coordinate transformations because the computation engines employed are pipelined math units with separate engines for integer and floating point calculations. In U.S. patent 5,745,125, assigned to Sun Microsystems, separate specialized computation engines are arranged in series to form a deeper pipeline than would normally occur.
It is a known goal in computer design to employ very large instruction words (VLIW) for achieving increased parallelism in computation. To make it practical to program such computers, high level programming languages are devised that employ instructions utilizing a register-to-register type of instruction set. The effect of a successful VLIW machine is to launch and complete a great many instructions on each clock cycle, so the register-to-register instruction set requires a register file with many read ports and many write ports. For example, U.S. patent 5.644,780, assigned to International Business
Machines, describes a register file for VLIW with 8 write ports and 12 read ports. The result is a VLIW computation engine capable of high levels of parallelism, but which can be built only at great cost that requires many registers.
SUMMARY OF THE INVENTION
The present invention achieves high levels of parallelism in a graphics processor by providing in a first embodiment an apparatus for processing computer graphics requests utilizing a wide word instruction. The apparatus of this embodiment has 1. a graphics request input;
2. a processor, coupled to the graphics data input, having an output, and responsive to instructions, wherein each instruction is a wide word. In a further related embodiment, each instruction is a very wide word. In a further embodiment, each instruction is a super wide word. In a still further embodiment, each instruction is an ultra wide word. In a related embodiment, which may, but need not, employ an instruction that is a wide word, a very wide word, a super wide word or an ultra wide word, the processor has functional units producing n results per clock cycle and registers for storing not more than n/2 of such results. In a further related embodiment, the functional units are connected by a cross bar. As used in this description and the accompanying claims, unless the context otherwise requires, the following definitions are employed. A "wide word" is an instruction, for a processor, that is issued in a single clock cycle of the processor, and providing greater than 64 bits of control to the processor. A "very wide word" is an instruction, for a processor, that is issued in a single clock cycle of the processor, and providing greater than 99 bits of control to the processor. A "super wide word" is an instruction, for a processor, that is issued in a single clock cycle of the processor, and providing greater than 128 bits of control to the processor. An "ultra wide word" is an instruction, for a processor, that is issued in a single clock cycle of the processor, and providing greater than 255 bits of control to the processor. A "register" is a storage element associated with a processor permitting reading of data on the processor clock cycle that immediately follows the clock cycle in which storage has been accomplished.
In another embodiment of the invention, there is provided a computer having data stores with multiple addressing modes. In this embodiment, the computer has 1. a data input;
2. a processor, coupled to the data input, and having an output and responsive to instructions; and
3. a plurality of data stores, coupled to the processor, each data store having a plurality of addressing modes, wherein a single instruction individually selects an addressing mode for each of the data stores.
The plurality of addressing modes may include indirect and absolute addressing modes. The indirect mode may further include a double level of indirect addressing. In a further embodiment, each instruction is a wide word. In a still further embodiment, there is provided a computer for processing computer graphics requests, wherein the data input is a graphics request input.
In another embodiment, there is provided a multiple processor apparatus for processing computer graphics requests in which the control store is accessed an increased clock rate in relation to the clock rate of the processors. In this embodiment, the apparatus has: 1. a plurality n of processors. n>l , each processor running at a processor clock rate R: and
2. a single control store supplying instructions for the processors, running at a store clock rate nR. In a related embodiment, the processors are responsive to instructions, and each instruction is a wide word or (in a yet further embodiment) a super wide word. Another related embodiment also has a control store sequencer, for evaluating branch instructions, at a clock rate nR, so that each processor may be caused to branch without processor clock delay for evaluation of branch instructions. Another embodiment of the invention provides an apparatus, for processing computer graphics requests, that uses a stack for storing instruction addresses arranged so as not to produce an overflow condition. This embodiment has
1. a graphics request input;
2. a processor, coupled to the graphics request input, and having an output and responsive to a set of instructions, the set including a call and a return;
3. a stack of n entries for storing instruction addresses, the stack having a top entry;
4. a program counter for addressing instructions and having a value for a current instruction, wherein i. each time a call is invoked, a number equal to one more than the value of the program counter is pushed onto the top of the stack and ii. each time a return is invoked, the top entry of the stack is popped off the stack and placed in the program counter;
5. wherein program execution is maintained even when the excess of calls over returns is greater than n, so that current entries in the stack may be abandoned by invoking an instruction stream that is independent of return addresses in the stack. In a further related embodiment, entries in the stack are addressed by a LIFO system.
In accord with another aspect of the invention, a graphics accelerator includes a vertex input for receiving vertex data, an output for forwarding processed data, and a processor coupled with the vertex input and output. The graphics accelerator also includes an instruction input that receives instructions for processing the vertex data received from the vertex input. The processor is responsive to wide word instructions. In accordance with yet another aspect of the invention, a graphics accelerator includes a vertex input for receiving vertex data, a processor coupled with the input, and a set of registers for storing results produced by the processor. To that end, the processor includes a plurality of functional units that execute based upon a clock cycle. The plurality of functional units produce n results per each clock cycle. The set of registers includes no more than n/2 registers. In preferred embodiments, n>l. Moreover, the graphics accelerator is responsive to a set of instructions that may include one of a plurality of wide words, super wide words, and ultra wide words.
BRIEF DESCRIPTION OF THE DRAWINGS
The foregoing features of the invention will be more readily understood by reference to the following detailed description, taken with reference to the accompanying drawings, in which:
Fig. 1 is a block diagram of a graphics geometry accelerator in accordance with a preferred embodiment of the invention;
Fig. 2 is a more detailed diagram of the embodiment of Fig. 1 ; Fig. 3 is a diagram showing yet further detail of the embodiment of Fig. 1.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
The following acronyms may be used in this description: ALU Arithmetic Logic Unit. BDU Breaker-Distributor Unit.
BIC Bus Interface Chip (Python).
CVB Cougar Vertex Buffer.
CVF Current Vertex File.
CVH Cougar Vertex Handler.
CRS Cougar Request Sequencer.
MAC Multiplier-Accumulator.
PC Program Counter.
PGU Programmable Graphics Unit.
RC Render Context.
SF Scratch File.
VA Vertex Assembly.
VRF Vertex Register File.
VF Vertex File.
WCS Writeable Control Store.
Fig. 1 shows a block diagram of a graphics geometry accelerator arranged in accordance with a preferred embodiment of the invention. The accelerator of this embodiment has been developed under the code name "Tomcat" and is sometimes referred to in this description by that name. The accelerator takes graphics request data defining strips of polygons from a host computer in accordance with a standard such as OpenGL and transforms the data from world coordinates to display coordinates. The Tomcat accelerator may be used to furnish data to a graphics processor, which performs rasterization and texture processing — that is, translating the transformed polygon data into pixel values associated with a display and overlaying any textures associated with objects in the displayed image. The Tomcat accelerator is designed to operate in conjunction with a graphics processor developed by the assignee of the present invention under the code name "Cougar" and that graphics processor is sometimes referred to in this description by that name. Details of the Cougar graphics processor are disclosed in pending provisional U.S. patent application number 60/093,247, and copending U.S. patent application entitled. " MULTI-PROCESSOR GRAPHICS ACCELERATOR", hied on even date herewith, assigned attorney docket number 1247/A22. and naming Steven Hemnch. Mark Mosley. Clihord Whitmore. James Deming. Stewart Carlton. Matt Buckelew. and Dale Kirkland as inventors Both such application are incorporated, in their entireties, by reierence
Fig. 1 shows a Breaker-Distributor Module (BDU) 1 1 , which receives graphics request data from the host computer The BDU takes the incoming stπps ol polygons, and, to the extent necessary, breaks the strips into substπps and distributes the substπps lor processing. The substπps are passed to a pair of Programmable Graphics Units (PGUs) 13 and 14. which are identified as PGU 0 and PGU 1. Each PGU periorms the required data calculations tor the substπps distributed to it, operating pursuant to instructions stored in the Wπteable Control Store (WCS) 12 The outputs of the respective PGUs 13 and 14 are then appropπately interleaved by sequencer 15 to permit processing by a Cougar graphics processor It is possible to employ a plurality of Tomcat processors simultaneously, in which case the BDU associated with each processor is coupled to the host computer and thus each BDU receives all graphics requests from the host. Each BDU processes graphics requests to the extent required to maintain the current global rendeπng context. However, a given BDU forwards data to its associated PGUs when it is the turn ot the given BDU to do so. In this manner the BDUs collectively forward data on a peer-to-peer round robin basis See also copending U.S. patent application entitled "Graphics Processing System with Multiple Strip Breaks", filed on even date herewith, and naming William S. Pesto, Russell Schroter. and David Young as inventors, the disclosure of which is incorporated herein, in its entirety, by reference
Fig. 2 is a more detailed diagram of the embodiment of Fig. 1 showing the BDU module 1 1, the WCS 12, the PGUs 13 and 14, and the sequencer 15 ( identihed here as the
Cougar Request Sequencer). In addition, iurther detail of the PGU is shown. Although the detail is shown lor PGU 0, PGU 1 has the same configuration. A sequencer module 121 controls the flow of the instruction stream to the PGU from the WCS 12. In order to achieve high data throughput, the PGUs utilize "wide word" instructions as defined above. In fact, in this embodiment, each instruction is 256 bits wide, qualifying for the status of "ultra wide word" Surprisingly, we have found that the architecture described herein permits the use of ultra wide word instructions without the need tor a higher level language to produce machine level instructions for operation ot the PGUs. In other words. we have found that ultra wide word instructions can be manually prepared at the machine level to guide operation of the PGUs to perform the limited number of calculations used for graphics geometry acceleration. The limited number of calculations required, therefore, has made it possible to utilize ultra wide word instructions without the need for a compiler wπtten tor a higher level language, and therefore without the development expense and complexity associated with a machine for which a compiler would be wπtten
In Fig 2 and associated with the PGU are shown a collection of items that make up the calculation engine and Input/Output handler ot the PGU, including a frame-based FIFO 211, Vertex Assembly Module 212, Arithmetic Logic Unit (ALU) Module 22, Scratch File Module 23. Reciprocal Module 24, Multiplier Accumulator Module 25, Transcendental Module 26, Cougar Vertex Buffer Module 281 (feeding Cougar Vertex Handler Module
282), and Cougar FIFO 283 These items can pass data to and from each other via the cross bar 29, to which they are all connected. As a result, duπng each clock cycle, each of the items can be handling data in parallel in accordance with the current instruction from the WCS that is accessed by the Sequencer Module 121 We have found that we can efficiently utilize a single control store, namely WCS 12, to serve both PGUs d e . PGU 0 and PGU 1), by running the clock ot the control store WCS at twice the frequency ot the PGU clocks. Then, to prevent conflicting accesses of the WCS by the PGUs, the PGU clocks are run 180 degrees out of phase with respect to each other Similarly, the Sequencer module 121 for each PGU is run at the same clock rate as the WCS. so that it may evaluate branch instructions in a manner that each PGU can be caused to branch without processor clock delay attπbutablc evaluation of branch instructions More generally, a number n ol PGUs may be utilized in a graphics processor, and each PGU may run at a clock rate R In such a case, the n PGUs may employ a single control store supplying instructions at a clock rate nR In a similar manner, each PGU may employ a control store sequencer, for evaluating branch instructions, running at a clock rate nR, so that each processor can be caused to branch without processor clock delav for evaluation of branch instructions
The Sequencerl21 utilizes a stack for stoπng instruction addresses Normally, it is possible in using such a stack to produce an overflow condition, but we have found that one can beneficially provide a stack that does not produce an overflow condition For example, we ha\ e found it convenient to use the control store 12 to store about 4000 lines of ultra wide words The Sequencer Module employs a stack of eight 12-bit addresses A program counter (PC) addresses the instructions in the WCS 12 and has a value for the current instruction Each time a call is invoked, a number equal to one more than the value of the program counter is pushed onto the top of the stack Each time a return is invoked, the top entry of the stack is popped off the stack and placed in the program counter When in the course of execution of the program, the excess of calls over returns is greater than eight (the number of entπes in the stack in this example), program execution is maintained When this condition occurs, current entπes in the stack may be abandoned by invoking an instruction stream that is independent of return addresses in the stack Entπes aie addressed by a LIFO system
Turning now to the other items in the PGU of Fig 2, the Frame-Based FIFO is the source tor all data from the BDU Module 1 1 The Vertex Assembly Module 212 is coupled to the Frame-Based FIFO 211 and performs in accordance with a cuπent request header In the case of a vertex header, the Vertex Assembly Module 212 builds the next vertex into one of four Vertex Files, which are part of the Vertex Assembly Module The Vertex Files render the vertices therein available for processing The request header can also specify two other ma]or tasks: processing of non-vertex headers from the BDU and (in a surpπsing usage) loading instructions into the WCS 12 See also copending U S patent application, entitled System for Processing Vertices From a Graphics Request Stream' . filed on even date herewith and naming Vernon Brethour and William Lazenby as inventors, the disclosure of which is incorporated herein, in its entirety, by reference Wπteable Control Store loading occurs when a Load WCS Request is being processed The BDU Module 1 1 sends the request header to all PGUs, but only the mastei PGU receives the Control Store data The slave PGU processes the Load WCS Request header by waiting tor the software to execute a REQ sequencer instruction and then notifying the master PGU that it is ready tor the WCS to be loaded The master PGU waits for the softw re to execute a REQ sequencer instruction and for the sla\ e PGU to notify that it is ready The master PGU updates the WCS and then acknowledges that the slave can resume The request is unloaded from the Frame Based FIFO by the Vertex Assembly Control Module 212 When the Vertex Assembly Control Module 212 is loading instructions into the Wπteable Control Store (WCS) or processing vertex headers, the Frame Based FIFO 21 1 is controlled by the Vertex Assembly Control Module
There are 4 sources for each of two ports (A and B) into the ALU as listed below
A Port Register File Contains sixteen 40-bit register values
Constants Contains thirty-two 40-bit constant values
A Port Register Contains the data selected by the ALU A Select on the previous clock
B Port Register Contains the data selected by the ALU B Select on the previous clock
B Port Register File Contains sixteen 40-bit register values
Utility Registers Accesses one of the utility registers used for controlling other modules
A Port Register Contains the data selected by the ALU A Select on the previous clock
B Port Register Contains the data selected by the ALU B Select on the previous clock
The ALU Module 22 is capable of performing operations typical of ALUs, including bitwise inversion, bitwise logical AND. logical OR. exclusive OR, logical shifts, minimum and maximum of two numbers, addition, subtraction, etc.
The output register that is driven onto the cross bar is always updated. The output of the ALU can also be wπtten into either the register file or one of the utility registers as specified by the target C address. This target register is updated immediately, and may be used as a source operand of an ALU operation on the following clock.
The Multiplier Accumulator Module (MAC) 25 perlorms multiplication and addition on eight 40-bit floating point inputs A-H provided on each cycle. The following seven results are produced on the specified ports. A*B MAC Port J
C*D MAC Port L
E*F MAC Port N
G*H MAC Port P
A*B+C*D MAC Port K E*F+G*H MAC Port O
A*B+C*D+E*F+G*H MAC Port M The MAC GH CTRL signal selects whether the G*H result (if Ctrl is clear) or zero (if set) is used as input to the right side accumulator. This permits the G*H multiplier to be used separately while the MAC still accumulates the results of three multiplies on MAC Port M With the MAC GH CTRL signal set, MAC Ports O and M therefore produce the following:
E*F MAC Port O
A*B+C*D+E*F MAC Port M
The Reciprocal Module 24 computes the reciprocal of the floaϋng point number provided as an input. The Transcendental Module 26 performs one of the following functions on the floating point number specified as input:
EXP Exponential (Base 2)
LOG Logarithm (Base 2)
ISQ Inverse Square Root
The Render Context (RC) Module 27 is a general purpose memory unit which contains 1024 32-bit words of data. It has two pairs of read ports, and one pair of write ports. A single address (and associated RC Pointer Register) drives each pair, with one port accessing the contents of the specified address addr. and the second port accessing addr XOR 1. Absolute addresses may be specified in the WCS control word. Indirect addressing is done by adding the offset specified in the least significant 5 bits of the corresponding RC Address field to the RC Pointer register for a particular port. The resulting address is latched and may be used to drive the address into the Render Context RAM on the next cycle. It may also be written back into the RC Pointer register if the Ptr Inc bit is set on the cycle that the offset is specified. The three RC Pointer registers may also be loaded directly by the ALU unit, which takes precedence over any updates attempted by the Pointer increment logic above. When reading from the RC, the data is available on the crossbar on the cycle after the instruction which specified the address and/or addressing mode. Sneak paths are provided from read ports A, B, C, and D to MAC Input Ports A, C, E, and G respectively that permit the data to be used as input to the MAC on the same cycle as the address is specified. When writing to the RC, the data is latched on the cycle that it and its address are presented, but is not written to the render context RAM until the next cycle. The new data may not be accessed until the 2nd cycle after it is written.
The Scratch File Module 23 is a fast, general-purpose memory unit containing 256 40-bit words of data. It has two independent read ports and two independent write ports which use 8 bit addresses. The various addressing modes are described below. Data read from the scratch file is available for use on the same clock cycle as the read instruction is issued. Any register used to form an address or to select an entry out of the Frame Index
(described below) must have been written at least one clock prior to its being used to access the Scratch File. Data written to the Scratch File is latched on the cycle that the write is specified, but is not written to the Scratch File Memory itself until the end of the following cycle. Special latch registers and logic are built in that detect a read from a location that is in the process of being updated. This logic watches for a read address on port C (D) being the same as the previous cycle's write address on port A (B). and returns the latched data instead of the contents of memory. If the read address on C matches the previous write address on B, then memory is read, and the old value is returned as the result of the read. The logic only compares the A/C and B/D port pairs. Addressing Modes
Each Scratch File port has control word fields which determine the addressing mode used to access the Scratch File on a given cycle. The following addressing modes are provided:
Absolute The complete 8 bit address is specified in the WCS control word SF_ADDR field.
Indirect The contents of the Scratch File register for that port is used as the 8 bit address.
Register: Offset The contents of the Scratch File register is used as the upper 4 bits of the address, and the lower 4 bits of the SF_ADDR control word field is used as the lower 4 bits, forming a complete 8 bit address. The register is in effect selecting one of sixteen frames within the Scratch File. The offset selects one of the sixteen words within that frame.
Offset: Register The reverse of Register: Offset mode. The Offset provides the upper 4 bits, and the register the lower 4 bits of the address.
Frame: Offset The selected output from the Frame Index array is used as the upper 4 bits of the address, and the offset specified in the instruction is used as the lower 4 bits.
Frame: Register The selected output from the Frame Index array is used as the upper 4 bits of the address, and the specified register is used as the lower 4 bits. .
Register: Register Only available on ports B and C. Bits 7:4 of Scratch File Pointer B (or C) are used as the upper four bits of the address, and bits 3:0 of Scratch File Pointer A (or D) are used as the lower four bits of the address.
The Scratch File Module 23 includes a Frame Index, a special purpose array of 18 4-bit numbers that can be used to generate the upper four bits of the Scratch File addresses.
The 256 word Scratch File can be thought of as 16 frames of 16 words each. This requires 4 bits to address a particular frame, and then 4 more bits to address a single word within that frame. The Frame Index Array is manipulated by the ALU Frame Index operations and the SF B Index and SF C Index utility registers. The least significant 5 bits of the B and C SF Pointer registers control the output of the Frame Index B and C ports respectively. These Frame Index outputs can then be selected as one of the inputs to a high MUX (i.e.. the MUX selecting the upper four bits of the address) of each Scratch File Address. When either SF B Index or SF C Index is specified as the target of an ALU operation, regardless of that operation, the least significant 4 bits of the B operand are written into the Frame Index Array at the location addressed by SF Pointer B or C register respectively. The previous contents of that Frame Index location are overwritten.
Each Tomcat PGU contains a Cougar Vertex Buffer Module 281 having six independent Cougar Vertex Buffers (CVBs) . Each CVB contains data that the Cougar Vertex Handler Module 282 extracts and sends to the Cougar FIFO 283. forming a valid Cougar vertex request. Only one CVB may be addressed at a time by the crossbar. The
CVSEL utility register contains a three bit value that is used to address one of the six CVBs. There are two independent write ports, and two independent read ports to the currently selected CVB. Each port is controlled by a 5-bit address which selects one of the 32 words in the CVB. When writing to the CVB. the data is latched on the cycle that it and its address are presented, but is not written to the specified buffer until the next cycle. The new data may not be accessed until the 2nd cycle after it is written. When reading from the CVB, the data is available on the crossbar on the cycle after the instruction which specified the address and/or addressing mode.
The Cougar Vertex Handler Module 282 is responsible for copying the relevant pieces of data from a specified CVB to the Cougar FIFO 283. It is controlled by requests that it pulls from its request FIFO, discussed below. If there are no requests for it to process, then it remains idle, waiting for a new request to be written into the request FIFO. The request FIFO contains requests for the Cougar Vertex Handler. It is 3 words deep, with each word being 24 bits wide. The Request FIFO is written to one word at a time by the ALU via a utility register.
The Cougar FIFO 283 of each PGU feeds the Cougar Request Sequencer 15, which is responsible for transferring Cougar bound data from each PGU to the Cougar Graphics Accelerator in the same order as the requests originally came into the Tomcat BDU.
In preferred embodiments, each PGU executes clipping operations to ensure that an object is within a viewing volume. Details of preferred clipping operations are discussed in U.S. provisional patent application serial number 60/093.184. the disclosure of which is incorporated, in its entirety, by reference.
By way of example, the ALU 22 preferably includes one or more hardware clipping modules that cooperate to store a single list of vertices in the scratch file module 23. This list is identified above as the file index array. Specifically, each vertex in a given triangle is checked to determine whether it is within a given view volume. To that end, the clipping modules initially store the initial three vertices in the index list array. Each initial vertex, and each newly determined vertex, then is clip checked to determine if it is within the given view volume. One or more new vertices, if any, are added to the list as they are calculated, while initial vertices are deleted from the list if determined to be outside the given view volume. As a result of these clipping operations, a single list is produced identifying the successive vertices of a resultant polygon formed within the given view volume. Examples of this process are disclosed in the immediately preceding above referenced provisional patent application.
In prefeπed embodiments, the index array includes two pointers to vertices. This size of the array may be equal to one more than the maximum number of vertices that can be active at any one point during the clip check process. This is dependent upon the number of vertices in the original triangle, and the maximum number of active clipping planes. The above noted two pointers are provided to address the array, along with read capability and
INSERT and DELETE instructions.
INSERT is given a new value to insert, and which pointer to use as the insertion point. All entries at the location addressed by the specified pointer and beyond are shifted up by one location, and the new value is inserted at the location addressed by the specified pointer. DELETE is given a pointer to use as the deletion point. The entry at the addressed location is deleted, and all subsequent entries are shifted down by one location.
Fig. 3 is a diagram showing yet further detail of the embodiment of Fig. 1. In this diagram, data flow through the accelerator is emphasized. In this figure, control sequencer 321 schematically represents the sequencer module 121 of on each PGU of Fig. 2. The BDU module 1 1 of Fig 2 is here shown to include both system FIFO 311 and BDU 312
Of course, it should be noted that although vaπous exemplary embodiments of the invention have been disclosed, it should be apparent to those skilled in the art that various changes and modifications can be made which will achieve some of the advantages of the invention without departing from the true scope of the invention These and other obvious modifications are intended to be covered by the appended claims
As discussed previously the highly parallel arrangement of the modules in Figs 2 and 3 permits use of a 256 bit ultra wide instruction word. Each word includes bits as described on the following five pages:
A possible bit assignment for a geometry accelerator control word:
Bits 0 - 2: Sequencer operation.
Bit - ft: Condition Code Select
Bit 7: Cundition Code Invert.
Bit S: ALU Condition Code write.
Bit 9: FPU Condition Code write.
Bits 10 - 11: Sequencer Loop Control.
Bit 12: Push PC
Bit 13 : Headstart Qualify .
Bits 14 - 25 Branch Field (also used as Immediate Field bits 0 through 11)
Bits 26 - 30 Cougar Vertex Buffer port A address (also used as Immediate Field bits 12 through 16)
Bits 3 1 - 34 Cougar Vertex Buffer port A select (also used a.s Immediate Field bits 17 through 20)
Bits 35 - 39 Cougar Vertex Buffer port B address (also used as Tminediεαe Field bits 21 thruugh 25)
Bits 40 - 43 Cougar Vertex Buffer port B select (also used as Immediate Field bits 26 through 29) - Bits 44 - 48 Cougar Vertex Buffer part C address (also used as Tmmediate Field bits 30 through 34)
Bits 49 - 53 Cougar Vertex Buffer port D address (also used as Immediate Field bits 35 through 39)
Bit 54: Cougar Vertex Buffer port A write enable
Bit 55: Cougar Vertex Buffer port B write enable
Bits 5 fi X 1 ALU Opcodf.
Bit 62: Signal bit visible to the other PGU
Bits G3 - 66 ALU X-Bar input Λ select
Bits 67 - 72 ALU argument A select
Bits 73 - 76 ALU X-Bur input B select
Bits 77 - 82 ALU argument B select
Bits S3 - £8 ALU argument C (destination) select Bitϋ 89 - % Scratch File port A address
Bit 97: Scratch File port A write enable nils 9S - 101 Scratch File port A select
Bits 102 - 109 Scratch File port B address
Bit 110: Scratch File port B write enable
Bits 111 - 114 Scratch File port B select
Bits 1 15 - 1 18 Scratch File port A address increment control
Bits 1 19 - 123 Scratch File port B address increment control
Bits 124 - 1 Scratch File port C address
Bits 132 - 136 Scratch File port C address increment control
Bits 137 - 144 Scratch File port address
Bits 145 - 148 Scratch File port D address increment control
Bits 149 - 152 Reciprocal Unit input select.
Bits 153 - 154 Transcendental Unit input select.
Bits 155 - 156 Transcendental Unit function (lookup table) select.
Bits 157 - 16U Multiply Accumulate Unit input A select
Bits 161 - 164 Multiply Accumulate Unit input B select
Bits 165 - 1 8 Multiply Accumulate Unit input C select
B its 1 9 - 172 Multiply Accumulate Unit input D select
Bits 173 - 176 Multiply Accumulate Unit input E select
Bits 177 - 180 Multiply Accumulate Unit input F select
Bits 181 - 1 84 Multiply Accumulate Unit input G select
Bits 185 - 188 Multiply Accumulate Unit input H select
Bit 1X9 - Multiply Accumulate mode control
Bits 190 - 197 Inpu: Buffer port Λ offset
Bit 198 Input Buffer port A advance euntrol Bits 199 - 206 Input Buffer port B offset
Bit 207 - Input Buffer port B advance control
Bit 208 - Input Buffer global advance
Bits 209 - 218 Render Context ports A&B address ( A is even & B is odd)
Bit3 219 - 220 Render Context ports A&B address advance control
Bits 221 - 230 Render Context ports C&D address ( C is even & D is odd)
Bits 231 - 232 Render Context ports C&D address advance control
Bits 233 - 242 Render Context ports H&F address ( F. is cvcu & F is odd)
Bits 243 - 244 Render Context ports E&F address advance control
Bits 245 -.248 Render Cuntext port E input select
Bit 249 Render Context port E write enable
Bits 250 - 252 Render Context port F input select
Bit 253 Render Context port F write enable
Bit 254 Diagnostic trigger to interface
Bit 255 ALU X-Bar output power down
A Crossbar connects all of the functional units to each other. The mux on each input port controls which of several possible sources is latched into that pun on a given cycle. Table provides an overview of all the sources and to which ports they are available, Q _/
Each row of the table represents one source, and each column represents one of the uiuxs located on a particular functional units input port. There are 24 ports driving outputs onto the crossbar, but the input Muxs may also .select hardwired values, sneak paths, or immediate values, so there axe a loud of 42 possible inputs.
"X" indicates possible input sources, and blanks indicate no possible connection.
Figure imgf000024_0001
Table 0-1 Cros.sbar/Input Sources

Claims

O 00/04484
23
We claim:
1 Apparatus for processing computer graphics requests comprising: a graphics request input; a processor, coupled to the graphics data input, having an output, and responsive to instructions, wherein each instruction is a wide word
2 Apparatus according to claim 1 , wherein each instruction is a very wide word
3 Apparatus according to claim 1 , wherein each instruction is a super wide word
4 Apparatus according to claim 1 , wherein each instruction is an ultra wide word
5. Apparatus according to claim 1, wherein the processor has functional units producing n results per clock cycle and registers for stonng not more than n/2 of such results
6 A computer for processing data, the computer composing. a data input, a processor, coupled to the data input, having an output and responsive to instructions, the processor having functional units producing n results per clock cycle and registers for stonng not more than n/2 of such results
7 A computer according to claim 6, wherein each instruction is a wide word
8 A computer according to claim 6, wherein each instruction is a super wide word
9 A computer according to claim 6, wherein the functional units are connected by a cross bar
10 A computer for processing data, the computer compπsing a data input, a processor, coupled to the data input, and having an output and responsive to instructions, a plurality of data stores, coupled to the processor, each data store having a plurality of addressing modes, wherein a single instruction individually selects an addressing mode for each of the data stores
11 A computer according to claim 10, wherein the plurality of addressing modes includes indirect and absolute addressing modes
12 A computer according to claim 10, wherein each instruction is a wide word
13 A computer according to claim 10, for processing computer graphics requests, wherein the data input is a graphics request input
14 Apparatus lor processing computer graphics requests, the apparatus compnsing a plurality n of processors wherein n> 1 , each processor running at a processor clock rate R, and a single control store supplying instructions for the processors, the single control store running at a store clock rate nR
15 Apparatus according to claim 14. wherein the processors are responsive to instructions, and each instruction is a wide word O 00/04484
25
16. Apparatus according to claim 14, wherein the processors are responsive to instructions, and each instruction is a super wide word.
17. Apparatus according to claim 14, further comprising: a control store sequencer, for evaluating branch instructions, at a clock rate nR, so that each processor may be caused to branch without processor clock delay for evaluation of branch instructions.
18. Apparatus for processing computer graphics requests comprising: a graphics request input; a processor, coupled to the graphics request input, and having an output and responsive to a set of instructions, the set including a call and a return; a stack of n entπes for storing instruction addresses, the stack having a top entry; a program counter for addressing instructions and having a value for a current instruction, wherein i. each time a call is invoked, a number equal to one more than the value of the program counter is pushed onto the top of the stack and ii. each time a return is invoked, the top value of the stack is popped off the stack and placed in the program counter; wherein program execution is maintained even when the excess of calls over returns is greater than n, so that current entries in the stack may be abandoned by invoking an instruction stream that is independent of return addresses in the stack.
19. Apparatus according to claim 17, wherein entries in the stack are addressed by a LIFO system.
20. A graphics accelerator compπsing: a vertex input for receiving vertex data: O 00/04484
-26-
an output for forwarding processed vertex data, an instruction input for receiving instructions for processing the vertex data received from the vertex input; and a processor coupled with the vertex input and the output, the processor being responsive to instmctions from the instruction input, the instructions including a wide word
21 The graphics accelerator as defined by claim 20 wherein each instruction is a very wide word
22 The graphics accelerator as defined bv claim 20 wherein each instruction is a super wide word
23 The graphics accelerator as defined by claim 20 wherein each instruction is an ultra wide word
24 The graphics accelerator as defined by claim 1 further comprising an instruction memory for stonng the instrucuons, the instruction memory being coupled with the instruction input
25 The graphics accelerator as defined by claim 24 wherein the instruction memory comprises a wnteable control store
26 The graphics accelerator as defined by claim 20 further compπsing a plurality of funcuonal units executing based upon a clock cycle, the plurality of functional units producing n results per each clock cycle, and a plurality of registers, the total number of registers not exceeding n/2
27 The graphics accelerator as defined by claim 26 wherein the funcuonal uruts are coupled via a crossbar, the crossbar compπsing a bus and a multiplexor
28 A graphics accelerator compπsing PCI7US99/16193 00/04484
27
a vertex input for receiving vertex data, a processor coupled with the input, the processor having a plurality of functional umts, the plurality of functional units execuung based upon a clock cycle, the plurality of functional umts producing n results per clock cycle, and a set of registers for storing a set of the results produced by the plurality of funcuonal units, the set of registers including no more than n/2 registers
29 The graphics accelerator as defined by claim 28 further including an instruction input that receives instnicuons, the funcuonal units execuung in response to their receipt of the instructions
30 The graphics accelerator as defined by claim 29 wherein the instnicuons each comprise wide words.
31 The graphics accelerator as defined by claim 29 wherein the instnicuons each comprise super wide words
32 The graphics accelerator as defined by claim 29 wherein the instructions each comprise ultra wide words
33 The graphics accelerator as defined by claim 29 further compπsing an instruction memory for stonng the instructions, the instruction memory being coupled with the instruction input
34 The graphics accelerator as defined by claim 28 wherein the plurality of funcuonal units are coupled with a crossbar having a bus and a multiplexer
35 The graphics accelerator as defined by claim 28 wherein n> 1
PCT/US1999/016193 1998-07-17 1999-07-15 Wide instruction word graphics processor WO2000004484A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US9316598P 1998-07-17 1998-07-17
US60/093,165 1998-07-17

Publications (2)

Publication Number Publication Date
WO2000004484A2 true WO2000004484A2 (en) 2000-01-27
WO2000004484A3 WO2000004484A3 (en) 2000-07-06

Family

ID=22237520

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/016193 WO2000004484A2 (en) 1998-07-17 1999-07-15 Wide instruction word graphics processor

Country Status (2)

Country Link
US (2) US6577316B2 (en)
WO (1) WO2000004484A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007058883A1 (en) * 2005-11-10 2007-05-24 Intel Corporation Apparatus and method for an interface architecture for flexible and extensible media processing
US7717405B2 (en) 2002-03-19 2010-05-18 Entegris, Inc. Hollow fiber membrane contact apparatus and process

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6646639B1 (en) 1998-07-22 2003-11-11 Nvidia Corporation Modified method and apparatus for improved occlusion culling in graphics systems
US6480205B1 (en) 1998-07-22 2002-11-12 Nvidia Corporation Method and apparatus for occlusion culling in graphics systems
US7127701B2 (en) * 1998-09-18 2006-10-24 Wylci Fables Computer processing and programming method using autonomous data handlers
US7209140B1 (en) 1999-12-06 2007-04-24 Nvidia Corporation System, method and article of manufacture for a programmable vertex processing model with instruction set
US6870540B1 (en) * 1999-12-06 2005-03-22 Nvidia Corporation System, method and computer program product for a programmable pixel processing model with instruction set
US6844880B1 (en) * 1999-12-06 2005-01-18 Nvidia Corporation System, method and computer program product for an improved programmable vertex processing model with instruction set
US7006101B1 (en) 2001-06-08 2006-02-28 Nvidia Corporation Graphics API with branching capabilities
US7456838B1 (en) 2001-06-08 2008-11-25 Nvidia Corporation System and method for converting a vertex program to a binary format capable of being executed by a hardware graphics pipeline
US7039397B2 (en) * 2003-07-30 2006-05-02 Lear Corporation User-assisted programmable appliance control
US7249252B2 (en) * 2004-06-16 2007-07-24 Intel Corporation Method of replacing initialization code in a control store with main code after execution of the initialization code has completed
US7350055B2 (en) 2004-10-20 2008-03-25 Arm Limited Tightly coupled accelerator
US7318143B2 (en) * 2004-10-20 2008-01-08 Arm Limited Reuseable configuration data
US7343482B2 (en) * 2004-10-20 2008-03-11 Arm Limited Program subgraph identification
US7352372B2 (en) * 2004-10-22 2008-04-01 Seiko Epson Corporation Indirect addressing mode for display controller
US20070186210A1 (en) * 2006-02-06 2007-08-09 Via Technologies, Inc. Instruction set encoding in a dual-mode computer processing environment
GB2514397B (en) 2013-05-23 2017-10-11 Linear Algebra Tech Ltd Corner detection
US9146747B2 (en) 2013-08-08 2015-09-29 Linear Algebra Technologies Limited Apparatus, systems, and methods for providing configurable computational imaging pipeline
US9727113B2 (en) 2013-08-08 2017-08-08 Linear Algebra Technologies Limited Low power computational imaging
US9910675B2 (en) 2013-08-08 2018-03-06 Linear Algebra Technologies Limited Apparatus, systems, and methods for low power computational imaging
US11768689B2 (en) 2013-08-08 2023-09-26 Movidius Limited Apparatus, systems, and methods for low power computational imaging
US10001993B2 (en) 2013-08-08 2018-06-19 Linear Algebra Technologies Limited Variable-length instruction buffer management
US9196017B2 (en) 2013-11-15 2015-11-24 Linear Algebra Technologies Limited Apparatus, systems, and methods for removing noise from an image
US9270872B2 (en) 2013-11-26 2016-02-23 Linear Algebra Technologies Limited Apparatus, systems, and methods for removing shading effect from image
US9916326B2 (en) 2015-01-27 2018-03-13 Splunk, Inc. Efficient point-in-polygon indexing technique for facilitating geofencing operations
US9607414B2 (en) 2015-01-27 2017-03-28 Splunk Inc. Three-dimensional point-in-polygon operation to facilitate displaying three-dimensional structures
US9836874B2 (en) 2015-01-27 2017-12-05 Splunk Inc. Efficient polygon-clipping technique to reduce data transfer requirements for a viewport
US10026204B2 (en) 2015-01-27 2018-07-17 Splunk Inc. Efficient point-in-polygon indexing technique for processing queries over geographic data sets
US10460704B2 (en) 2016-04-01 2019-10-29 Movidius Limited Systems and methods for head-mounted display adapted to human visual mechanism
US10949947B2 (en) 2017-12-29 2021-03-16 Intel Corporation Foveated image rendering for head-mounted display devices

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3689895A (en) * 1969-11-24 1972-09-05 Nippon Electric Co Micro-program control system
US3980992A (en) * 1974-11-26 1976-09-14 Burroughs Corporation Multi-microprocessing unit on a single semiconductor chip
US4604695A (en) * 1983-09-30 1986-08-05 Honeywell Information Systems Inc. Nibble and word addressable memory arrangement
GB2216307A (en) * 1988-03-01 1989-10-04 Ardent Computer Corp Vector register file
US5313551A (en) * 1988-12-28 1994-05-17 North American Philips Corporation Multiport memory bypass under software control
WO1994015280A2 (en) * 1992-12-18 1994-07-07 European Institute Of Technology Computer architecture for parallel data transfer in declarative computer languages
US5329615A (en) * 1990-09-14 1994-07-12 Hughes Aircraft Company Concurrent general purpose and DMA processing in a graphics rendering processor
EP0649083A2 (en) * 1993-10-18 1995-04-19 Cyrix Corporation A microcontrol unit for a superpipelined, superscalar microprocessor
US5446859A (en) * 1991-12-31 1995-08-29 Hyundai Electronics Industries Co., Ltd. Register addressing control circuit including a decoder and an index register
EP0735463A2 (en) * 1995-03-31 1996-10-02 Sun Microsystems, Inc. Computer processor having a register file with reduced read and/or write port bandwidth
US5666510A (en) * 1991-05-08 1997-09-09 Hitachi, Ltd. Data processing device having an expandable address space
US5717908A (en) * 1993-02-25 1998-02-10 Intel Corporation Pattern recognition system using a four address arithmetic logic unit
WO1998020422A1 (en) * 1996-11-07 1998-05-14 Atmel Corporation Eight-bit microcontroller having a risc architecture

Family Cites Families (67)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4366540A (en) * 1978-10-23 1982-12-28 International Business Machines Corporation Cycle control for a microprocessor with multi-speed control stores
US4434437A (en) 1981-01-26 1984-02-28 Rca Corporation Generating angular coordinate of raster scan of polar-coordinate addressed memory
GB2095067B (en) 1981-03-12 1984-10-03 Standard Telephones Cables Ltd Digital filter arrangement
US4615013A (en) 1983-08-02 1986-09-30 The Singer Company Method and apparatus for texture generation
US4646232A (en) 1984-01-03 1987-02-24 Texas Instruments Incorporated Microprocessor with integrated CPU, RAM, timer, bus arbiter data for communication system
US4897806A (en) 1985-06-19 1990-01-30 Pixar Pseudo-random point sampling techniques in computer graphics
US5146592A (en) * 1987-09-14 1992-09-08 Visual Information Technologies, Inc. High speed image processing computer with overlapping windows-div
JPH0782423B2 (en) 1987-09-16 1995-09-06 三洋電機株式会社 Data input / output circuit
US4991122A (en) 1987-10-07 1991-02-05 General Parametrics Corporation Weighted mapping of color value information onto a display screen
US4918626A (en) 1987-12-09 1990-04-17 Evans & Sutherland Computer Corp. Computer graphics priority system with antialiasing
US4908780A (en) 1988-10-14 1990-03-13 Sun Microsystems, Inc. Anti-aliasing raster operations utilizing sub-pixel crossing information to control pixel shading
JP2633331B2 (en) 1988-10-24 1997-07-23 三菱電機株式会社 Microprocessor
GB8828342D0 (en) 1988-12-05 1989-01-05 Rediffusion Simulation Ltd Image generator
US5446479A (en) 1989-02-27 1995-08-29 Texas Instruments Incorporated Multi-dimensional array video processor system
CA2016348C (en) 1989-05-10 2002-02-05 Kenichi Asano Multiprocessor type time varying image encoding system and image processor
EP0430501B1 (en) 1989-11-17 1999-02-03 Digital Equipment Corporation System and method for drawing antialiased polygons
US5239654A (en) 1989-11-17 1993-08-24 Texas Instruments Incorporated Dual mode SIMD/MIMD processor providing reuse of MIMD instruction memories as data memories when operating in SIMD mode
GB2240016A (en) 1990-01-15 1991-07-17 Philips Electronic Associated Texture memories store data at alternating levels of resolution
US5251296A (en) 1990-03-16 1993-10-05 Hewlett-Packard Company Methods and apparatus for generating arbitrarily addressed, arbitrarily shaped tiles in computer graphics systems
US5123085A (en) 1990-03-19 1992-06-16 Sun Microsystems, Inc. Method and apparatus for rendering anti-aliased polygons
US5371840A (en) 1990-04-26 1994-12-06 Honeywell Inc. Polygon tiling engine
JP2770598B2 (en) 1990-06-13 1998-07-02 株式会社日立製作所 Graphic display method and apparatus
WO1992000570A1 (en) 1990-06-26 1992-01-09 Du Pont Pixel Systems Limited Graphics rendering systems
DE69127516T2 (en) 1990-06-29 1998-02-26 Philips Electronics Nv Process and apparatus for imaging
US5293480A (en) 1990-08-06 1994-03-08 At&T Bell Laboratories High resolution graphics system architecture
DE69124437T2 (en) 1990-08-09 1997-07-03 Silicon Graphics Inc Method and device for reversing byte order in a computer
US5309561A (en) * 1990-09-28 1994-05-03 Tandem Computers Incorporated Synchronous processor unit with interconnected, separately clocked processor sections which are automatically synchronized for data transfer operations
US5519823A (en) 1991-03-15 1996-05-21 Hewlett-Packard Company Apparatus for rendering antialiased vectors
CA2069711C (en) 1991-09-18 1999-11-30 Donald Edward Carmon Multi-media signal processor computer system
US5257103A (en) 1992-02-05 1993-10-26 Nview Corporation Method and apparatus for deinterlacing video inputs
US5394524A (en) 1992-08-07 1995-02-28 International Business Machines Corporation Method and apparatus for processing two graphics data streams in parallel
US5511165A (en) 1992-10-23 1996-04-23 International Business Machines Corporation Method and apparatus for communicating data across a bus bridge upon request
US5666520A (en) 1993-03-29 1997-09-09 Hitachi, Ltd. Graphics display system including graphics processor having a register storing a series of vertex data relating to a polygonal line
DE69418646T2 (en) 1993-06-04 2000-06-29 Sun Microsystems Inc Floating point processor for a high-performance three-dimensional graphics accelerator
US5392393A (en) 1993-06-04 1995-02-21 Sun Microsystems, Inc. Architecture for a high performance three dimensional graphics accelerator
EP0631252B1 (en) 1993-06-23 2002-06-26 Sun Microsystems, Inc. Draw processor for a high performance three dimensional graphics accelerator
JPH0713757A (en) 1993-06-28 1995-01-17 Mitsubishi Electric Corp Data processor
US5684939A (en) 1993-07-09 1997-11-04 Silicon Graphics, Inc. Antialiased imaging with improved pixel supersampling
US5631693A (en) 1993-10-25 1997-05-20 Antec Corporation Method and apparatus for providing on demand services in a subscriber system
KR100200818B1 (en) 1993-11-30 1999-06-15 윤종용 Look-up table antialiasing method
KR100243174B1 (en) 1993-12-28 2000-02-01 윤종용 Apparatus and method of generating sub-pixel mask
US5548709A (en) 1994-03-07 1996-08-20 Silicon Graphics, Inc. Apparatus and method for integrating texture memory and interpolation logic in a computer system
US5568631A (en) * 1994-05-05 1996-10-22 International Business Machines Corporation Multiprocessor system with a shared control store accessed with predicted addresses
US5557734A (en) 1994-06-17 1996-09-17 Applied Intelligent Systems, Inc. Cache burst architecture for parallel processing, such as for image processing
EP0693737A3 (en) 1994-07-21 1997-01-08 Ibm Method and apparatus for managing multiprocessor graphical workload distribution
JP2637920B2 (en) 1994-08-11 1997-08-06 インターナショナル・ビジネス・マシーンズ・コーポレイション Computer graphic system and method of using frame buffer
TW278162B (en) 1994-10-07 1996-06-11 Yamaha Corp
US5561749A (en) 1994-12-02 1996-10-01 General Electric Company Modeling of surfaces employing polygon strips
US5737455A (en) 1994-12-12 1998-04-07 Xerox Corporation Antialiasing with grey masking techniques
US5696534A (en) 1995-03-21 1997-12-09 Sun Microsystems Inc. Time multiplexing pixel frame buffer video output
JPH08267827A (en) 1995-03-28 1996-10-15 Canon Inc Character processing method and apparatus and printer
US5664114A (en) 1995-05-16 1997-09-02 Hewlett-Packard Company Asynchronous FIFO queuing system operating with minimal queue status
US5720019A (en) * 1995-06-08 1998-02-17 Hewlett-Packard Company Computer graphics system having high performance primitive clipping preprocessing
EP0867016A1 (en) 1995-12-06 1998-09-30 Intergraph Corporation Peer-to-peer parallel processing graphics accelerator
WO1997027537A2 (en) * 1996-01-24 1997-07-31 Sun Microsystems, Inc. A processor for executing instruction sets received from a network or from a local memory
KR100269106B1 (en) 1996-03-21 2000-11-01 윤종용 Multiprocessor graphics system
US5821950A (en) 1996-04-18 1998-10-13 Hewlett-Packard Company Computer graphics system utilizing parallel processing for enhanced performance
US5914711A (en) 1996-04-29 1999-06-22 Gateway 2000, Inc. Method and apparatus for buffering full-motion video for display on a video monitor
US5886705A (en) 1996-05-17 1999-03-23 Seiko Epson Corporation Texture memory organization based on data locality
US5701365A (en) 1996-06-21 1997-12-23 Xerox Corporation Subpixel character positioning with antialiasing with grey masking techniques
US5821949A (en) 1996-07-01 1998-10-13 Sun Microsystems, Inc. Three-dimensional graphics accelerator with direct data channels for improved performance
EP0825550A3 (en) 1996-07-31 1999-11-10 Texas Instruments Incorporated Printing system and method using multiple processors
EP0840279A3 (en) 1996-11-05 1998-07-22 Compaq Computer Corporation Method and apparatus for presenting video on a display monitor associated with a computer
US5870567A (en) 1996-12-31 1999-02-09 Compaq Computer Corporation Delayed transaction protocol for computer system bus
US5883641A (en) 1997-04-29 1999-03-16 Hewlett-Packard Company System and method for speculative execution in a geometry accelerator
US5956047A (en) * 1997-04-30 1999-09-21 Hewlett-Packard Co. ROM-based control units in a geometry accelerator for a computer graphics system
US5949423A (en) * 1997-09-30 1999-09-07 Hewlett Packard Company Z buffer with degree of visibility test

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3689895A (en) * 1969-11-24 1972-09-05 Nippon Electric Co Micro-program control system
US3980992A (en) * 1974-11-26 1976-09-14 Burroughs Corporation Multi-microprocessing unit on a single semiconductor chip
US4604695A (en) * 1983-09-30 1986-08-05 Honeywell Information Systems Inc. Nibble and word addressable memory arrangement
GB2216307A (en) * 1988-03-01 1989-10-04 Ardent Computer Corp Vector register file
US5313551A (en) * 1988-12-28 1994-05-17 North American Philips Corporation Multiport memory bypass under software control
US5329615A (en) * 1990-09-14 1994-07-12 Hughes Aircraft Company Concurrent general purpose and DMA processing in a graphics rendering processor
US5666510A (en) * 1991-05-08 1997-09-09 Hitachi, Ltd. Data processing device having an expandable address space
US5446859A (en) * 1991-12-31 1995-08-29 Hyundai Electronics Industries Co., Ltd. Register addressing control circuit including a decoder and an index register
WO1994015280A2 (en) * 1992-12-18 1994-07-07 European Institute Of Technology Computer architecture for parallel data transfer in declarative computer languages
US5717908A (en) * 1993-02-25 1998-02-10 Intel Corporation Pattern recognition system using a four address arithmetic logic unit
EP0649083A2 (en) * 1993-10-18 1995-04-19 Cyrix Corporation A microcontrol unit for a superpipelined, superscalar microprocessor
EP0735463A2 (en) * 1995-03-31 1996-10-02 Sun Microsystems, Inc. Computer processor having a register file with reduced read and/or write port bandwidth
WO1998020422A1 (en) * 1996-11-07 1998-05-14 Atmel Corporation Eight-bit microcontroller having a risc architecture

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MITHANI AND MOLLER: "Microprogram sequencer handles a system's interrupts in real time" ELECTRONIC DESIGN, vol. 33, no. 1, January 1985 (1985-01), pages 319-324,326,328, XP002130596 Hasbrook Heights, New Jersey, US *
RATHNAM AND SLAVENBURG: "Processing the new world of interactive media" IEEE SIGNAL PROCESSING MAGAZINE, vol. 15, no. 2, March 1998 (1998-03), pages 108-117, XP002121705 us *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7717405B2 (en) 2002-03-19 2010-05-18 Entegris, Inc. Hollow fiber membrane contact apparatus and process
WO2007058883A1 (en) * 2005-11-10 2007-05-24 Intel Corporation Apparatus and method for an interface architecture for flexible and extensible media processing
GB2444472A (en) * 2005-11-10 2008-06-04 Intel Corp Apparatus and method for an interface architecture for flexible and extensible media processing
GB2444472B (en) * 2005-11-10 2011-02-16 Intel Corp Apparatus and method for an interface architecture for flexible and extensible media processing
US8462164B2 (en) 2005-11-10 2013-06-11 Intel Corporation Apparatus and method for an interface architecture for flexible and extensible media processing

Also Published As

Publication number Publication date
WO2000004484A3 (en) 2000-07-06
US6577316B2 (en) 2003-06-10
US20020030685A1 (en) 2002-03-14
US6948087B2 (en) 2005-09-20
US20030221137A1 (en) 2003-11-27

Similar Documents

Publication Publication Date Title
US6577316B2 (en) Wide instruction word graphics processor
US11797302B2 (en) Generalized acceleration of matrix multiply accumulate operations
US10719318B2 (en) Processor
EP0627682A1 (en) Floating-point processor for a high performance three dimensional graphics accelerator
US6624819B1 (en) Method and system for providing a flexible and efficient processor for use in a graphics processing system
US5822606A (en) DSP having a plurality of like processors controlled in parallel by an instruction word, and a control processor also controlled by the instruction word
US7724261B2 (en) Processor having a compare extension of an instruction set architecture
US11816481B2 (en) Generalized acceleration of matrix multiply accumulate operations
US20080074433A1 (en) Graphics Processors With Parallel Scheduling and Execution of Threads
US5867145A (en) Graphical image recasting
US6037947A (en) Graphics accelerator with shift count generation for handling potential fixed-point numeric overflows
EP0730220A2 (en) Method and apparatus for rapid execution of control transfer instructions
US7847803B1 (en) Method and apparatus for interleaved graphics processing
Watkins et al. A memory controller with an integrated graphics processor
EP1163591B1 (en) Processor having a compare extension of an instruction set architecture
US20240126547A1 (en) Instruction set architecture for a vector computational unit
JPH04315275A (en) Method for performing saturation operation in computer graphic with data processor
WO2000048080A9 (en) Processor having a compare extension of an instruction set architecture

Legal Events

Date Code Title Description
AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

122 Ep: pct application non-entry in european phase