CA2290649A1 - Method for compiling high level programming languages - Google Patents
Method for compiling high level programming languages Download PDFInfo
- Publication number
- CA2290649A1 CA2290649A1 CA002290649A CA2290649A CA2290649A1 CA 2290649 A1 CA2290649 A1 CA 2290649A1 CA 002290649 A CA002290649 A CA 002290649A CA 2290649 A CA2290649 A CA 2290649A CA 2290649 A1 CA2290649 A1 CA 2290649A1
- Authority
- CA
- Canada
- Prior art keywords
- group
- task
- time
- area
- code
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/30—Circuit design
- G06F30/34—Circuit design for reconfigurable circuits, e.g. field programmable gate arrays [FPGA] or programmable logic devices [PLD]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/30—Circuit design
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/443—Optimisation
- G06F8/4441—Reducing the execution time required by the program code
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/447—Target code generation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/45—Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
- G06F8/451—Code distribution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30181—Instruction operation extension or modification
Abstract
A computer program (item 101), written in a high level programming language, is compiled (item 103) into an intermediate data structure (105) which represents its control and data flow. This data structure is analyzed (item 111) to identify critical blocks of logic which can be implemented as an application specific integrated circuit (item 117) to improve the overall performance. The critical blocks of logic are first transformed into new equivalent logic with maximum data parallelism. The new parallelized logic is then translated into a Boolean gate representation which is suitable for implementation on an application specific integrated circuit (item 117). The application specific integrated circuit (item 117) is coupled with a generic microprocessor via custom instructions for the microprocessor (item 107). The original computer program is then compiled into object code (item 109) with the new expanded target instruction set.
Description
METHOD FOR COMPILING HIGH LEVEL PROGRAMMING LANGUAGES
BACKGROUND OF THE INVENTION
1. Field of the Invention The present invention relates to reconfigurable computing.
BACKGROUND OF THE INVENTION
1. Field of the Invention The present invention relates to reconfigurable computing.
2. State of the Art Traditionally, an integrated circuit must be designed by describing its structure with circuit primitives such as Boolean gates and registers. The circuit designer must begin with a specific application in mind, e.g. a video compression algorithm, and the resulting integrated circuit can only be used for the targeted application.
Alternatively, an integrated circuit may be designed as a general purpose microprocessor with a fixed instruction set, e.g. the Intel x86 processors.
This allows flexibility in writing computer programs which can invoke arbitrary sequences of the microprocessor instructions. While this approach increases the flexibility, it decreases the performance since the circuitry cannot be optimized for any specific application.
It would be desirable for high level programmers to be able to write arbitrary computer programs and have them automatically translated into fast application specific integrated circuits. However, currently there is no bridge between the computer programmers, who have expertise in programming languages for microprocessors, and the application specific integrated circuits, which require expertise in circuit design.
Research and development in integrated circuit design is attempting to push the level of circuit description to increasingly higher levels of abstraction. The current state of the art is the "behavioral synthesizer" whose input is a behavioral language description of the circuit's register/transfer behavior and whose output is a structural description of the circuit elements required to implement that behavior. The input description must have targeted a specific application and must describe its behavior in high level circuit primitives, but the behavioral compiler will automatically determine how many low level circuit primitives are required, how these primitives will be shared between different blocks of logic, and how the use of these primitives will be scheduled. The output description of these circuit primitives is then passed down to a "logic synthesizer" which maps the circuit primitives onto a library of available "cells", where each cell is the complete implementation of a circuit primitive on an integrated circuit. The output of the logic synthesizer is a description of all the required cells and their interconnections. This description is then passed down to a "placer and router"
which determines the detailed layout of all the cells and interconnections on the integrated circuit.
IO On the other hand, research and development in computer programming is also attempting to push down a level of abstraction by matching the specific application programs with custom targeted hardware. One such attempt is the Intel MMX
instruction set. This instruction set was designed specifically to accelerate applications with digital signal processing algorithms. Such applications may be written generically and an MMX aware compiler will automatically accelerate the compiled code by using the special instructions. Another attempt to match the application with appropriate hardware is the work on parallelizing compilers. These compilers wilt take a computer program written in a sequential programming language and automatically extract the implicit parallelism which can then be targeted for execution on a variable number of processors. Thus different applications may execute on a different number of processors, depending on their particular needs.
Despite the above efforts by both the hardware and software communities, the gap has not yet been bridged between high level programming languages and integrated circuit behavioral descriptions.
SUMMARY OF THE INVENTION
A computer program, written in a high level programming language, is compiled into an intermediate data structure which represents its control and data flow.
This data structure is analyzed to identify critical blocks of logic which can be implemented as an application specific integrated circuit to improve the overall ... . ..r_ _ , , . , performance. The critical blocks of logic are first transformed into new equivalent logic with maximal data parallelism. The new parallelized logic is then translated into a Boolean gate representation which is suitable for implementation on an application specific integrated circuit. The application specific integrated circuit is coupled with a generic microprocessor via custom instructions for the microprocessor. The original computer program is then compiled into object code with the new expanded target instruction set.
In accordance with one embodiment of the invention, a computer implemented method automatically compiles a computer program written in a high level programming language into a program for execution by one or more application specific integrated circuits coupled with a microprocessor. Code blocks the functions of which are to be performed by circuitry within the one or more application specific integrated circuits are selected, and the code blocks are grouped into groups based on at least one of an area constraint and an execution timing constraint. Loading and activation of the functions are scheduled; and code is produced for execution by the microprocessor, including instructions for loading and activating the functions.
In accordance another aspect of the invention, a computer implemented method automatically compiles a computer program written in a high level programming language into one or more application specific integrated circuits. In accordance with yet another aspect of the invention, a computer implemented method automatically compiles a computer program written in a high level programming language into one or more application specific integrated circuits coupled with a standard microprocessor.
In accordance with still another aspect of the invention, a reconfigurable logic block is locked by compiled instructions, wherein an activate configuration instruction locks the block from any subsequent activation and a release configuration instruction unlocks the block. In accordance with a further aspect of the invention, a high level programming language compiler automatically determines a set of one or more special instructions to extend the standard instruction set of a microprocessor which will result in a relative performance improvement for a given input computer program. In accordance with yet a further aspect of the invention, a method is provided for transforming the execution _,_ of more than one microprocessor standard instruction into the execution of a single special instruction. In accordance with still a further aspect of the invention, a high level programming language compiler is coupled with a behavioral synthesizer via a data flow graph intermediate representation.
BRIEF DESCRIPTION OF THE DRAWING
The present invention may be further understood from the following description in conjunction with the appended drawing. In the drawing:
Figure 1 shows the design methodology flow diagram of the preferred embodiment of a compiler.
Figure 2 shows the control flow for the operation of the preferred embodiment of an application specific integrated circuit.
Figure 3 shows a fragment of a high level source code example which can be input into the compiler.
Figure 4 shows the microprocessor object code for the code example of Figure 3 which would be output by a standard compiler.
Figure 5 shows an example of the application specific circuitry which is output by the compiler for the code example of Figure 3.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
In accordance with the preferred embodiment of the present invention, a method is presented for automatically compiling high level programming languages into application specific integrated circuits (ASIC).
Referring to Figure 1, the computer program source code 101 is parsed with standard compiler technology 103 into a language independent intermediate format 105.
The intermediate format I05 is a standard control and data flow graph, but with the addition of constructs to capture loops, conditional statements, and array accesses. The format's operators are language independent simple RISC-like instructions, but with additional operators for array accesses and procedure calls. These constructs capture all the high level information necessary for parallelization of the code. For further description of a compiled intermediate format see for example S. P.
Amarasinghe, J.
M. Anderson, C. S. Wilson, S.-W. Liao, B. M. Murphy, R. S. French, M. S.
Lam and M. W. Hall; Multiprocessors from a Software Perspective; IEEE Micro, June 1996; pages 52-61.
Because standard compiler technolo~_>y is used, the input computer program can be any legal source code for a supported hiLh level programming language. The methodology does not require a special ian~uage with constructs specifically for describing hardware implementation elements. Front end parsers currently exist for ANSI C and FORTRAN 77 and other languages can be supported simply by adding new front end parsers. For further information on front end parsers see for example C.
W. Fraser and D. R. Hanson; A Retargetable Compiler for ANSI C; SIGPLAN
Notices, 26(10); October 1991.
From the intermediate format 105, the present methodology uniquely supports code generation for two different types of target hardware: standard microprocessor and ASIC. Both targets are needed because while the ASIC is much faster than the microprocessor, it is also much larger and snore expensive and therefore needs to be treated as a scarce resource. The compiler will estimate the performance versus area tradeoffs and automatically determine which code blocks should be targeted for a given available ASIC area.
Code generation for the microprocessor is handled by standard compiler technology i07. A code generator for the MIPS microprocessor currently exists and other microprocessors can be supported by simply adding new back end generators. In the generated object code 109, custom instructions are inserted which invoke the ASIC-implemented logic as special instructions.
The special instructions are in four general categories: load configuration, activate configuration, invoke configuration, release configuration. The load configuration instruction identifies the address of a fixed bit stream which can configure the logic and interconnect for a single block of reconfigurable logic on the ASIC. Referring to Figure 2, the ASIC 20() may have one or more such blocks 201a, 201b on a single chip, possibly together with an embedded microprocessor 205 and control logic 207 for the reconfigurable logic. The identified bit stream may reside in, for example, random access memory (RAM) or read-only-memory (PROM or EEPROM) 203. The bit stream is downloaded to a cache of possible block configurations on the ASIC. The activate configuration instruction identifies a previously downloaded configuration, restructures the reconfigurable logic on the ASIC
block according to that configuration, and locks the block from any subsequent activate instructions. The invoke configuration instmction loads the input operand registers, locks the output registers, and invokes the configured logic on the ASIC.
After the ASIC loads the results into the instruction's output registers, it unlocks the registers and the microprocessor can take the results and continue execution. The release configuration instruction unlocks the ASIC block and makes it available for subsequent activate configuration instructions. For further description of an embedded microprocessor with reconfigurable logic see U.S. Patent Application 081884,380 of L.
Cooke, C. Phillips, and D. Wong for An Integrated Processor and Programmable Data Path Chip for Reconfigurable Computing, incorporated herein by reference.
Code generation for the ASIC logic can be implemented by several methods.
One implementation passes the intermediate control and data flow graphs to a behavioral synthesis program. This interface could be accomplished either by passing the data structures directly or by generating an intermediate behavioral language description. For further discussion of behavioral synthesis see for example D.
Knapp;
Behavioral Synthesis; Prentice Hall PTR; 1996. An alternative implementation generates one-to-one mappings of the intermediate format primitives onto a library of circuit implementations. For example: scalar variables and arrays are implemented as registers and register files with appropriate bit widths; arithmetic and Boolean operators such as add, multiply, accumulate, and compare are implemented as single cells with appropriate bit widths; conditional branch implementations and loops are implemented as state machines. In general, as illustrated in Figure 1, a silicon compiler 113 receives as inputs compiled code in the intermediate i'ormat 105 and circuit primitives from a circuit primitive library 115 and produces layout or configuration information for an ASIC l I7. For further discussion of techniques for state machine synthesis see for example G. De Micheli, A. Sangiovanni-Vincentelli, and P. Antognetti; Design Systems for VLSI Circuits; Martinus Nijhoff Publishers; 1987; pp. 327-364.
After the synthesis or mapping step is completed, an equivalent list of cells and their interconnections is generated. This list is commonly referred to as a netlist. This netlist is then passed to a placer and router which determines the actual layout of the cells and their interconnections on an ASIC. The complete layout is then encoded and compressed in a bit stream format which can be stored and loaded as a single unit to configure the ASIC. A step-by-step example of the foregoing process is illustrated in Figure 3, Figure 4, and Figure 5. For a general discussion of place and route algorithms see T. Ohtsuki; Layout Design and Verification; North-Holland;
1986; pp.
55-198.
The basic unit of code that would be targeted for an ASIC is a loop. A single loop in the input source code may be transformed in the intermediate format into multiple constructs for runtime optimization and parallelization by optimizer and parallelizer 111 in Figure 1. The degree of loop transformation for parallel execution is a key factor in improving the performance of the ASIC versus a microprocessor.
These transformations are handled by standard parallelizing compiler technology which includes constant propagation, forward propagation, induction variable detection, constant folding, scalar privatization analysis, loop interchange, skewing, and reversal.
For a general discussion of parallel compiler loop transformations see Michael Wolfe;
High Performance Compilers for Parallel Computing; Addison-Wesley Publishing Company; 1996; pp. 307-363.
To determine which source code loops will yield the most relative performance improvement, the results of a standard source code profiler are input to the compiler.
The profiler analysis indicates the percentage of runtime spent in each block of code.
By combining these percentages with the amount of possible parallelization for each loop, a figure of merit can be estimated for the possible gain of each loop.
For example:
Gain = (profilePercent) * (1 - 1 I para11e1Paths) where profilePercent = percent of runtime spent in this loop parallelPaths = number of paths which can be executed in parallel The amount of ASIC area required to implement a source code loop is determined by summing the individual areas of all its mapped cells and estimating the additional area required to interconnect the cells. The size of the cells and their interconnect depends on the number bits needed to implement the required data precision. The ASIC area can serve as a figure of merit for the cost of each loop. For example:
Cost = cellArea + MAX(0, (interconnectArea - overTheCellArea)) where ceilArea = sum of all component cell areas overTheCellArea = cellArea * (per cell area available for interconnects) interconnectArea = {number of interconnects) (interconnectLength) * (interconnect width) interconnectLength = (square root of the number of cells) l 3 For further information on estimating interconnect area see B. Preas, M.
Lorenzetti; Physical Design Automation of VLSI Systems; Benjamin/Cummings Publishing Company; 1988; pp. 31-64.
The method does not actually calculate the figures of merit for all the loops in the source code. The compiler is given two runtime parameters: the maximum area for a single ASIC block, and the maximum total ASIC area available, depending on the targeted runtime system. It first sorts the loops in descending order of their percentage of runtime, and then estimates the figures of merit for each loop until it reaches a predetermined limit in the total amount of area estimated. The predetermined limit is a constant times the maximum total ASIC area available. Loops that require an area larger than a single ASIC block may be skipped for a simpler implementation.
Finally, with all the loops for which figures of merit have been calculated, a knapsack algorithm is applied to select the loops. This procedure can be trivially extended to handle the _g_ _..... ... , case of targeting multiple ASICs if there is no gain or cost associated with being in different ASICs. For a general discussion of knapsack algorithms see Syslo, Deo, Kowalik; Discrete Optimization Algorithms; Prentice-Hall; 1983; pp. 118-176.
The various source code loops which are packed onto a single ASIC are generally independent of each other. W ith certain types of ASICs, namely a field programmable gate array (FPGA), it is possible to change at runtime some or all of the functions on the FPGA. The FPGA has one or more independent blocks of reconfigurable logic. Each block may be reconfigured without affecting any other block. Changing which functions are currently implemented may be desirable as the computer program executes different areas of code, or when an entirely different computer program is loaded, or when the amount of available FPGA logic changes.
A reconfigurable FPGA environment presents the following problems for the compiler to solve: selecting the total set of functions to be implemented, partitioning the functions across multiple FPGA blocks, and scheduling the loading and activation of FPGA blocks during the program execution. These problems cannot be solved optimally in polynomial time. The following paragraphs describe some heuristics which can be successfully applied to these problems.
The set of configurations simultaneously coexisting on an FPGA at a single instant of time will be referred to as a snapshot. The various functions comprising a snapshot are partitioned into the separate blocks by the compiler in order to minimize the block's stall time and therefore minimize the overall execution schedule.
A block will be stalled if the microprocessor has issued a new activate configuration instruction, but all the functions of the previous configuration have not yet completed.
The partitioning will group together functions that finish at close to the same time. All the functions which have been selected by the knapsack algorithm are sorted according to their ideal scheduled finish times (the ideal finish times assume that the blocks have been downloaded and activated without delay so that the functions can be invoked at their scheduled start times). Traversing the list by increasing finish times, each function is assigned to the same FPGA block until the FPGA block's area capacity is reached. When an FPGA block is filled, the next FPGA block is opened. After all functions have been assigned to FPGA blocks, the difference between the earliest and the latest finish times is calculated for each FPGA block. Then each function is revisited in reverse (decreasing) order. If reassigning the function to the next FPGA
block does not exceed its area capacity and reduces the maximum of the two S differences for the two FPGA blocks, then the function is reassigned to the next FPGA
block.
After the functions are partitioned, each configuration of an FPGA block may be viewed as a single task. Its data and control dependencies are the union of its assigned function's dependencies, and its required time is the difference between the latest finish time and the earliest start time of its assigned functions. The set of all such configuration tasks across ali snapshots may be scheduled with standard multiprocessor scheduling algorithms, treating each physical FPGA block as a processor. 'This wlll schedule all the activate configuration instructions.
A common scheduling algorithm is called list scheduling. In list scheduling, the following steps are a typical implementation:
1. Each node in the task graph is assigned a priority. The priority is defined as the length of the longest path from the starting point of the task graph to the node. A priority queue is initialized for ready tasks by inserting every task that has no immediate predecessors. Tasks are sorted in decreasing order of task priorities.
2. As long as the priority queue is not empty do the following:
a. A task is obtained from the front of the queue.
b. An idle processor is selected to run the task.
c. When all the immediate predecessors of a particular task are executed, that successor is now ready and can be inserted into the priority queue.
For further information on multiprocessor scheduling algorithms see A.
Zomaya; Parallel and Distributed Computing Handbook; McGraw-Hill; 1996; pp.
239-273.
All the load configuration instructions may be issued at the beginning of the program if the total number of configurations for any FPGA block does not exceed the capacity of the FPGA block's confi~,~uration cache. Similarly, the program may be _.. ._ r . i . ~
divided into more than one section, where the total number of configurations for any FPGA block does not exceed the capacity of the FPGA block's configuration cache.
Alternatively, the load configuration instructions may be scheduled at the lowest preceding branch point in the program's control flow graph which covers all the block's activate configuration instructions. This will be referred to as a covering load instruction. This is a preliminary schedule for the load instructions, but will lead to stalls if the actual load time exceeds the time the microprocessor requires to go from the load configuration instruction to the first activate configuration instruction. In addition, the number of configurations for an FPGA block may still exceed the capacity of its configuration cache. This will again lead to stalls in the schedule. In such a case, the compiler will compare the length of the stall versus the estimated gains for each of the configurations in contention. The gain of a configuration is estimated as the sum of the gains of its assigned functions. Among all the configurations in contention, the one with the minimum estimated gain is found. If the stall is greater than the minimum gain, the configuration with the minimum gain will not be used at that point in the schedule.
When a covering load instruction is de-scheduled as above, tentative load configuration tasks will be created just before each activate configuration instruction. These will be created at the lowest branch point immediately preceding the activate instruction. These will be referred to as single load instructions. A
new attempt will be made to schedule the single load command without exceeding the FPGA
block's configuration cache capacity at that point in the schedule. Similarly to the previous scheduling attempt, if the number of configurations again exceeds the configuration cache capacity, the length of the stall will be compared to the estimated gains. In this case, however, the estimated gain of the configuration is just the gain of the single function which will be invoked down this branch. Again, if the stall is greater than the minimum gain, the configuration with the minimum gain will not be used at that point in the schedule.
If a de-scheduled load instruction is a covering load instruction, the process will recurse; otherwise if it is a single load instruction, the process terminates.
This process can be generalized to shifting the load instructions down the control flow graph one step at a time and decreasing the number of invocations it must support. For a single step, partition each of the contending configurations into two new tasks. For the configurations which have already been scheduled, split the assigned functions into those which finish by the current time and those that don't. For the configuration which has not been scheduled yet, split the assigned functions into those which start after the stall time and those that don't.
Branch prediction may be used to predict the likely outcome of a branch and to load in advance of the branch a configuration likely to be needed as a result of the branch. Inevitably, branch prediction will sometimes be unsuccessful, with the result that a configuration will have been loaded that is not actually needed. To provide for these instances, instructions may be inserted after the branch instruction to clear the configuration loaded prior to the branch and to load a different configuration needed following the branch, provided that a net execution-time savings results.
It will be appreciated by those of ordinary skill in the art that the invention can be embodied in other specific forms without departing from the spirit or essential character thereof. The presently disclosed embodiments are therefore considered in all respects to be illustrative and not restrictive. The scope of the invention is indicated by the appended claims rather than the foregoing description, and all changes which come within the meaning and range of equivalents thereof are intended to be embraced therein.
Alternatively, an integrated circuit may be designed as a general purpose microprocessor with a fixed instruction set, e.g. the Intel x86 processors.
This allows flexibility in writing computer programs which can invoke arbitrary sequences of the microprocessor instructions. While this approach increases the flexibility, it decreases the performance since the circuitry cannot be optimized for any specific application.
It would be desirable for high level programmers to be able to write arbitrary computer programs and have them automatically translated into fast application specific integrated circuits. However, currently there is no bridge between the computer programmers, who have expertise in programming languages for microprocessors, and the application specific integrated circuits, which require expertise in circuit design.
Research and development in integrated circuit design is attempting to push the level of circuit description to increasingly higher levels of abstraction. The current state of the art is the "behavioral synthesizer" whose input is a behavioral language description of the circuit's register/transfer behavior and whose output is a structural description of the circuit elements required to implement that behavior. The input description must have targeted a specific application and must describe its behavior in high level circuit primitives, but the behavioral compiler will automatically determine how many low level circuit primitives are required, how these primitives will be shared between different blocks of logic, and how the use of these primitives will be scheduled. The output description of these circuit primitives is then passed down to a "logic synthesizer" which maps the circuit primitives onto a library of available "cells", where each cell is the complete implementation of a circuit primitive on an integrated circuit. The output of the logic synthesizer is a description of all the required cells and their interconnections. This description is then passed down to a "placer and router"
which determines the detailed layout of all the cells and interconnections on the integrated circuit.
IO On the other hand, research and development in computer programming is also attempting to push down a level of abstraction by matching the specific application programs with custom targeted hardware. One such attempt is the Intel MMX
instruction set. This instruction set was designed specifically to accelerate applications with digital signal processing algorithms. Such applications may be written generically and an MMX aware compiler will automatically accelerate the compiled code by using the special instructions. Another attempt to match the application with appropriate hardware is the work on parallelizing compilers. These compilers wilt take a computer program written in a sequential programming language and automatically extract the implicit parallelism which can then be targeted for execution on a variable number of processors. Thus different applications may execute on a different number of processors, depending on their particular needs.
Despite the above efforts by both the hardware and software communities, the gap has not yet been bridged between high level programming languages and integrated circuit behavioral descriptions.
SUMMARY OF THE INVENTION
A computer program, written in a high level programming language, is compiled into an intermediate data structure which represents its control and data flow.
This data structure is analyzed to identify critical blocks of logic which can be implemented as an application specific integrated circuit to improve the overall ... . ..r_ _ , , . , performance. The critical blocks of logic are first transformed into new equivalent logic with maximal data parallelism. The new parallelized logic is then translated into a Boolean gate representation which is suitable for implementation on an application specific integrated circuit. The application specific integrated circuit is coupled with a generic microprocessor via custom instructions for the microprocessor. The original computer program is then compiled into object code with the new expanded target instruction set.
In accordance with one embodiment of the invention, a computer implemented method automatically compiles a computer program written in a high level programming language into a program for execution by one or more application specific integrated circuits coupled with a microprocessor. Code blocks the functions of which are to be performed by circuitry within the one or more application specific integrated circuits are selected, and the code blocks are grouped into groups based on at least one of an area constraint and an execution timing constraint. Loading and activation of the functions are scheduled; and code is produced for execution by the microprocessor, including instructions for loading and activating the functions.
In accordance another aspect of the invention, a computer implemented method automatically compiles a computer program written in a high level programming language into one or more application specific integrated circuits. In accordance with yet another aspect of the invention, a computer implemented method automatically compiles a computer program written in a high level programming language into one or more application specific integrated circuits coupled with a standard microprocessor.
In accordance with still another aspect of the invention, a reconfigurable logic block is locked by compiled instructions, wherein an activate configuration instruction locks the block from any subsequent activation and a release configuration instruction unlocks the block. In accordance with a further aspect of the invention, a high level programming language compiler automatically determines a set of one or more special instructions to extend the standard instruction set of a microprocessor which will result in a relative performance improvement for a given input computer program. In accordance with yet a further aspect of the invention, a method is provided for transforming the execution _,_ of more than one microprocessor standard instruction into the execution of a single special instruction. In accordance with still a further aspect of the invention, a high level programming language compiler is coupled with a behavioral synthesizer via a data flow graph intermediate representation.
BRIEF DESCRIPTION OF THE DRAWING
The present invention may be further understood from the following description in conjunction with the appended drawing. In the drawing:
Figure 1 shows the design methodology flow diagram of the preferred embodiment of a compiler.
Figure 2 shows the control flow for the operation of the preferred embodiment of an application specific integrated circuit.
Figure 3 shows a fragment of a high level source code example which can be input into the compiler.
Figure 4 shows the microprocessor object code for the code example of Figure 3 which would be output by a standard compiler.
Figure 5 shows an example of the application specific circuitry which is output by the compiler for the code example of Figure 3.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
In accordance with the preferred embodiment of the present invention, a method is presented for automatically compiling high level programming languages into application specific integrated circuits (ASIC).
Referring to Figure 1, the computer program source code 101 is parsed with standard compiler technology 103 into a language independent intermediate format 105.
The intermediate format I05 is a standard control and data flow graph, but with the addition of constructs to capture loops, conditional statements, and array accesses. The format's operators are language independent simple RISC-like instructions, but with additional operators for array accesses and procedure calls. These constructs capture all the high level information necessary for parallelization of the code. For further description of a compiled intermediate format see for example S. P.
Amarasinghe, J.
M. Anderson, C. S. Wilson, S.-W. Liao, B. M. Murphy, R. S. French, M. S.
Lam and M. W. Hall; Multiprocessors from a Software Perspective; IEEE Micro, June 1996; pages 52-61.
Because standard compiler technolo~_>y is used, the input computer program can be any legal source code for a supported hiLh level programming language. The methodology does not require a special ian~uage with constructs specifically for describing hardware implementation elements. Front end parsers currently exist for ANSI C and FORTRAN 77 and other languages can be supported simply by adding new front end parsers. For further information on front end parsers see for example C.
W. Fraser and D. R. Hanson; A Retargetable Compiler for ANSI C; SIGPLAN
Notices, 26(10); October 1991.
From the intermediate format 105, the present methodology uniquely supports code generation for two different types of target hardware: standard microprocessor and ASIC. Both targets are needed because while the ASIC is much faster than the microprocessor, it is also much larger and snore expensive and therefore needs to be treated as a scarce resource. The compiler will estimate the performance versus area tradeoffs and automatically determine which code blocks should be targeted for a given available ASIC area.
Code generation for the microprocessor is handled by standard compiler technology i07. A code generator for the MIPS microprocessor currently exists and other microprocessors can be supported by simply adding new back end generators. In the generated object code 109, custom instructions are inserted which invoke the ASIC-implemented logic as special instructions.
The special instructions are in four general categories: load configuration, activate configuration, invoke configuration, release configuration. The load configuration instruction identifies the address of a fixed bit stream which can configure the logic and interconnect for a single block of reconfigurable logic on the ASIC. Referring to Figure 2, the ASIC 20() may have one or more such blocks 201a, 201b on a single chip, possibly together with an embedded microprocessor 205 and control logic 207 for the reconfigurable logic. The identified bit stream may reside in, for example, random access memory (RAM) or read-only-memory (PROM or EEPROM) 203. The bit stream is downloaded to a cache of possible block configurations on the ASIC. The activate configuration instruction identifies a previously downloaded configuration, restructures the reconfigurable logic on the ASIC
block according to that configuration, and locks the block from any subsequent activate instructions. The invoke configuration instmction loads the input operand registers, locks the output registers, and invokes the configured logic on the ASIC.
After the ASIC loads the results into the instruction's output registers, it unlocks the registers and the microprocessor can take the results and continue execution. The release configuration instruction unlocks the ASIC block and makes it available for subsequent activate configuration instructions. For further description of an embedded microprocessor with reconfigurable logic see U.S. Patent Application 081884,380 of L.
Cooke, C. Phillips, and D. Wong for An Integrated Processor and Programmable Data Path Chip for Reconfigurable Computing, incorporated herein by reference.
Code generation for the ASIC logic can be implemented by several methods.
One implementation passes the intermediate control and data flow graphs to a behavioral synthesis program. This interface could be accomplished either by passing the data structures directly or by generating an intermediate behavioral language description. For further discussion of behavioral synthesis see for example D.
Knapp;
Behavioral Synthesis; Prentice Hall PTR; 1996. An alternative implementation generates one-to-one mappings of the intermediate format primitives onto a library of circuit implementations. For example: scalar variables and arrays are implemented as registers and register files with appropriate bit widths; arithmetic and Boolean operators such as add, multiply, accumulate, and compare are implemented as single cells with appropriate bit widths; conditional branch implementations and loops are implemented as state machines. In general, as illustrated in Figure 1, a silicon compiler 113 receives as inputs compiled code in the intermediate i'ormat 105 and circuit primitives from a circuit primitive library 115 and produces layout or configuration information for an ASIC l I7. For further discussion of techniques for state machine synthesis see for example G. De Micheli, A. Sangiovanni-Vincentelli, and P. Antognetti; Design Systems for VLSI Circuits; Martinus Nijhoff Publishers; 1987; pp. 327-364.
After the synthesis or mapping step is completed, an equivalent list of cells and their interconnections is generated. This list is commonly referred to as a netlist. This netlist is then passed to a placer and router which determines the actual layout of the cells and their interconnections on an ASIC. The complete layout is then encoded and compressed in a bit stream format which can be stored and loaded as a single unit to configure the ASIC. A step-by-step example of the foregoing process is illustrated in Figure 3, Figure 4, and Figure 5. For a general discussion of place and route algorithms see T. Ohtsuki; Layout Design and Verification; North-Holland;
1986; pp.
55-198.
The basic unit of code that would be targeted for an ASIC is a loop. A single loop in the input source code may be transformed in the intermediate format into multiple constructs for runtime optimization and parallelization by optimizer and parallelizer 111 in Figure 1. The degree of loop transformation for parallel execution is a key factor in improving the performance of the ASIC versus a microprocessor.
These transformations are handled by standard parallelizing compiler technology which includes constant propagation, forward propagation, induction variable detection, constant folding, scalar privatization analysis, loop interchange, skewing, and reversal.
For a general discussion of parallel compiler loop transformations see Michael Wolfe;
High Performance Compilers for Parallel Computing; Addison-Wesley Publishing Company; 1996; pp. 307-363.
To determine which source code loops will yield the most relative performance improvement, the results of a standard source code profiler are input to the compiler.
The profiler analysis indicates the percentage of runtime spent in each block of code.
By combining these percentages with the amount of possible parallelization for each loop, a figure of merit can be estimated for the possible gain of each loop.
For example:
Gain = (profilePercent) * (1 - 1 I para11e1Paths) where profilePercent = percent of runtime spent in this loop parallelPaths = number of paths which can be executed in parallel The amount of ASIC area required to implement a source code loop is determined by summing the individual areas of all its mapped cells and estimating the additional area required to interconnect the cells. The size of the cells and their interconnect depends on the number bits needed to implement the required data precision. The ASIC area can serve as a figure of merit for the cost of each loop. For example:
Cost = cellArea + MAX(0, (interconnectArea - overTheCellArea)) where ceilArea = sum of all component cell areas overTheCellArea = cellArea * (per cell area available for interconnects) interconnectArea = {number of interconnects) (interconnectLength) * (interconnect width) interconnectLength = (square root of the number of cells) l 3 For further information on estimating interconnect area see B. Preas, M.
Lorenzetti; Physical Design Automation of VLSI Systems; Benjamin/Cummings Publishing Company; 1988; pp. 31-64.
The method does not actually calculate the figures of merit for all the loops in the source code. The compiler is given two runtime parameters: the maximum area for a single ASIC block, and the maximum total ASIC area available, depending on the targeted runtime system. It first sorts the loops in descending order of their percentage of runtime, and then estimates the figures of merit for each loop until it reaches a predetermined limit in the total amount of area estimated. The predetermined limit is a constant times the maximum total ASIC area available. Loops that require an area larger than a single ASIC block may be skipped for a simpler implementation.
Finally, with all the loops for which figures of merit have been calculated, a knapsack algorithm is applied to select the loops. This procedure can be trivially extended to handle the _g_ _..... ... , case of targeting multiple ASICs if there is no gain or cost associated with being in different ASICs. For a general discussion of knapsack algorithms see Syslo, Deo, Kowalik; Discrete Optimization Algorithms; Prentice-Hall; 1983; pp. 118-176.
The various source code loops which are packed onto a single ASIC are generally independent of each other. W ith certain types of ASICs, namely a field programmable gate array (FPGA), it is possible to change at runtime some or all of the functions on the FPGA. The FPGA has one or more independent blocks of reconfigurable logic. Each block may be reconfigured without affecting any other block. Changing which functions are currently implemented may be desirable as the computer program executes different areas of code, or when an entirely different computer program is loaded, or when the amount of available FPGA logic changes.
A reconfigurable FPGA environment presents the following problems for the compiler to solve: selecting the total set of functions to be implemented, partitioning the functions across multiple FPGA blocks, and scheduling the loading and activation of FPGA blocks during the program execution. These problems cannot be solved optimally in polynomial time. The following paragraphs describe some heuristics which can be successfully applied to these problems.
The set of configurations simultaneously coexisting on an FPGA at a single instant of time will be referred to as a snapshot. The various functions comprising a snapshot are partitioned into the separate blocks by the compiler in order to minimize the block's stall time and therefore minimize the overall execution schedule.
A block will be stalled if the microprocessor has issued a new activate configuration instruction, but all the functions of the previous configuration have not yet completed.
The partitioning will group together functions that finish at close to the same time. All the functions which have been selected by the knapsack algorithm are sorted according to their ideal scheduled finish times (the ideal finish times assume that the blocks have been downloaded and activated without delay so that the functions can be invoked at their scheduled start times). Traversing the list by increasing finish times, each function is assigned to the same FPGA block until the FPGA block's area capacity is reached. When an FPGA block is filled, the next FPGA block is opened. After all functions have been assigned to FPGA blocks, the difference between the earliest and the latest finish times is calculated for each FPGA block. Then each function is revisited in reverse (decreasing) order. If reassigning the function to the next FPGA
block does not exceed its area capacity and reduces the maximum of the two S differences for the two FPGA blocks, then the function is reassigned to the next FPGA
block.
After the functions are partitioned, each configuration of an FPGA block may be viewed as a single task. Its data and control dependencies are the union of its assigned function's dependencies, and its required time is the difference between the latest finish time and the earliest start time of its assigned functions. The set of all such configuration tasks across ali snapshots may be scheduled with standard multiprocessor scheduling algorithms, treating each physical FPGA block as a processor. 'This wlll schedule all the activate configuration instructions.
A common scheduling algorithm is called list scheduling. In list scheduling, the following steps are a typical implementation:
1. Each node in the task graph is assigned a priority. The priority is defined as the length of the longest path from the starting point of the task graph to the node. A priority queue is initialized for ready tasks by inserting every task that has no immediate predecessors. Tasks are sorted in decreasing order of task priorities.
2. As long as the priority queue is not empty do the following:
a. A task is obtained from the front of the queue.
b. An idle processor is selected to run the task.
c. When all the immediate predecessors of a particular task are executed, that successor is now ready and can be inserted into the priority queue.
For further information on multiprocessor scheduling algorithms see A.
Zomaya; Parallel and Distributed Computing Handbook; McGraw-Hill; 1996; pp.
239-273.
All the load configuration instructions may be issued at the beginning of the program if the total number of configurations for any FPGA block does not exceed the capacity of the FPGA block's confi~,~uration cache. Similarly, the program may be _.. ._ r . i . ~
divided into more than one section, where the total number of configurations for any FPGA block does not exceed the capacity of the FPGA block's configuration cache.
Alternatively, the load configuration instructions may be scheduled at the lowest preceding branch point in the program's control flow graph which covers all the block's activate configuration instructions. This will be referred to as a covering load instruction. This is a preliminary schedule for the load instructions, but will lead to stalls if the actual load time exceeds the time the microprocessor requires to go from the load configuration instruction to the first activate configuration instruction. In addition, the number of configurations for an FPGA block may still exceed the capacity of its configuration cache. This will again lead to stalls in the schedule. In such a case, the compiler will compare the length of the stall versus the estimated gains for each of the configurations in contention. The gain of a configuration is estimated as the sum of the gains of its assigned functions. Among all the configurations in contention, the one with the minimum estimated gain is found. If the stall is greater than the minimum gain, the configuration with the minimum gain will not be used at that point in the schedule.
When a covering load instruction is de-scheduled as above, tentative load configuration tasks will be created just before each activate configuration instruction. These will be created at the lowest branch point immediately preceding the activate instruction. These will be referred to as single load instructions. A
new attempt will be made to schedule the single load command without exceeding the FPGA
block's configuration cache capacity at that point in the schedule. Similarly to the previous scheduling attempt, if the number of configurations again exceeds the configuration cache capacity, the length of the stall will be compared to the estimated gains. In this case, however, the estimated gain of the configuration is just the gain of the single function which will be invoked down this branch. Again, if the stall is greater than the minimum gain, the configuration with the minimum gain will not be used at that point in the schedule.
If a de-scheduled load instruction is a covering load instruction, the process will recurse; otherwise if it is a single load instruction, the process terminates.
This process can be generalized to shifting the load instructions down the control flow graph one step at a time and decreasing the number of invocations it must support. For a single step, partition each of the contending configurations into two new tasks. For the configurations which have already been scheduled, split the assigned functions into those which finish by the current time and those that don't. For the configuration which has not been scheduled yet, split the assigned functions into those which start after the stall time and those that don't.
Branch prediction may be used to predict the likely outcome of a branch and to load in advance of the branch a configuration likely to be needed as a result of the branch. Inevitably, branch prediction will sometimes be unsuccessful, with the result that a configuration will have been loaded that is not actually needed. To provide for these instances, instructions may be inserted after the branch instruction to clear the configuration loaded prior to the branch and to load a different configuration needed following the branch, provided that a net execution-time savings results.
It will be appreciated by those of ordinary skill in the art that the invention can be embodied in other specific forms without departing from the spirit or essential character thereof. The presently disclosed embodiments are therefore considered in all respects to be illustrative and not restrictive. The scope of the invention is indicated by the appended claims rather than the foregoing description, and all changes which come within the meaning and range of equivalents thereof are intended to be embraced therein.
Claims (52)
1. A computer implemented method for the automatic compilation of a computer program written in a high level programming language into a program for execution by one or more application specific integrated circuits coupled with a microprocessor, the method comprising the steps of:
automatically determining a set of one or more special instructions, to be executed by said one or more application specific integrated circuits, that will result in a relative performance improvement for a given input computer program written for execution by the microprocessor; and generating code including said one or more special instructions.
automatically determining a set of one or more special instructions, to be executed by said one or more application specific integrated circuits, that will result in a relative performance improvement for a given input computer program written for execution by the microprocessor; and generating code including said one or more special instructions.
2. The method of Claim 1, wherein generating code comprises producing code for execution by the microprocessor, including instructions for loading and activating said functions.
3. The method of Claim 2, comprising the further steps of:
selecting code blocks the functions of which are to be performed by circuitry within the one or more application specific integrated circuits;
grouping the code blocks into groups based on at least one of an area constraint and an execution timing constraint;
scheduling loading of said functions; and scheduling activation of said functions.
selecting code blocks the functions of which are to be performed by circuitry within the one or more application specific integrated circuits;
grouping the code blocks into groups based on at least one of an area constraint and an execution timing constraint;
scheduling loading of said functions; and scheduling activation of said functions.
4. The method of Claim 2, comprising the further step of producing detailed integrated circuit layouts of said circuitry.
5. The method of Claim 4, comprising the further step of producing configuration data for said functions.
6. The method of Claim 2, wherein said instructions include special instructions to load, activate, invoke andlor release functions implemented on an application specific integrated circuit.
7. The method of Claim 2, wherein grouping comprises calculating start and finish times for the selected blocks of code.
8. The method of Claim 7, wherein the start and finish times are calculated assuming that the selected code blocks are implemented in parallel with a fixed overhead for each parallel operation.
9. The method of Claim 8, wherein the fixed overhead is calculated as OV
= I + A + L, where I is an average time required to invoke the application specific integrated circuit as a coprocessor instruction; A is an average time required to issue an activate configuration instructions plus an average stall time for activation;
and L is an average time required to issue a load configuration instruction plus an average stall time for loading.
= I + A + L, where I is an average time required to invoke the application specific integrated circuit as a coprocessor instruction; A is an average time required to issue an activate configuration instructions plus an average stall time for activation;
and L is an average time required to issue a load configuration instruction plus an average stall time for loading.
10. The method of Claim 7, wherein grouping is performed such that a difference between the latest and earliest finish times within a group is minimized.
11. The method of Claim 7, wherein grouping is performed such that for each group, circuitry for performing the functions of that group does not exceed a specified capacity of a block of an application integrated circuit.
12. The method of Claim 7, wherein grouping further comprises:
opening a new group with a total assigned area of zero;
sorting and traversing the code blocks in a predetermined order;
for each code block, if the area of the block plus the group's assigned area does not exceed a specified maximum area for a single group, adding the code block to the group and adding the area of the code block to the group's assigned area;
otherwise, opening a new group, adding the code block to the new group and adding the area of the code block to the new group's assigned area.
opening a new group with a total assigned area of zero;
sorting and traversing the code blocks in a predetermined order;
for each code block, if the area of the block plus the group's assigned area does not exceed a specified maximum area for a single group, adding the code block to the group and adding the area of the code block to the group's assigned area;
otherwise, opening a new group, adding the code block to the new group and adding the area of the code block to the new group's assigned area.
13. The method of Claim I2, wherein said predetermined order is in increasing order of finish times as a primary key, and increasing order of start times as a secondary key.
14. The method of Claim 13, wherein grouping comprises the further steps of:
traversing the code blocks in decreasing order of finish times;
for each code block, determining a start spread and finish spread of a group to which the code block belongs, wherein the start spread is the difference between the latest and earliest start times of all of the code blocks belonging to the same group, and the finish spread is the difference between the latest and earliest finish times of all of the code blocks belonging to the same group; and reassigning the code block to a different group if the code block's area plus the different group's assigned area does not exceed the specified maximum area for a single group, and if reassigning the code block results in a net improvement in at least one of start spread and finish spread for the group to which the code block belongs and the different group.
traversing the code blocks in decreasing order of finish times;
for each code block, determining a start spread and finish spread of a group to which the code block belongs, wherein the start spread is the difference between the latest and earliest start times of all of the code blocks belonging to the same group, and the finish spread is the difference between the latest and earliest finish times of all of the code blocks belonging to the same group; and reassigning the code block to a different group if the code block's area plus the different group's assigned area does not exceed the specified maximum area for a single group, and if reassigning the code block results in a net improvement in at least one of start spread and finish spread for the group to which the code block belongs and the different group.
15. The method of Claim 2, wherein selecting comprises sampling the percentage of time spent in each block of code when the computer program is executed on a single microprocessor.
16. The method of Claim 15, wherein selecting further comprises:
parsing the high level programming language into an intermediate data structure representing control and data dependencies of the computer program; and analyzing the amount of implicit parallelism in the intermediate data structure.
parsing the high level programming language into an intermediate data structure representing control and data dependencies of the computer program; and analyzing the amount of implicit parallelism in the intermediate data structure.
17. The method of Claim 16, wherein selecting further comprises, for at least some of the code blocks of the computer program, estimating the cost and benefit of implementing a code block using circuitry within an application specific integrated circuit.
18. The method of Claim I7, wherein estimating the cost and benefit of implementing a code block comprises:
estimating a reduction in execution time if the code block is implemented as an application specific integrated circuit; and estimating a layout area required if the code block is implemented as an application specific integrated circuit.
estimating a reduction in execution time if the code block is implemented as an application specific integrated circuit; and estimating a layout area required if the code block is implemented as an application specific integrated circuit.
19. The method of Claim 18, wherein selecting further comprises:
accepting a first runtime parameter representing a maximum area of a single block of an application specific integrated circuit and a second runtime parameter representing a maximum total area for all blocks to be considered for implementation as application specific integrated circuits; and selecting a set of code blocks which satisfies the first and second runtime parameters and which maximizes a total estimated reduction in execution time.
accepting a first runtime parameter representing a maximum area of a single block of an application specific integrated circuit and a second runtime parameter representing a maximum total area for all blocks to be considered for implementation as application specific integrated circuits; and selecting a set of code blocks which satisfies the first and second runtime parameters and which maximizes a total estimated reduction in execution time.
20. The method of Claim 19, wherein selecting a set of code blocks which satisfies the first and second runtime parameters and which maximizes a total estimated reduction in execution time comprises:
sorting and traversing the code blocks in decreasing order of reduction in execution time; and for each code block:
if the reduction equals zero, terminate;
estimate the required layout area;
if the area exceeds the specified maximum area for a single block of an application specific integrated circuit, skip this code block;
multiplying the specified maximum total area for all blocks by a constant greater than one;
if a total area of previously selected code blocks plus an estimated required layout area for a current code block exceeds the specified maximum total multiplied by the constant, terminate;
otherwise, select the code block; and using a knapsack algorithm and the maximum total area to perform a further selection on the selected code blocks.
sorting and traversing the code blocks in decreasing order of reduction in execution time; and for each code block:
if the reduction equals zero, terminate;
estimate the required layout area;
if the area exceeds the specified maximum area for a single block of an application specific integrated circuit, skip this code block;
multiplying the specified maximum total area for all blocks by a constant greater than one;
if a total area of previously selected code blocks plus an estimated required layout area for a current code block exceeds the specified maximum total multiplied by the constant, terminate;
otherwise, select the code block; and using a knapsack algorithm and the maximum total area to perform a further selection on the selected code blocks.
21. The method of Claim 18, wherein the reduction in execution time is estimated in accordance with the formula R = T(1 - 1/P) where T is a percentage of execution time spent in the code block and P is a number of paths which can be executed in parallel in the code block.
22. The method of Claim 18, wherein the intermediate data structure is a tree structure containing nodes, and estimating the layout area comprises:
performing bottom-up traversal of the tree structure;
mapping each node in the tree to a cell from a library of circuit primitives;
calculating a total area of the mapped cells; and calculating an additional area required for cell interconnections.
performing bottom-up traversal of the tree structure;
mapping each node in the tree to a cell from a library of circuit primitives;
calculating a total area of the mapped cells; and calculating an additional area required for cell interconnections.
23. The method of Claim 22, wherein mapping is performed in accordance with multiple predetermined mappings including at least one of the following:
scalar variables map to registers; arrays map to register files; addition and subtraction operators map to adders; increment and decrement operators map to adders;
multiplications and division operators map to multipliers; equality and inequality operators map to comparators; + =, = operators map to accumulators; *=, / =
operators map to multiply-accumulators, <<, >> operators map to shift registers;
&, ~, , operators map to Boolean gates, branches map to a state machine, and loops map to a state machine.
scalar variables map to registers; arrays map to register files; addition and subtraction operators map to adders; increment and decrement operators map to adders;
multiplications and division operators map to multipliers; equality and inequality operators map to comparators; + =, = operators map to accumulators; *=, / =
operators map to multiply-accumulators, <<, >> operators map to shift registers;
&, ~, , operators map to Boolean gates, branches map to a state machine, and loops map to a state machine.
24. The method of Claim 22, wherein mapping includes determining a number of significant bits required to support a data precision expected by the computer program.
25. The method of Claim 22, wherein calculating an additional area required for interconnections is performed in accordance with the following formula:
area =
max(0, (A - B)) where A is an estimate of total area required for interconnections and B is an estimate of area available within the mapped cells for use by interconnections.
area =
max(0, (A - B)) where A is an estimate of total area required for interconnections and B is an estimate of area available within the mapped cells for use by interconnections.
26. The method of Claim 25, wherein A is calculated as the product of a runtime parameter for the width of an interconnection, an average length of an interconnection calculated as a fraction times the square root of the number of mapped cells, and the total number of interconnections.
27. The method of Claim 25, wherein B is calculated as the product of a runtime parameter for the fraction of cell area for interconnections and the total area of all of the mapped cells.
28. The method of Claim 16, comprising the further step of estimating a reduction in execution time for each group.
29. The method of Claim 28, wherein scheduling activation is performed such that overall execution time is minimized subject to at least one of an area constraint and an execution time constraint.
30. The method of Claim 29, wherein scheduling activation is performed such that data and control dependencies of all code blocks within a group are not violated.
31. The method of Claim 29, wherein scheduling activation is performed such that a specified number of simultaneous blocks of an application specific circuit is not exceeded.
32. The method of Claim 29, wherein scheduling further comprises:
modeling each group as a separate task;
modeling as a processor each available block of reconfigurable logic on an application specific integrated circuit; and running a modified multiprocessor scheduling algorithm.
modeling each group as a separate task;
modeling as a processor each available block of reconfigurable logic on an application specific integrated circuit; and running a modified multiprocessor scheduling algorithm.
33. The method of Claim 32, wherein the intermediate data structure is a graph in which arcs represent dependencies, and wherein modeling each group as a separate task comprises:
for each group, adding a node to the graph;
for each code block assigned to a group, modifying the graph such that arcs that previously pointed to the code block point instead to a node representing the group;
determining a difference between a latest finish time and an earliest start time of code blocks assigned to the group; and setting a required time of the group equal to said difference.
for each group, adding a node to the graph;
for each code block assigned to a group, modifying the graph such that arcs that previously pointed to the code block point instead to a node representing the group;
determining a difference between a latest finish time and an earliest start time of code blocks assigned to the group; and setting a required time of the group equal to said difference.
34. The method of Claim 32, wherein running a modified multiprocessor scheduling algorithm comprises:
running a standard list scheduling multiprocessor scheduling algorithm;
during running of the algorithm, in the event no processor is available when a newly-ready task becomes ready:
calculating a stall time until a processor would become available;
create a list of contending tasks including the newly-ready task and tasks scheduled to be executing at a time the newly-ready task becomes ready; and finding a contending task with a minimum estimated reduction in execution time.
running a standard list scheduling multiprocessor scheduling algorithm;
during running of the algorithm, in the event no processor is available when a newly-ready task becomes ready:
calculating a stall time until a processor would become available;
create a list of contending tasks including the newly-ready task and tasks scheduled to be executing at a time the newly-ready task becomes ready; and finding a contending task with a minimum estimated reduction in execution time.
35. The method of Claim 34, wherein running the modified multiprocessor scheduling algorithm further comprises:
if the stall time is less than or equal to the minimum reduction, scheduling the newly-ready task to execute when a processor becomes available and continuing to run the multiprocessor scheduling algorithm.
if the stall time is less than or equal to the minimum reduction, scheduling the newly-ready task to execute when a processor becomes available and continuing to run the multiprocessor scheduling algorithm.
36. The method of Claim 35, wherein running the modified multiprocessor scheduling algorithm further comprises, if the stall time is greater than the minimum reduction, discarding the task with the minimum reduction and continuing to run the multiprocessor scheduling algorithm.
37. The method of Claim 35, wherein running the modified multiprocessor scheduling algorithm further comprises, if the stall time is greater than the minimum reduction:
replacing the newly-ready task with two new tasks, a first new task containing code blocks of the newly-ready task having start times later than when a processor would become available, and a second new task containing other code blocks of the newly-ready task;
replacing respective tasks scheduled to be executing at a time the newly-ready task becomes ready with two new respective tasks, a first new task containing code blocks of the newly-ready task having start times later than when a processor would become available, and a second new task containing other code blocks of the newly-ready task.
replacing the newly-ready task with two new tasks, a first new task containing code blocks of the newly-ready task having start times later than when a processor would become available, and a second new task containing other code blocks of the newly-ready task;
replacing respective tasks scheduled to be executing at a time the newly-ready task becomes ready with two new respective tasks, a first new task containing code blocks of the newly-ready task having start times later than when a processor would become available, and a second new task containing other code blocks of the newly-ready task.
38. The method of Claim 37, wherein running the modified multiprocessor scheduling algorithm further comprises:
of the new tasks, finding a task with a minimum reduction in execution time;
and discarding the task with the minimum reduction.
of the new tasks, finding a task with a minimum reduction in execution time;
and discarding the task with the minimum reduction.
39. The method of Claim 28, wherein scheduling loading is performed such that overall execution time is minimized subject to at least one of an area constraint and an execution time constraint.
40. The method of Claim 39, wherein scheduling loading is performed such that each function activation is preceded by loading.
41. The method of Claim 39, wherein scheduling loading is performed such that a specified capacity for coexisting groups loaded for a block of an application specific circuit is not exceeded.
42. The method of Claim 39, wherein the data structure includes a control flow graph, and wherein scheduling loading comprises:
modeling each group as a task and each available block of an application specific integrated circuit as a processor with a specified maximum number of simultaneous tasks;
for each group activation of which has been successfully scheduled, creating a new load-group task having a finish time equal to a finish time of a task representing the group and having a start time equal to a start time of the task representing the group minus a runtime parameter specifying a time required to load a group.
modeling each group as a task and each available block of an application specific integrated circuit as a processor with a specified maximum number of simultaneous tasks;
for each group activation of which has been successfully scheduled, creating a new load-group task having a finish time equal to a finish time of a task representing the group and having a start time equal to a start time of the task representing the group minus a runtime parameter specifying a time required to load a group.
43. The method of Claim 42, wherein scheduling loading further comprises, for each new load group task, inserting a node into the control flow graph.
44. The method of Claim 42, wherein scheduling loading further comprises:
finding a branching node in the control flow graph immediately preceding activation;
calculating a stall time of a load-group task as a finish time of the branching node minus the load-group task start time;
if the stall time is less than or equal to the estimated reduction in execution time for the group, creating a control flow arc from the branching node to the load-group task;
otherwise, discarding the load-group task and discarding the group.
finding a branching node in the control flow graph immediately preceding activation;
calculating a stall time of a load-group task as a finish time of the branching node minus the load-group task start time;
if the stall time is less than or equal to the estimated reduction in execution time for the group, creating a control flow arc from the branching node to the load-group task;
otherwise, discarding the load-group task and discarding the group.
45. The method of Claim 43 or Claim 44 wherein scheduling loading further comprises running a modified list processing multiprocessor scheduling algorithm.
46. The method of Claim 45, wherein running a modified list processing multiprocessor scheduling algorithm comprises:
running a list scheduling multiprocessor scheduling algorithm with a specified maximum number of simultaneous tasks per processor;
during running of the algorithm, in the event no processor is available when a newly-ready task becomes ready:
calculating a stall time until a processor would become available;
create a list of contending tasks including the newly-ready task and tasks scheduled to be executing at a time the newly-ready task becomes ready; and finding a contending task with a minimum estimated reduction in execution tune.
running a list scheduling multiprocessor scheduling algorithm with a specified maximum number of simultaneous tasks per processor;
during running of the algorithm, in the event no processor is available when a newly-ready task becomes ready:
calculating a stall time until a processor would become available;
create a list of contending tasks including the newly-ready task and tasks scheduled to be executing at a time the newly-ready task becomes ready; and finding a contending task with a minimum estimated reduction in execution tune.
47. The method of Claim 46, wherein running the modified multiprocessor scheduling algorithm further comprises:
if the stall time is less than or equal to the minimum reduction, scheduling the newly-ready task to execute when a processor becomes available, adjusting the schedule for a corresponding group task and continuing to run the multiprocessor scheduling algorithm.
if the stall time is less than or equal to the minimum reduction, scheduling the newly-ready task to execute when a processor becomes available, adjusting the schedule for a corresponding group task and continuing to run the multiprocessor scheduling algorithm.
48. The method of Claim 46, wherein running the modified multiprocessor scheduling algorithm further comprises, if the stall time is greater than the minimum reduction, discarding the task with the minimum reduction and its corresponding group and continuing to run the multiprocessor scheduling algorithm.
49. The method of Claim 48, wherein running the modified multiprocessor scheduling algorithm further comprises, if in the control flow graph a branching node intervenes between a node representing a discarded load_group task and a node representing activation of the corresponding group:
finding a branching node in the control flow graph immediately preceding activation;
calculating a stall time of a load_group task as a finish time of the branching node minus the load-group task start time;
if the stall time is less than or equal to the estimated reduction in execution time for the group, creating a control flow arc from the branching node to the load_group task;
otherwise, discarding the load group task and discarding the group.
finding a branching node in the control flow graph immediately preceding activation;
calculating a stall time of a load_group task as a finish time of the branching node minus the load-group task start time;
if the stall time is less than or equal to the estimated reduction in execution time for the group, creating a control flow arc from the branching node to the load_group task;
otherwise, discarding the load group task and discarding the group.
50. The method of Claim 48, wherein running the modified multiprocessor scheduling algorithm further comprises, if the stall time is greater than the minimum reduction:
replacing the newly-ready task with two new tasks and corresponding groups, a first new group containing code blocks of the newly-ready task having start times later than when a processor would become available, and a second new group containing other code blocks of the newly-ready task;
replacing respective tasks scheduled to be executing at a time the newly-ready task becomes ready with two new respective tasks and corresponding groups, a first new group containing code blocks of the newly-ready task having start times later than when a processor would become available, and a second new group containing other code blocks of the newly-ready task.
replacing the newly-ready task with two new tasks and corresponding groups, a first new group containing code blocks of the newly-ready task having start times later than when a processor would become available, and a second new group containing other code blocks of the newly-ready task;
replacing respective tasks scheduled to be executing at a time the newly-ready task becomes ready with two new respective tasks and corresponding groups, a first new group containing code blocks of the newly-ready task having start times later than when a processor would become available, and a second new group containing other code blocks of the newly-ready task.
51. The method of Claim 50, wherein running the modified multiprocessor scheduling algorithm further comprises:
of the new tasks, finding a task with corresponding group having a minimum reduction in execution time; and discarding the task with the corresponding partition having the minimum reduction.
of the new tasks, finding a task with corresponding group having a minimum reduction in execution time; and discarding the task with the corresponding partition having the minimum reduction.
52. The method of Claim 51, wherein running the modified multiprocessor scheduling algorithm further comprises, if in the control flow graph a branching node intervenes between a node representing a discarded load group task and a node representing activation of the corresponding group:
fording a branching node in the control flow graph immediately preceding activation;
calculating a stall time of a load group task as a finish time of the branching node minus the load group task start time;
if the stall time is less than or equal to the estimated reduction in execution time for the group, creating a control flow arc from the branching node to the load-group task;
otherwise, discarding the load group task and discarding the group.
fording a branching node in the control flow graph immediately preceding activation;
calculating a stall time of a load group task as a finish time of the branching node minus the load group task start time;
if the stall time is less than or equal to the estimated reduction in execution time for the group, creating a control flow arc from the branching node to the load-group task;
otherwise, discarding the load group task and discarding the group.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US08/884,377 US5966534A (en) | 1997-06-27 | 1997-06-27 | Method for compiling high level programming languages into an integrated processor with reconfigurable logic |
US08/884,377 | 1997-06-27 | ||
PCT/US1998/013563 WO1999000731A1 (en) | 1997-06-27 | 1998-06-29 | Method for compiling high level programming languages |
Publications (1)
Publication Number | Publication Date |
---|---|
CA2290649A1 true CA2290649A1 (en) | 1999-01-07 |
Family
ID=25384489
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA002290649A Abandoned CA2290649A1 (en) | 1997-06-27 | 1998-06-29 | Method for compiling high level programming languages |
Country Status (7)
Country | Link |
---|---|
US (2) | US5966534A (en) |
EP (1) | EP0991997A4 (en) |
JP (1) | JP2002508102A (en) |
KR (1) | KR100614491B1 (en) |
AU (1) | AU8275498A (en) |
CA (1) | CA2290649A1 (en) |
WO (1) | WO1999000731A1 (en) |
Families Citing this family (195)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6077315A (en) * | 1995-04-17 | 2000-06-20 | Ricoh Company Ltd. | Compiling system and method for partially reconfigurable computing |
US7266725B2 (en) * | 2001-09-03 | 2007-09-04 | Pact Xpp Technologies Ag | Method for debugging reconfigurable architectures |
DE19651075A1 (en) * | 1996-12-09 | 1998-06-10 | Pact Inf Tech Gmbh | Unit for processing numerical and logical operations, for use in processors (CPU's), multi-computer systems, data flow processors (DFP's), digital signal processors (DSP's) or the like |
DE19654595A1 (en) | 1996-12-20 | 1998-07-02 | Pact Inf Tech Gmbh | I0 and memory bus system for DFPs as well as building blocks with two- or multi-dimensional programmable cell structures |
EP1329816B1 (en) | 1996-12-27 | 2011-06-22 | Richter, Thomas | Method for automatic dynamic unloading of data flow processors (dfp) as well as modules with bidimensional or multidimensional programmable cell structures (fpgas, dpgas or the like) |
DE19654846A1 (en) * | 1996-12-27 | 1998-07-09 | Pact Inf Tech Gmbh | Process for the independent dynamic reloading of data flow processors (DFPs) as well as modules with two- or multi-dimensional programmable cell structures (FPGAs, DPGAs, etc.) |
US6542998B1 (en) | 1997-02-08 | 2003-04-01 | Pact Gmbh | Method of self-synchronization of configurable elements of a programmable module |
DE19704728A1 (en) * | 1997-02-08 | 1998-08-13 | Pact Inf Tech Gmbh | Method for self-synchronization of configurable elements of a programmable module |
DE19704742A1 (en) * | 1997-02-11 | 1998-09-24 | Pact Inf Tech Gmbh | Internal bus system for DFPs, as well as modules with two- or multi-dimensional programmable cell structures, for coping with large amounts of data with high networking effort |
US6330659B1 (en) | 1997-11-06 | 2001-12-11 | Iready Corporation | Hardware accelerator for an object-oriented programming language |
US8686549B2 (en) | 2001-09-03 | 2014-04-01 | Martin Vorbach | Reconfigurable elements |
JP3539613B2 (en) * | 1997-12-03 | 2004-07-07 | 株式会社日立製作所 | Array summary analysis method for loops containing loop jump statements |
US7373440B2 (en) | 1997-12-17 | 2008-05-13 | Src Computers, Inc. | Switch/network adapter port for clustered computers employing a chain of multi-adaptive processors in a dual in-line memory module format |
US6076152A (en) * | 1997-12-17 | 2000-06-13 | Src Computers, Inc. | Multiprocessor computer architecture incorporating a plurality of memory algorithm processors in the memory subsystem |
US7565461B2 (en) | 1997-12-17 | 2009-07-21 | Src Computers, Inc. | Switch/network adapter port coupling a reconfigurable processing element to one or more microprocessors for use with interleaved memory controllers |
DE19861088A1 (en) * | 1997-12-22 | 2000-02-10 | Pact Inf Tech Gmbh | Repairing integrated circuits by replacing subassemblies with substitutes |
JP2002530780A (en) | 1998-11-20 | 2002-09-17 | アルテラ・コーポレーション | Reconfigurable programmable logic device computer system |
US6286138B1 (en) * | 1998-12-31 | 2001-09-04 | International Business Machines Corporation | Technique for creating remotely updatable programs for use in a client/server environment |
US6477683B1 (en) | 1999-02-05 | 2002-11-05 | Tensilica, Inc. | Automated processor generation system for designing a configurable processor and method for the same |
US6453407B1 (en) * | 1999-02-10 | 2002-09-17 | Infineon Technologies Ag | Configurable long instruction word architecture and instruction set |
DE19910863A1 (en) * | 1999-03-11 | 2000-09-21 | Siemens Ag | Device and method for processing orders |
WO2000077652A2 (en) | 1999-06-10 | 2000-12-21 | Pact Informationstechnologie Gmbh | Sequence partitioning in cell structures |
WO2001013583A2 (en) | 1999-08-16 | 2001-02-22 | Iready Corporation | Internet jack |
EA004196B1 (en) * | 1999-08-30 | 2004-02-26 | Ай Пи ФЛЕКС ИНК. | Control program product and data processing system |
US6714978B1 (en) * | 1999-12-04 | 2004-03-30 | Worldcom, Inc. | Method and system for processing records in a communications network |
US6986128B2 (en) * | 2000-01-07 | 2006-01-10 | Sony Computer Entertainment Inc. | Multiple stage program recompiler and method |
US6625797B1 (en) * | 2000-02-10 | 2003-09-23 | Xilinx, Inc. | Means and method for compiling high level software languages into algorithmically equivalent hardware representations |
US7334216B2 (en) * | 2000-04-04 | 2008-02-19 | Sosy, Inc. | Method and apparatus for automatic generation of information system user interfaces |
US6681383B1 (en) * | 2000-04-04 | 2004-01-20 | Sosy, Inc. | Automatic software production system |
WO2001090887A1 (en) * | 2000-05-25 | 2001-11-29 | Fujitsu Limited | Method fir processing program for high-speed processing by using dynamically reconfigurable hardware and program for executing the processing method |
US7340596B1 (en) * | 2000-06-12 | 2008-03-04 | Altera Corporation | Embedded processor with watchdog timer for programmable logic |
EP2226732A3 (en) | 2000-06-13 | 2016-04-06 | PACT XPP Technologies AG | Cache hierarchy for a multicore processor |
US7168069B1 (en) * | 2000-07-12 | 2007-01-23 | Stmicroelectronics, Inc. | Dynamic generation of multimedia code for image processing |
JP2002049652A (en) * | 2000-08-03 | 2002-02-15 | Hiroshi Yasuda | Digital circuit design method, its compiler and simulator |
US7343594B1 (en) | 2000-08-07 | 2008-03-11 | Altera Corporation | Software-to-hardware compiler with symbol set inference analysis |
EP1356401A2 (en) | 2000-08-07 | 2003-10-29 | Altera Corporation | Software-to-hardware compiler |
JP2004517386A (en) * | 2000-10-06 | 2004-06-10 | ペーアーツェーテー イクスペーペー テクノロジーズ アクチエンゲゼルシャフト | Method and apparatus |
US8058899B2 (en) | 2000-10-06 | 2011-11-15 | Martin Vorbach | Logic cell array and bus system |
US20040015899A1 (en) * | 2000-10-06 | 2004-01-22 | Frank May | Method for processing data |
JP2002123563A (en) * | 2000-10-13 | 2002-04-26 | Nec Corp | Compiling method, composing device, and recording medium |
US6904105B1 (en) * | 2000-10-27 | 2005-06-07 | Intel Corporation | Method and implemention of a traceback-free parallel viterbi decoder |
US6834291B1 (en) | 2000-10-27 | 2004-12-21 | Intel Corporation | Gold code generator design |
US7039717B2 (en) | 2000-11-10 | 2006-05-02 | Nvidia Corporation | Internet modem streaming socket method |
US7379475B2 (en) | 2002-01-25 | 2008-05-27 | Nvidia Corporation | Communications processor |
US7444531B2 (en) | 2001-03-05 | 2008-10-28 | Pact Xpp Technologies Ag | Methods and devices for treating and processing data |
US20070299993A1 (en) * | 2001-03-05 | 2007-12-27 | Pact Xpp Technologies Ag | Method and Device for Treating and Processing Data |
US9250908B2 (en) | 2001-03-05 | 2016-02-02 | Pact Xpp Technologies Ag | Multi-processor bus and cache interconnection system |
US9552047B2 (en) | 2001-03-05 | 2017-01-24 | Pact Xpp Technologies Ag | Multiprocessor having runtime adjustable clock and clock dependent power supply |
US9037807B2 (en) * | 2001-03-05 | 2015-05-19 | Pact Xpp Technologies Ag | Processor arrangement on a chip including data processing, memory, and interface elements |
US20090210653A1 (en) * | 2001-03-05 | 2009-08-20 | Pact Xpp Technologies Ag | Method and device for treating and processing data |
US9411532B2 (en) | 2001-09-07 | 2016-08-09 | Pact Xpp Technologies Ag | Methods and systems for transferring data between a processing device and external devices |
US7210129B2 (en) * | 2001-08-16 | 2007-04-24 | Pact Xpp Technologies Ag | Method for translating programs for reconfigurable architectures |
US7844796B2 (en) | 2001-03-05 | 2010-11-30 | Martin Vorbach | Data processing device and method |
US20090300262A1 (en) * | 2001-03-05 | 2009-12-03 | Martin Vorbach | Methods and devices for treating and/or processing data |
US9141390B2 (en) | 2001-03-05 | 2015-09-22 | Pact Xpp Technologies Ag | Method of processing data with an array of data processors according to application ID |
US9436631B2 (en) | 2001-03-05 | 2016-09-06 | Pact Xpp Technologies Ag | Chip including memory element storing higher level memory data on a page by page basis |
WO2005045692A2 (en) | 2003-08-28 | 2005-05-19 | Pact Xpp Technologies Ag | Data processing device and method |
US7962716B2 (en) | 2001-03-22 | 2011-06-14 | Qst Holdings, Inc. | Adaptive integrated circuitry with heterogeneous and reconfigurable matrices of diverse and adaptive computational units having fixed, application specific computational elements |
US7433909B2 (en) | 2002-06-25 | 2008-10-07 | Nvidia Corporation | Processing architecture for a reconfigurable arithmetic node |
US7624204B2 (en) * | 2001-03-22 | 2009-11-24 | Nvidia Corporation | Input/output controller node in an adaptable computing environment |
US7489779B2 (en) * | 2001-03-22 | 2009-02-10 | Qstholdings, Llc | Hardware implementation of the secure hash standard |
US7325123B2 (en) | 2001-03-22 | 2008-01-29 | Qst Holdings, Llc | Hierarchical interconnect for configuring separate interconnects for each group of fixed and diverse computational elements |
US20040133745A1 (en) | 2002-10-28 | 2004-07-08 | Quicksilver Technology, Inc. | Adaptable datapath for a digital processing system |
US6836839B2 (en) | 2001-03-22 | 2004-12-28 | Quicksilver Technology, Inc. | Adaptive integrated circuitry with heterogeneous and reconfigurable matrices of diverse and adaptive computational units having fixed, application specific computational elements |
US7752419B1 (en) | 2001-03-22 | 2010-07-06 | Qst Holdings, Llc | Method and system for managing hardware resources to implement system functions using an adaptive computing architecture |
US7653710B2 (en) | 2002-06-25 | 2010-01-26 | Qst Holdings, Llc. | Hardware task manager |
US6577678B2 (en) | 2001-05-08 | 2003-06-10 | Quicksilver Technology | Method and system for reconfigurable channel coding |
TWI234737B (en) * | 2001-05-24 | 2005-06-21 | Ip Flex Inc | Integrated circuit device |
US6618434B2 (en) * | 2001-05-31 | 2003-09-09 | Quicksilver Technology, Inc. | Adaptive, multimode rake receiver for dynamic search and multipath reception |
US7657877B2 (en) * | 2001-06-20 | 2010-02-02 | Pact Xpp Technologies Ag | Method for processing data |
US10031733B2 (en) * | 2001-06-20 | 2018-07-24 | Scientia Sol Mentis Ag | Method for processing data |
JP2005508029A (en) * | 2001-08-16 | 2005-03-24 | ペーアーツェーテー イクスペーペー テクノロジーズ アクチエンゲゼルシャフト | Program conversion method for reconfigurable architecture |
US7996827B2 (en) | 2001-08-16 | 2011-08-09 | Martin Vorbach | Method for the translation of programs for reconfigurable architectures |
US20030037319A1 (en) * | 2001-08-20 | 2003-02-20 | Ankur Narang | Method and apparatus for partitioning and placement for a cycle-based simulation system |
US7434191B2 (en) | 2001-09-03 | 2008-10-07 | Pact Xpp Technologies Ag | Router |
US20030056091A1 (en) * | 2001-09-14 | 2003-03-20 | Greenberg Craig B. | Method of scheduling in a reconfigurable hardware architecture with multiple hardware configurations |
US8686475B2 (en) | 2001-09-19 | 2014-04-01 | Pact Xpp Technologies Ag | Reconfigurable elements |
US20030149962A1 (en) * | 2001-11-21 | 2003-08-07 | Willis John Christopher | Simulation of designs using programmable processors and electronically re-configurable logic arrays |
US7046635B2 (en) | 2001-11-28 | 2006-05-16 | Quicksilver Technology, Inc. | System for authorizing functionality in adaptable hardware devices |
US6986021B2 (en) | 2001-11-30 | 2006-01-10 | Quick Silver Technology, Inc. | Apparatus, method, system and executable module for configuration and operation of adaptive integrated circuitry having fixed, application specific computational elements |
US8412915B2 (en) | 2001-11-30 | 2013-04-02 | Altera Corporation | Apparatus, system and method for configuration of adaptive integrated circuitry having heterogeneous computational elements |
US7215701B2 (en) | 2001-12-12 | 2007-05-08 | Sharad Sambhwani | Low I/O bandwidth method and system for implementing detection and identification of scrambling codes |
US7577822B2 (en) * | 2001-12-14 | 2009-08-18 | Pact Xpp Technologies Ag | Parallel task operation in processor and reconfigurable coprocessor configured based on information in link list including termination information for synchronization |
US20030120460A1 (en) * | 2001-12-21 | 2003-06-26 | Celoxica Ltd. | System, method, and article of manufacture for enhanced hardware model profiling |
US7403981B2 (en) | 2002-01-04 | 2008-07-22 | Quicksilver Technology, Inc. | Apparatus and method for adaptive multimedia reception and transmission in communication environments |
AU2003214046A1 (en) * | 2002-01-18 | 2003-09-09 | Pact Xpp Technologies Ag | Method and device for partitioning large computer programs |
AU2003208266A1 (en) | 2002-01-19 | 2003-07-30 | Pact Xpp Technologies Ag | Reconfigurable processor |
AU2003214003A1 (en) | 2002-02-18 | 2003-09-09 | Pact Xpp Technologies Ag | Bus systems and method for reconfiguration |
WO2003081454A2 (en) * | 2002-03-21 | 2003-10-02 | Pact Xpp Technologies Ag | Method and device for data processing |
US9170812B2 (en) | 2002-03-21 | 2015-10-27 | Pact Xpp Technologies Ag | Data processing system having integrated pipelined array data processor |
US8914590B2 (en) | 2002-08-07 | 2014-12-16 | Pact Xpp Technologies Ag | Data processing method and device |
WO2004088502A2 (en) * | 2003-04-04 | 2004-10-14 | Pact Xpp Technologies Ag | Method and device for data processing |
US6732354B2 (en) * | 2002-04-23 | 2004-05-04 | Quicksilver Technology, Inc. | Method, system and software for programming reconfigurable hardware |
CN1650258A (en) * | 2002-04-25 | 2005-08-03 | 皇家飞利浦电子股份有限公司 | Automatic task distribution in scalable processors |
USRE43393E1 (en) | 2002-05-13 | 2012-05-15 | Qst Holdings, Llc | Method and system for creating and programming an adaptive computing engine |
US7660984B1 (en) | 2003-05-13 | 2010-02-09 | Quicksilver Technology | Method and system for achieving individualized protected space in an operating system |
US7328414B1 (en) | 2003-05-13 | 2008-02-05 | Qst Holdings, Llc | Method and system for creating and programming an adaptive computing engine |
US6931612B1 (en) * | 2002-05-15 | 2005-08-16 | Lsi Logic Corporation | Design and optimization methods for integrated circuits |
CN1656486A (en) * | 2002-05-23 | 2005-08-17 | 皇家飞利浦电子股份有限公司 | Integrated circuit design method |
US7024654B2 (en) * | 2002-06-11 | 2006-04-04 | Anadigm, Inc. | System and method for configuring analog elements in a configurable hardware device |
US20030233639A1 (en) * | 2002-06-11 | 2003-12-18 | Tariq Afzal | Programming interface for a reconfigurable processing system |
US7802108B1 (en) | 2002-07-18 | 2010-09-21 | Nvidia Corporation | Secure storage of program code for an embedded system |
US20110238948A1 (en) * | 2002-08-07 | 2011-09-29 | Martin Vorbach | Method and device for coupling a data processing unit and a data processing array |
US7657861B2 (en) | 2002-08-07 | 2010-02-02 | Pact Xpp Technologies Ag | Method and device for processing data |
WO2004021176A2 (en) | 2002-08-07 | 2004-03-11 | Pact Xpp Technologies Ag | Method and device for processing data |
US8108656B2 (en) | 2002-08-29 | 2012-01-31 | Qst Holdings, Llc | Task definition for specifying resource requirements |
US7394284B2 (en) * | 2002-09-06 | 2008-07-01 | Pact Xpp Technologies Ag | Reconfigurable sequencer structure |
US7502915B2 (en) * | 2002-09-30 | 2009-03-10 | Nvidia Corporation | System and method using embedded microprocessor as a node in an adaptable computing machine |
US7222218B2 (en) * | 2002-10-22 | 2007-05-22 | Sun Microsystems, Inc. | System and method for goal-based scheduling of blocks of code for concurrent execution |
US7603664B2 (en) | 2002-10-22 | 2009-10-13 | Sun Microsystems, Inc. | System and method for marking software code |
US7346902B2 (en) * | 2002-10-22 | 2008-03-18 | Sun Microsystems, Inc. | System and method for block-based concurrentization of software code |
US7937591B1 (en) | 2002-10-25 | 2011-05-03 | Qst Holdings, Llc | Method and system for providing a device which can be adapted on an ongoing basis |
US7225324B2 (en) | 2002-10-31 | 2007-05-29 | Src Computers, Inc. | Multi-adaptive processing systems and techniques for enhancing parallelism and performance of computational functions |
US6983456B2 (en) * | 2002-10-31 | 2006-01-03 | Src Computers, Inc. | Process for converting programs in high-level programming languages to a unified executable for hybrid computing platforms |
US8949576B2 (en) * | 2002-11-01 | 2015-02-03 | Nvidia Corporation | Arithmetic node including general digital signal processing functions for an adaptive computing machine |
US8276135B2 (en) | 2002-11-07 | 2012-09-25 | Qst Holdings Llc | Profiling of software and circuit designs utilizing data operation analyses |
US7225301B2 (en) | 2002-11-22 | 2007-05-29 | Quicksilver Technologies | External memory controller node |
SE0300742D0 (en) * | 2003-03-17 | 2003-03-17 | Flow Computing Ab | Data Flow Machine |
US7373640B1 (en) * | 2003-07-31 | 2008-05-13 | Network Appliance, Inc. | Technique for dynamically restricting thread concurrency without rewriting thread code |
US20050039189A1 (en) * | 2003-08-14 | 2005-02-17 | Todd Anderson | Methods and apparatus to preemptively compile an application |
US8296764B2 (en) | 2003-08-14 | 2012-10-23 | Nvidia Corporation | Internal synchronization control for adaptive integrated circuitry |
US7174432B2 (en) | 2003-08-19 | 2007-02-06 | Nvidia Corporation | Asynchronous, independent and multiple process shared memory system in an adaptive computing architecture |
DE10349349A1 (en) * | 2003-10-23 | 2005-05-25 | Kuka Roboter Gmbh | Method for determining and providing runtime information for robot control programs |
US7685587B2 (en) * | 2003-11-19 | 2010-03-23 | Ecole Polytechnique Federal De Lausanne | Automated instruction-set extension |
US7689958B1 (en) | 2003-11-24 | 2010-03-30 | Sun Microsystems, Inc. | Partitioning for a massively parallel simulation system |
US8549170B2 (en) | 2003-12-19 | 2013-10-01 | Nvidia Corporation | Retransmission system and method for a transport offload engine |
US7899913B2 (en) | 2003-12-19 | 2011-03-01 | Nvidia Corporation | Connection management system and method for a transport offload engine |
US7624198B1 (en) | 2003-12-19 | 2009-11-24 | Nvidia Corporation | Sequence tagging system and method for transport offload engine data lists |
US8065439B1 (en) | 2003-12-19 | 2011-11-22 | Nvidia Corporation | System and method for using metadata in the context of a transport offload engine |
US8176545B1 (en) | 2003-12-19 | 2012-05-08 | Nvidia Corporation | Integrated policy checking system and method |
US7260631B1 (en) | 2003-12-19 | 2007-08-21 | Nvidia Corporation | System and method for receiving iSCSI protocol data units |
KR100552675B1 (en) * | 2003-12-26 | 2006-02-20 | 한국전자통신연구원 | Apparatus for Selecting Extended Instruction and Method Thereof |
US7206872B2 (en) | 2004-02-20 | 2007-04-17 | Nvidia Corporation | System and method for insertion of markers into a data stream |
US7249306B2 (en) | 2004-02-20 | 2007-07-24 | Nvidia Corporation | System and method for generating 128-bit cyclic redundancy check values with 32-bit granularity |
US7343378B2 (en) * | 2004-03-29 | 2008-03-11 | Microsoft Corporation | Generation of meaningful names in flattened hierarchical structures |
US7698413B1 (en) | 2004-04-12 | 2010-04-13 | Nvidia Corporation | Method and apparatus for accessing and maintaining socket control information for high speed network connections |
US7765539B1 (en) | 2004-05-19 | 2010-07-27 | Nintendo Co., Ltd. | System and method for trans-compiling video games |
US7353488B1 (en) * | 2004-05-27 | 2008-04-01 | Magma Design Automation, Inc. | Flow definition language for designing integrated circuit implementation flows |
US7487497B2 (en) * | 2004-08-26 | 2009-02-03 | International Business Machines Corporation | Method and system for auto parallelization of zero-trip loops through induction variable substitution |
US8984496B2 (en) * | 2004-09-20 | 2015-03-17 | The Mathworks, Inc. | Extensible internal representation of systems with parallel and sequential implementations |
US7957379B2 (en) | 2004-10-19 | 2011-06-07 | Nvidia Corporation | System and method for processing RX packets in high speed network applications using an RX FIFO buffer |
US7318143B2 (en) * | 2004-10-20 | 2008-01-08 | Arm Limited | Reuseable configuration data |
US7350055B2 (en) * | 2004-10-20 | 2008-03-25 | Arm Limited | Tightly coupled accelerator |
US7343482B2 (en) * | 2004-10-20 | 2008-03-11 | Arm Limited | Program subgraph identification |
KR20070097051A (en) * | 2004-11-30 | 2007-10-02 | 동경 엘렉트론 주식회사 | Dynamically reconfigurable processor |
US7426708B2 (en) | 2005-01-31 | 2008-09-16 | Nanotech Corporation | ASICs having programmable bypass of design faults |
EP1849095B1 (en) * | 2005-02-07 | 2013-01-02 | Richter, Thomas | Low latency massive parallel data processing device |
US20060225049A1 (en) * | 2005-03-17 | 2006-10-05 | Zhiyuan Lv | Trace based signal scheduling and compensation code generation |
TWI306215B (en) | 2005-04-29 | 2009-02-11 | Ind Tech Res Inst | Method and corresponding apparatus for compiling high-level languages into specific processor architectures |
US7401314B1 (en) * | 2005-06-09 | 2008-07-15 | Altera Corporation | Method and apparatus for performing compound duplication of components on field programmable gate arrays |
KR100731976B1 (en) * | 2005-06-30 | 2007-06-25 | 전자부품연구원 | Efficient reconfiguring method of a reconfigurable processor |
US9774699B2 (en) * | 2005-09-20 | 2017-09-26 | The Mathworks, Inc. | System and method for transforming graphical models |
GB0519981D0 (en) * | 2005-09-30 | 2005-11-09 | Ignios Ltd | Scheduling in a multicore architecture |
US7890686B2 (en) * | 2005-10-17 | 2011-02-15 | Src Computers, Inc. | Dynamic priority conflict resolution in a multi-processor computer system having shared resources |
US7281942B2 (en) * | 2005-11-18 | 2007-10-16 | Ideal Industries, Inc. | Releasable wire connector |
US7716100B2 (en) * | 2005-12-02 | 2010-05-11 | Kuberre Systems, Inc. | Methods and systems for computing platform |
JP2007156926A (en) * | 2005-12-06 | 2007-06-21 | Matsushita Electric Ind Co Ltd | Interruption control unit |
WO2007082730A1 (en) | 2006-01-18 | 2007-07-26 | Pact Xpp Technologies Ag | Hardware definition method |
JP4528728B2 (en) | 2006-01-31 | 2010-08-18 | 株式会社東芝 | Digital circuit automatic design apparatus, automatic design method, and automatic design program |
US20070220235A1 (en) * | 2006-03-15 | 2007-09-20 | Arm Limited | Instruction subgraph identification for a configurable accelerator |
JP2007286671A (en) * | 2006-04-12 | 2007-11-01 | Fujitsu Ltd | Software/hardware division program and division method |
WO2007137266A2 (en) | 2006-05-22 | 2007-11-29 | Coherent Logix Incorporated | Designing an asic based on execution of a software program on a processing system |
US7693257B2 (en) * | 2006-06-29 | 2010-04-06 | Accuray Incorporated | Treatment delivery optimization |
US9058203B2 (en) * | 2006-11-20 | 2015-06-16 | Freescale Semiconductor, Inc. | System, apparatus and method for translating data |
KR100893527B1 (en) * | 2007-02-02 | 2009-04-17 | 삼성전자주식회사 | Method of mapping and scheduling of reconfigurable multi-processor system |
US8291400B1 (en) * | 2007-02-07 | 2012-10-16 | Tilera Corporation | Communication scheduling for parallel processing architectures |
EP1975791A3 (en) * | 2007-03-26 | 2009-01-07 | Interuniversitair Microelektronica Centrum (IMEC) | A method for automated code conversion |
US7987065B1 (en) | 2007-04-17 | 2011-07-26 | Nvidia Corporation | Automatic quality testing of multimedia rendering by software drivers |
US7996798B2 (en) * | 2007-05-24 | 2011-08-09 | Microsoft Corporation | Representing binary code as a circuit |
KR100940362B1 (en) | 2007-09-28 | 2010-02-04 | 고려대학교 산학협력단 | Method for mode set optimization in instruction processor using mode sets |
JP5175524B2 (en) * | 2007-11-13 | 2013-04-03 | 株式会社日立製作所 | compiler |
JP5109764B2 (en) * | 2008-03-31 | 2012-12-26 | 日本電気株式会社 | Description processing apparatus, description processing method, and program |
JP5576605B2 (en) * | 2008-12-25 | 2014-08-20 | パナソニック株式会社 | Program conversion apparatus and program conversion method |
KR101511273B1 (en) * | 2008-12-29 | 2015-04-10 | 삼성전자주식회사 | System and method for 3d graphic rendering based on multi-core processor |
US8667474B2 (en) * | 2009-06-19 | 2014-03-04 | Microsoft Corporation | Generation of parallel code representations |
KR101814221B1 (en) | 2010-01-21 | 2018-01-02 | 스비랄 인크 | A method and apparatus for a general-purpose, multiple-core system for implementing stream-based computations |
US9176845B2 (en) * | 2010-03-19 | 2015-11-03 | Red Hat, Inc. | Use of compiler-introduced identifiers to improve debug information pertaining to user variables |
US8601013B2 (en) | 2010-06-10 | 2013-12-03 | Micron Technology, Inc. | Analyzing data using a hierarchical structure |
US8661424B2 (en) * | 2010-09-02 | 2014-02-25 | Honeywell International Inc. | Auto-generation of concurrent code for multi-core applications |
CN106227693B (en) | 2010-10-15 | 2019-06-04 | 相干逻辑公司 | Communication disabling in multicomputer system |
WO2012062595A1 (en) * | 2010-11-11 | 2012-05-18 | Siemens Aktiengesellschaft | Method and apparatus for assessing software parallelization |
JP5848778B2 (en) | 2011-01-25 | 2016-01-27 | マイクロン テクノロジー, インク. | Use of dedicated elements to implement FSM |
EP2668575B1 (en) | 2011-01-25 | 2021-10-20 | Micron Technology, INC. | Method and apparatus for compiling regular expressions |
US8788991B2 (en) | 2011-01-25 | 2014-07-22 | Micron Technology, Inc. | State grouping for element utilization |
WO2012103148A2 (en) | 2011-01-25 | 2012-08-02 | Micron Technology, Inc. | Unrolling quantifications to control in-degree and/or out degree of automaton |
US8533698B2 (en) * | 2011-06-13 | 2013-09-10 | Microsoft Corporation | Optimizing execution of kernels |
US8997065B2 (en) * | 2011-12-06 | 2015-03-31 | The Mathworks, Inc. | Automatic modularization of source code |
US8959469B2 (en) | 2012-02-09 | 2015-02-17 | Altera Corporation | Configuring a programmable device using high-level language |
JP2013242700A (en) * | 2012-05-21 | 2013-12-05 | Internatl Business Mach Corp <Ibm> | Method, program, and system for code optimization |
US8819618B2 (en) | 2012-09-26 | 2014-08-26 | The Mathworks, Inc. | Behavior invariant optimization of maximum execution times for model simulation |
JP6849371B2 (en) * | 2015-10-08 | 2021-03-24 | 三星電子株式会社Samsung Electronics Co.,Ltd. | Side emission laser light source and 3D image acquisition device including it |
CA3012781C (en) * | 2016-01-26 | 2022-08-30 | Icat Llc | Processor with reconfigurable algorithmic pipelined core and algorithmic matching pipelined compiler |
US11477302B2 (en) | 2016-07-06 | 2022-10-18 | Palo Alto Research Center Incorporated | Computer-implemented system and method for distributed activity detection |
US10891132B2 (en) | 2019-05-23 | 2021-01-12 | Xilinx, Inc. | Flow convergence during hardware-software design for heterogeneous and programmable devices |
US20220321403A1 (en) * | 2021-04-02 | 2022-10-06 | Nokia Solutions And Networks Oy | Programmable network segmentation for multi-tenant fpgas in cloud infrastructures |
Family Cites Families (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5870308A (en) * | 1990-04-06 | 1999-02-09 | Lsi Logic Corporation | Method and system for creating and validating low-level description of electronic design |
US5625797A (en) * | 1990-08-10 | 1997-04-29 | Vlsi Technology, Inc. | Automatic optimization of a compiled memory structure based on user selected criteria |
US5513124A (en) * | 1991-10-30 | 1996-04-30 | Xilinx, Inc. | Logic placement using positionally asymmetrical partitioning method |
US5485455A (en) * | 1994-01-28 | 1996-01-16 | Cabletron Systems, Inc. | Network having secure fast packet switching and guaranteed quality of service |
US5511067A (en) * | 1994-06-17 | 1996-04-23 | Qualcomm Incorporated | Layered channel element in a base station modem for a CDMA cellular communication system |
US5603063A (en) * | 1994-06-27 | 1997-02-11 | Quantum Corporation | Disk drive command queuing method using two memory devices for storing two types of commands separately first before queuing commands in the second memory device |
US5548587A (en) * | 1994-09-12 | 1996-08-20 | Efficient Networks, Inc. | Asynchronous transfer mode adapter for desktop applications |
US5752035A (en) * | 1995-04-05 | 1998-05-12 | Xilinx, Inc. | Method for compiling and executing programs for reprogrammable instruction set accelerator |
US5794062A (en) * | 1995-04-17 | 1998-08-11 | Ricoh Company Ltd. | System and method for dynamically reconfigurable computing using a processing unit having changeable internal hardware organization |
US5729705A (en) * | 1995-07-24 | 1998-03-17 | Symbios Logic Inc. | Method and apparatus for enhancing throughput of disk array data transfers in a controller |
US5794044A (en) * | 1995-12-08 | 1998-08-11 | Sun Microsystems, Inc. | System and method for runtime optimization of private variable function calls in a secure interpreter |
GB2317245A (en) * | 1996-09-12 | 1998-03-18 | Sharp Kk | Re-timing compiler integrated circuit design |
US5864535A (en) * | 1996-09-18 | 1999-01-26 | International Business Machines Corporation | Network server having dynamic load balancing of messages in both inbound and outbound directions |
US5898860A (en) * | 1996-10-01 | 1999-04-27 | Leibold; William Steven | System and method for automatically generating a control drawing for a real-time process control system |
US6078736A (en) * | 1997-08-28 | 2000-06-20 | Xilinx, Inc. | Method of designing FPGAs for dynamically reconfigurable computing |
US6212650B1 (en) * | 1997-11-24 | 2001-04-03 | Xilinx, Inc. | Interactive dubug tool for programmable circuits |
US6075935A (en) * | 1997-12-01 | 2000-06-13 | Improv Systems, Inc. | Method of generating application specific integrated circuits using a programmable hardware architecture |
US6510546B1 (en) * | 2000-07-13 | 2003-01-21 | Xilinx, Inc. | Method and apparatus for pre-routing dynamic run-time reconfigurable logic cores |
-
1997
- 1997-06-27 US US08/884,377 patent/US5966534A/en not_active Expired - Lifetime
-
1998
- 1998-06-29 CA CA002290649A patent/CA2290649A1/en not_active Abandoned
- 1998-06-29 JP JP50587699A patent/JP2002508102A/en not_active Ceased
- 1998-06-29 WO PCT/US1998/013563 patent/WO1999000731A1/en not_active Application Discontinuation
- 1998-06-29 US US09/446,758 patent/US6708325B2/en not_active Expired - Fee Related
- 1998-06-29 AU AU82754/98A patent/AU8275498A/en not_active Abandoned
- 1998-06-29 KR KR1019997012384A patent/KR100614491B1/en not_active IP Right Cessation
- 1998-06-29 EP EP98932983A patent/EP0991997A4/en not_active Withdrawn
Also Published As
Publication number | Publication date |
---|---|
WO1999000731A1 (en) | 1999-01-07 |
JP2002508102A (en) | 2002-03-12 |
US6708325B2 (en) | 2004-03-16 |
KR20010020544A (en) | 2001-03-15 |
US20030014743A1 (en) | 2003-01-16 |
US5966534A (en) | 1999-10-12 |
EP0991997A4 (en) | 2004-12-29 |
AU8275498A (en) | 1999-01-19 |
KR100614491B1 (en) | 2006-08-22 |
EP0991997A1 (en) | 2000-04-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6708325B2 (en) | Method for compiling high level programming languages into embedded microprocessor with multiple reconfigurable logic | |
Grandpierre et al. | From algorithm and architecture specifications to automatic generation of distributed real-time executives: a seamless flow of graphs transformations | |
Gupta et al. | SPARK: A high-level synthesis framework for applying parallelizing compiler transformations | |
Gupta et al. | Coordinated parallelizing compiler optimizations and high-level synthesis | |
US5491823A (en) | Loop scheduler | |
Leupers | Instruction scheduling for clustered VLIW DSPs | |
EP0843257B1 (en) | Improved code optimiser for pipelined computers | |
Venkataramani et al. | C to asynchronous dataflow circuits: An end-to-end toolflow | |
Wolfe | Multiprocessor synchronization for concurrent loops | |
Eles et al. | VHDL system-level specification and partitioning in a hardware/software co-synthesis environment | |
Yang et al. | Detecting program components with equivalent behaviors | |
Rim et al. | Global scheduling with code-motions for high-level synthesis applications | |
Reshadi et al. | Utilizing horizontal and vertical parallelism with a no-instruction-set compiler for custom datapaths | |
Puschner | Transforming execution-time boundable code into temporally predictable code | |
JP3311381B2 (en) | Instruction scheduling method in compiler | |
Wall | Experience with a software-defined machine architecture | |
Haldar et al. | Automated synthesis of pipelined designs on FPGAs for signal and image processing applications described in MATLAB | |
Scheichenzuber et al. | Global hardware synthesis from behavioral dataflow descriptions | |
Binh et al. | A hardware/software partitioning algorithm for pipelined instruction set processor | |
Bergamaschi et al. | Scheduling under resource constraints and module assignment | |
Ping Seng et al. | Flexible instruction processors | |
Dos Santos et al. | A code-motion pruning technique for global scheduling | |
Radivojevic et al. | Symbolic techniques for optimal scheduling | |
Kirchhoff et al. | Increasing efficiency in data flow oriented model driven software development for softcore processors | |
Ferrandi et al. | Automatic parallelization of sequential specifications for symmetric mpsocs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
EEER | Examination request | ||
FZDE | Discontinued |