WO2003091875A1 - Method, system and language structure for programming reconfigurable hardware - Google Patents

Method, system and language structure for programming reconfigurable hardware Download PDF

Info

Publication number
WO2003091875A1
WO2003091875A1 PCT/US2003/010946 US0310946W WO03091875A1 WO 2003091875 A1 WO2003091875 A1 WO 2003091875A1 US 0310946 W US0310946 W US 0310946W WO 03091875 A1 WO03091875 A1 WO 03091875A1
Authority
WO
WIPO (PCT)
Prior art keywords
program construct
program
variable
construct
loop
Prior art date
Application number
PCT/US2003/010946
Other languages
French (fr)
Inventor
W.H. Carl Ebeling
Eugene B. Hogenauer
Original Assignee
Quicksilver Techonology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Quicksilver Techonology, Inc. filed Critical Quicksilver Techonology, Inc.
Priority to AU2003221714A priority Critical patent/AU2003221714A1/en
Publication of WO2003091875A1 publication Critical patent/WO2003091875A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/31Programming languages or programming paradigms

Definitions

  • the present invention relates, in general, to software and code languages used in programming hardware circuits, and more specifically, to a method, system, and language command or statement structure for defining adaptive computational units in reconfigurable integrated circuitry.
  • the first related application discloses a new form or type of integrated circuitry which effectively and efficiently combines and maximizes the various advantages of processors, application specific integrated circuits ("ASICs"), and field programmable gate arrays ("FPGAs”), while minimizing potential disadvantages.
  • the first related application illustrates a new form or type of integrated circuit (“IC"), referred to as an adaptive computing engine (“ACE”), which provides the programming flexibility of a processor, the post-fabrication flexibility of FPGAs, and the high speed and high utilization factors of an ASIC.
  • ACE integrated circuitry is readily reconfigurable, is capable of having corresponding, multiple modes of operation, and further minimizes power consumption while increasing performance, with particular suitability for low power applications, such as for use in hand-held and other battery-powered devices.
  • Configuration information (or, equivalently, adaptation information) is required to generate, in advance or in real-time (or potentially at a slower rate), the adaptations (configurations and reconfigurations) which provide and create one or more operating modes for the ACE circuit, such as wireless communication, radio reception, personal digital assistance ("PDA”), MP3 music playing, or any other desired functions.
  • PDA personal digital assistance
  • the second related application discloses a preferred system embodiment that includes an ACE integrated circuit coupled with one or more sets of configuration information. This configuration (adaptation) information is required to generate, in advance or in real-time (or potentially at a slower rate), the configurations and reconfigurations which provide and create one or more operating modes for the ACE circuit, such as wireless communication, radio reception, personal digital assistance
  • PDA personal area network
  • MP3 or MP4 music playing or any other desired functions.
  • Various methods, apparatuses and systems are also illustrated in the second related application for generating and providing configuration information for an ACE integrated circuit, for determining ACE reconfiguration capacity or capability, for providing secure and authorized configurations, and for providing appropriate monitoring of configuration and content usage.
  • the adaptive computing engine (“ACE") circuit of the present invention for adaptive or reconfigurable computing, includes a plurality of differing, heterogeneous computational elements coupled to an interconnection network (rather than the same, homogeneous repeating and arrayed units of FPGAs).
  • the plurality of heterogeneous computational elements include corresponding computational elements having fixed and differing architectures, such as fixed architectures for different functions such as memory, addition, multiplication, complex multiplication, subtraction, synchronization, queuing, sampling, configuration, reconfiguration, control, input, output, routing, and field programmability.
  • the interconnection network is operative, in advance, in realtime or potentially slower, to configure and reconfigure the plurality of heterogeneous computational elements for a plurality of different functional modes, including linear algorithmic operations, non-linear algorithmic operations, finite state machine operations, memory operations, and bit-level manipulations.
  • this configuration and reconfiguration of heterogeneous computational elements, forming various computational units and adaptive matrices, generates the selected, higher-level operating mode of the ACE integrated circuit, for the performance of a wide variety of tasks.
  • This adaptability or reconfigurability (with adaptation and configuration used interchangeably and equivalently herein) of the ACE circuitry is based upon, among other things, determining the optimal type, number, and sequence of computational elements required to perform a given task.
  • adaptation or configuration refers to changing or modifying ACE functionality, from one functional mode to another, in general, for performing a task within a specific operating mode, or for changing operating modes.
  • the algorithm of the task preferably, is expressed through "data flow graphs" ("DFGs"), which schematically depict inputs, outputs and the computational elements needed for a given operation.
  • DFGs data flow graphs
  • Software engineers frequently use data flow graphs to guide the programming of the algorithms, particularly for digital signal processing (“DSP”) applications.
  • DFGs typically have one of two forms, either of which are applicable to the present invention: (1) representing the flow of data through a system where data streams from one module (e.g., a filter) to another module; and (2) representing a computation as a combinational flow of data through a set of operators from inputs to outputs.
  • Assembly languages at the other extreme, tightly control data flow through hardware elements such as the logic gates, registers and random access memory (RAM) of a specific processor, and efficiently direct resource usage.
  • assembly languages are extremely verbose and detailed, requiring the programmer to specify exactly when and where every operation is to be performed. Consequently, programming in an assembly language is extraordinarily labor-intensive, expensive, and difficult to learn.
  • languages designed specifically for programming a processor i. e., fixed processor architecture
  • assembly languages have limited, if any, applicability to or utility for adaptive computing applications.
  • HDLs hardware description languages
  • These languages may allow explicit parallelism, but require the designer to manage such parallelism in great detail.
  • HDLs require the programmer to specify exactly when and where every operation is to be performed.
  • a need remains for a method and system of providing programmability of adaptive computing architectures.
  • a need also remains for a comparatively high-level language that is syntactically similar to widely used and well known languages like C++, for ready acceptance within the engineering and computing fields, but that also contains specialized constructs for an adaptive computing environment and for maximizing the performance of an ACE integrated circuit or other adaptive computing architecture.
  • the present invention is a programming language, system and methodology that facilitate programming of integrated circuits having adaptive and reconfigurable computing architectures.
  • the method, system and programming language of the present invention provide for program constructs, such as commands, declarations, variables, and statements, which have been developed to describe computations for an adaptive computing architecture, rather than provide instructions to a sequential microprocessor or DSP architecture.
  • the invention includes program constructs that permit a programmer to define data flow graphs in software, to provide for operations to be executed in parallel, and to reference variable states and historical values in a straightforward manner.
  • the preferred method, system, and programming language also includes mechanisms for efficiently referencing array variables, and enables the programmer to succinctly describe the direct data flow among matrices, nodes, and other configurations of computational elements and computational units forming the adaptive computing architecture.
  • the preferred programming language includes dataflow statements, channel objects, stream variables, state variables, unroll statements, iterators, and loop statements.
  • Figure 1 is a block diagram illustrating a preferred apparatus embodiment in accordance the invention disclosed in the first related application.
  • Figure 2 is a block diagram illustrating a reconfigurable matrix, a plurality of computation units, and a plurality of computational elements of the ACE architecture, in accordance the invention disclosed in the first related application.
  • Figure 3 is a block diagram depicting the role of Q language in programming instructions for configuring computational units, in accordance with the present invention.
  • Figure 4 is a schematic diagram illustrating an exemplary data flow graph, utilized in accordance with the present invention.
  • FIG. 5 is a block diagram illustrating the communication between Q language programming blocks, in accordance with the present invention.
  • Figures 6 A, 6B and 6C are diagrams providing a useful summary of the Q programming language of the present invention.
  • Figure 7 provides a FIR filter, expressed in the Q language for implementation in adaptive computing architecture, in accordance with the present invention.
  • Figure 8 provides a FIR filter with registered coefficients, expressed in the Q language for implementation in adaptive computing architecture, in accordance with the present invention.
  • Figures 9A and 9B provide a FIR filter for a comparatively large number of coefficients, expressed in the Q language for implementation in adaptive computing architecture, in accordance with the present invention.
  • Such a method and system are provided, in accordance with the present invention, for enabling ready programmability of adaptive computing architectures, such as the ACE architecture.
  • the present invention also provides for a comparatively high-level language, referred to as the Q programming language (or Q language), that is designed to be backward compatible with and syntactically similar to widely used and well known languages like C++, for acceptance within the engineering and computing fields.
  • the method, system, and Q language of the present invention provides new and specialized program constructs for an adaptive computing environment and for maximizing the performance of an ACE integrated circuit or other adaptive computing architecture.
  • the Q language methodology of the present invention including commands, declarations, variables, and statements (which are individually and collectively referred to herein as "constructs", “program constructs” or “program structures”) have been developed to describe computations for an adaptive computing architecture, and preferably the ACE architecture. It includes program constructs that permit a programmer to define data flow graphs in software, to provide for operations to be executed in parallel, and to reference variable states in a straightforward manner.
  • the Q language also includes mechanisms for efficiently referencing array variables, and enables the programmer to succinctly describe the direct data flow among matrices, nodes, and other configurations of computational elements and computational units.
  • Each of these new features of the Q language provide for effective programming in a reconfigurable computing environment, facilitating a compiler to implement the programmed algorithms efficiently in adaptive hardware. While the Q language was developed as part of a design system for the ACE architecture, its feature set is not limited to that application, and has broad applicability for adaptive computing and other potential adaptive or reconfigurable architectures.
  • the program constructs of the language, method and system of the present invention include: (1) "dataflow” statements, which declare that the operations within the dataflow statement may be executed in parallel; (2) "channel” objects, which are objects with a buffer for data items, having an input stream and an output stream, and which connect together computational "blocks"; (3) “stream” variables, used to reference channel buffers, using an index which is automatically incremented whenever it is read or written, providing automatic array indexing; (4) “state” variables, which are register variables which provide convenient access to previous values of the variable; (5) “unroll” statements, which provide a mechanism for a loop-type statement to have a determinate number of iterations when compiled, for execution in the minimum number of cycles allowed by any data dependencies; (6) “iterators”, which are special indexing variables which provide for automatic accessing of arrays in a predetermined address pattern; and (7) “loop” statements, which provide for loop or repeating calculations which execute a fixed number of times.
  • FIG. 1 is a block diagram illustrating a preferred apparatus 100 embodiment of the adaptive computing engine (ACE) architecture, in accordance the invention disclosed in the first related application.
  • the ACE 100 is preferably embodied as an integrated circuit, or as a portion of an integrated circuit having other, additional components.
  • the ACE 100 includes one or more reconfigurable matrices (or nodes) 150, such as matrices 150A through 150N as illustrated, and a matrix interconnection network (MEN) 110.
  • matrices 150 such as matrices 150A and 150B, are configured for functionality as a controller 120, while other matrices, such as matrices 150C and 150D, are configured for functionality as a memory 140. While illustrated as separate matrices 150 A through 150D, it should be noted that these control and memory functionalities may be, and preferably are, distributed across a plurality of matrices 150 having additional functions to, for example, avoid any processing or memory
  • the various matrices 150 and matrix interconnection network 110 may also be implemented together as fractal subunits, which may be scaled from a few nodes to thousands of nodes.
  • the ACE 100 does not utilize traditional (and typically separate) data, DMA, random access, configuration and instruction busses for signaling and other transmission between and among the reconfigurable matrices 150, the controller 120, and the memory 140, or for other input/output (“I/O") functionality.
  • the matrix interconnection network 110 which may be configured and reconfigured, to provide any given connection between and among the reconfigurable matrices 150, including those matrices 150 configured as the controller 120 and the memory 140, as discussed in greater detail below.
  • the MIN 110 also functions as a memory, directly providing the interconnections for particular functions, until and unless it is reconfigured.
  • configuration and reconfiguration may occur in advance of the use of a particular function or operation, and/or may occur in real-time or at a slower rate, namely, in advance of, during or concurrently with the use of the particular function or operation.
  • Such configuration and f econfiguration may be occurring in a distributed fashion without disruption of function or operation, with computational elements in one location being configured while other computational elements (having been previously configured) are concurrently performing their designated function.
  • This configuration flexibility of the ACE 100 contrasts starkly with FPGA reconfiguration, both which generally occurs comparatively slowly, not in realtime or concurrently with use, and which must be completed in its entirety prior to any operation or other use.
  • the matrices 150 configured to function as memory 140 maybe implemented in any desired or preferred way, utilizing computational elements (discussed below) or fixed memory elements, and maybe included within the ACE 100 or incorporated within another IC or portion of an IC.
  • the memory 140 is included within the ACE 100, and preferably is comprised of computational elements which are low power consumption random access memory (RAM), but also maybe comprised of computational elements of any other form of memory, such as flash, DRAM, SRAM, MRAM, ROM, EPROM or E 2 PROM.
  • the memory 140 preferably includes direct memory access (DMA) engines, not separately illustrated.
  • DMA direct memory access
  • the controller 120 is preferably implemented, using matrices 150A and 150B configured as adaptive finite state machines, as a reduced instruction set (“RISC”) processor, controller or other device or IC capable of performing the two types of functionality discussed below. (Alternatively, these functions may be implemented utilizing a conventional RISC or other processor.)
  • This control functionality may also be distributed throughout one or more matrices 150 which perform other, additional functions as well. In addition, this control functionality may be included within and directly embodied as configuration information, without separate hardware controller functionality.
  • the first control functionality, referred to as "kernel” control is illustrated as kernel controller ("KLARC") of matrix 150A
  • matrix controller matrix controller
  • the kernel and matrix control functions of the controller 120 are explained in greater detail below, with reference to the configurability and reconfigurability of the various matrices 150, and with reference to the preferred form of combined data, configuration and control information referred to herein as a "silverware" module.
  • the matrix interconnection network 110 of Figure 1, and its subset interconnection networks illustrated in Figure 2 (Boolean interconnection network 210, data interconnection network 240, and interconnect 220), collectively and generally referred to herein as "interconnect”, “interconnection(s)” or “interconnection network(s)”, may be implemented generally as known in the art, such as utilizing FPGA interconnection networks or switching fabrics, albeit in a considerably more varied fashion.
  • the various interconnection networks are implemented as described, for example, in U.S. Patent No. 5,218,240, U.S. Patent No. 5,336,950, U.S. Patent No. 5,245,227, and U.S. Patent No. 5,144,166.
  • These various interconnection networks provide selectable (or switchable) connections between and among the controller 120, the memory 140, the various matrices 150, and the computational units 200 and computational elements 250, providing the physical basis for the configuration and reconfiguration referred to herein, in response to and under the control of configuration signaling generally referred to herein as "configuration information".
  • the various interconnection networks (110, 210, 240 and 220) provide selectable or switchable data, input, output, control and configuration paths, between and among the controller 120, the memory 140, the various matrices 150, and the computational units 200 and computational elements 250, in lieu of any form of traditional or separate input/output busses, data busses, DMA, RAM, configuration and instruction busses.
  • any given switching or selecting operation of or within the various interconnection networks (110, 210, 240 and 220) may be implemented as known in the art
  • the design and layout of the various interconnection networks (110, 210, 240 and 220), in accordance with the ACE architecture are new and novel.
  • varying levels of interconnection are provided to correspond to the varying levels of the matrices 150, the computational units 200, and the computational elements 250.
  • the matrix interconnection network 110 is considerably more limited and less "rich", with lesser connection capability in a given area, to reduce capacitance and increase speed of operation.
  • the interconnection network (210, 220 and 240) may be considerably more dense and rich, to provide greater adaptation and reconfiguration capability within a narrow or close locality of reference.
  • the various matrices or nodes 150 are reconfigurable and heterogeneous, namely, in general, and depending upon the desired configuration: reconfigurable matrix 150A is generally different from reconfigurable matrices 150B through 150N; reconfigurable matrix 150B is generally different from reconfigurable matrices 150A and 150C through 150N; reconfigurable matrix 150C is generally different from reconfigurable matrices 150A, 150B and 150D through 150N, and so on.
  • the various reconfigurable matrices 150 each generally contain a different or varied mix of adaptive and reconfigurable computational (or computation) units (200); the computational units 200, in turn, generally contain a different or varied mix of fixed, application specific computational elements (250), which may be adaptively connected, configured and reconfigured in various ways to perform varied functions, through the various interconnection networks.
  • the various matrices 150 maybe connected, configured and reconfigured at a higher level, with respect to each of the other matrices 150, through the matrix interconnection network 110, also as discussed in greater detail in the first related application.
  • the first novel concepts of ACE 100 architecture concern the adaptive and reconfigurable use of application specific, dedicated or fixed hardware units (computational elements 250), and the selection of particular functions for acceleration, to be included within these application specific, dedicated or fixed hardware units (computational elements 250) within the computational units 200 (Fig. 4) of the matrices 150, such as pluralities of multipliers, complex multipliers, and adders, each of which are designed for optimal execution of corresponding multiplication, complex multiplication, and addition functions.
  • these differing, heterogeneous computational elements (250) may then be adaptively configured, in advance, in real-time or at a slower rate, to perform the selected algorithm, such as the performance of discrete cosine transformations often utilized in mobile communications.
  • different (“heterogeneous") computational elements (250) are configured and reconfigured, at any given time, through various levels of interconnect, to optimally perform a given algorithm or other function.
  • a given instantiation or configuration of computational elements may also remain in place over time, i.e., unchanged, throughout the course of such repetitive calculations.
  • the temporal nature of the ACE 100 architecture should also be noted.
  • a particular configuration may exist within the ACE 100 which has been optimized to perform a given function or implement a particular algorithm, such as to implement channel acquisition and control processing in a GSM operating mode in a mobile station.
  • the configuration may be changed, to interconnect other computational elements (250) or connect the same computational elements 250 differently, for the performance of another function or algorithm, such as for data and voice reception for a GSM operating mode.
  • Another important features arise from this temporal reconfigurability.
  • algorithms may change over time to, for example, implement a new technology standard, the ACE 100 may co-evolve and be reconfigured to implement the new algorithm.
  • This temporal reconfigurability of computational elements 250 also illustrates a conceptual distinction utilized herein between configuration and reconfiguration, on the one hand, and programming or reprogrammability, on the other hand.
  • Typical programmability utilizes a pre-existing group or set of functions, which may be called in various orders, over time, to implement a particular algorithm.
  • configurability and reconfigurability includes the additional capability of adding or creating new functions which were previously unavailable or non-existent.
  • the present invention also utilizes a tight coupling (or interdigitation) of data and configuration (or other control) information, within one, effectively continuous stream of information.
  • This coupling or commingling of data and configuration information referred to as "silverware” or as a “silverware” module, is the subject of another related patent application.
  • this coupling of data and configuration information into one information (or bit) stream which may be continuous or divided into packets, helps to enable real-time reconfigurability of the ACE 100, without a need for the (often unused) multiple, overlaying networks of hardware interconnections of the prior art.
  • a particular, first configuration of computational elements at a particular, first period of time as the hardware to execute a corresponding algorithm during or after that first period of time, may be viewed or conceptualized as a hardware analog of "calling" a subroutine in software which may perform the same algorithm.
  • the configuration of the computational elements has occurred (i.e., is in place), as directed by (a first subset of) the configuration information, the data for use in the algorithm is immediately available as part of the silverware module.
  • the same computational elements may then be reconfigured for a second period of time, as directed by second configuration information (i.e., a second subset of configuration information), for execution of a second, different algorithm, also utilizing immediately available data.
  • the immediacy of the data, for use in the configured computational elements provides a one or two clock cycle hardware analog to the multiple and separate software steps of determining a memory address and fetching stored data from the addressed registers. This has the further result of additional efficiency, as the configured computational elements may execute, in comparatively few clock cycles, an algorithm which may require orders of magnitude more clock cycles for execution if called as a subroutine in a conventional microprocessor or digital signal processor ("DSP").
  • DSP digital signal processor
  • This use of silverware modules as a commingling of data and configuration information, in conjunction with the reconfigurability of a plurality of heterogeneous and fixed computational elements 250 to form adaptive, different and heterogeneous computation units 200 and matrices 150, enables the ACE 100 architecture to have multiple and different modes of operation.
  • the ACE 100 may have various and different operating modes as a cellular or other mobile telephone, a music player, a pager, a personal digital assistant, and other new or existing functionalities. In addition, these operating modes may change based upon the physical location of the device.
  • the ACE 100 while configured for a first operating mode, using a first set of configuration information, as a CDMA mobile telephone for use in the United States, the ACE 100 may be reconfigured using a second set of configuration information for an operating mode as a GSM mobile telephone for use in Europe.
  • the functions of the controller 120 may be explained with reference to a silverware module, namely, the tight coupling of data and configuration information within a single stream of information, with reference to multiple potential modes of operation, with reference to the reconfigurable matrices 150, and with reference to the reconfigurable computation units 200 and the computational elements 150 illustrated in Figure 3.
  • the ACE 100 may be configured or reconfigured to perform a new or additional function, such as an upgrade to a new technology standard or the addition of an entirely new function, such as the addition of a music function to a mobile communication device.
  • Such a silverware module may be stored in the matrices 150 of memory 140, or may be input from an external (wired or wireless) source through, for example, matrix interconnection network 110.
  • one of the plurality of matrices 150 is configured, to decrypt such a module and verify its validity, for security purposes.
  • the controller 120 through the matrix (KARC) 150A, checks and verifies that the configuration or reconfiguration may occur without adversely affecting any pre-existing functionality, such as whether the addition of music functionality would adversely affect pre-existing mobile communications functionality.
  • the system requirements for such configuration or reconfiguration are included within the silverware module, for use by the matrix (KARC) 150A in performing this evaluative function. If the configuration or reconfiguration may occur without such adverse affects, the silverware module is allowed to load into the matrices 150 of memory 140, with the matrix (KARC) 150A setting up the DMA engines within the matrices 150C and 150D of the memory 140 (or other stand-alone DMA engines of a conventional memory). If the configuration or reconfiguration would or may have such adverse affects, the matrix (KARC) 150A does not allow the new module to be incorporated within the ACE 100.
  • the matrix (MARC) 150B manages the scheduling of matrix 150 resources and the timing of any corresponding data, to synchronize any configuration or reconfiguration of the various computational elements 250 and computation units 200 with any corresponding input data and output data.
  • timing information is also included within a silverware module, to allow the matrix (MARC) 150B through the various interconnection networks to direct a reconfiguration of the various matrices 150 in time, and preferably just in time, for the reconfiguration to occur before corresponding data has appeared at any inputs of the various reconfigured computation units 200.
  • the matrix (MARC) 150B may also perform any residual processing which has not been accelerated within any of the various matrices 150.
  • the matrix (MARC) 150B maybe viewed as a control unit which "calls" the configurations and reconfigurations of the matrices 150, computation units 200 and computational elements 250, in real-time, in synchronization with any corresponding data to be utilized by these various reconfigurable hardware units, and which performs any residual or other control processing.
  • Other matrices 150 may also include this control functionality, with any given matrix 150 capable of calling and controlling a configuration and reconfiguration of other matrices 150.
  • FIG. 2 is a block diagram illustrating, in greater detail, a reconfigurable matrix 150 with a plurality of computation units 200 (illustrated as computation units 200A through 200N), and a plurality of computational elements 250 (illustrated as computational elements 250A through 250Z), and provides additional illustration of the preferred types of computational elements 250.
  • any matrix 150 generally includes a matrix controller 230, a plurality of computation (or computational) units 200, and as logical or conceptual subsets or portions of the matrix interconnect network 110, a data interconnect network 240 and a Boolean interconnect network 210.
  • the interconnect networks become increasingly rich, for greater levels of adaptability and reconfiguration.
  • the Boolean interconnect network 210 also as mentioned above, provides the reconfiguration and data interconnection capability between and among the various computation units 200, and is preferably small (i. e., only a few bits wide), while the data interconnect network 240 provides the reconfiguration and data interconnection capability for data input and output between and among the various computation units 200, and is preferably comparatively large (i.e., many bits wide). It should be noted, however, that while conceptually divided into reconfiguration and data capabilities, any given physical portion of the matrix interconnection network 110, at any given time, may be operating as either the Boolean interconnect network 210, the data interconnect network 240, the lowest level interconnect 220 (between and among the various computational elements 250), or other input, output, configuration, or connection functionality.
  • computational elements 250 included within a computation unit 200 are a plurality of computational elements 250, illustrated as computational elements 250A through 250Z (individually and collectively referred to as computational elements 250), and additional interconnect 220.
  • the interconnect 220 provides the reconfigurable interconnection capability and input/output paths between and among the various computational elements 250.
  • each of the various computational elements 250 consist of dedicated, application specific hardware designed to perform a given task or range of tasks, resulting in a plurality of different, fixed computational elements 250.
  • the fixed computational elements 250 may be reconfigurably connected together into adaptive and varied computational units 200, which also may be further reconfigured and interconnected, to execute an algorithm or other function, at any given time,.
  • interconnect 220 utilizing the interconnect 220, the Boolean network 210, and the matrix interconnection network 110. While illustrated with effectively two levels of interconnect (for configuring computational elements 250 into computational units 200, and in turn, into matrices 150), for ease of explanation, it should be understood that the interconnect, and corresponding configuration, may extend to many additional levels within the ACE 100. For example, utilizing a tree concept, with the fixed computational elements analogous to leaves, a plurality of levels of interconnection and adaptation are available, analogous to twigs, branches, boughs, limbs, trunks, and so on, without limitation.
  • the various computational elements 250 are designed and grouped together, into the various adaptive and reconfigurable computation units 200.
  • computational elements 250 which are designed to execute a particular algorithm or function, such as multiplication, correlation, clocking, synchronization, queuing, sampling, or addition
  • other types of computational elements 250 are also utilized in the preferred embodiment.
  • computational elements 250A and 250B implement memory, to provide local memory elements for any given calculation or processing function (compared to the more "remote" memory 140).
  • computational elements 2501, 250J, 250K and 250L are configured to implement finite state machines, to provide local processing capability (compared to the more "remote” matrix (MARC) 150B), especially suitable for complicated control processing.
  • MMC local processing capability
  • a first category of computation units 200 includes computational elements 250 performing linear operations, such as multiplication, addition, finite impulse response filtering, clocking, synchronization, and so on.
  • a second category of computation units 200 includes computational elements 250 performing non-linear operations, such as discrete cosine transformation, trigonometric calculations, and complex multiplications.
  • a third type of computation unit 200 implements a finite state machine, such as computation unit 200C as illustrated in Figure 2, particularly useful for complicated control sequences, dynamic scheduling, and input/output management, while a fourth type may implement memory and memory management, such as computation unit 200A as illustrated in Fig. 2.
  • a fifth type of computation unit 200 may be included to perform bit-level manipulation, such as for encryption, decryption, channel coding, Niterbi decoding, and packet and protocol processing (such as Internet Protocol processing).
  • another (sixth) type of computation unit 200 may be utilized to extend or continue any of these concepts, such as bit-level manipulation or finite state machine manipulations, to increasingly lower levels within the ACE 100 architecture.
  • a matrix controller 230 may also be included or distributed within any given matrix 150, also to provide greater locality of reference and control of any reconfiguration processes and any corresponding data manipulations. For example, once a reconfiguration of computational elements 250 has occurred within any given computation unit 200, the matrix controller 230 may direct that that particular instantiation (or configuration) remain intact for a certain period of time to, for example, continue repetitive data processing for a given application.
  • the Q language of the present invention provides program constructs in a high-level language that allow detailed description of concurrent computation, without requiring the complexity of a hardware description language.
  • One of the goals of the Q language is to incorporate language features which allow a compiler to make efficient use of the adaptive hardware to create concurrent computations at the operator level and the task level.
  • Figure 3 illustrates the role of the Q language in the context of the ACE architecture, and beginning with the exemplary data flow graph of Figure 4, the new and novel features of the present invention are discussed in detail.
  • Figure 3 is a block diagram depicting the role of Q language in providing for configuration of computational units, in accordance with the present invention.
  • Figure 3 depicts the progress of an algorithm (function or operation) 300, coded in the high-level Q language 305, through a plurality of system design tools 310, such as a scheduler and Q compiler 320, to its final inclusion as part of an adaptive computing IC (ACE) configuration bit file 335, which contains the configuration information for adaptation of an adaptive computing circuit, such as the ACE 100.
  • the system design tools 310 which include a hardware object "creator”, a computing operations "scheduler” and an operation “emulator” are the subject of other patent applications. Relevant to the present invention are the scheduler and Q compiler 320 component.
  • Components of an adaptive computing circuit are initially defined as hardware "objects", and in this instance, specifically as adaptive computing objects 325.
  • the scheduler portion of scheduler and Q compiler 320 arranges (or schedules) the programmed operations with or across the adaptive computing objects 325, in a sequence across time and across space, in an iterative manner, producing one or more versions of adaptive computing architectures 330, and eventually selecting an adaptive computing architecture as optimal, in light of various design goals, such as speed of operation and comparatively low power consumption.
  • the Q compiler portion of scheduler and Q compiler 320 converts the scheduled Q program into a bit-level information stream (configuration information) 335.
  • configuration information bit-level information stream
  • any reference to a "compiler” should be understood to mean this Q compiler portion of scheduler and Q compiler 320, or an equivalent compiler.
  • the resulting adaptive computing integrated circuit 335 may be configured, using the configuration information 335 generated for that adaptive computing architecture.
  • one of the novel features of the Q language is that it can specify parallel execution of particular functions or operations, rather than being limited to sequential execution.
  • the scheduler selects computational elements and matches the desired parallel functions to available computational elements, or creates the availability of computational elements, for the function to be executed at a scheduled time, in parallel, across these elements.
  • Figure 4 is a schematic diagram illustrating an exemplary data flow graph, utilized in accordance with the present invention.
  • Algorithms or other functions selected for acceleration are converted into data flow graphs (DFGs), which describe the flow of inputs through computational elements to produce outputs.
  • the data flow graph of Figure 4 shows various inputs passing through multipliers and then iterating through adders to produce outputs. Equipped with data flow graphs, the high-level Q code may be refined to improve the computing performance of the algorithm.
  • the data flow graph describes a comparatively fine-grained computation, i.e., a computation composed of relatively simple, primitive operators like add and multiply.
  • data flow graphs may also be used at a higher level of abstractions that describe more coarse-grained computations, such as those composed of complex operators like filters. These operators typically correspond to tasks that may comprise many instances of the more fine-grained data flow graphs.
  • a digital signal processing (“DSP”) system involves a plurality of operations that can be depicted by data flow graphs.
  • Q supports the construction of DSP systems by utilizing computational "blocks” consisting of a plurality of programmed DFGs that communicate with each other via data "streams". Data are passed from one block to another by connecting the output streams of blocks to the input streams of other blocks.
  • a DSP system operates efficiently by running the individual blocks when input data are available, which then produces output data used by other blocks. Blocks may be executed concurrently, as determined by a Q scheduler. (It should be noted that this Q scheduler is different than the system tool scheduler (of 320) discussed above, which schedules the compiled Q code to available computational elements, in space and time).
  • a block implements a computation that consumes some number of inputs and processes them to produce some number of outputs.
  • a block in the Q language is an object, that is, an instance of a class. It can be loaded into a matrix, it has persistent data, such as stream variables and coefficients, state, and methods such as init ( ) and run ( ) .
  • invoking the init ( ) method initializes connections and performs any other system specific initialization, while the run ( ) method, which has no parameters, executes the block.
  • a finite impulse response filter (“FIR”), commonly used in digital signal processing, could be implemented as a Q block.
  • the filter coefficients, the input and output streams and a variable used for the input state are part of the filter state.
  • the run ( ) method processes some number of inputs from an input stream, computes, and writes the outputs to an output stream.
  • the run ( ) method could be called many times for successive streams of input data, with the state of the execution saved between invocations. Treating a matrix computation as an object allows it to be run in short bursts instead of all at once. Because its state is persistent, execution of a computation object can be stopped and continued at a later time. This is vital for real-time DSP applications where data become available incrementally.
  • the filter can be initialized, and run on input data as it becomes available without any overhead to reinitialize or load data into the matrix. This also allows many matrix computations to concurrently share the hardware because each maintains its own data.
  • Q contains constructs that allow the programmer to expose the parallelism of the computation to the compiler in a block, and to compose a digital signal processing system as a collection of blocks, supporting both types of data flow mentioned above.
  • the overall goal of the Q language is to support systems that are implemented partly in hardware using either the adaptive computing architecture or parameterized hardwired components, and which may also be implemented partly in software on a conventional processor.
  • Q primarily supports the construction of DSP systems via the composition of computational blocks that communicate via data streams. These blocks are compiled to run either on the host processor or in the adaptive computing architecture. This flexibility of implementation supports code reuse and flexible system implementation as well as rapid system prototyping using a software only solution.
  • the compiler attempts to produce an efficient parallel version that minimizes memory accesses. How well the compiler can do this generally depends on how the block is written: as mentioned above, Q contains constructs that allow the programmer to expose the parallelism of the computation to the compiler.
  • the blocks of the present invention follow a reactive dataflow model, removing data from input streams and processing it to produce data on output streams.
  • Data is passed from one block to another by connecting the output streams of blocks to the input streams of other blocks.
  • the entire system operates by ranning the individual blocks when their input data are available, which then produces output data used by other blocks.
  • the scheduling of blocks can either be done statically at compile time in the case of well-behaved data flow systems such as synchronous data flow, or dynamically in the more general case.
  • the scheduler can be supplied either by the system software, which uses information supplied by the blocks about its I/O characteristics, or it can be left to the user program. In order for a system to be scheduled automatically, the blocks should publish their I/O characteristics.
  • a stream carrying data between two blocks is implemented as a channel, which contains a buffer to store data items in transit between the blocks as well as information about the size of the buffer and the number of items in it.
  • Blocks producing data use an output stream to send data through a channel to the input stream of another block.
  • the data is stored in the channel where it becomes available to the input stream.
  • a block reads data from an input stream, it is removed from the channel.
  • the channel buffer is typically implemented using shared buffers so that no data copying is necessary: the writing block writes data directly into the buffer and the reading block reads it directly from the buffer.
  • Streams are declared to carry a specific data type which may be a built-in type or user-defined such as a class object or an array. Reads and writes are done on items of the data type and the channel buffer is sized in terms of how many data items it contains.
  • a stream data item may be as simple as a number or as complex as an array of data. Reading an input stream normally consumes a data item and writing an output stream produces a data item to the stream. However, for complex data items where the item may be processed incrementally, an open can be done to get a handle to the next item of the stream without consuming or producing it. After the item has been processed, a close is used to complete the read or write. More complex operations may also be supported, such as reading ahead or behind the current location in the stream. However, such operations make assumptions about the streams that are difficult for a scheduler to check.
  • a block In order for the scheduler to be able to construct a schedule, a block should publish its I/O characteristics and its computation timing. This information can be used by a scheduler at compile time to construct a static schedule, or at run time for dynamic scheduling. Such information can be used as preconditions that must be met before a block is executed. For example, the precondition might be that there are eight data items available on the input stream and space for eight data items on the output stream.
  • Streams may be declared to be non-blocking (the default) or blocking.
  • Non-blocking is the default for dataflow systems where scheduling is done to ensure that no blocking can occur. In this case reading an empty stream or writing a full stream is an error. Blocking only makes sense where blocks can run in parallel or where block execution can be suspended to allow other blocks to supply the needed data. Blocking is implemented in hardware for hardware blocks. Note that streaming I/O can be used to implement double-buffering, either blocking or non-blocking. In this case, the channel buffer contains space for two items (which can be arrays) where the output stream can be writing one array while the input stream reads the other.
  • the stream buffer sizes depend on the relative rates at which blocks produce and consume data. Normally dataflow blocks are written in terms of the computation corresponding to one time step, sample or frame. For example, a filter would consume the next input sample, producing the corresponding output sample. Implementing a system at such a fine-grained level might be very inefficient, however. The programmer may decide for efficiency reasons that every invocation of a block will compute many data samples; however, larger buffers are needed to store the increased amount of I/O data.
  • An application will generally comprise both signal processing components constructed as data flow graphs as described above, as well as control-oriented
  • supervisor code that interacts with other applications and the operating system, and controls the overall processing required by the application. This control-oriented part of the application would be written in the usual procedural style, as known in the art. This supervisor code may execute the nodes of a dataflow graph directly, particularly when the computation produces information that changes how the computation is performed.
  • Q computation objects describe computations that use the adaptive computing architecture to apply operations to input data to produce output data.
  • the set of operations are depicted in data flow graphs and are accomplished in programming code by a plurality of assignment statements. Although some operations may be executed in parallel, the execution semantics are defined by the sequential ordering of assignments as they appear in a program.
  • a compiler may perform analysis to find parallelism, or may not detect opportunities for parallelism that may be obvious to an experienced programmer.
  • the Q "dataflow" statement informs the compiler that the code within braces following the dataflow statement describes a computation corresponding to a static, acyclic data flow graph that can be executed in parallel.
  • conditional branching performed using the known method of predicated execution (which moves branches into a data flow graph)
  • the scheduler may schedule the data flow graphs of adjacent iterations so that they overlap and thus achieve even greater parallelism.
  • data type or datatype
  • Q blocks are connected together using Q "channels", each channel an object with a buffer in memory for data, an input stream and an output stream.
  • Channels are conceptually related to "named pipes” in the Unix operating system environment, but unlike named pipes, when channel data are accessed they need not be copied from the buffer to another location.
  • a channel is allocated to a first block to use for output stream, then the channel is subsequently defined as input stream to a second block, to connect the two blocks.
  • a channel is declared with the type of data communicated through the channel and the size of the buffer.
  • the following code fragment illustrates how two blocks are connected using a channel: // Channel with buffer for 16 items of datatype fraction channel ⁇ fractl6> chan(16) ;
  • the channel also has a method that allows supervisor code to find out the size of the buffer and how full it is.
  • a "stream” variable supports the streaming I/O abstraction where by each "read” of the input stream variable retrieves the next available value of the stream and each "write” to an output stream sends a value to the stream.
  • a stream variable references a channel buffer and is implemented using an index that is automatically incremented whenever it is read or written. This automatic array indexing is accomplished by using an address generator in the adaptive computing architecture or other hardware.
  • Block 400 A uses a stream variable 401 A to write to channel 402.
  • Channel 402 stores the data until the scheduler determines that enough data have accumulated to justify a read by block 400 B , which uses a stream variable 401 B as input.
  • channels have methods that allow supervisor code to learn the size of the channel's buffer, and how full it is.
  • the scheduler can then optimize I/O operations of the streams from/to the various blocks.
  • channel variables can be shared among blocks, multiple blocks can access channel data simultaneously, increasing parallel execution.
  • the sfream variable and a sample Q programs are discussed in greater detail below.
  • a stream variable supports the streaming I/O abstraction where by each read of the input stream variable retrieves the next available value of the stream and each write to an output stream sends a value to the stream.
  • a stream variable references a channel buffer and is implemented using an index that is automatically incremented whenever it is read or written. This automatic array indexing is implemented directly using an address generator.
  • the following example program snippet computes a FIR filter using stream and state variables. Each loop iteration reads a sample from the input stream, computes the resulting output, and writes it to the output stream.
  • the sample state variable is used keep a history of the values assigned to sample. Note that sample [l] refers to the current value of the sample state variable because of the assignment to sample before the unroll statement (discussed in greater detail below).
  • a stream variable is usually initialized by the init() method to reference a channel provided by the calling procedure. Note that channels are implemented using a circular buffer, that is, the stream index wraps around to the beginning of the channel buffer when it reaches the end.
  • the read and write stream methods read and write individual data items in streams.
  • the open method can be used to get a pointer to the next item in the stream. This pointer can then be used, for example, to access data items that are complex data types or arrays.
  • the close method is then used to complete the open, which moves the stream index to the next data item in the stream.
  • the open and close methods can also be used with output streams. By default, the stream is advanced by one data item by each read, write or close. In cases where the stream data is treated as an array, the stream must be informed via the init () method how many data items to advance. It is important that when using open() to process blocks of data that the channel buffer is sized in units of the block size.
  • the inSwath array is one swath from the input stream fractl4 * inSwath; // We will access the input swath using the 3D iterator below: // foreach (window in the row of windows) // foreach (row in the window) // foreach (pixel in the row) Qiterator ⁇ fractl4> inSwathl; // 3D iterator
  • the block processes all the 8x8 windows on an 8-row swath, producing a corresponding swath in the output image. Pixels in the input image are accessed in row major order within each 8x8 window, while pixels in the output image are written in column major order. Clearly, the pixels cannot be accessed in stream order, so an open 0 is used to access an entire swath.
  • the sfream init () method is used to indicate how many pixels are read and written by each open () /close () pair for the input and output images. The pointer returned by the open ( ) is handed to the iterator, which also indicates how the iteration is done.
  • a 3-dimensional iterator is used to define the windowed data structure on the image swath. Note that the iterator must be reinitialized for each new swath. Also note that we write the program to process single windows because the window data is not contiguous in the stream, while swaths are.
  • processing may require the program to read ahead on a stream, and then back up and read some of the data again.
  • the rewind () method is provided to allow a program to back up a stream.
  • the argument to rewind indicates how many data items to back up. If the argument is negative, the stream is moved forward. Caution must be used with rewind because if blocks are running in parallel, then the producing block may have already written into the buffer space vacated by the reads, leaving no space for the rewind 0.
  • State variables allow convenient access to previous values of a variable in a computation occurring over time. For example, a FIR filter may refer to the previous N values of the input variable.
  • State variables avoid having to keep track of the history of a variable explicitly, thus streamlining programming code.
  • State variables are declared as follows: state ⁇ type> name (N) ; where "type” is the data type and "name” is the name of the state variable, and "N” is a constant which declares how far into the past a variable value can be referenced.
  • Arrays of state variables are allowed, for example: state ⁇ fractl6> X [8] (2) ; which declares an array of 8 state variables of data type fraction, each of which keeps two history values.
  • Unroll statements in the Q language are utilized to provide for parallel execution of computations and other functions, which may otherwise be problematic due to the sequential nature of typical "loop" statements of the prior art. More specifically, the "unroll” statement provides for control over how a compiler handles a loop: on the one hand, it can be used to direct the compiler (320) to unroll the code before scheduling it; on the other hand, where a compiler might aggressively unroll a loop, the unroll statement of the invention may constrain precisely how it should be unrolled. "Unroll" statements in the Q language utilize the syntax and semantics utilized in C for loops, but are compiled very differently, with very different results.
  • Unroll in the Q language is converted at compile time into straight-line code, each command of which implicitly could be executed in parallel.
  • Unroll parameters must be known at compile time and any reference to the iteration variable in the unroll body evaluates to a constant. For example, the code fragment below assigns the value of the index of an array to the indexed element of the array:
  • Unroll statements are allowed in dataflow blocks, because the entire unroll statement can in principle be executed in a single cycle if the data dependencies allow it. It should be noted that loop and unroll are quite different; although both run a fixed number of iterations, loop's are executed a number of iterations determined at run time, while unroll statements are elaborated into a dataflow graph at compile time. This means that loops cannot be part of a dataflow block because it is not known until runtime how many iterations a loop will execute (i.e., the different iterations of a loop statement must be executed sequentially, in contrast to the parallel execution of an unroll statement).
  • Q program code of the present invention computes a FIR filter using sfream and state variables, and the unroll command.
  • Each iteration reads a sample from the input sfream, computes, and writes the result to the output stream.
  • the sample state variable is used keep a history of the values assigned to sample.
  • read() // Perform parallel reads
  • Data for Q programs is input and output via matrices of the adaptive computing architecture adapted for memory functionality (or random access memories (RAMs) that are shared with the host processor).
  • RAMs random access memories
  • the only concern is that values in a memory are transferred to some form of register, and then fransferred back.
  • Data are often stored in the form of arrays that are addressed using some addressing pattern, for example, linear order for a one-dimensional array or row- major order for a two-dimensional array.
  • Q "Iterators" are special indexing variables used to access arrays in a fixed address pattern, and make efficient use of any available address generators. For example, a two-dimensional array can be accessed in row-major order using an iterator instead of the usual control structure that uses nested "for" loops.
  • the argument list for an iterator declaration contains first the array to be accessed, and then groups of four parameters for each dimension over which the array is to be iterated;
  • level - referring to the iteration level, in which the 0 level is the innermost loop and iterates the fastest;
  • Xi is an iterator used to reference X as a 128 x 64 two-dimensional array.
  • the compiler can often implement array indexing with an address generator, iterators expose the deterministic address pattern directly to the compiler for situations that are top complex. This action reduces the work, i.e., clock cycles, expended to reference an array.
  • the iteration variable i and the loop limit n and increment value c cannot be modified in the loop body. Moreover, in the preferred embodiment, there is no mechanism to break out of the loop before the predetermined number of iterations have executed. Without a means to branch from a loop statement, computing overhead, and thus processing time, is reduced. Other efficient control mechanisms, however, may be implemented in the adaptive computing architecture.
  • Figures 6 A, 6B and 6C are diagrams providing a useful summary of the Q programming language of the present invention.
  • Figures 7 through 9 provide exemplary Q programs.
  • Figure 7 provides a FIR filter, expressed in the Q language for implementation in adaptive computing architecture, in accordance with the present invention
  • Figure 8 provides a FIR filter with registered coefficients, expressed in the Q language for implementation in adaptive computing architecture, in accordance with the present invention
  • Figures 9A and 9B provide a FIR filter for a comparatively large number of coefficients, expressed in the Q language for implementation in adaptive computing architecture, in accordance with the present invention.
  • the preferred method for programming an adaptive computing integrated circuit includes: ( 1 ) using a first program construct to provide for execution of a computational block in parallel, the first program construct defined as a dataflow command for informing a compiler that included commands are for concurrent performance in parallel;
  • the third program construct defined as a state variable for maintaining a plurality of previous values of a variable after the variable has been assigned a plurality of current values (for example, maintaining the "N" most recent values assigned to the variable);
  • a fourth program construct to provide for iterations having a predetermined number of iterations at a compile time, the fourth program construct defined as an unroll command for transforming a loop operation into a predetermined plurality of individual executable operations; (5) using a fifth program construct to provide array accessing, the fifth program construct defined as an iterator variable for accessing the array in a predetermined, fixed address pattern; and
  • the sixth program construct defined as a loop command for informing a compiler that the included commands contain no branching to locations outside of the loop and that a plurality of loop conditions cannot be changed.
  • the first program construct may be viewed as having a semantics including a first program construct identifier, such as the "dataflow" identifier; a commencement designation and a termination designation following the first program construct identifier, such as " ⁇ " and " ⁇ ", respectively, or another equivalent demarcation; and a plurality of included program statements contained within the commencement designation and the termination designation.
  • a first program construct identifier such as the "dataflow" identifier
  • commencement designation and a termination designation following the first program construct identifier such as " ⁇ " and " ⁇ ", respectively, or another equivalent demarcation
  • a plurality of included program statements contained within the commencement designation and the termination designation may be viewed as having a semantics including a first program construct identifier, such as the "dataflow" identifier; a commencement designation and a termination designation following the first program construct identifier, such as " ⁇ " and " ⁇ ", respectively, or another equivalent demarcation; and a plurality of included program statements contained within the commencement designation and the termination designation
  • the system of the present invention may be embodied, for example, in a computer, a workstation, or any other form of computing device, whether have processor-based architecture, an ASIC-based architecture, an FPGA-based architecture, or an adaptively-based architecture.
  • the system may further include compilers and schedulers, as discussed above.
  • the present invention provides for a comparatively high-level programming language, for enabling ready programmability of adaptive computing architectures, such as the ACE architecture.
  • the Q programming language is designed to be backward compatible with and syntactically similar to widely used and well known languages like C++, for acceptance within the engineering and computing fields.
  • the method, system, and Q language of the present invention provides new and specialized program constructs for an adaptive computing environment and for maximizing the performance of an ACE integrated circuit or other adaptive computing architecture.
  • the language, system and methodology of the present invention include program constructs that permit a programmer to define data flow graphs in software, to provide for operations to be executed in parallel, and to reference variable states in a straightforward manner.
  • the invention also includes mechanisms for efficiently referencing array variables, and enables the programmer to succinctly describe the direct data flow among matrices, nodes, and other configurations of computational elements and computational units.
  • Each of these new features of the invention provide for effective programming in a reconfigurable computing environment, facilitating a compiler to implement the programmed algorithms efficiently in adaptive hardware.

Abstract

The method, system and programming language of the present invention, provide for program constructs, such as commands, declarations, variables, and statements, which have been developed to describe computations for an adaptive computing architecture, rather than provide instructions to a sequential microprocessor or DSP architecture. The invention includes program constructs that permit a programmer to define data flow graphs in software, to provide for operations to be executed in parallel, and to reference variable states and historical values in a straightforward manner. The preferred method, system, and programming language also includes mechanisms for efficiently referencing array variables, and enables the programmer to succinctly describe the direct data flow among matrices, nodes, and other configurations of computational elements and computational units forming the adaptive computing architecture. The preferred programming language includes dataflow statements, channel objects, stream variables, state variables, unroll statements, iterators, and loop statements.

Description

METHOD, SYSTEM AND LANGUAGE STRUCTURE FOR PROGRAMMING RECONFIGURABLE HARDWARE
Field of the Invention The present invention relates, in general, to software and code languages used in programming hardware circuits, and more specifically, to a method, system, and language command or statement structure for defining adaptive computational units in reconfigurable integrated circuitry.
Cross-Reference to Related Applications
This application is related to Paul L. Master et al., U. S. Patent Application Serial No. 09/815,122, entitled "Adaptive Integrated Circuitry With Heterogeneous And Reconfigurable Matrices Of Diverse And Adaptive Computational Units Having Fixed, Application Specific Computational Elements", filed March 22, 2001, commonly assigned to Quicksilver Technology, Inc., and incorporated by reference herein, with priority claimed for all commonly disclosed subject matter (the "first related application").
This application is related to Paul L. Master et al., U. S. Patent Application Serial No. 09/997,530, entitled "Apparatus, System and Method For Configuration Of Adaptive Integrated Circuitry Having Fixed, Application Specific Computational
Elements", filed November 30, 2001, commonly assigned to Quicksilver Technology, Inc., and incorporated by reference herein, with priority claimed for all commonly disclosed subject matter (the "second related application").
Background of the Invention
The first related application discloses a new form or type of integrated circuitry which effectively and efficiently combines and maximizes the various advantages of processors, application specific integrated circuits ("ASICs"), and field programmable gate arrays ("FPGAs"), while minimizing potential disadvantages. The first related application illustrates a new form or type of integrated circuit ("IC"), referred to as an adaptive computing engine ("ACE"), which provides the programming flexibility of a processor, the post-fabrication flexibility of FPGAs, and the high speed and high utilization factors of an ASIC. This ACE integrated circuitry is readily reconfigurable, is capable of having corresponding, multiple modes of operation, and further minimizes power consumption while increasing performance, with particular suitability for low power applications, such as for use in hand-held and other battery-powered devices. Configuration information (or, equivalently, adaptation information) is required to generate, in advance or in real-time (or potentially at a slower rate), the adaptations (configurations and reconfigurations) which provide and create one or more operating modes for the ACE circuit, such as wireless communication, radio reception, personal digital assistance ("PDA"), MP3 music playing, or any other desired functions. The second related application discloses a preferred system embodiment that includes an ACE integrated circuit coupled with one or more sets of configuration information. This configuration (adaptation) information is required to generate, in advance or in real-time (or potentially at a slower rate), the configurations and reconfigurations which provide and create one or more operating modes for the ACE circuit, such as wireless communication, radio reception, personal digital assistance
("PDA"), MP3 or MP4 music playing, or any other desired functions. Various methods, apparatuses and systems are also illustrated in the second related application for generating and providing configuration information for an ACE integrated circuit, for determining ACE reconfiguration capacity or capability, for providing secure and authorized configurations, and for providing appropriate monitoring of configuration and content usage.
As disclosed in the first and second related applications, the adaptive computing engine ("ACE") circuit of the present invention, for adaptive or reconfigurable computing, includes a plurality of differing, heterogeneous computational elements coupled to an interconnection network (rather than the same, homogeneous repeating and arrayed units of FPGAs). The plurality of heterogeneous computational elements include corresponding computational elements having fixed and differing architectures, such as fixed architectures for different functions such as memory, addition, multiplication, complex multiplication, subtraction, synchronization, queuing, sampling, configuration, reconfiguration, control, input, output, routing, and field programmability. hi response to configuration information, the interconnection network is operative, in advance, in realtime or potentially slower, to configure and reconfigure the plurality of heterogeneous computational elements for a plurality of different functional modes, including linear algorithmic operations, non-linear algorithmic operations, finite state machine operations, memory operations, and bit-level manipulations. In turn, this configuration and reconfiguration of heterogeneous computational elements, forming various computational units and adaptive matrices, generates the selected, higher-level operating mode of the ACE integrated circuit, for the performance of a wide variety of tasks.
This adaptability or reconfigurability (with adaptation and configuration used interchangeably and equivalently herein) of the ACE circuitry is based upon, among other things, determining the optimal type, number, and sequence of computational elements required to perform a given task. As indicated above, such adaptation or configuration, as used herein, refers to changing or modifying ACE functionality, from one functional mode to another, in general, for performing a task within a specific operating mode, or for changing operating modes.
The algorithm of the task, preferably, is expressed through "data flow graphs" ("DFGs"), which schematically depict inputs, outputs and the computational elements needed for a given operation. Software engineers frequently use data flow graphs to guide the programming of the algorithms, particularly for digital signal processing ("DSP") applications. Such DFGs typically have one of two forms, either of which are applicable to the present invention: (1) representing the flow of data through a system where data streams from one module (e.g., a filter) to another module; and (2) representing a computation as a combinational flow of data through a set of operators from inputs to outputs.
A dilemma arises when developing programs for adaptive or reconfigurable computing applications, as currently there are not any adequate or sufficient methodologies or programming languages expressly designed for such adaptive computing, other than the present invention. High-level programming languages, such as C++ or Java, are widely used, well known, and easily maintainable. The languages were developed to accommodate a variety of applications, many of which are platform- independent, but all of which are fundamentally based upon compiling a sequence of instructions ultimately fed into processor, microprocessor, or DSP. The program code is designed to run sequentially, generally in response to a user-initiated event. However the languages have limited capabilities of expressing the concurrency of computing operations, and other features, which may be significant in adaptive computing applications.
Assembly languages, at the other extreme, tightly control data flow through hardware elements such as the logic gates, registers and random access memory (RAM) of a specific processor, and efficiently direct resource usage. By their very nature, however, assembly languages are extremely verbose and detailed, requiring the programmer to specify exactly when and where every operation is to be performed. Consequently, programming in an assembly language is extraordinarily labor-intensive, expensive, and difficult to learn. In addition, as languages designed specifically for programming a processor (i. e., fixed processor architecture), assembly languages have limited, if any, applicability to or utility for adaptive computing applications.
In between these extremes, and also very different than a high-level language, are hardware description languages (HDLs), that allow a designer to specify the behavior of a hardware system as a collection of components described at the structural or behavioral level. These languages may allow explicit parallelism, but require the designer to manage such parallelism in great detail. In addition, like assembly languages, HDLs require the programmer to specify exactly when and where every operation is to be performed.
As a consequence, a need remains for a method and system of providing programmability of adaptive computing architectures. A need also remains for a comparatively high-level language that is syntactically similar to widely used and well known languages like C++, for ready acceptance within the engineering and computing fields, but that also contains specialized constructs for an adaptive computing environment and for maximizing the performance of an ACE integrated circuit or other adaptive computing architecture.
Summary of the Invention
The present invention is a programming language, system and methodology that facilitate programming of integrated circuits having adaptive and reconfigurable computing architectures. The method, system and programming language of the present invention provide for program constructs, such as commands, declarations, variables, and statements, which have been developed to describe computations for an adaptive computing architecture, rather than provide instructions to a sequential microprocessor or DSP architecture. The invention includes program constructs that permit a programmer to define data flow graphs in software, to provide for operations to be executed in parallel, and to reference variable states and historical values in a straightforward manner. The preferred method, system, and programming language also includes mechanisms for efficiently referencing array variables, and enables the programmer to succinctly describe the direct data flow among matrices, nodes, and other configurations of computational elements and computational units forming the adaptive computing architecture. The preferred programming language includes dataflow statements, channel objects, stream variables, state variables, unroll statements, iterators, and loop statements.
Numerous other advantages and features of the present invention will become readily apparent from the following detailed description of the invention and the embodiments thereof, from the claims and from the accompanying drawings.
Brief Description of the Drawings
Figure 1 is a block diagram illustrating a preferred apparatus embodiment in accordance the invention disclosed in the first related application.
Figure 2 is a block diagram illustrating a reconfigurable matrix, a plurality of computation units, and a plurality of computational elements of the ACE architecture, in accordance the invention disclosed in the first related application.
Figure 3 is a block diagram depicting the role of Q language in programming instructions for configuring computational units, in accordance with the present invention. Figure 4 is a schematic diagram illustrating an exemplary data flow graph, utilized in accordance with the present invention.
Figure 5 is a block diagram illustrating the communication between Q language programming blocks, in accordance with the present invention.
Figures 6 A, 6B and 6C are diagrams providing a useful summary of the Q programming language of the present invention. Figure 7 provides a FIR filter, expressed in the Q language for implementation in adaptive computing architecture, in accordance with the present invention.
Figure 8 provides a FIR filter with registered coefficients, expressed in the Q language for implementation in adaptive computing architecture, in accordance with the present invention.
Figures 9A and 9B provide a FIR filter for a comparatively large number of coefficients, expressed in the Q language for implementation in adaptive computing architecture, in accordance with the present invention.
Detailed Description of the Invention
While the present invention is susceptible of embodiment in many different forms, there are shown in the drawings and will be described herein in detail specific embodiments thereof, with the understanding that the present disclosure is to be considered as an exemplification of the principles of the invention and is not intended to limit the invention to the specific embodiments or generalized examples illustrated.
As mentioned above, a need remains for a method and system of providing programmability of adaptive computing architectures. Such a method and system are provided, in accordance with the present invention, for enabling ready programmability of adaptive computing architectures, such as the ACE architecture. The present invention also provides for a comparatively high-level language, referred to as the Q programming language (or Q language), that is designed to be backward compatible with and syntactically similar to widely used and well known languages like C++, for acceptance within the engineering and computing fields. More importantly, the method, system, and Q language of the present invention provides new and specialized program constructs for an adaptive computing environment and for maximizing the performance of an ACE integrated circuit or other adaptive computing architecture.
The Q language methodology of the present invention, including commands, declarations, variables, and statements (which are individually and collectively referred to herein as "constructs", "program constructs" or "program structures") have been developed to describe computations for an adaptive computing architecture, and preferably the ACE architecture. It includes program constructs that permit a programmer to define data flow graphs in software, to provide for operations to be executed in parallel, and to reference variable states in a straightforward manner. The Q language also includes mechanisms for efficiently referencing array variables, and enables the programmer to succinctly describe the direct data flow among matrices, nodes, and other configurations of computational elements and computational units. Each of these new features of the Q language provide for effective programming in a reconfigurable computing environment, facilitating a compiler to implement the programmed algorithms efficiently in adaptive hardware. While the Q language was developed as part of a design system for the ACE architecture, its feature set is not limited to that application, and has broad applicability for adaptive computing and other potential adaptive or reconfigurable architectures.
As discussed in greater detail below, with reference to Figures 3 through 9, the program constructs of the language, method and system of the present invention include: (1) "dataflow" statements, which declare that the operations within the dataflow statement may be executed in parallel; (2) "channel" objects, which are objects with a buffer for data items, having an input stream and an output stream, and which connect together computational "blocks"; (3) "stream" variables, used to reference channel buffers, using an index which is automatically incremented whenever it is read or written, providing automatic array indexing; (4) "state" variables, which are register variables which provide convenient access to previous values of the variable; (5) "unroll" statements, which provide a mechanism for a loop-type statement to have a determinate number of iterations when compiled, for execution in the minimum number of cycles allowed by any data dependencies; (6) "iterators", which are special indexing variables which provide for automatic accessing of arrays in a predetermined address pattern; and (7) "loop" statements, which provide for loop or repeating calculations which execute a fixed number of times.
These program constructs of the present invention have particular relevance for programming of the preferred adaptive computing architecture. When the program constructs are compiled and converted into configuration information and executed in the ACE, various computational units of the ACE architecture are configured or "called" into existence, executing the program across both space and time, such as for parallel execution of a dataflow statement. As a consequence, the ACE architecture is explained in detail below with reference to Figures 1 and 2, followed by the description of the method, system and language of the present invention.
Figure 1 is a block diagram illustrating a preferred apparatus 100 embodiment of the adaptive computing engine (ACE) architecture, in accordance the invention disclosed in the first related application. The ACE 100 is preferably embodied as an integrated circuit, or as a portion of an integrated circuit having other, additional components. In the preferred embodiment, the ACE 100 includes one or more reconfigurable matrices (or nodes) 150, such as matrices 150A through 150N as illustrated, and a matrix interconnection network (MEN) 110. Also in the preferred embodiment, one or more of the matrices 150, such as matrices 150A and 150B, are configured for functionality as a controller 120, while other matrices, such as matrices 150C and 150D, are configured for functionality as a memory 140. While illustrated as separate matrices 150 A through 150D, it should be noted that these control and memory functionalities may be, and preferably are, distributed across a plurality of matrices 150 having additional functions to, for example, avoid any processing or memory
"bottlenecks" or other limitations. Such distributed functionality, for example, is illustrated in Figure 2. The various matrices 150 and matrix interconnection network 110 may also be implemented together as fractal subunits, which may be scaled from a few nodes to thousands of nodes. A significant departure from the prior art, the ACE 100 does not utilize traditional (and typically separate) data, DMA, random access, configuration and instruction busses for signaling and other transmission between and among the reconfigurable matrices 150, the controller 120, and the memory 140, or for other input/output ("I/O") functionality. Rather, data, control and configuration information are transmitted between and among these matrix 150 elements, utilizing the matrix interconnection network 110, which may be configured and reconfigured, to provide any given connection between and among the reconfigurable matrices 150, including those matrices 150 configured as the controller 120 and the memory 140, as discussed in greater detail below. It should also be noted that once configured, the MIN 110 also functions as a memory, directly providing the interconnections for particular functions, until and unless it is reconfigured. In addition, such configuration and reconfiguration may occur in advance of the use of a particular function or operation, and/or may occur in real-time or at a slower rate, namely, in advance of, during or concurrently with the use of the particular function or operation. Such configuration and f econfiguration, moreover, may be occurring in a distributed fashion without disruption of function or operation, with computational elements in one location being configured while other computational elements (having been previously configured) are concurrently performing their designated function. This configuration flexibility of the ACE 100 contrasts starkly with FPGA reconfiguration, both which generally occurs comparatively slowly, not in realtime or concurrently with use, and which must be completed in its entirety prior to any operation or other use.
The matrices 150 configured to function as memory 140 maybe implemented in any desired or preferred way, utilizing computational elements (discussed below) or fixed memory elements, and maybe included within the ACE 100 or incorporated within another IC or portion of an IC. In the preferred embodiment, the memory 140 is included within the ACE 100, and preferably is comprised of computational elements which are low power consumption random access memory (RAM), but also maybe comprised of computational elements of any other form of memory, such as flash, DRAM, SRAM, MRAM, ROM, EPROM or E2PROM. In the preferred embodiment, the memory 140 preferably includes direct memory access (DMA) engines, not separately illustrated.
The controller 120 is preferably implemented, using matrices 150A and 150B configured as adaptive finite state machines, as a reduced instruction set ("RISC") processor, controller or other device or IC capable of performing the two types of functionality discussed below. (Alternatively, these functions may be implemented utilizing a conventional RISC or other processor.) This control functionality may also be distributed throughout one or more matrices 150 which perform other, additional functions as well. In addition, this control functionality may be included within and directly embodied as configuration information, without separate hardware controller functionality. The first control functionality, referred to as "kernel" control, is illustrated as kernel controller ("KLARC") of matrix 150A, and the second control functionality, referred to as "matrix" control, is illustrated as matrix controller ("MARC") of matrix 150B. The kernel and matrix control functions of the controller 120 are explained in greater detail below, with reference to the configurability and reconfigurability of the various matrices 150, and with reference to the preferred form of combined data, configuration and control information referred to herein as a "silverware" module. The matrix interconnection network 110 of Figure 1, and its subset interconnection networks illustrated in Figure 2 (Boolean interconnection network 210, data interconnection network 240, and interconnect 220), collectively and generally referred to herein as "interconnect", "interconnection(s)" or "interconnection network(s)", may be implemented generally as known in the art, such as utilizing FPGA interconnection networks or switching fabrics, albeit in a considerably more varied fashion. In the preferred embodiment, the various interconnection networks are implemented as described, for example, in U.S. Patent No. 5,218,240, U.S. Patent No. 5,336,950, U.S. Patent No. 5,245,227, and U.S. Patent No. 5,144,166. These various interconnection networks provide selectable (or switchable) connections between and among the controller 120, the memory 140, the various matrices 150, and the computational units 200 and computational elements 250, providing the physical basis for the configuration and reconfiguration referred to herein, in response to and under the control of configuration signaling generally referred to herein as "configuration information". In addition, the various interconnection networks (110, 210, 240 and 220) provide selectable or switchable data, input, output, control and configuration paths, between and among the controller 120, the memory 140, the various matrices 150, and the computational units 200 and computational elements 250, in lieu of any form of traditional or separate input/output busses, data busses, DMA, RAM, configuration and instruction busses.
It should be pointed out, however, that while any given switching or selecting operation of or within the various interconnection networks (110, 210, 240 and 220) may be implemented as known in the art, the design and layout of the various interconnection networks (110, 210, 240 and 220), in accordance with the ACE architecture are new and novel. For example, varying levels of interconnection are provided to correspond to the varying levels of the matrices 150, the computational units 200, and the computational elements 250. At the matrix 150 level, in comparison with the prior art FPGA interconnect, the matrix interconnection network 110 is considerably more limited and less "rich", with lesser connection capability in a given area, to reduce capacitance and increase speed of operation. Within a particular matrix 150 or computational unit 200, however, the interconnection network (210, 220 and 240) may be considerably more dense and rich, to provide greater adaptation and reconfiguration capability within a narrow or close locality of reference. The various matrices or nodes 150 are reconfigurable and heterogeneous, namely, in general, and depending upon the desired configuration: reconfigurable matrix 150A is generally different from reconfigurable matrices 150B through 150N; reconfigurable matrix 150B is generally different from reconfigurable matrices 150A and 150C through 150N; reconfigurable matrix 150C is generally different from reconfigurable matrices 150A, 150B and 150D through 150N, and so on. The various reconfigurable matrices 150 each generally contain a different or varied mix of adaptive and reconfigurable computational (or computation) units (200); the computational units 200, in turn, generally contain a different or varied mix of fixed, application specific computational elements (250), which may be adaptively connected, configured and reconfigured in various ways to perform varied functions, through the various interconnection networks. In addition to varied internal configurations and reconfigurations, the various matrices 150 maybe connected, configured and reconfigured at a higher level, with respect to each of the other matrices 150, through the matrix interconnection network 110, also as discussed in greater detail in the first related application.
Several different, insightful and novel concepts are incorporated within the ACE 100 architecture, provide a useful explanatory basis for the real-time operation of the ACE 100 and its inherent advantages, and provide a useful foundation for understanding the present invention. The first novel concepts of ACE 100 architecture concern the adaptive and reconfigurable use of application specific, dedicated or fixed hardware units (computational elements 250), and the selection of particular functions for acceleration, to be included within these application specific, dedicated or fixed hardware units (computational elements 250) within the computational units 200 (Fig. 4) of the matrices 150, such as pluralities of multipliers, complex multipliers, and adders, each of which are designed for optimal execution of corresponding multiplication, complex multiplication, and addition functions. Through the varying levels of interconnect, corresponding algorithms are then implemented, at any given time, through the configuration and reconfiguration of fixed computational elements (250), namely, implemented within hardware which has been optimized and configured for efficiency, i.e., a "machine" is configured in real-time which is optimized to perform the particular algorithm. The next and perhaps most significant concept of the present invention, is the concept of reconfigurable "heterogeneity" utilized to implement the various selected algorithms mentioned above. In accordance with the present invention, within computation units 200, different computational elements (250) are implemented directly as correspondingly different fixed (or dedicated) application specific hardware, such as dedicated multipliers, complex multipliers, and adders. Utilizing interconnect (210 and 220), these differing, heterogeneous computational elements (250) may then be adaptively configured, in advance, in real-time or at a slower rate, to perform the selected algorithm, such as the performance of discrete cosine transformations often utilized in mobile communications. As a consequence, in accordance with the present invention, different ("heterogeneous") computational elements (250) are configured and reconfigured, at any given time, through various levels of interconnect, to optimally perform a given algorithm or other function. In addition, for repetitive functions, a given instantiation or configuration of computational elements may also remain in place over time, i.e., unchanged, throughout the course of such repetitive calculations. The temporal nature of the ACE 100 architecture should also be noted. At any given instant of time, utilizing different levels of interconnect (110, 210, 240 and 220), a particular configuration may exist within the ACE 100 which has been optimized to perform a given function or implement a particular algorithm, such as to implement channel acquisition and control processing in a GSM operating mode in a mobile station. At another instant in time, the configuration may be changed, to interconnect other computational elements (250) or connect the same computational elements 250 differently, for the performance of another function or algorithm, such as for data and voice reception for a GSM operating mode. Two important features arise from this temporal reconfigurability. First, as algorithms may change over time to, for example, implement a new technology standard, the ACE 100 may co-evolve and be reconfigured to implement the new algorithm. Second, because computational elements are interconnected at one instant in time, as an instantiation of a given algorithm, and then reconfigured at another instant in time for performance of another, different algorithm, gate (or transistor) utilization is maximized, providing significantly better performance than the most efficient ASICs relative to their activity factors. This temporal reconfigurability also illustrates the memory functionality inherent in the MIN 110, as mentioned above.
This temporal reconfigurability of computational elements 250, for the performance of various different algorithms, also illustrates a conceptual distinction utilized herein between configuration and reconfiguration, on the one hand, and programming or reprogrammability, on the other hand. Typical programmability utilizes a pre-existing group or set of functions, which may be called in various orders, over time, to implement a particular algorithm. In contrast, configurability and reconfigurability, as used herein, includes the additional capability of adding or creating new functions which were previously unavailable or non-existent.
Next, the present invention also utilizes a tight coupling (or interdigitation) of data and configuration (or other control) information, within one, effectively continuous stream of information. This coupling or commingling of data and configuration information, referred to as "silverware" or as a "silverware" module, is the subject of another related patent application. For purposes of the present invention, however, it is sufficient to note that this coupling of data and configuration information into one information (or bit) stream, which may be continuous or divided into packets, helps to enable real-time reconfigurability of the ACE 100, without a need for the (often unused) multiple, overlaying networks of hardware interconnections of the prior art. For example, as an analogy, a particular, first configuration of computational elements at a particular, first period of time, as the hardware to execute a corresponding algorithm during or after that first period of time, may be viewed or conceptualized as a hardware analog of "calling" a subroutine in software which may perform the same algorithm. As a consequence, once the configuration of the computational elements has occurred (i.e., is in place), as directed by (a first subset of) the configuration information, the data for use in the algorithm is immediately available as part of the silverware module. The same computational elements may then be reconfigured for a second period of time, as directed by second configuration information (i.e., a second subset of configuration information), for execution of a second, different algorithm, also utilizing immediately available data. The immediacy of the data, for use in the configured computational elements, provides a one or two clock cycle hardware analog to the multiple and separate software steps of determining a memory address and fetching stored data from the addressed registers. This has the further result of additional efficiency, as the configured computational elements may execute, in comparatively few clock cycles, an algorithm which may require orders of magnitude more clock cycles for execution if called as a subroutine in a conventional microprocessor or digital signal processor ("DSP").
This use of silverware modules, as a commingling of data and configuration information, in conjunction with the reconfigurability of a plurality of heterogeneous and fixed computational elements 250 to form adaptive, different and heterogeneous computation units 200 and matrices 150, enables the ACE 100 architecture to have multiple and different modes of operation. For example, when included within a hand-held device, given a corresponding silverware module, the ACE 100 may have various and different operating modes as a cellular or other mobile telephone, a music player, a pager, a personal digital assistant, and other new or existing functionalities. In addition, these operating modes may change based upon the physical location of the device. For example, in accordance with the present invention, while configured for a first operating mode, using a first set of configuration information, as a CDMA mobile telephone for use in the United States, the ACE 100 may be reconfigured using a second set of configuration information for an operating mode as a GSM mobile telephone for use in Europe.
Referring again to Figure 1, the functions of the controller 120 (preferably matrix (KARC) 150A and matrix (MARC) 150B, configured as finite state machines) may be explained with reference to a silverware module, namely, the tight coupling of data and configuration information within a single stream of information, with reference to multiple potential modes of operation, with reference to the reconfigurable matrices 150, and with reference to the reconfigurable computation units 200 and the computational elements 150 illustrated in Figure 3. As indicated above, through a silverware module, the ACE 100 may be configured or reconfigured to perform a new or additional function, such as an upgrade to a new technology standard or the addition of an entirely new function, such as the addition of a music function to a mobile communication device. Such a silverware module may be stored in the matrices 150 of memory 140, or may be input from an external (wired or wireless) source through, for example, matrix interconnection network 110. In the preferred embodiment, one of the plurality of matrices 150 is configured, to decrypt such a module and verify its validity, for security purposes. Next, prior to any configuration or reconfiguration of existing ACE 100 resources, the controller 120, through the matrix (KARC) 150A, checks and verifies that the configuration or reconfiguration may occur without adversely affecting any pre-existing functionality, such as whether the addition of music functionality would adversely affect pre-existing mobile communications functionality. In the preferred embodiment, the system requirements for such configuration or reconfiguration are included within the silverware module, for use by the matrix (KARC) 150A in performing this evaluative function. If the configuration or reconfiguration may occur without such adverse affects, the silverware module is allowed to load into the matrices 150 of memory 140, with the matrix (KARC) 150A setting up the DMA engines within the matrices 150C and 150D of the memory 140 (or other stand-alone DMA engines of a conventional memory). If the configuration or reconfiguration would or may have such adverse affects, the matrix (KARC) 150A does not allow the new module to be incorporated within the ACE 100.
Continuing to refer to Figure 1, the matrix (MARC) 150B manages the scheduling of matrix 150 resources and the timing of any corresponding data, to synchronize any configuration or reconfiguration of the various computational elements 250 and computation units 200 with any corresponding input data and output data. In the preferred embodiment, timing information is also included within a silverware module, to allow the matrix (MARC) 150B through the various interconnection networks to direct a reconfiguration of the various matrices 150 in time, and preferably just in time, for the reconfiguration to occur before corresponding data has appeared at any inputs of the various reconfigured computation units 200. In addition, the matrix (MARC) 150B may also perform any residual processing which has not been accelerated within any of the various matrices 150. As a consequence, the matrix (MARC) 150B maybe viewed as a control unit which "calls" the configurations and reconfigurations of the matrices 150, computation units 200 and computational elements 250, in real-time, in synchronization with any corresponding data to be utilized by these various reconfigurable hardware units, and which performs any residual or other control processing. Other matrices 150 may also include this control functionality, with any given matrix 150 capable of calling and controlling a configuration and reconfiguration of other matrices 150.
Figure 2 is a block diagram illustrating, in greater detail, a reconfigurable matrix 150 with a plurality of computation units 200 (illustrated as computation units 200A through 200N), and a plurality of computational elements 250 (illustrated as computational elements 250A through 250Z), and provides additional illustration of the preferred types of computational elements 250. As illustrated in Figure 2, any matrix 150 generally includes a matrix controller 230, a plurality of computation (or computational) units 200, and as logical or conceptual subsets or portions of the matrix interconnect network 110, a data interconnect network 240 and a Boolean interconnect network 210. As mentioned above, in the preferred embodiment, at increasing "depths" within the ACE 100 architecture, the interconnect networks become increasingly rich, for greater levels of adaptability and reconfiguration. The Boolean interconnect network 210, also as mentioned above, provides the reconfiguration and data interconnection capability between and among the various computation units 200, and is preferably small (i. e., only a few bits wide), while the data interconnect network 240 provides the reconfiguration and data interconnection capability for data input and output between and among the various computation units 200, and is preferably comparatively large (i.e., many bits wide). It should be noted, however, that while conceptually divided into reconfiguration and data capabilities, any given physical portion of the matrix interconnection network 110, at any given time, may be operating as either the Boolean interconnect network 210, the data interconnect network 240, the lowest level interconnect 220 (between and among the various computational elements 250), or other input, output, configuration, or connection functionality. Continuing to refer to Figure 2, included within a computation unit 200 are a plurality of computational elements 250, illustrated as computational elements 250A through 250Z (individually and collectively referred to as computational elements 250), and additional interconnect 220. The interconnect 220 provides the reconfigurable interconnection capability and input/output paths between and among the various computational elements 250. As indicated above, each of the various computational elements 250 consist of dedicated, application specific hardware designed to perform a given task or range of tasks, resulting in a plurality of different, fixed computational elements 250. Utilizing the interconnect 220, the fixed computational elements 250 may be reconfigurably connected together into adaptive and varied computational units 200, which also may be further reconfigured and interconnected, to execute an algorithm or other function, at any given time,. utilizing the interconnect 220, the Boolean network 210, and the matrix interconnection network 110. While illustrated with effectively two levels of interconnect (for configuring computational elements 250 into computational units 200, and in turn, into matrices 150), for ease of explanation, it should be understood that the interconnect, and corresponding configuration, may extend to many additional levels within the ACE 100. For example, utilizing a tree concept, with the fixed computational elements analogous to leaves, a plurality of levels of interconnection and adaptation are available, analogous to twigs, branches, boughs, limbs, trunks, and so on, without limitation.
In the preferred ACE 100 embodiment, the various computational elements 250 are designed and grouped together, into the various adaptive and reconfigurable computation units 200. In addition to computational elements 250 which are designed to execute a particular algorithm or function, such as multiplication, correlation, clocking, synchronization, queuing, sampling, or addition, other types of computational elements 250 are also utilized in the preferred embodiment. As illustrated in Fig. 2, computational elements 250A and 250B implement memory, to provide local memory elements for any given calculation or processing function (compared to the more "remote" memory 140). In addition, computational elements 2501, 250J, 250K and 250L are configured to implement finite state machines, to provide local processing capability (compared to the more "remote" matrix (MARC) 150B), especially suitable for complicated control processing. With the various types of different computational elements 250 which may be available, depending upon the desired functionality of the ACE 100, the computation units 200 maybe loosely categorized. A first category of computation units 200 includes computational elements 250 performing linear operations, such as multiplication, addition, finite impulse response filtering, clocking, synchronization, and so on. A second category of computation units 200 includes computational elements 250 performing non-linear operations, such as discrete cosine transformation, trigonometric calculations, and complex multiplications. A third type of computation unit 200 implements a finite state machine, such as computation unit 200C as illustrated in Figure 2, particularly useful for complicated control sequences, dynamic scheduling, and input/output management, while a fourth type may implement memory and memory management, such as computation unit 200A as illustrated in Fig. 2. Lastly, a fifth type of computation unit 200 may be included to perform bit-level manipulation, such as for encryption, decryption, channel coding, Niterbi decoding, and packet and protocol processing (such as Internet Protocol processing). In addition, another (sixth) type of computation unit 200 may be utilized to extend or continue any of these concepts, such as bit-level manipulation or finite state machine manipulations, to increasingly lower levels within the ACE 100 architecture.
In the preferred embodiment, in addition to control from other matrices or nodes 150, a matrix controller 230 may also be included or distributed within any given matrix 150, also to provide greater locality of reference and control of any reconfiguration processes and any corresponding data manipulations. For example, once a reconfiguration of computational elements 250 has occurred within any given computation unit 200, the matrix controller 230 may direct that that particular instantiation (or configuration) remain intact for a certain period of time to, for example, continue repetitive data processing for a given application.
With this foundation of the preferred adaptive computing architecture (ACE), the need for the present invention is readily apparent, as there are no adequate or sufficient high-level programming languages which are available to fully exploit such adaptive hardware. The Q language of the present invention, for example, provides program constructs in a high-level language that allow detailed description of concurrent computation, without requiring the complexity of a hardware description language. One of the goals of the Q language is to incorporate language features which allow a compiler to make efficient use of the adaptive hardware to create concurrent computations at the operator level and the task level. Figure 3 illustrates the role of the Q language in the context of the ACE architecture, and beginning with the exemplary data flow graph of Figure 4, the new and novel features of the present invention are discussed in detail. It should be noted that in the following discussion, and with regard to the present invention in general, the important features are the mechanisms and the semantics of the mechanisms, such as for the dataflow statements, channels, stream variables, state variables, unroll statements, and iterators, rather than the particular syntax involved.
Figure 3 is a block diagram depicting the role of Q language in providing for configuration of computational units, in accordance with the present invention. Figure 3 depicts the progress of an algorithm (function or operation) 300, coded in the high-level Q language 305, through a plurality of system design tools 310, such as a scheduler and Q compiler 320, to its final inclusion as part of an adaptive computing IC (ACE) configuration bit file 335, which contains the configuration information for adaptation of an adaptive computing circuit, such as the ACE 100. The system design tools 310, which include a hardware object "creator", a computing operations "scheduler" and an operation "emulator" are the subject of other patent applications. Relevant to the present invention are the scheduler and Q compiler 320 component. Components of an adaptive computing circuit are initially defined as hardware "objects", and in this instance, specifically as adaptive computing objects 325. Once the algorithm, function or operation (300) has been expressed in the Q language (305), the scheduler portion of scheduler and Q compiler 320 arranges (or schedules) the programmed operations with or across the adaptive computing objects 325, in a sequence across time and across space, in an iterative manner, producing one or more versions of adaptive computing architectures 330, and eventually selecting an adaptive computing architecture as optimal, in light of various design goals, such as speed of operation and comparatively low power consumption.
When the programmed operations have been scheduled across the selected adaptive computing architecture, the Q compiler portion of scheduler and Q compiler 320 then converts the scheduled Q program into a bit-level information stream (configuration information) 335. (It should be noted that, as used throughout the remainder of this discussion, any reference to a "compiler" should be understood to mean this Q compiler portion of scheduler and Q compiler 320, or an equivalent compiler). Following conversion of the selected adaptive computing architecture into a hardware description 340 (using any preferred hardware description language such as Nerilog or NHDL) and fabrication 345, the resulting adaptive computing integrated circuit 335 may be configured, using the configuration information 335 generated for that adaptive computing architecture. For example, one of the novel features of the Q language is that it can specify parallel execution of particular functions or operations, rather than being limited to sequential execution. Using defined adaptive computing objects 325, such as ACE computational elements, the scheduler selects computational elements and matches the desired parallel functions to available computational elements, or creates the availability of computational elements, for the function to be executed at a scheduled time, in parallel, across these elements.
Figure 4 is a schematic diagram illustrating an exemplary data flow graph, utilized in accordance with the present invention. Algorithms or other functions selected for acceleration are converted into data flow graphs (DFGs), which describe the flow of inputs through computational elements to produce outputs. The data flow graph of Figure 4 shows various inputs passing through multipliers and then iterating through adders to produce outputs. Equipped with data flow graphs, the high-level Q code may be refined to improve the computing performance of the algorithm. As illustrated, the data flow graph describes a comparatively fine-grained computation, i.e., a computation composed of relatively simple, primitive operators like add and multiply. As discussed below, data flow graphs may also be used at a higher level of abstractions that describe more coarse-grained computations, such as those composed of complex operators like filters. These operators typically correspond to tasks that may comprise many instances of the more fine-grained data flow graphs.
For example, a digital signal processing ("DSP") system involves a plurality of operations that can be depicted by data flow graphs. Q supports the construction of DSP systems by utilizing computational "blocks" consisting of a plurality of programmed DFGs that communicate with each other via data "streams". Data are passed from one block to another by connecting the output streams of blocks to the input streams of other blocks. A DSP system operates efficiently by running the individual blocks when input data are available, which then produces output data used by other blocks. Blocks may be executed concurrently, as determined by a Q scheduler. (It should be noted that this Q scheduler is different than the system tool scheduler (of 320) discussed above, which schedules the compiled Q code to available computational elements, in space and time). At its simplest, a block implements a computation that consumes some number of inputs and processes them to produce some number of outputs. A block in the Q language is an object, that is, an instance of a class. It can be loaded into a matrix, it has persistent data, such as stream variables and coefficients, state, and methods such as init ( ) and run ( ) . As exemplary methods, invoking the init ( ) method initializes connections and performs any other system specific initialization, while the run ( ) method, which has no parameters, executes the block.
As an example, a finite impulse response filter ("FIR"), commonly used in digital signal processing, could be implemented as a Q block. The filter coefficients, the input and output streams and a variable used for the input state are part of the filter state. The run ( ) method processes some number of inputs from an input stream, computes, and writes the outputs to an output stream. The run ( ) method could be called many times for successive streams of input data, with the state of the execution saved between invocations. Treating a matrix computation as an object allows it to be run in short bursts instead of all at once. Because its state is persistent, execution of a computation object can be stopped and continued at a later time. This is vital for real-time DSP applications where data become available incrementally. In the example FIR filter, the filter can be initialized, and run on input data as it becomes available without any overhead to reinitialize or load data into the matrix. This also allows many matrix computations to concurrently share the hardware because each maintains its own data.
The efficiency of a block's execution as measured in power usage and clock cycles depends upon how well the compiler can optimize the programming code to produce a configuration bit file that directs parallel execution of operations while minimizing memory accesses. Q contains constructs that allow the programmer to expose the parallelism of the computation to the compiler in a block, and to compose a digital signal processing system as a collection of blocks, supporting both types of data flow mentioned above.
The overall goal of the Q language is to support systems that are implemented partly in hardware using either the adaptive computing architecture or parameterized hardwired components, and which may also be implemented partly in software on a conventional processor. Q primarily supports the construction of DSP systems via the composition of computational blocks that communicate via data streams. These blocks are compiled to run either on the host processor or in the adaptive computing architecture. This flexibility of implementation supports code reuse and flexible system implementation as well as rapid system prototyping using a software only solution. When a block is compiled to the adaptive computing architecture, the compiler attempts to produce an efficient parallel version that minimizes memory accesses. How well the compiler can do this generally depends on how the block is written: as mentioned above, Q contains constructs that allow the programmer to expose the parallelism of the computation to the compiler.
The blocks of the present invention follow a reactive dataflow model, removing data from input streams and processing it to produce data on output streams. Data is passed from one block to another by connecting the output streams of blocks to the input streams of other blocks. The entire system operates by ranning the individual blocks when their input data are available, which then produces output data used by other blocks. The scheduling of blocks can either be done statically at compile time in the case of well-behaved data flow systems such as synchronous data flow, or dynamically in the more general case. The scheduler can be supplied either by the system software, which uses information supplied by the blocks about its I/O characteristics, or it can be left to the user program. In order for a system to be scheduled automatically, the blocks should publish their I/O characteristics.
A stream carrying data between two blocks is implemented as a channel, which contains a buffer to store data items in transit between the blocks as well as information about the size of the buffer and the number of items in it. Blocks producing data use an output stream to send data through a channel to the input stream of another block. When a block writes data to an output stream, the data is stored in the channel where it becomes available to the input stream. When a block reads data from an input stream, it is removed from the channel. Thus the channel implements the FIFO implicit in dataflow graph arcs. The channel buffer is typically implemented using shared buffers so that no data copying is necessary: the writing block writes data directly into the buffer and the reading block reads it directly from the buffer. Streams are declared to carry a specific data type which may be a built-in type or user-defined such as a class object or an array. Reads and writes are done on items of the data type and the channel buffer is sized in terms of how many data items it contains. A stream data item may be as simple as a number or as complex as an array of data. Reading an input stream normally consumes a data item and writing an output stream produces a data item to the stream. However, for complex data items where the item may be processed incrementally, an open can be done to get a handle to the next item of the stream without consuming or producing it. After the item has been processed, a close is used to complete the read or write. More complex operations may also be supported, such as reading ahead or behind the current location in the stream. However, such operations make assumptions about the streams that are difficult for a scheduler to check.
In order for the scheduler to be able to construct a schedule, a block should publish its I/O characteristics and its computation timing. This information can be used by a scheduler at compile time to construct a static schedule, or at run time for dynamic scheduling. Such information can be used as preconditions that must be met before a block is executed. For example, the precondition might be that there are eight data items available on the input stream and space for eight data items on the output stream.
Streams may be declared to be non-blocking (the default) or blocking. Non-blocking is the default for dataflow systems where scheduling is done to ensure that no blocking can occur. In this case reading an empty stream or writing a full stream is an error. Blocking only makes sense where blocks can run in parallel or where block execution can be suspended to allow other blocks to supply the needed data. Blocking is implemented in hardware for hardware blocks. Note that streaming I/O can be used to implement double-buffering, either blocking or non-blocking. In this case, the channel buffer contains space for two items (which can be arrays) where the output stream can be writing one array while the input stream reads the other.
The stream buffer sizes depend on the relative rates at which blocks produce and consume data. Normally dataflow blocks are written in terms of the computation corresponding to one time step, sample or frame. For example, a filter would consume the next input sample, producing the corresponding output sample. Implementing a system at such a fine-grained level might be very inefficient, however. The programmer may decide for efficiency reasons that every invocation of a block will compute many data samples; however, larger buffers are needed to store the increased amount of I/O data.
An application will generally comprise both signal processing components constructed as data flow graphs as described above, as well as control-oriented
"supervisor code" that interacts with other applications and the operating system, and controls the overall processing required by the application. This control-oriented part of the application would be written in the usual procedural style, as known in the art. This supervisor code may execute the nodes of a dataflow graph directly, particularly when the computation produces information that changes how the computation is performed.
The key concepts, mechanisms, constructs and syntax of the Q language are described in detail below.
1. DATAFLOW STATEMENTS in the Q language Q computation objects describe computations that use the adaptive computing architecture to apply operations to input data to produce output data. The set of operations are depicted in data flow graphs and are accomplished in programming code by a plurality of assignment statements. Although some operations may be executed in parallel, the execution semantics are defined by the sequential ordering of assignments as they appear in a program. A compiler may perform analysis to find parallelism, or may not detect opportunities for parallelism that may be obvious to an experienced programmer. As a consequence, in accordance with the present invention, the Q "dataflow" statement informs the compiler that the code within braces following the dataflow statement describes a computation corresponding to a static, acyclic data flow graph that can be executed in parallel. Other than conditional branching performed using the known method of predicated execution (which moves branches into a data flow graph), there is no branching in the dataflow section, and no non-obvious side effects or aliasing that would cause data dependencies a compiler cannot detect. If the data flow graph is invoked as a loop body, the scheduler may schedule the data flow graphs of adjacent iterations so that they overlap and thus achieve even greater parallelism. For a comparatively straightforward example: int sumYl; int sumY2 ; int sumXYl; int sumXY2 ; dataflow { sumY2 = sumY2 + sumYl; sumXY2 = sumXY2 + sumXYl; } The example above shows four variables of data type (or datatype) integer, two of which are assigned new values within a dataflow section. Because the values of sumY2 and sumXY2 are independent, the dataflow statement directs that the two operations be done in parallel. (While useful for explanatory purposes, this example is relatively trivial, as a compiler may recognize such an easy example; in actual practice, the dataflow statement is especially useful for directing a compiler or scheduler in how to divide large data flow graphs into units which may be scheduled in parallel).
2. CHANNELS and BLOCKS in the Q language
Q blocks are connected together using Q "channels", each channel an object with a buffer in memory for data, an input stream and an output stream. Channels are conceptually related to "named pipes" in the Unix operating system environment, but unlike named pipes, when channel data are accessed they need not be copied from the buffer to another location.
In the method of the present invention, a channel is allocated to a first block to use for output stream, then the channel is subsequently defined as input stream to a second block, to connect the two blocks. A channel is declared with the type of data communicated through the channel and the size of the buffer. The following code fragment illustrates how two blocks are connected using a channel: // Channel with buffer for 16 items of datatype fraction channel<fractl6> chan(16) ;
// Connect blockA output to channel blockA. init (streamOut<fraσtl6> (σhan) ) ; // Connect blockB input to channel blockB.init (streamln<fractl6> (chan) ) ; // Are there more than 4 items // in the buffer ? if (chan.items () > 4) blockB.run () ;
The channel also has a method that allows supervisor code to find out the size of the buffer and how full it is.
3. STREAM variables
Blocks access channels via streams. A "stream" variable supports the streaming I/O abstraction where by each "read" of the input stream variable retrieves the next available value of the stream and each "write" to an output stream sends a value to the stream. A stream variable references a channel buffer and is implemented using an index that is automatically incremented whenever it is read or written. This automatic array indexing is accomplished by using an address generator in the adaptive computing architecture or other hardware.
// Declare an input stream variable and an // output stream variable with a buffer of N items of // datatype fraction. streamln <fractl6> svar(N) ; streamOut <fractl6> svar(N) ; // Reference an input stream: // returns current value, advances stream. var = svar.readO ;
// Write to output stream: // sends next value, advances stream, svar.write (var) ;
// Open a stream data item for read/write without advancing // stream var = svar. open O ;
// Close an open stream data item: advances the stream svar . close () ;
// Debug method: print the stream buffer, // showing current location svar . di splay () ;
The relationships between blocks, channels and streams are illustrated in Figure 5. Block 400A uses a stream variable 401A to write to channel 402. Channel 402 stores the data until the scheduler determines that enough data have accumulated to justify a read by block 400B, which uses a stream variable 401B as input.
As described above, channels have methods that allow supervisor code to learn the size of the channel's buffer, and how full it is. The scheduler can then optimize I/O operations of the streams from/to the various blocks. Furthermore, because channel variables can be shared among blocks, multiple blocks can access channel data simultaneously, increasing parallel execution. The sfream variable and a sample Q programs are discussed in greater detail below.
A stream variable supports the streaming I/O abstraction where by each read of the input stream variable retrieves the next available value of the stream and each write to an output stream sends a value to the stream. A stream variable references a channel buffer and is implemented using an index that is automatically incremented whenever it is read or written. This automatic array indexing is implemented directly using an address generator. The following example program snippet computes a FIR filter using stream and state variables. Each loop iteration reads a sample from the input stream, computes the resulting output, and writes it to the output stream. The sample state variable is used keep a history of the values assigned to sample. Note that sample [l] refers to the current value of the sample state variable because of the assignment to sample before the unroll statement (discussed in greater detail below).
streamln<fractl6> input; // Input stream of samples streamOut<fractl6> output; // Output stream for results
loop (int 1=0; l<nOut; 1++) dataflow {
// Read the next sample from input stream sample = input. ead() ; sum = 0.0; unroll (int i=0; i<nCoef; i++) { sum = sum + coefRegti] * sample [nCoef-i] ; } output .write (sum) ; // Write result to output stream
}
A stream variable is usually initialized by the init() method to reference a channel provided by the calling procedure. Note that channels are implemented using a circular buffer, that is, the stream index wraps around to the beginning of the channel buffer when it reaches the end.
The read and write stream methods read and write individual data items in streams. For more complicated stream processing, the open method can be used to get a pointer to the next item in the stream. This pointer can then be used, for example, to access data items that are complex data types or arrays. The close method is then used to complete the open, which moves the stream index to the next data item in the stream. The open and close methods can also be used with output streams. By default, the stream is advanced by one data item by each read, write or close. In cases where the stream data is treated as an array, the stream must be informed via the init () method how many data items to advance. It is important that when using open() to process blocks of data that the channel buffer is sized in units of the block size. In other words, it is important that the block of data processed by an open() does not go past the end of the buffer for obvious reasons. Thus, if a stream contains image data which is processed via an open() in blocks of 8 rows (as in the example below) then the channel buffer must be sized in units of 8 row blocks.
Sometimes data needs to be accessed in more complex ways than simple streams allow. The following complicated example uses a combination of streams and iterators (discussed below) to process an image.
streamln<fractl4> inputStr;
// The inSwath array is one swath from the input stream fractl4 * inSwath; // We will access the input swath using the 3D iterator below: // foreach (window in the row of windows) // foreach (row in the window) // foreach (pixel in the row) Qiterator<fractl4> inSwathl; // 3D iterator
// Output access pattern is the same as for the input image stream0ut<fractl4> outputStr; // Output stream for result swaths
// The outSwath array is one swath written to the output stream fractl4 *outSwath;
// We will access the output swath using the 3D iterator below:
// foreach (window in the row of windows)
// foreach (colum in the window) // foreach (pixel in the column)
Qiterator<fractl4> outSwathl; // 3D iterator
inputStr. init (8*imageWidth) ; // initO initializes the stream outputStr. init (8*imageWidth) ;
fractl4 dataln[8]; fractl4 dataOut [8] ;
// Get next swath from input stream and initialize iterator inSwath = inputStr.open() ;
// Treat the input swath as a 3D array [row, window, col] inSwathl . init (inSwath,
1, 0, 1, 8, // rows in window
2, 0, 1, imageWidth/8, // windows on row 0, 0, 1, 8) ; // columns in window
// Get access to next swath in output stream outSwath = outputStr.open() ;
// Treat the output swath as a 3D array [row, window, col] outSwathl. init (outSwath,
0, 0, 1, 8, // rows in window 2, 0, 1, imageWidth/8, // windows on row
1, 0, 1, 8) ; // columns in window // Loop over all windows in a row of the image loop (int w=0; w<imageWidth/8; w++) { loop (int row=0; row<8; row++) dataflow { unroll (i=0; i<8; i++) { dataln[i] = inSwathl .next () ;
} // The row DCTs are done here...
} loop (int col=0; col<8; col++) dataflow { }
// The column DCTs are done here...
// Write the results to the output array unroll (i=0; i<8; i4-+) { outSwathl .next () = dataOut[i];
} } } inputStr. close () ; // We are done with the input and output outputStr. close () ;
All that is shown here are the details of accessing the input and output images - the computation has been omitted for clarity. It should also be noted that the particular syntax used was designed for backward compatibility with C++ as a prototype implementation; a myriad of other syntaxes are available and may even be clearer, and are within the scope of the present invention. For example, the Q code: inSwathl . init (inSwath,
1, 0, 1, 8, // rows in window
2, 0, 1, imageWidth/8, // windows on row 0, 0, 1, 8) ; // columns in window maybe equivalentlyreplaced with: inSwathl = { | for( int i = 0; i < imageWidth/8 ; i++ ) for( int j = 0; j < 8; j++ ) for( int k = 0; k < 8; k++ ) inSwath [j] [i*8+k]
1} The block processes all the 8x8 windows on an 8-row swath, producing a corresponding swath in the output image. Pixels in the input image are accessed in row major order within each 8x8 window, while pixels in the output image are written in column major order. Clearly, the pixels cannot be accessed in stream order, so an open 0 is used to access an entire swath. The sfream init () method is used to indicate how many pixels are read and written by each open () /close () pair for the input and output images. The pointer returned by the open ( ) is handed to the iterator, which also indicates how the iteration is done. In this case, a 3-dimensional iterator is used to define the windowed data structure on the image swath. Note that the iterator must be reinitialized for each new swath. Also note that we write the program to process single windows because the window data is not contiguous in the stream, while swaths are.
In some cases, processing may require the program to read ahead on a stream, and then back up and read some of the data again. The rewind () method is provided to allow a program to back up a stream. The argument to rewind indicates how many data items to back up. If the argument is negative, the stream is moved forward. Caution must be used with rewind because if blocks are running in parallel, then the producing block may have already written into the buffer space vacated by the reads, leaving no space for the rewind 0.
4. STATE variables
Q language "state" variables allow convenient access to previous values of a variable in a computation occurring over time. For example, a FIR filter may refer to the previous N values of the input variable. State variables avoid having to keep track of the history of a variable explicitly, thus streamlining programming code. State variables are declared as follows: state<type> name (N) ; where "type" is the data type and "name" is the name of the state variable, and "N" is a constant which declares how far into the past a variable value can be referenced. Arrays of state variables are allowed, for example: state<fractl6> X [8] (2) ; which declares an array of 8 state variables of data type fraction, each of which keeps two history values.
The value of a state variable i time units in the past (i.e. time = t-i) is referenced using the [] operator: sum = sum + in [i] ; refers to the value of in, i time steps in the past.
A state variable is assigned using a normal assignment statement to the state variable without the time operator []. For example the assignment: state<fractl6> S (4) ; S = X; assigns a new value X to S. Each assignment to a state variable causes time to advance for that state variable. Time is defined for a state variable by the assignments made to it. When a state variable is assigned a value, time advances and the value becomes the previous value of the variable, i.e. S[l]. After the statement S = X; above, the value of S[l] is X, the previous value of S[l] becomes available as S[2], the previous value of S[2] is available as S[3], etc. State variables can be initialized by specifying their values for specific times in the past. This is done by assigning a value to X[i] to initialize the value of X at t-i. Assignments to a state variable using the [] notation do not advance time.
UNROLL statement
"Unroll" statements in the Q language, in general, are utilized to provide for parallel execution of computations and other functions, which may otherwise be problematic due to the sequential nature of typical "loop" statements of the prior art. More specifically, the "unroll" statement provides for control over how a compiler handles a loop: on the one hand, it can be used to direct the compiler (320) to unroll the code before scheduling it; on the other hand, where a compiler might aggressively unroll a loop, the unroll statement of the invention may constrain precisely how it should be unrolled. "Unroll" statements in the Q language utilize the syntax and semantics utilized in C for loops, but are compiled very differently, with very different results. An unroll in the Q language is converted at compile time into straight-line code, each command of which implicitly could be executed in parallel. Unroll parameters must be known at compile time and any reference to the iteration variable in the unroll body evaluates to a constant. For example, the code fragment below assigns the value of the index of an array to the indexed element of the array:
int 16 j ; int 16 a [4] ; unroll (j=0; j <4 ; j ++) { a [j ] = j ; } is equivalent to the code int 16 a [4] ; a [0] = 0 ; a [l] = 1 ; a [2] = 2 ; a [3] = 3 ;
Unroll statements are allowed in dataflow blocks, because the entire unroll statement can in principle be executed in a single cycle if the data dependencies allow it. It should be noted that loop and unroll are quite different; although both run a fixed number of iterations, loop's are executed a number of iterations determined at run time, while unroll statements are elaborated into a dataflow graph at compile time. This means that loops cannot be part of a dataflow block because it is not known until runtime how many iterations a loop will execute (i.e., the different iterations of a loop statement must be executed sequentially, in contrast to the parallel execution of an unroll statement).
In the following example, Q program code of the present invention computes a FIR filter using sfream and state variables, and the unroll command. Each iteration reads a sample from the input sfream, computes, and writes the result to the output stream. The sample state variable is used keep a history of the values assigned to sample. streamln<fractl6> input; // Input stream of samples streamOut<fractl6> output; // Output stream for results loop (int 1=0; l<nOut; 1++) dataflow { sample = input. read() ; // Perform parallel reads
// from the input stream sum = 0.0; unroll (int i=0; i<nCoef; i++) {■ sum = sum + coefReg fi] * sample [nCoef -i] ;
} output.write (sum) ; // Write result to output stream }
6. ITERATORS
Data for Q programs is input and output via matrices of the adaptive computing architecture adapted for memory functionality (or random access memories (RAMs) that are shared with the host processor). For purposes of the present invention, the only concern is that values in a memory are transferred to some form of register, and then fransferred back. Data are often stored in the form of arrays that are addressed using some addressing pattern, for example, linear order for a one-dimensional array or row- major order for a two-dimensional array. Q "Iterators" are special indexing variables used to access arrays in a fixed address pattern, and make efficient use of any available address generators. For example, a two-dimensional array can be accessed in row-major order using an iterator instead of the usual control structure that uses nested "for" loops.
ram fractlβ X [] ; // Two dimensional array in RAM iterator Xi (X, 0 , 0 , 1 , 128 ,
1 , 0 , 1, 64) ; sum = sum + Xi; // Retrieve the next value in the array
In the preferred embodiment, the argument list for an iterator declaration contains first the array to be accessed, and then groups of four parameters for each dimension over which the array is to be iterated;
(1) level - referring to the iteration level, in which the 0 level is the innermost loop and iterates the fastest;
(2) init - referring to the initial value of the index; (3) inc - referring to the amount added to the index in each iteration; and
(4) limit - referring to the index limit for this index.
It should be noted, however, that as mentioned above, the particular syntax employed may be highly variable, and many equivalent syntaxes are within the scope of the present invention.
Each time the iterator is referenced, the next value in the array is accessed according to the iterator pattern. In the above example, Xi is an iterator used to reference X as a 128 x 64 two-dimensional array. The address pattern generated is equivalent to that generated by the following nested "for" loops: for (j =0 ; j <64 ; j =j +l) for (i=0 ; i<128 ; i=i+l) X [i] [j ]
It should be noted that the inner "for" statement iterates over the first dimension because level=0 for the first dimension. Although the compiler can often implement array indexing with an address generator, iterators expose the deterministic address pattern directly to the compiler for situations that are top complex. This action reduces the work, i.e., clock cycles, expended to reference an array.
7. LOOP statement The Q "loop" statement is defined to have the same syntax utilized in the
C "for" statement. However, Q loops are restricted to execute a fixed number of times, determined at run time. More precisely, in the statement:
loop (int i=0 ; i<n; i=i+c) { s = s + datal;
}
the iteration variable i and the loop limit n and increment value c cannot be modified in the loop body. Moreover, in the preferred embodiment, there is no mechanism to break out of the loop before the predetermined number of iterations have executed. Without a means to branch from a loop statement, computing overhead, and thus processing time, is reduced. Other efficient control mechanisms, however, may be implemented in the adaptive computing architecture.
Figures 6 A, 6B and 6C are diagrams providing a useful summary of the Q programming language of the present invention. Figures 7 through 9 provide exemplary Q programs. In particular, Figure 7 provides a FIR filter, expressed in the Q language for implementation in adaptive computing architecture, in accordance with the present invention; Figure 8 provides a FIR filter with registered coefficients, expressed in the Q language for implementation in adaptive computing architecture, in accordance with the present invention; and Figures 9A and 9B provide a FIR filter for a comparatively large number of coefficients, expressed in the Q language for implementation in adaptive computing architecture, in accordance with the present invention.
The method and system embodiments of the present invention are readily apparent. For example, the preferred method for programming an adaptive computing integrated circuit includes: ( 1 ) using a first program construct to provide for execution of a computational block in parallel, the first program construct defined as a dataflow command for informing a compiler that included commands are for concurrent performance in parallel;
(2) using a second program construct to provide for automatic indexing of reference to a channel object, the channel object for providing a buffer for storing data, the second program construct defined as a sfream variable for referencing the channel object;
(3) using a third program construct for maintaining a previous value of a variable between process invocations, the third program construct defined as a state variable for maintaining a plurality of previous values of a variable after the variable has been assigned a plurality of current values (for example, maintaining the "N" most recent values assigned to the variable);
(4) using a fourth program construct to provide for iterations having a predetermined number of iterations at a compile time, the fourth program construct defined as an unroll command for transforming a loop operation into a predetermined plurality of individual executable operations; (5) using a fifth program construct to provide array accessing, the fifth program construct defined as an iterator variable for accessing the array in a predetermined, fixed address pattern; and
(6) using a sixth program construct to provide for a fixed number of loop iterations at run time, the sixth program construct defined as a loop command for informing a compiler that the included commands contain no branching to locations outside of the loop and that a plurality of loop conditions cannot be changed.
Also for example, the first program construct may be viewed as having a semantics including a first program construct identifier, such as the "dataflow" identifier; a commencement designation and a termination designation following the first program construct identifier, such as "{" and "}", respectively, or another equivalent demarcation; and a plurality of included program statements contained within the commencement designation and the termination designation.
The system of the present invention, while not separately illustrated, may be embodied, for example, in a computer, a workstation, or any other form of computing device, whether have processor-based architecture, an ASIC-based architecture, an FPGA-based architecture, or an adaptively-based architecture. The system may further include compilers and schedulers, as discussed above.
Numerous advantages of the present invention are readily apparent. The present invention provides for a comparatively high-level programming language, for enabling ready programmability of adaptive computing architectures, such as the ACE architecture. The Q programming language is designed to be backward compatible with and syntactically similar to widely used and well known languages like C++, for acceptance within the engineering and computing fields. More importantly, the method, system, and Q language of the present invention provides new and specialized program constructs for an adaptive computing environment and for maximizing the performance of an ACE integrated circuit or other adaptive computing architecture.
The language, system and methodology of the present invention, include program constructs that permit a programmer to define data flow graphs in software, to provide for operations to be executed in parallel, and to reference variable states in a straightforward manner. The invention also includes mechanisms for efficiently referencing array variables, and enables the programmer to succinctly describe the direct data flow among matrices, nodes, and other configurations of computational elements and computational units. Each of these new features of the invention provide for effective programming in a reconfigurable computing environment, facilitating a compiler to implement the programmed algorithms efficiently in adaptive hardware. From the foregoing, it will be observed that numerous variations and modifications maybe effected without departing from the spirit and scope of the novel concept of the invention. It is to be understood that no limitation with respect to the specific methods and apparatus illusfrated herein is intended or should be inferred. It is, of course, intended to cover by the appended claims all such modifications as fall within the scope of the claims.

Claims

We claim:
1. A method for programming an integrated circuit, the method comprising: (a) using a first program construct to provide for execution of a computational block in parallel; (b) using a second program construct to provide for automatic indexing of reference to a buffer object;
(c) using a third program construct for maintaining a previous value of a variable between process invocations; and
(d) using a fourth program construct to provide for iterations having a predetermined number of iterations at a compile time.
2. The method of claim 1, wherein step (a) further comprises: using a dataflow command for informing a compiler that included commands are for concurrent performance in parallel.
3. The method of claim 1, wherein step (b) further comprises: using a channel object for providing a buffer for storing data; and using a stream variable for referencing the channel object.
4. The method of claim 3, wherein the channel object is a buffer instantiated with a declared data type and a size, and wherein the sfream variable is declared with a buffer of a plurality of data items of a specified data type.
5. The method of claim 1, wherein step (c) further comprises: using a state variable for maintaining a plurality of previous values of a variable after the variable has been assigned a plurality of current values.
6. The method of claim 1, wherein step (d) further comprises: using an unroll command for transforming a loop operation into a predetermined plurality of individual executable operations.
7. The method of claim 1 , further comprising:
(e) using a fifth program construct to provide array accessing with a predetermined address pattern.
8. The method of claim 7, wherein step (e) further comprises: using an iterator variable for accessing the array in a predetermined, fixed address pattern.
9. The method of claim 7, wherein the fifth program construct is a declaration which includes a plurality of arguments, the plurality of arguments including an iteration level, an initial value of an index, an increment added to the index for a repeated iteration, and an index limit.
10. The method of claim 1 , further comprising: (f) using a sixth program construct to provide for a fixed number of loop iterations at run time.
11. The method of claim 10, wherein step (f) further comprises: using a loop command for informing a compiler that a plurality of included commands contain no branching to locations outside of the loop and that a plurality of loop conditions are fixed.
12. The method of claim 1 , wherein the first program construct has a semantics comprising: a first program construct identifier, followed by a plurality of included program statements.
13. The method of claim 12, wherein the first program construct has a syntax comprising: a dataflow designation; a commencement designation and a termination designation following the dataflow designation; and the plurality of included program statements contained within the commencement designation and the termination designation.
14. The method of claim 1 , wherein the fourth program construct has a semantics comprising: a fourth program construct identifier having a plurality of arguments, followed by program statements for expansion into a plurality of individual commands according to the plurality of arguments.
15. A system for programming an integrated circuit, the system comprising: means for using a first program construct to provide for execution of a computational block in parallel; means for using a second program construct to provide for automatic indexing of reference to a buffer object; means for using a third program construct for maintaining a previous value of a variable between process invocations; and means for using a fourth program construct to provide for iterations having a predetermined number of iterations at a compile time.
16. The system of claim 15, wherein the means for using the first program construct further comprises: means for using a dataflow command for informing a compiler that included commands are for concurrent performance in parallel.
17. The system of claim 15, wherein the means for using the second program construct further comprises: means for using a channel object for providing a buffer for storing data; and means for using a stream variable for referencing the channel object.
18. The system of claim 17, wherein the channel object is a buffer instantiated with a declared data type and a size, and wherein the stream variable is declared with a buffer of a plurality of data items of a specified data type.
19. The system of claim 15, wherein the means for using the third program construct further comprises: means for using a state variable for maintaining a plurality of previous values of a variable after the variable has been assigned a plurality of current values.
20. The system of claim 15, wherein the means for using the fourth program construct further comprises: means for using an unroll command for transforming a loop operation into a predetermined plurality of individual executable operations.
21. The system of claim 15, further comprising: means for using a fifth program construct to provide array accessing with a predetermined address pattern.
22. The system of claim 21 , wherein the means for using the fifth program construct further comprises: means for using an iterator variable for accessing the array in a predetermined, fixed address pattern.
23. The system of claim 21 , wherein the fifth program construct is a declaration which includes a plurality of arguments, the plurality of arguments including an iteration level, an initial value of an index, an increment added to the index for a repeated iteration, and an index limit.
24. The system of claim 15, further comprising: means for using a sixth program construct to provide for a fixed number of loop iterations at run time.
25. The system of claim 24, wherein the means for using the sixth program construct further comprises: means for using a loop command for informing a compiler that a plurality of included commands contain no branching to locations outside of the loop and that a plurality of loop conditions are fixed.
26. The system of claim 15, wherein the first program construct has a semantics comprising: a first program construct identifier, followed by a plurality of included program statements.
27. The system of claim 26, wherein the first program construct has a syntax comprising: a dataflow designation; a commencement designation and a termination designation following the dataflow designation; and the plurality of included program statements contained within the commencement designation and the termination designation.
28. The system of claim 15, wherein the fourth program construct has a semantics comprising: a fourth program construct identifier having a plurality of arguments, followed by program statements for expansion into a plurality of individual commands according to the plurality of arguments.
29. A programming language for programming an integrated circuit, the programming language comprising: a first program construct to provide for execution of a computational block in parallel; a second program construct to provide for automatic indexing of reference to a buffer object; a third program construct for maintaining a previous value of a variable between process invocations; and a fourth program construct to provide for iterations having a predetermined number of iterations at a compile time.
30. The programming language of claim 29, wherein the first program construct further comprises: a dataflow command for informing a compiler that included commands are for concurrent performance in parallel.
31. The programming language of claim 29, wherein the second program construct further comprises: a channel object for providing a buffer for storing data; and a stream variable for referencing the channel object.
32. The programming language of claim 31 , wherein the channel object is a buffer instantiated with a declared data type and a size, and wherein the stream variable is declared with a buffer of a plurality of data items of a specified data type.
33. The programming language of claim 29, wherein the third program construct further comprises: a state variable for maintaining a plurality of previous values of a variable after the variable has been assigned a plurality of current values.
34. The programming language of claim 29, wherein the fourth program construct further comprises: an unroll command for transforming a loop operation into a predetermined plurality of individual executable operations.
35. The programming language of claim 29, further comprising: a fifth program construct to provide array accessing with a predetermined address pattern.
36. The programming language of claim 35, wherein the fifth program construct further comprises: an iterator variable for accessing the array in a predetermined, fixed address pattern.
37. The programming language of claim 35, wherein the fifth program construct is a declaration which includes a plurality of arguments, the plurality of arguments including an iteration level, an initial value of an index, an increment added to the index for a repeated iteration, and an index limit.
38. The programming language of claim 29, further comprising: a sixth program construct to provide for a fixed number of loop iterations at run time.
39. The programming language of claim 38, wherein the sixth program construct further comprises: a loop command for informing a compiler that a plurality of included commands contain no branching to locations outside of the loop and that a plurality of loop conditions are fixed.
40. The programming language of claim 29, wherein the first program construct has a semantics comprising: a first program construct identifier, followed by a plurality of included program statements.
41. The programming language of claim 40, wherein the first program construct has a syntax comprising: a dataflow designation; a commencement designation and a termination designation following the dataflow designation; and the plurality of included program statements contained within the commencement designation and the termination designation.
42. The programming language of claim 29, wherein the fourth program construct has a semantics comprising: a fourth program construct identifier having a plurality of arguments, followed by program statements for expansion into a plurality of individual commands according to the plurality of arguments.
43. A method for programming an adaptive computing integrated circuit, the method comprising: using a first program construct to provide for execution of a computational block in parallel, the first program construct defined as a dataflow command for informing a compiler that included commands are for concurrent performance in parallel; using a second program construct to provide for automatic indexing of reference to a channel object, the channel object for providing a buffer for storing data, the second program construct defined as a sfream variable for referencing the channel object; using a third program construct for maintaining a previous value of a variable between process invocations, the third program construct defined as a state variable for maintaining a plurality of previous values of a variable after the variable has been assigned a plurality of current values; using a fourth program construct to provide for iterations having a predetermined number of iterations at a compile time, the fourth program construct defined as an unroll command for transforming a loop operation into a predetermined plurality of individual executable operations; using a fifth program construct to provide array accessing, the fifth program construct defined as an iterator variable for accessing the array in a predetermined, fixed address pattern; and using a sixth program construct to provide for a fixed number of loop iterations at run time, the sixth program construct defined as a loop command for informing a compiler that a plurality of included commands contain no branching to locations outside of the loop and that a plurality of loop conditions are fixed.
44. The method of claim 43, wherein the channel object is a buffer instantiated with a declared data type and a size, and wherein the stream variable is declared with a buffer of a plurality of data items of a specified data type.
45. The method of claim 43 , wherein the fifth program construct is a declaration which includes a plurality of arguments, the plurality of arguments including an iteration level, an initial value of an index, an increment added to the index for a repeated iteration, and an index limit.
46. The method of claim 43, wherein the first program construct has a semantics comprising: a first program construct identifier; a commencement designation and a termination designation following the first program construct identifier; and a plurality of included program statements contained within the commencement designation and the termination designation.
47. The method of claim 43 , wherein the fourth program construct has a semantics comprising: a fourth program construct identifier having a plurality of arguments, followed by program statements for expansion into a plurality of individual commands according to the plurality of arguments.
48. A programming language for programming an adaptive computing integrated circuit, the programming language comprising: a first program construct to provide for execution of a computational block in parallel, the first program construct defined as a dataflow command for informing a compiler that included commands are for concurrent performance in parallel; a second program construct to provide for automatic indexing of reference to a channel object, the channel object for providing a buffer for storing data, the second program construct defined as a stream variable for referencing the channel object, wherein the channel object is a buffer instantiated with a declared data type and a size, and wherein the sfream variable is declared with a buffer of a plurality of data items of a specified data type; a third program construct for maintaining a previous value of a variable between process invocations, the third program construct defined as a state variable for maintaining a plurality of previous values of a variable after the variable has been assigned a plurality of current values; a fourth program construct to provide for iterations having a predetermined number of iterations at a compile time, the fourth program construct defined as an unroll command for transforming a loop operation into a predetermined plurality of individual executable operations; a fifth program construct to provide array accessing, the fifth program construct defined as an iterator variable for accessing the array in a predetermined, fixed address pattern; and a sixth program construct to provide for a fixed number of loop iterations at run time, the sixth program construct defined as a loop command for informing a compiler that a plurality of included commands contain no branching to locations outside of the loop and that a plurality of loop conditions are fixed.
49. The programming language of claim 48, wherein the fifth program construct is a declaration which includes a plurality of arguments, the plurality of arguments including an iteration level, an initial value of an index, an increment added to the index for a repeated iteration, and an index limit.
50. The programming language of claim 48, wherein the first program construct has a semantics comprising: a first program construct identifier; a commencement designation and a termination designation following the first program construct identifier; and a plurality of included program statements contained within the commencement designation and the termination designation; and wherein the fourth program construct has a semantics comprising: a fourth program construct identifier having a plurality of arguments, followed by program statements for expansion into a plurality of individual commands according to the plurality of arguments.
PCT/US2003/010946 2002-04-23 2003-04-09 Method, system and language structure for programming reconfigurable hardware WO2003091875A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003221714A AU2003221714A1 (en) 2002-04-23 2003-04-09 Method, system and language structure for programming reconfigurable hardware

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/127,882 2002-04-23
US10/127,882 US6732354B2 (en) 2002-04-23 2002-04-23 Method, system and software for programming reconfigurable hardware

Publications (1)

Publication Number Publication Date
WO2003091875A1 true WO2003091875A1 (en) 2003-11-06

Family

ID=29215349

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2003/010946 WO2003091875A1 (en) 2002-04-23 2003-04-09 Method, system and language structure for programming reconfigurable hardware

Country Status (3)

Country Link
US (1) US6732354B2 (en)
AU (1) AU2003221714A1 (en)
WO (1) WO2003091875A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7899962B2 (en) 1996-12-20 2011-03-01 Martin Vorbach I/O and memory bus system for DFPs and units with two- or multi-dimensional programmable cell architectures
US8312301B2 (en) 2001-03-05 2012-11-13 Martin Vorbach Methods and devices for treating and processing data
US8312200B2 (en) 1999-06-10 2012-11-13 Martin Vorbach Processor chip including a plurality of cache elements connected to a plurality of processor cores
US8468329B2 (en) 1999-02-25 2013-06-18 Martin Vorbach Pipeline configuration protocol and configuration unit communication
USRE44365E1 (en) 1997-02-08 2013-07-09 Martin Vorbach Method of self-synchronization of configurable elements of a programmable module
US8686549B2 (en) 2001-09-03 2014-04-01 Martin Vorbach Reconfigurable elements
US8819505B2 (en) 1997-12-22 2014-08-26 Pact Xpp Technologies Ag Data processor having disabled cores
US8869121B2 (en) 2001-08-16 2014-10-21 Pact Xpp Technologies Ag Method for the translation of programs for reconfigurable architectures
US8914590B2 (en) 2002-08-07 2014-12-16 Pact Xpp Technologies Ag Data processing method and device
US9037807B2 (en) 2001-03-05 2015-05-19 Pact Xpp Technologies Ag Processor arrangement on a chip including data processing, memory, and interface elements
US9047440B2 (en) 2000-10-06 2015-06-02 Pact Xpp Technologies Ag Logical cell array and bus system

Families Citing this family (76)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7752419B1 (en) 2001-03-22 2010-07-06 Qst Holdings, Llc Method and system for managing hardware resources to implement system functions using an adaptive computing architecture
US7653710B2 (en) 2002-06-25 2010-01-26 Qst Holdings, Llc. Hardware task manager
US7962716B2 (en) 2001-03-22 2011-06-14 Qst Holdings, Inc. Adaptive integrated circuitry with heterogeneous and reconfigurable matrices of diverse and adaptive computational units having fixed, application specific computational elements
US7249242B2 (en) 2002-10-28 2007-07-24 Nvidia Corporation Input pipeline registers for a node in an adaptive computing engine
US6836839B2 (en) 2001-03-22 2004-12-28 Quicksilver Technology, Inc. Adaptive integrated circuitry with heterogeneous and reconfigurable matrices of diverse and adaptive computational units having fixed, application specific computational elements
US6577678B2 (en) 2001-05-08 2003-06-10 Quicksilver Technology Method and system for reconfigurable channel coding
US7046635B2 (en) 2001-11-28 2006-05-16 Quicksilver Technology, Inc. System for authorizing functionality in adaptable hardware devices
US6986021B2 (en) 2001-11-30 2006-01-10 Quick Silver Technology, Inc. Apparatus, method, system and executable module for configuration and operation of adaptive integrated circuitry having fixed, application specific computational elements
US8412915B2 (en) 2001-11-30 2013-04-02 Altera Corporation Apparatus, system and method for configuration of adaptive integrated circuitry having heterogeneous computational elements
US7215701B2 (en) 2001-12-12 2007-05-08 Sharad Sambhwani Low I/O bandwidth method and system for implementing detection and identification of scrambling codes
US7403981B2 (en) * 2002-01-04 2008-07-22 Quicksilver Technology, Inc. Apparatus and method for adaptive multimedia reception and transmission in communication environments
US7493375B2 (en) * 2002-04-29 2009-02-17 Qst Holding, Llc Storage and delivery of device features
US7328414B1 (en) 2003-05-13 2008-02-05 Qst Holdings, Llc Method and system for creating and programming an adaptive computing engine
US7660984B1 (en) 2003-05-13 2010-02-09 Quicksilver Technology Method and system for achieving individualized protected space in an operating system
USRE43393E1 (en) * 2002-05-13 2012-05-15 Qst Holdings, Llc Method and system for creating and programming an adaptive computing engine
US20030233639A1 (en) * 2002-06-11 2003-12-18 Tariq Afzal Programming interface for a reconfigurable processing system
US6934938B2 (en) * 2002-06-28 2005-08-23 Motorola, Inc. Method of programming linear graphs for streaming vector computation
US7140019B2 (en) * 2002-06-28 2006-11-21 Motorola, Inc. Scheduler of program instructions for streaming vector processor having interconnected functional units
US7159099B2 (en) * 2002-06-28 2007-01-02 Motorola, Inc. Streaming vector processor with reconfigurable interconnection switch
US7415601B2 (en) * 2002-06-28 2008-08-19 Motorola, Inc. Method and apparatus for elimination of prolog and epilog instructions in a vector processor using data validity tags and sink counters
US8108656B2 (en) 2002-08-29 2012-01-31 Qst Holdings, Llc Task definition for specifying resource requirements
US7937591B1 (en) * 2002-10-25 2011-05-03 Qst Holdings, Llc Method and system for providing a device which can be adapted on an ongoing basis
US8276135B2 (en) 2002-11-07 2012-09-25 Qst Holdings Llc Profiling of software and circuit designs utilizing data operation analyses
US7478031B2 (en) * 2002-11-07 2009-01-13 Qst Holdings, Llc Method, system and program for developing and scheduling adaptive integrated circuity and corresponding control or configuration information
JP4487479B2 (en) * 2002-11-12 2010-06-23 日本電気株式会社 SIMD instruction sequence generation method and apparatus, and SIMD instruction sequence generation program
US7225301B2 (en) 2002-11-22 2007-05-29 Quicksilver Technologies External memory controller node
CA2457971A1 (en) * 2003-02-20 2004-08-20 Nortel Networks Limited Circulating switch
US7581081B2 (en) 2003-03-31 2009-08-25 Stretch, Inc. Systems and methods for software extensible multi-processing
US7613900B2 (en) 2003-03-31 2009-11-03 Stretch, Inc. Systems and methods for selecting input/output configuration in an integrated circuit
US7590829B2 (en) * 2003-03-31 2009-09-15 Stretch, Inc. Extension adapter
US8001266B1 (en) 2003-03-31 2011-08-16 Stretch, Inc. Configuring a multi-processor system
US7373642B2 (en) * 2003-07-29 2008-05-13 Stretch, Inc. Defining instruction extensions in a standard programming language
US7418575B2 (en) * 2003-07-29 2008-08-26 Stretch, Inc. Long instruction word processing with instruction extensions
US7290122B2 (en) * 2003-08-29 2007-10-30 Motorola, Inc. Dataflow graph compression for power reduction in a vector processor
US7395527B2 (en) 2003-09-30 2008-07-01 International Business Machines Corporation Method and apparatus for counting instruction execution and data accesses
US8381037B2 (en) * 2003-10-09 2013-02-19 International Business Machines Corporation Method and system for autonomic execution path selection in an application
US20050086042A1 (en) * 2003-10-15 2005-04-21 Gupta Shiv K. Parallel instances of a plurality of systems on chip in hardware emulator verification
US7979384B2 (en) * 2003-11-06 2011-07-12 Oracle International Corporation Analytic enhancements to model clause in structured query language (SQL)
US7895382B2 (en) 2004-01-14 2011-02-22 International Business Machines Corporation Method and apparatus for qualifying collection of performance monitoring events by types of interrupt when interrupt occurs
US7415705B2 (en) 2004-01-14 2008-08-19 International Business Machines Corporation Autonomic method and apparatus for hardware assist for patching code
US7602771B1 (en) 2004-12-30 2009-10-13 Nortel Networks Limited Two-dimensional circulating switch
US20050235290A1 (en) * 2004-04-20 2005-10-20 Jefferson Stanley T Computing system and method for transparent, distributed communication between computing devices
US7218137B2 (en) * 2004-04-30 2007-05-15 Xilinx, Inc. Reconfiguration port for dynamic reconfiguration
US7126372B2 (en) * 2004-04-30 2006-10-24 Xilinx, Inc. Reconfiguration port for dynamic reconfiguration—sub-frame access for reconfiguration
US7599299B2 (en) * 2004-04-30 2009-10-06 Xilinx, Inc. Dynamic reconfiguration of a system monitor (DRPORT)
US7102555B2 (en) * 2004-04-30 2006-09-05 Xilinx, Inc. Boundary-scan circuit used for analog and digital testing of an integrated circuit
US7138820B2 (en) * 2004-04-30 2006-11-21 Xilinx, Inc. System monitor in a programmable logic device
US7109750B2 (en) * 2004-04-30 2006-09-19 Xilinx, Inc. Reconfiguration port for dynamic reconfiguration-controller
US7233532B2 (en) * 2004-04-30 2007-06-19 Xilinx, Inc. Reconfiguration port for dynamic reconfiguration-system monitor interface
US7392489B1 (en) * 2005-01-20 2008-06-24 Altera Corporation Methods and apparatus for implementing application specific processors
US7613858B1 (en) * 2005-01-24 2009-11-03 Altera Corporation Implementing signal processing cores as application specific processors
US7571153B2 (en) * 2005-03-28 2009-08-04 Microsoft Corporation Systems and methods for performing streaming checks on data format for UDTs
US7451426B2 (en) * 2005-07-07 2008-11-11 Lsi Corporation Application specific configurable logic IP
KR100707781B1 (en) * 2006-01-19 2007-04-17 삼성전자주식회사 Method for efficiently processing array operation in computer system
US20070174318A1 (en) 2006-01-26 2007-07-26 International Business Machines Corporation Methods and apparatus for constructing declarative componentized applications
US7774189B2 (en) * 2006-12-01 2010-08-10 International Business Machines Corporation System and method for simulating data flow using dataflow computing system
US20080209405A1 (en) * 2007-02-28 2008-08-28 Microsoft Corporation Distributed debugging for a visual programming language
US20090064092A1 (en) * 2007-08-29 2009-03-05 Microsoft Corporation Visual programming language optimization
US7941460B2 (en) * 2007-09-05 2011-05-10 International Business Machines Corporation Compilation model for processing hierarchical data in stream systems
US7860863B2 (en) * 2007-09-05 2010-12-28 International Business Machines Corporation Optimization model for processing hierarchical data in stream systems
US8087010B2 (en) * 2007-09-26 2011-12-27 International Business Machines Corporation Selective code generation optimization for an advanced dual-representation polyhedral loop transformation framework
US8087011B2 (en) * 2007-09-26 2011-12-27 International Business Machines Corporation Domain stretching for an advanced dual-representation polyhedral loop transformation framework
US8069190B2 (en) * 2007-12-27 2011-11-29 Cloudscale, Inc. System and methodology for parallel stream processing
US7945768B2 (en) 2008-06-05 2011-05-17 Motorola Mobility, Inc. Method and apparatus for nested instruction looping using implicit predicates
US8161380B2 (en) * 2008-06-26 2012-04-17 International Business Machines Corporation Pipeline optimization based on polymorphic schema knowledge
US8755515B1 (en) 2008-09-29 2014-06-17 Wai Wu Parallel signal processing system and method
US8429395B2 (en) * 2009-06-12 2013-04-23 Microsoft Corporation Controlling access to software component state
JP5990466B2 (en) * 2010-01-21 2016-09-14 スビラル・インコーポレーテッド Method and apparatus for a general purpose multi-core system for implementing stream-based operations
US8381195B2 (en) * 2010-06-17 2013-02-19 Microsoft Corporation Implementing parallel loops with serial semantics
WO2012019111A2 (en) 2010-08-06 2012-02-09 Frederick Furtek A method and apparatus for a compiler and related components for stream-based computations for a general-purpose, multiple-core system
US10559123B2 (en) 2012-04-04 2020-02-11 Qualcomm Incorporated Patched shading in graphics processing
US8887138B2 (en) * 2012-05-25 2014-11-11 Telefonaktiebolaget L M Ericsson (Publ) Debugging in a dataflow programming environment
US9727606B2 (en) * 2012-08-20 2017-08-08 Oracle International Corporation Hardware implementation of the filter/project operations
US9690894B1 (en) * 2015-11-02 2017-06-27 Altera Corporation Safety features for high level design
CN108363615B (en) * 2017-09-18 2019-05-14 清华大学 Method for allocating tasks and system for reconfigurable processing system
US11886377B2 (en) 2019-09-10 2024-01-30 Cornami, Inc. Reconfigurable arithmetic engine circuit

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5380550A (en) * 1992-08-26 1995-01-10 Ricoh Company, Ltd. Method for producing a reversible thermosensitive recording material
US5465368A (en) * 1988-07-22 1995-11-07 The United States Of America As Represented By The United States Department Of Energy Data flow machine for data driven computing
US6016395A (en) * 1996-10-18 2000-01-18 Samsung Electronics Co., Ltd. Programming a vector processor and parallel programming of an asymmetric dual multiprocessor comprised of a vector processor and a risc processor
US20020042907A1 (en) * 2000-10-05 2002-04-11 Yutaka Yamanaka Compiler for parallel computer
US6507947B1 (en) * 1999-08-20 2003-01-14 Hewlett-Packard Company Programmatic synthesis of processor element arrays

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5021947A (en) * 1986-03-31 1991-06-04 Hughes Aircraft Company Data-flow multiprocessor architecture with three dimensional multistage interconnection network for efficient signal and data processing
US5278986A (en) * 1991-12-13 1994-01-11 Thinking Machines Corporation System and method for compiling a source code supporting data parallel variables
US5892961A (en) * 1995-02-17 1999-04-06 Xilinx, Inc. Field programmable gate array having programming instructions in the configuration bitstream
US5737631A (en) * 1995-04-05 1998-04-07 Xilinx Inc Reprogrammable instruction set accelerator
US5933642A (en) * 1995-04-17 1999-08-03 Ricoh Corporation Compiling system and method for reconfigurable computing
US5646544A (en) * 1995-06-05 1997-07-08 International Business Machines Corporation System and method for dynamically reconfiguring a programmable gate array
US6237029B1 (en) * 1996-02-26 2001-05-22 Argosystems, Inc. Method and apparatus for adaptable digital protocol processing
US5907580A (en) * 1996-06-10 1999-05-25 Morphics Technology, Inc Method and apparatus for communicating information
US6023742A (en) * 1996-07-18 2000-02-08 University Of Washington Reconfigurable computing architecture for providing pipelined data paths
US5828858A (en) * 1996-09-16 1998-10-27 Virginia Tech Intellectual Properties, Inc. Worm-hole run-time reconfigurable processor field programmable gate array (FPGA)
US5825202A (en) * 1996-09-26 1998-10-20 Xilinx, Inc. Integrated circuit with field programmable and application specific logic areas
US5966534A (en) * 1997-06-27 1999-10-12 Cooke; Laurence H. Method for compiling high level programming languages into an integrated processor with reconfigurable logic
US5970254A (en) * 1997-06-27 1999-10-19 Cooke; Laurence H. Integrated processor and programmable data path chip for reconfigurable computing
US6078736A (en) * 1997-08-28 2000-06-20 Xilinx, Inc. Method of designing FPGAs for dynamically reconfigurable computing
US6120551A (en) * 1997-09-29 2000-09-19 Xilinx, Inc. Hardwire logic device emulating an FPGA
US5999734A (en) * 1997-10-21 1999-12-07 Ftl Systems, Inc. Compiler-oriented apparatus for parallel compilation, simulation and execution of computer programs and hardware models
US5959811A (en) * 1998-01-13 1999-09-28 Read-Rite Corporation Magnetoresistive transducer with four-lead contact
US6230307B1 (en) * 1998-01-26 2001-05-08 Xilinx, Inc. System and method for programming the hardware of field programmable gate arrays (FPGAs) and related reconfiguration resources as if they were software by creating hardware objects
US6088043A (en) * 1998-04-30 2000-07-11 3D Labs, Inc. Scalable graphics processor architecture
US6150838A (en) * 1999-02-25 2000-11-21 Xilinx, Inc. FPGA configurable logic block with multi-purpose logic/memory circuit

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5465368A (en) * 1988-07-22 1995-11-07 The United States Of America As Represented By The United States Department Of Energy Data flow machine for data driven computing
US5380550A (en) * 1992-08-26 1995-01-10 Ricoh Company, Ltd. Method for producing a reversible thermosensitive recording material
US6016395A (en) * 1996-10-18 2000-01-18 Samsung Electronics Co., Ltd. Programming a vector processor and parallel programming of an asymmetric dual multiprocessor comprised of a vector processor and a risc processor
US6507947B1 (en) * 1999-08-20 2003-01-14 Hewlett-Packard Company Programmatic synthesis of processor element arrays
US20020042907A1 (en) * 2000-10-05 2002-04-11 Yutaka Yamanaka Compiler for parallel computer

Non-Patent Citations (19)

* Cited by examiner, † Cited by third party
Title
"Cray T3E Fortran Optimization Guide", January 1999, CRAY RESEARCH INC., XP002968916 *
"Fortran 3.0.1 User's Guide", August 1994, SUN MICROSYSTEMS, pages: 57 - 68 *
"IEEE Transactions on Acoustics, Speech and Signal Processing", September 1987, article LEE E. AND MESSERSCHMITT D.: "Pipeline interleaved programmable DSP's: synchronous data flow programming", XP000001892 *
"OpenMP C and C++ application program interface", OPENMP ARCHITECTURE REVIEW BOARD, October 1998 (1998-10-01), pages 8 - 16, XP002968918 *
"Oracle8i JDBC Developer's Guide and Reference", July 2000, ORACLE CORPORATION, pages: 10-8 - 10-10, XP002968917 *
BACON D., GRAHAM S., SHARP. O.: "Compiler transformations for high-performance computing", ACM COMPUTING SURVEYS, vol. 26, no. 4, December 1994 (1994-12-01), pages 368 - 373, XP002246513 *
BUCK J. ET AL.: "Ptolemy: A framework for simulating and prototyping heterogeneous systems", INTERNATIONAL JOURNAL OF COMPUTER SIMULATION, vol. 4, April 1994 (1994-04-01), pages 155 - 182, XP000614664 *
CHAPMAN B. AND MEHROTRA P.: "Proceedings of the 4th International Euro-Par Conference (Euro-Par'98)", vol. 1470, SPRINGER-VERLAG HEIDELBERG, LECTURE NOTES IN COMPUTER SCIENCE, article "OpenMP and HPF: Integrating Two Paradigms", pages: 650 - 658, XP002968923 *
GOKHALE M. AND SCHLESINGER J.: "A data parallel C and its platforms", PROCEEDINGS OF THE FIFTH SYMPOSIUM ON THE FRONTIERS OF MASSIVELY PARALLEL COMPUTATION (FRONTIERS'95), February 1995 (1995-02-01), pages 194 - 202, XP010130218 *
HALBWACHS N. ET AL.: "The synchronous data flow programming language LUSTRE", PROCEEDINGS OF THE IEEE, vol. 79, no. 9, September 1991 (1991-09-01), XP002968920 *
HEINZ E.: "An efficiently compilable extension of (M)odula-3 for problem-oriented explicitly parallel programming", PROCEEDINGS OF THE JOINT SYMPOSIUM ON PARALLEL PROCESSING, May 1993 (1993-05-01), pages 269 - 276 *
HORTON I.: "Beginning Java 2: JDK 1.3 Edition", February 2001, WROX PRESS, pages: 313 - 316, XP002968919 *
JUNG H., LEE K., HA S.: "Efficient hardware controller synthesis for synchronous dataflow graph in system level design", PROCEEDINGS OF THE 13TH INTERNATIONAL SYMPOSIUM ON SYSTEM SYNTHESIS (ISSS'00), September 2000 (2000-09-01), pages 79 - 84, XP010514329 *
LEE E. AND MESSERSCHMITT D.: "Synchronous data flow", PROCEEDINGS OF THE IEEE, vol. 75, no. 9, September 1987 (1987-09-01), XP000048475 *
LEE E. AND PARKS T.: "Dataflow process networks", PROCEEDINGS OF THE IEEE, vol. 83, no. 5, May 1995 (1995-05-01), XP000517102 *
MCGRAW J.: "Parallel functional programming in sisal: fictions, facts and future", LAWRENCE LIVERMORE NATIONAL LABORATORY, July 1993 (1993-07-01), XP002968922 *
NICHOLS M., SIEGEL H., DIETZ H.: "Data management and control-flow constructs in a SIMD/SPMD parallel language/compiler", PROCEEDINGS OF THE 3RD SYMPOSIUM ON THE FRONTIERS OF MASSIVELY PARALLEL COMPUTATION, October 1990 (1990-10-01), pages 397 - 406, XP010019670 *
WHITING P. AND PASCOE R.: "A history of data flow languages", IEEE ANNALS OF THE HISTORY OF COMPUTING, vol. 16, no. 4, 1994, XP002968921 *
WILLIAMSON M. AND LEE E.: "Synthesis of parallel hardware implementations from synchronous dataflow graph specifications", CONFERENCE RECORD OF THE THIRTIETH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS AND COMPUTERS, vol. 2, November 1996 (1996-11-01), pages 1340 - 1343, XP010231365 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8195856B2 (en) 1996-12-20 2012-06-05 Martin Vorbach I/O and memory bus system for DFPS and units with two- or multi-dimensional programmable cell architectures
US7899962B2 (en) 1996-12-20 2011-03-01 Martin Vorbach I/O and memory bus system for DFPs and units with two- or multi-dimensional programmable cell architectures
USRE45223E1 (en) 1997-02-08 2014-10-28 Pact Xpp Technologies Ag Method of self-synchronization of configurable elements of a programmable module
USRE45109E1 (en) 1997-02-08 2014-09-02 Pact Xpp Technologies Ag Method of self-synchronization of configurable elements of a programmable module
USRE44365E1 (en) 1997-02-08 2013-07-09 Martin Vorbach Method of self-synchronization of configurable elements of a programmable module
USRE44383E1 (en) 1997-02-08 2013-07-16 Martin Vorbach Method of self-synchronization of configurable elements of a programmable module
US8819505B2 (en) 1997-12-22 2014-08-26 Pact Xpp Technologies Ag Data processor having disabled cores
US8468329B2 (en) 1999-02-25 2013-06-18 Martin Vorbach Pipeline configuration protocol and configuration unit communication
US8312200B2 (en) 1999-06-10 2012-11-13 Martin Vorbach Processor chip including a plurality of cache elements connected to a plurality of processor cores
US9047440B2 (en) 2000-10-06 2015-06-02 Pact Xpp Technologies Ag Logical cell array and bus system
US8312301B2 (en) 2001-03-05 2012-11-13 Martin Vorbach Methods and devices for treating and processing data
US9037807B2 (en) 2001-03-05 2015-05-19 Pact Xpp Technologies Ag Processor arrangement on a chip including data processing, memory, and interface elements
US9075605B2 (en) 2001-03-05 2015-07-07 Pact Xpp Technologies Ag Methods and devices for treating and processing data
US8869121B2 (en) 2001-08-16 2014-10-21 Pact Xpp Technologies Ag Method for the translation of programs for reconfigurable architectures
US8686549B2 (en) 2001-09-03 2014-04-01 Martin Vorbach Reconfigurable elements
US8914590B2 (en) 2002-08-07 2014-12-16 Pact Xpp Technologies Ag Data processing method and device

Also Published As

Publication number Publication date
AU2003221714A1 (en) 2003-11-10
US6732354B2 (en) 2004-05-04
US20030200538A1 (en) 2003-10-23

Similar Documents

Publication Publication Date Title
US6732354B2 (en) Method, system and software for programming reconfigurable hardware
US7979263B2 (en) Method, system and program for developing and scheduling adaptive integrated circuitry and corresponding control or configuration information
Catthoor et al. Data access and storage management for embedded programmable processors
US7200837B2 (en) System, method and software for static and dynamic programming and configuration of an adaptive computing architecture
Weng et al. A hybrid systolic-dataflow architecture for inductive matrix algorithms
Vahid et al. Warp processing: Dynamic translation of binaries to FPGA circuits
WO2004014065A2 (en) System of finite state machines
Abnous et al. The pleiades architecture
Mei et al. Design and optimization of dynamically reconfigurable embedded systems
Chaudhuri et al. SAT-based compilation to a non-vonNeumann processor
Bhattacharyya Compiling dataflow programs for digital signal processing
Gokhale et al. Co-synthesis to a hybrid RISC/FPGA architecture
Cardoso et al. From C programs to the configure-execute model
Chen et al. An integrated system for rapid prototyping of high performance algorithm specific data paths
Resano et al. A hybrid design-time/run-time scheduling flow to minimise the reconfiguration overhead of FPGAs
David et al. Energy-Efficient Reconfigurable Processsors
Silva et al. A dynamic dataflow architecture using partial reconfigurable hardware as an option for multiple cores
Ottimo et al. FSP: a framework for data stream processing applications targeting FPGAs
Misra et al. Efficient HW and SW Interface Design for Convolutional Neural Networks Using High-Level Synthesis and TensorFlow
Barnwell et al. The Georgia Tech digital signal multiprocessor
Reinders et al. Programming for FPGAs
David et al. A compilation framework for a dynamically reconfigurable architecture
EP1470478A2 (en) Method and device for partitioning large computer programs
Dimitroulakos et al. Performance improvements using coarse-grain reconfigurable logic in embedded SOCs
Sunny et al. Standalone Nested Loop Acceleration on CGRAs for Signal Processing Applications

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP