|Publication number||US7509647 B2|
|Application number||US 10/152,634|
|Publication date||Mar 24, 2009|
|Filing date||May 23, 2002|
|Priority date||May 25, 2001|
|Also published as||EP1402379A2, EP1402379A4, EP2472407A2, EP2472407A3, US20020178285, WO2002097565A2, WO2002097565A3|
|Publication number||10152634, 152634, US 7509647 B2, US 7509647B2, US-B2-7509647, US7509647 B2, US7509647B2|
|Inventors||Robert L. Donaldson, Rhett D. Hudson, Lawrence M. Marshall, Jr., Michael N. Gray, James J. Sullivan, James B. Peterson, Teresa G. Smith, Michael P. Klewin, Dennis M. Hawver|
|Original Assignee||Annapolis Micro Systems, Inc.|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (16), Referenced by (9), Classifications (10), Legal Events (4)|
|External Links: USPTO, USPTO Assignment, Espacenet|
The invention relates to apparatus and methods for modeling dataflow within systems and circuits and to realizing such systems and circuits in hardware.
Conventional circuit capture and simulation schemes are limited in their abilities to scale designs and are typically restricted by their fixed port maps, with a specific number of ports having specific names and types. Such approaches require that all input ports be connected in some functionally correct manner and do not provide the option of dynamic ports wherein the ports and their characteristics can adapt to system requirements. Moreover, conventional schematic capture schemes provide libraries of components with fixed parameters. As a result, designs of products, such as field programmable gate arrays and other custom circuits are constrained to a narrow range of implementations and cannot be optimized based on the dataflow in the system. In addition, such design systems typically comprise low-level components whose interaction with peer components must be manually negotiated by the user.
A system according to the invention seeks to avoid such limitations by providing a designer the ability to create designs based on dataflow, as represented in components responsive to well defined intercommunication protocols. A system according to the invention conceptually behaves in a manner analogous to that of a Petri Net. According to the invention, protocols are derivative of several canonical forms. According to the invention, these canonical forms also directly imply a second tier of protocols that convert a data stream compliant to one canonical protocol to any of the other canonical protocols. A system according to the invention allows the development of multiple dataflow streams by interconnecting components compliant with one or more of the canonical forms either directly, for components compliant to the same intercommunication protocol, or through the canonical conversions, for components whose intercommunication protocols differ.
Further, the invention provides a mapping between a modeled system and various hardware implementations of the modeled system. This mapping includes low-level hardware for implementing the canonical intercommunication protocols and the canonical conversions between protocols. Target hardware platforms include but are not limited to microprocessor-based systems, programmable logic devices, field programmable gate arrays and application specific integrated circuits.
In another aspect of the invention, consistent with a dataflow approach, a design can be modeled with components having non-static port maps, so that components can derive their parameters, such as bit width, based on the their interconnections with other components. Glue logic, such as that required for the convergence of ready signals can be automatically implied. Moreover, a design targeted to a specific hardware platform can be re-targeted to a different hardware platform by changing the binding of data sources and sinks, making designs re-useable among a plurality of platforms.
A method and apparatus according to the invention are disclosed herein with reference to the drawings in which:
A method of modeling a dataflow architecture according to the invention includes storing in a memory indicia which represent a dataflow architecture having components responsive to intercommunication protocols and processing conversions among the intercommunication protocols to determine control structures required to implement the dataflow architecture. The indicia stored in memory and the control structures are translated into a description of an implementation of the dataflow architecture. The description of an implementation may be in a standard format for defining a network format such as the electronic design interchange format (EDIF) netlisting language, or a hardware description language such as structural very highscale integrated circuit description language (VHDL), or structural Verilog. The description of an implementation is further translated into a physical implementation of the dataflow architecture. The physical implementation may be configuration data for a field programmable gate array (FPGA). The physical implementation may be a layout for an application specific integrated circuit (ASIC). The physical implementation is object code for a general-purpose processor.
In addition to data signals, virtually all dataflow paths also require at least one forward-flowing handshake bit and one backward-flowing handshake bit. Protocols requiring less than this typically have the missing information implied as a property of the design. For instance, a pipelined computational component may be always ready to accept inputs, thus obviating the need for a “ready” signal indicating that it is able to accept inputs. These two handshake signals will always accompany data paths in order to provide full control, and appear herein as shown in
The forward-flowing handshake and data signals, as well as the backward-flowing handshake signal may be thought of as a single bus, much the same way that a multiple-bit data value is considered to be a single bus, with the exception that the directionality of the backward-flowing handshake signal is always contrary to the nominal flow of the bus. This contrary flow becomes a source of complexity when attempting seemingly trivial operations such as fanning out a signal, as will be discussed later.
The DFC protocol shown in use in
Due to timing restrictions, there may be occasions where the forward-flowing “valid” and “data” signals might be temporally skewed with respect to the backward-flowing “ready” signal. For more complex circuits, this becomes a highly useful aspect of the DFC protocols, and is referred to herein as the skew associated with the DFC bus. The skew is defined as the number of clock cycles of delay between when a component asserts its “ready” line (or whatever the backward-flowing handshake signal may be called) and when the data to be consumed is present on the “data” signal. This skew is almost always non-negative, since accurately presenting valid data while having to predict whether or not it will be consumed is typically prohibitively complex, if not impossible, and is usually not of much use. The skew is presented herein as a sub-scripted number to the left of the “VDR” name. The intuitive protocol described in the previous paragraph and shown in
Skew may be incurred through the use of registers on all inputs and outputs of a component when strict timing constraints are required, such as with high-performance computations in a field programmable gate array (FPGA). Registering all three signals in a VDRn bus will add 2 to the skew, transforming it into a VDRn+2 bus protocol, as shown in
A manipulation of the data stream which requires pipelining to meet timing constraints can also add skew to a VDR bus.
Because data is only considered to be transferred from a VDR output to a VDR input when both the “valid” and the “ready” flow-control signals are asserted, one of the first things that a component accepting a VDR input would typically do is perform a logical AND operation on these two signals to generate a signal indicating that the currently incoming data is to be consumed. Since this signal would most often used to push values into a FIFO, it is called the “push” signal. Creating a “push” signal from a “valid” and “ready” signal is known herein as a “push-cast” operation. Due to the potential for skew between the “ready” and “data” signals (and hence, between “ready” and “valid” signals), a “push-cast” operation is not only an AND gate, but also a delay chain used to synchronize the “ready” and “valid” signals.
This “push-cast” operation converts a VDRn signal to another DFC protocol referred to herein as “PDR”. The name of this protocol is derived from the first letters of its three signals: “push”, “data”, and “ready”. The PDR protocol also has an associated skew, and the “push-cast” operation maintains this skew, converting a VDRn signal into a PDRn signal. The major rules of the PDR protocol are that data presented with an asserted “push” must be consumed, and that “push” may not be asserted when the associated “ready” signal is not (note that, due to potential skew, the “push” signal and its associated “ready” signal may not necessarily occur during the same clock cycle).
Similarly, some components may only wish to assert their intent to consume incoming data during certain times when they are guaranteed that the incoming data will be consumable. These components AND the incoming “valid” signal with their internally-computed “ready” signal, and present the resulting signal as their backward-flowing handshake output. Since this signal may be used to pull data from a FIFO, it is referred to as the “pull” signal. Creating a “pull” signal from a “valid” and “ready” signal is known as a “pull-cast” operation, and since the result of the AND operation is temporally aligned with the “valid” signal, the resulting protocol has a skew of 0, and the “pull-cast” operation is usually only performed on VDR0 busses.
The “pull-cast” operation implies a third DFC protocol referenced herein as “VDL”. The third letter of “pull” is used to name this protocol in order to avoid the confusion of using the same letter to denote both “push” and “pull”. The major rules of the VDL protocol are that “pull” can only be asserted when “valid” is asserted, and the data associated with an assertion of “pull” will be consumed.
During manipulations of DFC busses, skew tends to accumulate. Many simple components can add to the skew of a DFC bus but few can subtract from this skew. However, at times, it is required that the skew on a particular bus be 0. Reducing the skew to 0 can be achieved using a “clear-skew” operation. To clear a skew of “n”, the operation must drop its “ready” output when it can only accept “n” more data values. These data values are queued up in a First in First out device (FIFO) and presented to the downstream port as required. The “clear-skew” operator is thus a data FIFO, with an “almost full” control signal which is set to be asserted when there are “n” or fewer spaces left in the FIFO. Incoming data, if it uses the VDR protocol, may be cast to PDR using a “push-cast”, and the resulting VDL0 output may be converted to VDR0.
DFC Protocol Rules
The “valid”, “push”, “data”, “ready”, and “pull” signal definitions, along with the associated definition of the skew parameter, are the elements which make up the VDR, PDR, and VDL protocols. It is within the scope of the invention to employ a dual set of protocols which use “empty” as an inverted-sense replacement for “valid”, and “full” as the inverted-sense replacement for “ready”. The resulting protocols (EDF, PDF, and EDL) would then exactly match the correspondingly-named signals on a typical FIFO element. Since there is no benefit inherent in choosing one set of protocols over the other, we acknowledge the existence of the dual set here as within the scope of the invention, and recognize that the “valid” and “ready” signals may be generated from the “not empty” and “not full” control signals of a FIFO. The formal definitions of the various DFC signals are as follows:
The rules associated with these signals ensure that no data is lost or fabricated within a computational structure. The rules are as follows:
It seems at first glance that a PDRn output could be made to drive a VDRn input simply by hooking the “push” output to the “valid” input. After all, the additional restriction that the “push”/“valid” line will not be asserted during a time when the “ready” line is not asserted does not appear to violate the looser VDR protocol, which allows “valid” to be high at any time. Similarly, it appears that a VDRn output could easily drive a VDLn input. To avoid deadlocking data streams, however, it is useful to realize precisely which protocol is being used. If a component with a VDR output will only assert “valid” when it sees an asserted “ready” signal, that component is better described as having a PDR output. And again, a VDR input which is only ready when valid data is available is really a VDL input.
This distinction becomes important when considering a fourth DFC protocol called “PDL”, which has “push”, “data”, and “pull” signals. This protocol is subject to the complication of deadlock. Extrapolating the rules of the PDR and VDL protocols, one major rule of PDL is that the “push” and associated “pull” signals must either be both asserted or both deasserted. Because the upstream component may not have valid data to present, the downstream component is not able to assume that it can assert its “pull” output. Similarly, since the downstream component may not be able to accept valid data, the upstream component could not presume to be able to assert “push”. The only functional form of a bus using the PDL protocol is the degenerate case where a component is known to either always be ready or always have valid data to present. This is really more like cheating the protocol, though, since that known property of the component is a form of a “virtual” handshake signal, whose value is always known and is factored into the design of the interfacing component.
The importance of this PDL protocol becomes more obvious when it is used to help describe what really happens when a DFC stream is misclassified, and subsequently hooked up to a component which implements the protocol with which the stream is labeled.
Therefore, in general, it is useful to precisely describe the protocol supported by a designed component. It should be noted, though, that the same collection of logic and registers may just happen to implement two different DFC protocols.
Multiple DFC Streams
Until now, all examples herein have used only a single input stream and a single output stream. In useful computations, it is rarely the case that the implementation consists entirely of such components. Handling multiple DFC streams is more complicated than the analog of handling multiple “once-per-clock-cycle” raw data streams. Even the simple act of fanning out a DFC component's output is more complex than just routing it to two destinations, due in part to the fact that the backward-flowing “ready” or “pull” signal would then have multiple drivers. Two of the most common manipulations on DFC streams that require attention, besides those already described, are the “fan-out” operation and the “synchronize” operation.
A “fan-out” operator can not route the “valid” signal directly to both destinations, since one destination may consume the data before the other is ready to do so. This may lead to dropping or replicating data if not compensated for with additional hardware. A simple form of a “fan-out” operator is a single multiple-input AND gate, and takes a PDR input and provides PDR outputs. This operator is shown in the left diagram of
A “synchronize” operator has restrictions similar to the version of the “fan-out” operator which takes a VDL0 input. The only difference being that, since it has multiple inputs, multiple “valid” signals are fed into the AND gate to generate the “push”/“pull” signal for all inputs and outputs. In fact, both the VDL0 version of the “fan-out” operator and the “synchronize” operator can be viewed as special variations on a more general operator which routes its input data to its output ports in whatever form is desired, performs a logical AND on all of the “valid” and “ready” signals, and uses the result for all of the “push” and “pull” signals.
The “synchronize” operator is particularly useful in the construction of computational units which require multiple inputs.
In order to cascade two such computational components, the PDRn output of the first must be converted to a VDL0 signal to attach to the second. This is achieved using a variant of the “clear-skew” operator shown in
The DX Protocol
The “synchronize” and “fan-out” operations are common, considering that any multiple-input operator needs to synchronize its inputs. In addition, significant logic simplifications may be achieved by recognizing that all “push” signals are identical, and that data consumption at the outputs of these operations is synchronized. Therefore, it is useful to capture the concept of a single, common “push” signal with a different protocol name. The name “DX” is given to this protocol, since the commonly passed signals would consist of the “data” and a more restricted “ready” signal.
The DX protocol does not explicitly route a “push” signal to components which support it. Most components do not need the “push” signal to function properly, and diligently tracking the skew of the DX busses provides enough information to create an appropriate “push” signal from the original, zero-skew “push” coming from the “synchronize” operator when such a signal is required. Therefore, in addition to an associated skew parameter, a DX bus is also considered to belong to a particular “push domain”. The push domain is directly related to the “synchronize” or “fan-out” operator which generated the original DX signals. This push domain may be denoted as a super-scripted token immediately to the left of the “DX” specifier, such as: pDXn. The push domain is synonymous with the original zero-skew “push” signal generated by the “synchronize” or “fan-out” operator.
In addition to these specializations with respect to the “push” signal, components which manipulate DX signals must also take care when manipulating the “ready” signal. In particular, the “ready” outputs of a component can only be the logical AND of all of its “ready” inputs. The component can not mask off the “ready” signal in response to some data value or the “push” signal, and they can not delay the ready signal by any number of clock cycles. This restriction guarantees that a downstream conversion from DX to another DFC protocol will function properly. A DX component must always be ready to accept new data for processing (exceptions to this rule discussed further herein involve advanced pipeline design). A DX stream may eventually need to be converted to a less restrictive DFC protocol. One technique for performing this conversion involves reconstructing a valid “push” signal and reinterpreting the signals as PDR signals. This conversion is shown in
For the purposes of simplification in an automated system, the delayed “push” signal generated in
In order to perform an operation on two DX inputs, the inputs must belong to the same domain and have the same skew. If one input has a lesser skew, this mismatch can be fixed by adding delay components. If the domains do not match, however, this difference requires that a completely new domain be created, and that the data be brought into this domain before the operation can proceed. While the concept of creating a new domain sounds overly complicated, it can be as uncomplicated as implementing a “synchronize” operation. In order to synchronize the inputs, however, the two incoming DX streams must be converted to PDR as shown above, and then into VDL0 with the aid of a “clear-skew” operator.
Another advantage of the DX protocol is seen when cascading two multiple-input computational elements and comparing the resulting hardware to that of the previous example that used VDL inputs and PDR outputs.
The PDX Protocol
In order to perform protocol transitions locally, an evolution of the DX protocol that relaxes skew constraints between “push” and “data” signals, as well as tracks the individual skew of each of the “push”, “data”, and “ready” signals may be used. This “PDX” protocol appears as a cross between the PDR protocol and the DX protocol. Any PDX signal thus has four associated parameters: skews for “push”, “data”, and “ready” signals, and the push domain. As an advantage over the PDR protocol, PDX signals originating from the same push domain can be synchronized with simple delay elements, rather than conversion to VDL0. As an advantage over the DX protocol, any component, given PDX inputs, can create synchronization circuitry locally, without the need for operations such as “get-push”, which requires inter-component coordination with the domain controller. Notation for a PDX signal consists of a subscript in front representing the push domain, and a subscript after every letter representing the relative skew of each of the three signals, such as: pP0D8X−1. In truth, the PDX protocol is a superset of the PDR and DX protocols.
Two disadvantages of the PDX protocol are the redundant application of delay lines to the “push” signal in components connected in parallel, and the wide (and potentially redundant) AND operation typically applied to “ready” signals. To alleviate the effects of these disadvantages, care may be taken in delaying “push” and “ready” signals. In particular, at each “fan-out” operation, the “push” signal can be delayed up to at least the skew of the “data” signal, with the assumption that this delay will be required just this one time. Delaying a “push” signal to a skew higher than the “data” signal may cause the “data” signal to have to be delayed further on, so “push” is typically not delayed for as long as possible. Furthermore, the “ready” signal, which is AND-ed at each “fan-out” operation, can be registered after the AND. This actually decreases the skew of the “ready” signal on the component's output. The individual skew of a “ready” signal is typically negative when not zero. This registering will alleviate timing concerns by breaking up the wide AND operation, but may aggravate redundancy concerns by making it more difficult for synthesis tools to recognize and optimize redundancies.
The PDX protocol has no restrictions on the “ready” signal (but the “X” is kept as part of its name to imply that PDX is similar to DX). Such restrictions were required for the DX protocol because the individual skews were not tracked. As with the DX protocol, in order to perform a pipelined computation on two PDX data streams, the streams must belong to the same push domain and have the same skew on their “data” signals. If the PDX signals do not belong to the same push domain, a new one can be created by transitioning through the PDR, and VDL protocols to return to a PDX protocol with a new push domain. A pPaDbXc signal can be converted into a PDRn signal, where n=max(a,b)−c by delaying the “push” or “data” signal up to a skew of max(a,b). The resulting “push”, “data”, and “ready” then use signaling identical to that of a PDRn signal and can be reinterpreted as such. Conversely, a PDRn signal may be reinterpreted as a qPxDxXx−n signal for any value of x, so long as the push domain, q, is unique within the system. The acts of converting from PDX to PDR and back, “resync” and “new-domain”, are illustrated in
Converting between PDX and DX protocols is also possible. A pPaDbXc signal can be converted into a qDXn signal, where n=b−c, by relabelling the “data” and “ready” signals and keeping track of the “push” signal in a fabricated domain controller. If c=0, the two push domains, p and q, may even be identical. If c≠0, the two push domains must be different, but they are related in that the “push” signal in domain q is thought to be decided upon c clock cycles earlier than that in domain p. Even though this results in identical “push” signals, changing the frame of reference has the effect of zeroing the skew of the “ready” signal (a prerequisite for the DX protocol). Because of this, the conversion is referred to as the “zero-ready” operation.
One caveat to this conversion is that it may be impossible to generate a zero-skew “push”, so the “get-push” operator to convert from DX to PDR must take this into account. In fact, if a>b, it may be impossible to synchronize the “push” and “data” signals without having to delay the data. If the “get-push” operator is handled as a black box, with communication to the domain controller about the minimum available push skew, these restrictions can be handled seamlessly. It would just no longer be the case that a pDXn signal is converted to a PDRn signal, but rather a PDRm signal, where m=max(a,n).
A pDXn signal can be converted to a pP0DnX0 signal with a simple relabelling of signals, although if a zero-skew “push” is not available, it may only be convertible to a pPmDnX0 signal, where m is the minimum skew “push” available. In this conversion, the two signals remain within the same push domain. Because the zero-skew “push” is typically available from a DX signal, this operation is called the “zero-push” operator.
A push domain is a reference point that refers to the decision to consume data. It is synonymous with a boolean function of discrete time which, at any given point in time (typically referred to as a “clock cycle”), describes whether or not a collection of data is about to be consumed. Usually, this decision is manifested in hardware as the generation of a zero-skew “push” signal used to pull data from FIFOs and push it into a logic structure which digests the data to produce a result. Thus, the push domain, the value of the function, and its electrically equivalent “push” signal all refer to the decision to consume data.
For PDX signals, the skew of each individual signal refers to its latency relative to the consumption decision. For simpler signal protocols, like DX and PDR, some of this information is lost. The DX protocol keeps the push domain reference, but looses the “push” signal and assumes the “ready” signal has zero skew. The PDR protocol looses the push domain, assumes the “push” and “data” signals are skewed by the same amount, and only remembers the difference between the common “push”/“data” skew and the “ready” skew (as this is the only number required to construct a “clear-skew” operator).
One issue that arises is: “what is making this consumption decision, and how is it making it?”. Typically, the answer is that the “synchronize” operator decides upon the consumption of data by ensuring that all required data is available and there is room reserved downstream for the digested result (the product of all pertinent “valid” and “ready” signals). In general, a component which consumes one datum from each of its inputs to produce one datum on each of its outputs may make its own consumption decision identically to that of the “synchronize” operator, and may even incorporate a “synchronize” operator to make the decision for it. In such a case, all inputs to the component are VDL0 and all outputs are at pDXn, and the component is the controller of its own push domain—its own consumption rate.
Other components may be slaves within a foreign push domain, where data consumption is decided by some external logic. The simplest of such cases is pipelined components where all inputs and outputs use the PDX or DX protocols within the same push domain. Other possibilities include components where one input is PDR0, pP0DnX0, or pDX0 and the others are VDL0, or where one output is VDL0, and the others are PDR, PDX, or DX. These components are instructed on when to consume data by the incoming handshake signal of the odd input or output. The restriction on some skews being zero is due to the fact that the slave component must, on a clock-by-clock basis, recompute its outbound “ready” or “valid” signal.
Casting between Protocols
Given the five protocols discussed (VDR, PDR, VDL, DX, and PDX), it is informative to consider the different options available for casting from one protocol to the next. In particular, any sort of protocol on the inputs to a pipelined computation must be cast to the DX protocol in order for the computation to proceed. Furthermore, it may be necessary to not produce an output of type DX where the domain controller is nested within the component, since retrieving the “push” signal from the domain controller may then need to alter the port map of the component. In such a case, the DX signal will likely be cast to a PDR signal before leaving the component.
One cast shown in
Transitioning out of the DX and PDX protocols and back in must create a new push domain. This is due to the fact that transitioning to one of the other protocols loses the information about the push domain and signal skews relative to that domain. This information cannot be regained. Therefore, this information must be regenerated within the context of a new push domain.
Advanced Push-Domain Implementation Techniques
Because working with the DX protocol is similar to working on a simple pipelined structure, many of the techniques applicable to pipelining can be used to implement efficient push-domain hardware. Sharing pipelined hardware between two similar operations is possible, provided the required throughput is 50% (one result every two clock cycles) or less, and the two operations occur within the same push domain. Sharing takes the form of a multiplexor on the inputs to a unit and the fanning out of the outputs, with the realization that the outputs are only valid some of the time. Furthermore, the “ready” signal fed to the domain controller must be modulated by the duty cycle of the shared unit's availability.
The simplest form of such sharing occurs when a pipelined component which can accept one input per clock cycle is shared to perform N operations within the desired computation. A “modulo-N” phase counter can be used to modulate the “ready” signal fed to the domain controller, as well as control the multiplexor selecting inputs to the pipelined component. The use of this counter to control the multiplexor implies that the skew, ‘s’, associated with an input which will be selected when the count value is ‘x’ must satisfy the following equation:
s mod N=x
The skew, ‘r’, of the resulting output of the unit is then just ‘s+n’, where ‘n’ is the latency of the component. If the above relationship can not be satisfied using the inputs directly, one or more of the inputs must be delayed enough clock cycles until it can.
Up to now, the only types of computational components considered herein were pipelined units which could accept an input on each clock cycle, and produced results after a fixed number of clock cycles from the input. This covers a majority of the useful high-performance components, but there may be instances where a computation is known to only need to be performed once in a while. If a computational unit's availability is more complex than “once every clock cycle”, such availability may be expressed by modulating the “ready” input to the domain controller, as seen in the previous example. If, however, the unit's availability and latency is dependent on run-time data, it can not be covered with a simple modification of the “ready” signal.
Such components may require full-flow-control protocols such as VDL, VDR, and PDR. Converting from the DX protocol to a full-flow-control protocol and then back can incur a lot of hardware overhead. If it can be guaranteed that a component can sustain a particular worst-case throughput, a modulation that expresses no more than that throughput can be applied to the “ready” input of the domain controller, and FIFOs can be used to wrap the component and cushion the effect of it not being ready all the time. The “push” signal is then derived with a “get-push” operator to temporarily convert the units input from DXn to PDR. This “push” signal can then be delayed by the maximum expected latency, ‘m’ (which includes the latencies of the FIFOs), and used as the “pull” signal to convert back from VDL to DXn+m.
Since the “full” output from the front FIFO and the “empty” output from the back FIFO are ignored, there are complex statistical requirements which must be met by the computational unit in order for this structure to function properly.
Low throughput requirements may also simplify the conversion to VDL0 at the end of a push-domain structure. Typically, such a conversion is implemented as a “get-push” operator (to convert from DXn to PDRn) followed by a “clear-skew” operator (to convert from PDRn to VDL0). In cases where the skew, ‘n’, is high and the required throughput of the structure is low, a straightforward implementation of the “clear-skew” operator can require an unnecessarily large FIFO. One way to side-step this issue is to modulate the “ready” signal of the “clear-skew” operator to restrict its maximum throughput to some percentage ‘p’. The FIFO then need only be able to store a similarly reduced amount of data, with the “almost-full” threshold being set to when the FIFO has ‘pĚn’ spaces left available.
Excessive, simplistic modulation of the “ready” signal in a push domain may cause undesirable effects. For instance, one unit that modulates it by allowing data every 3 clock cycles, factored in with another unit that modulates it by allowing data every 5 clock cycles will result in a throughput of one result every 15 clock cycles. For this reason, it is best for a programmatic process generating push-domain structures to react more intelligently to components with special throughput requirements, and to generate all control signals needed to constrict throughput in the push domain.
An alternative method of reducing the sizes of “clear-skew” operators takes advantage of the fact that one and only one result is produced for every one set of inputs. Because of this fact, the computation can be “paced”, by adding a gate at the front of the push domain which ensures that a FIFO used to clear the skew at the back of the push domain is sufficiently large to contain the outputs.
The “pacer” component is used to reduce the size of a FIFO in a “clear-skew” operator by placing two matching FIFOs at each of its inputs. These FIFOs produce VDL0 outputs, the first of which is fed through the “pacer” components input to its output, as the VDL0 result of the computation performed in the push domain. The second is a regular “clear-skew” implementation, which ensures that the first FIFO will never overflow by disallowing any inputs into the push domain which would eventually produce too many outputs. The AND gate within the “pacer” may not be necessary in such a configuration, as it may be assumed that the pacer FIFO is always valid whenever the data FIFO has valid data.
Because the data for the second input to the “pacer” is unused, it need not be routed to the FIFO. In fact, all that truly needs to be implemented for the second FIFO is the controller which determines “empty” and “almost-full” signals. This implies the need for a “void” data type, for which the value of the data being passed is not important, but the control signals used to pass the data are important. The “void” data type is thus used to implement explicit control structures using what appear, at first glance, to be computational elements.
Higher-Level DFC Components
The DFC protocols provide an explicit specification of the presence of valid data through either the “valid” or “push” handshake signal. It also denotes explicit consumption of data through the “ready” and “pull” handshake signals. This explicit presence and consumption gives data passing through the system an almost physical analog, in which the designer must ensure that data is neither created nor destroyed inappropriately. These restrictions imply a collection of structures unique to explicitly present data.
The simplest such structure is one with a single input and no outputs which consumes all data passed in to it. This component is always ready to accept input, so its implementation, shown in
Related to the “null” component is a single-input, single-output component which accepts all incoming data, and repeats it on its output until a new input arrives. This “echo” component, as shown in
Another component is the “constant” component, whose data is always valid, regardless of the state of the “pull” input. This component is shown in
A slight variation of the “constant” component is a “controlled-constant” component. An implementation of such a component is shown in
The most basic structure which justifies a need for full flow control is that of a “switch” component. This component is similar to a multiplexor, except that data on a deselected input is not consumed, and remains at the input until it is selected. The selection control is unspecified, for now, and comes in through a raw data connection, “sel”.
A slight variation on the “switch” component is the “merge” component, in which the “sel” input uses a DFC protocol instead of just raw input data. The values on “sel” are then pulled whenever the main output is pulled, and the main output is valid only when the “sel” input is valid. Thus, an implementation of the “merge” component is shown in
A second variant of the “switch” component is one that automatically determines which input to select. This determination can be any arbitrary logic function, but it is typically a function of the “valid” lines on each of the inputs. The arbitration logic (arbiter) is thus fed each of the “valid” signals, and produces a selection. In order to track which input corresponds to each output, an output similar to “sel” is produced for each data value that passes to the main output. In this way, the multiplexed output stream may then be demultiplexed further downstream using the information on the “sel” port. An example of this “auto-merge” component is shown in
Both variants of the “switch” component consume only one of their main inputs for each output produced. An alternative type of selected-input component is one in which all main inputs are consumed for each output. This alternative, known as the “sift” operator, is similar to a common multiplexor, and can be implemented using the DX protocol as just a multiplexor.
Conditional constructs can typically be implemented using a “fan-out” operator to feed both branches of the condition, followed by a “sift” operator to select the appropriate result given the value of the condition.
The inverse of the “switch” component is the “deswitch” component, which takes a single input and connects it to one of multiple outputs. As with the “switch” component, the selection control comes in through a raw data connection, “sel”.
As with the “switch” component, the “deswitch” component can be made so that the “sel” input uses a DFC protocol.
As with the “switch” component, there is a second variant of the “deswitch” component in which the “sel” signal is automatically determined by an arbitrary logic function. In this variant, PDR0 signals are used in place of VDL0 signals, and the arbiter is fed the “ready” signals of all the outputs.
A third variant of the “deswitch” component is the “route-mask” component. This component takes a bit mask as the “sel” input and routes the incoming data to all outputs for which the corresponding bit in the mask is a one. With such a component, a “sel” input of all zeros will discard the incoming data, whereas a “sel” input of all ones will broadcast the incoming data to all outputs. This component is implemented by performing a bit-wise logical OR of the “ready” inputs with the inverse of the “sel” mask and then generating the vector AND of the resulting bit vector concatenated with the “sel” and input's “valid” signals to determine the “pull” and “push” outputs.
With the existence of an “auto-merge” component and a “route” component, it is possible to manually share a general computational unit between multiple users. Each user's input to the shared unit is fed to a unique input of the “auto-merge” component. The output of the “auto-merge” is then fed to the computational unit, whose output is fed to the “route” component. The “sel” line of the “route” component is driven by the “sel” output from the “auto-merge” (after the appropriate conversion from PDR0 to VDL0 using a “clear-skew” operator) and the outputs of the “route” component are used as the results for the users tied to the corresponding inputs on the “auto-merge”.
This simplistic form of sharing an operator may lead to deadlock situations, where data attempting to leave the computational unit is supposed to be routed to a particular user who will not be ready for more data until it receives a second input from another user sharing this same computational unit. To avoid this situation, it is more appropriate to fan out each user's input and route it to the pacing input of a “pacer” component. The “pacer” is then used to pace the user's output stream. Because fanning out a DFC stream will typically generate a PDR stream, there will be “clear-skew” operators just before the inputs to the “auto-merge” and “pacer” component, which will allow the first computation to be performed. The size of the FIFOs in these “clear-skew” operators will directly affect how many outstanding requests can be propagating through the shared computational unit, and may adversely affect the utilization or throughput of the structure if the size is too small compared to the latency of the computational unit.
In a manner similar to the “share” operator, if the user wishes to create multiple identical computational units to improve the throughput in the case where each individual computational unit's throughput is lower than one result per clock cycle, a “serve” operator can be manually constructed using an “auto-route” and “merge” component. The main input goes in to the input of the “auto-route” and each output of the “auto-route” is fed to a unique computational unit. Each computational unit's output is the fed to the corresponding input of the “merge” component, and the output is the final result. The “sel” input of the “merge” is driven using the “sel” output of the “auto-route” after the appropriate conversion from PDR0 to VDL0 using a “clear-skew” operator.
The example “share” and “serve” operations shown above only describe how to use single-input and single-output computations. When dealing, for example, with sharing a multiple-input computation, it is inappropriate to attempt to replicate “auto-merge” components, since there is no guarantee that each component will select the same user's input. Thus, the computation would effectively be performed on a random collection of inputs. Instead, it is convenient to introduce a “group” operator, which assembles individual streams into a single input, and an “ungroup” operator, which breaks a group into individual streams. This grouping and ungrouping can then be performed on either side of the various “merge” and “route” components to make the computation being “share”d or “serve”d effectively appear as a single-input single-output operation. A “group” operator is implemented by performing a “synchronize” operation on the individual inputs and then concatenating the resulting data outputs. An “ungroup” operator is a simple “fan-out” operator followed by each output taking a slice of the original data.
A basic high-level construct is that of the loop. While loops could be manually crafted with simple components such as “increment” and “merge”, it is typically more efficient to produce a single component, which accepts “start” and “end” values for the count, as well as an “up” input to be reproduced ‘N’ times as an output (where ‘N=end−start+1’).
A variant of the “iterate” component is one in which the “start” and “end” values are constant for all time. This is a common occurrence, used either to generate a repeating “count” stream or to replicate data more precisely than with an “echo” component, and can greatly simplify the implementation of the component.
Loops having non-constant exit conditions or data dependencies between separate iterations of the loop must be manually constructed. To achieve this, a two-input “auto-merge” component is used at the entrance of the loop, where one input to the “auto-merge” represents initial entry into the loop and the other input represents re-execution of the loop body. Two-output “route” components may be used for exit conditions as many times as desired around the loop body, where one output of the route represents continuing execution of the loop and the other output represents an exit from the loop.
There are some intricacies involved in implementing ‘for’ loops as shown in
Locking a portion of a dataflow graph so that only one element of data can reside within it at any given time is achievable using a “mutex” component. The “mutex” component has two data throughways which must be traveled alternately: one piece of data must go through the “lock” throughway before data is allowed through the “unlock” throughway. This locking and unlocking is then performed on either side of the restricted structure, ensuring that only one data item is operating within the structure at any time.
Given the “mutex” component above, any section of a dataflow graph may be locked to a single data value as shown in
The issues involved in implementing loops highlight a more general concern when implementing traditionally sequential processes in a parallel execution environment: ensuring that certain operations are performed in the appropriate order. For example, when a memory read is supposed to be performed after a memory write, how is this dependency expressed? For this reason, “memory-write” operators typically have an output stream of “void” data type, to indicate the completion of a write. Furthermore, there are purely control-oriented components available, such as the “pacer” and “mutex” components, to ensure that the proper sequence is maintained.
The almost physical presence of data when using DFC protocols, along with the ability of the “void” data type to explicitly describe control structures, allows any general state machine to be constructed from DFC components. Such a state machine may start with “void” data present in its initial state, at the input to a particular component. Various “auto-merge” and “route” components may then be used to control the movement of the state “token” to other positions within the structure. While this technique is overkill for simple state machines, it becomes extremely useful in the context of controlling high-level computational structures.
As an example, consider the implementation of such a state machine where the individual states are:
Once the “init”, “compute”, and “read” operations are constructed to be triggered via “void” data tokens and to report completion tokens, they may be controlled by a state machine whose state token is fanned out to trigger processes and paced and routed by the completion signals of these processes.
While the state machine implementation shown in the right half of
To take full advantage of such a modification, a control unit is required that will keep track of the availability of the various sections of memory. The unit may feed available memory offsets to the control input whenever it is able, and then must retrieve these offsets from the control output to feed them back in. For small numbers of available memory sections, this control unit may be as simple as the expanded “token” component described earlier, where the unit has, buffered within it at start-up, the memory spaces available for computations. It feeds these out and then passes on whatever comes back to its input. The resulting “multi-threaded” structure then looks exactly as shown in the left half of, except with the understanding that the start “token” component actually contains many tokens which have actual values to be passed to the operational units as memory offsets. Because of the origins of such a component, it is referred to as a “token controller”.
When the number of start tokens is large, the above technique may not be efficient, as it would require a large FIFO that was initializable upon reset. This problem may be surmountable through the use of explicit initialization circuitry and external memory used for storage. Simplifications may be had if the values of the tokens are easily derivable via a one-to-one mapping to a contiguous range of integers, where the integers themselves could refer to bits in a large bit mask stored in a memory. A bit from the mask would then be cleared and its corresponding integer converted into the token value to feed to the state machine, and returning values from the state machine could be converted back to the corresponding integer, which in turn is used to set a bit in the bit mask. Additional circuitry would have to be employed to search the bit mask for available tokens.
An even greater simplification may be had if the state machine structure is a linear process, always returning tokens in the same order in which they were received. In such a case, the control unit is virtually identical to that used for memory-based FIFOs. What used to be the “head” pointer of the FIFO becomes the integer to be converted to the input token of the state machine. This integer is incremented, such as in a FIFO “push” operation, whenever one is available (the FIFO is not full) and the result can be fed to the state machine. Values returning from the state machine are required to correspond to the “tail” pointer, which they will, so long as the process is linear, and are returned to the control unit in a manner similar to a FIFO “pull” operation.
The state machine described above implies a technique useful in dataflow computations and similar to the “pass-by-reference” technique of programming languages. Just as passing arrays to functions in the C programming language is really implemented by passing a pointer to that array, passing arrays in a dataflow structure can be implemented by simply passing a pointer to the beginning of the array. One issue is the ability to accurately track the lifetime of the array in a manner that allows memory space to be recycled for use in future computations. Doing so typically requires explicit user declarations for when a memory space should be allocated or deallocated. Depending on the flexibility required, the recycling of memory spaces may either be as simple as that described in the previous section, or it may be extremely complex, requiring the hardware equivalent of C's “malloc” and “free” functions.
As an example, consider a simple implementation of pointers, as applied to a “histogram” operator. The result of a “histogram” operation on an input image could be a sequence of values representing the accumulated number of pixels at each possible pixel value, but this may be cumbersome if the downstream component wishes to access the results in a random pattern. Therefore, it is more appropriate for the result to be an address within a particular memory, in which the results are stored in an understood fashion. Thus, to begin a “histogram” operation, an appropriate amount of memory space needs to be allocated and initialized to zero. The incoming pixel values are then used to increment the appropriate accumulation in memory. Once the agreed upon number of pixels has been accumulated, the memory image of the histogram is complete, and the pointer to the beginning of the results can be passed out as the result.
An “alloc” block, as shown in
Such an implementation of memory allocation still has the complications associated with its equivalent in programming languages: memory leaks when an offset isn't passed to “free”, and controller corruption when the same offset is inappropriately passed to “free” more than once. When a simple memory controller, such as a token controller, is used to track memory allocation, memory leaks will eventually discard all valid tokens leaving none available for allocation and the process will halt. Inappropriately passing a token to “free” more than once will allow it to be simultaneously allocated more than once, and could eventually clog the loop from the “alloc” to the “free” blocks.
1. Circuit Methodology
Methodology for Assembling Computational Operators
A series of parameterized transformations can describe the proper method for connecting computational elements using the flow-control protocols of dataflow components (DFC) as disclosed herein. By maintaining parameters that describe the flow-control protocols used at a particular point in the computation, the data stream may be translated into any other desired protocol using the proper set of simple transformations. This provides accommodations for hardware and other environmental restrictions such as a set number of memory connections or an upper bound on interconnection throughput.
Control Correct by Assembly
Because the DFC transformations assure that data is not lost or fabricated, a given computation is subsequently guaranteed to function exactly as described. The function may be designed at a level of abstraction above the actual hardware implementation, allowing the programmer to focus solely on the desired computation while ensuring that the underlying result will behave properly. Since the control protocols are programmatically created, the programmer need not be concerned about the details and has no opportunity to implement the control incorrectly.
The transformations available to DFC components allow the data flow to be paused without the need for a high-level controller, which could become overly-complicated as the specified computation evolved into the desired result. In contrast, the transformations rely on simple, localized hardware elements, each of which need only manipulate a portion of data flowing through the complete computational graph.
Synchronized Data Protocol
By properly maintaining parameters which describe the data flow through a series of pipelined components (e.g., skew), several pipelined computational elements may be connected together. Reconvergent fanout of data may be accommodated so as to automatically generate a functionally correct design. Furthermore, translations between the “synchronized data” protocol and the DFC protocols may be achieved in a manner which is guaranteed to function properly. While having no flow control itself, the “synchronized data” protocol has the advantage of requiring near-zero hardware overhead to implement simple computations and offers the ability of being translated back into a suitable flow-control protocol when necessary.
Memory FIFO Protocol
Some computations operate on an extreme amount of input data which may need to be accessed randomly and repeatedly, rather than sequentially. With such computations, it is inappropriate to stream data from one computational element to the next, as each unit would need to buffer up the entire data set before performing its calculations. The memory-FIFO protocol is layered on top of the standard dataflow protocols so as to alleviate these inefficiencies. Data sets stay in one place in memory while a pointer indicating the offset into that memory is passed from one computation to the next using standard flow control. Recycling of memory space is achieved with a memory-FIFO controller.
Components in typical schematic capture schemes have fixed port maps. They have a specific number of ports, with specific names and specific types. According to the invention, components have non-static dynamic port maps that can add and remove ports in response to user actions. Such components are part of a system according to the invention designated COREFIRE™.
Ports in typical, conventional schematic capture schemes have fixed types. Components in the system according to the invention support classes of port interfaces. For instance, according to the invention using the protocols discussed herein, an adder can be defined that has two inputs called A and B and one output called SUM. A, B and SUM can adapt to become any bit width of any supported type. For instance, A could be an 8-bit unsiged integer, B could be a 32-bit IEEE float, and SUM could be a 24-bit signed integer.
Conventional examples of programmatic structural assembly of digital circuits, for example JHDL, create structural models of hardware at object creation time. The system according to the invention separates object creation from rendering to hardware. This allows the parameters of an existing circuit to be modified after it is created. One example of such variation is adjustment of bit width.
According to the Invention, Components can be Generated Using the Protocols Disclosed Herein to Adapt Based on Whether Their Ports are Used or Not.
Statically port mapped circuit models require that all of their input ports be connected in some functionally correct manner. According to the invention, dynamic port maps allow certain ports to be optional. The ENABLE and RESET ports of a flip-flop, for instance, may simply be removed at rendering time if they are unconnected.
Typical conventional schematic capture schemes provide libraries of components with fixed parameterization. For instance, if a circuit contains an 8-bit adder and a 16-bit adder is desired, the 8-bit adder must be removed and a 16-bit adder put in its place. According to the invention, using protocols disclosed herein, components can adapt based on their interconnection with other components. For example, according to the invention, components may derive parameters like their bit width based on their interconnection with other components. Indeed, it is possible to reconfigure the bit width of a design by simply changing it in one place and allowing the circuit to adapt to the new situation. Moreover, in addition to deriving their parameterizations from their interconnection with other components, some components must be set to fixed bit widths or other parameters. According to the invention, the components can be configured to override the adaptive propagation mechanism and mandate certain parameterizations.
According to the invention components can also be configured with the protocols to adjust port maps as needed. An adder, for example, has several inherent constraints on its port map. A typical adder has A, B and SUM ports. All three start off with an unknown type. If the user connects an 8-bit unsigned integer to the A port, the component can make certain inferences about its remaining ports. It knows, for instance, that it's output port must be at least 9 bits. If a 9-bit unsigned integer is connected to the SUM port, the component knows further that is can only accept unsigned inputs of 8-bits or less on the B port.
Components according to the invention have LAD interfaces to the host system. Such components provide information to the host at runtime reporting the component's version of its well-known interface and its location in the LAD address space. This allows a runtime host application to determine the interfaces available in a programmed processing element.
Components according to the invention can also report their resource requirements at construction time. These resource requirements can be used to monitor circuit size dynamically at design time. This provides the user with immediate feedback to their design decisions.
Typical conventional schematic capture systems have flat port structures. Ports of components according to the invention can be configured to contain arbitrary hierarchy. In this way, ports can be created that support multi-signal connections with arbitrary directionality.
According to the inventive data flow components (DFC) and computational components can be implemented. An exemplary list of DFC related components includes:
ClearSkew is a circuit structure implementing a DFC transform that converts a PDR or VDR stream that may or may not have skew to a VDR stream with no skew.
PushCast is a circuit structure implementing a DFC transform from VDR to PDR.
AddSkew is a circuit structure implementing a DFC transform that adds skew to a stream. Typically, it is used to alleviate timing problems during place and route.
Sync is a circuit structure that combines two DFC streams into one synchronized two-element DFC stream.
The Valve component is a DFC component with an LAD host interface. It is primarily a debugging tool. It allows a certain amount of data to pass through, then pauses for interaction with a host computer. Valve can be used to provide host-observability into the computational process.
FPDP to DFC Protocol Converter
Provides functional conversion from FPDP protocol to PDR protocol.
WSDP to DFC Protocol Converter
Provides functional conversion from WSDP protocol to PDR protocol.
MyriNet to DFC Protocol Converter
Provides functional conversion from MyriNet protocol to PDR protocol.
A/D to DFC Protocol Converter
Provides functional conversion from analog signaling to PDR protocol.
Unidirectional PE-to-PE DFC Connections Implemented with Scarce Interconnect
PDR-based component that allows data to move between processing elements in one direction.
Bidirectional PE-to-PE DFC Connections Implemented with Scarce Interconnect
PDR-based component that allows data to move across between processing elements in both directions.
PDR to Memory Protocol Converter
Components that source or sink PDR streams to or from memories.
DMA to DFC Protocol Converter
Components that source or sink PDR streams to or from the host using DMA.
Computational Components Include:
FFT cores can be adapted to appear as standard dataflow components using a modified form of the memory-FIFO protocol. Incoming data is retired to FFT memory and a ‘pointer’ to that memory is then passed to the FFT control circuitry. When the FFT operation is complete, that same ‘pointer’ is then passed to a memory-fetch unit which reads out the results and passes them on to the next computation. Modifications to the memory-FIFO protocol involve an additional level of control to guarantee that the memory crossbar is not switched at inappropriate times.
Image Processing Window
The Window operator takes a PDR-stream containing a rasterized image and provides a output PDR-stream containing a parallel sliding window view of the image. This allows operations that work on windows, like edge detection, to take a raster image as input and thus minimize their memory overhead.
Variable Priority Encoder Using Carry Logic
The primitive carry-logic elements found within FPGAs may be used to implement priority encoding. By remembering previous encoder results and feeding them back into the carry logic, the priority encoder may be modified to implement a variable priority scheme. This variable priority can be far more fair in terms of its acknowledgment of many outstanding requests, while still maintaining the maximum theoretical throughput of the fixed priority encoder.
The above lists of DFC and computational components according to the invention are illustrative. Other useful components can be formed using the protocols disclosed herein.
Typical schematic capture systems model connections between the ports of components as simple wires. According to the invention connections between components allow components to discover the types of ports they are connected to and adapt accordingly.
According to the invention, connections can imply complex glue logic. DFC connections are more complex than simple wires. Fanout of a DFC connection, for instance, requires some glue logic to accommodate reconvergence of READY signals. This logic need not be explicitly represented by the user, but may be inferred at build time based on the connection for the protocol.
According to the invention, designs are portable, and are not dependent on the resources of a particular hardware platform or architecture. Designs are created as a hierarchy of “Diagrams”—a Diagram is a collection of primitive intelligent components implementing the protocols disclosed herein, data sources and sinks, and other Diagrams. A Diagram is a reusable design element that is targeted to a specific hardware platform during a separate resource binding stage that is independent of design construction.
The Diagrams which comprise a design contain no hardware platform specific information. Therefore, each Diagram may be reused by including it in other designs.
A design is targeted to a specific hardware platform by binding the data sources and sinks contained in the design Diagrams to data sources and sinks provided by a selected hardware platform. A design may be retargeted to a different hardware platform by changing the bindings of its data sources and sinks.
According to the invention components may be parameterized explicitly by a user via a design editor, or automatically via type propagation mechanism. The Type Propagation mechanism, and automatic adaptation of components within a design are useful to maintaining the reusability of such designs. When a connection is made between a port with a specified data type, and a port with an unspecified data type, the port with the unspecified data type may determine its type from the type of the port it is connected to. For example, if a component has a 32 bit signed integer output, and that output is connected to an unspecified input of another component, the type propagation mechanism can set the input port to 32 bit signed integer type.
The type propagation mechanism allows components to adapt when they are inserted into a circuit. In addition, according to the invention components may use type changes that are propagated onto one of their ports to set the types of their other ports. This will cause further type propagation waves through the circuit.
According to the invention design components can be automatically re-used to minimize the amount of logic needed to instantiate a design. Reuse may be applied to both platform resources and computational operations. For example, when a memory resource is used by multiple design components, multiplexers may be automatically inserted to effect multiple use of the memory. Likewise, computational resources (a multiplier, for example) may be reused by automatic insertion of multiplexing hardware. Such automatic time multiplexing of computational operations may be implemented using software.
Processing element designs produced using techniques, components and connections according to the invention can contain a LAD interface at a well-known location that provides information about the other LAD interfaces present in the system. This interface may consist of a memory element containing the interface name, version and LAD offset of each interface present in the design. Host software could interface with a design and provide an automatically generated graphical user interface to the processing element.
The presence of PLLs, DLLs and pipelined memories in high-performance FPGA architectures make it very difficult to perform traditional single-step debugging of hardware designs. Using specialized DFC components, a form of single-step debugging at full clock rates is possible. According to the invention, Valve components can be placed throughout a circuit. These Valve components can be programmed by a host computer to allow a certain number of pushes to propagate through them. The Valves can store the data values that passed through them. When the predefined number of pushes has past, the Valve asserts that it is not ready and waits for interaction from the host. The host can query the Valve to see the data that passed through and prompt it to allow more data through. Valves can also be used to bypass portions of the design by allowing the host to directly insert data in the stream. This would, for instance, allow testing of the computational portion of a design without have to use external test fixtures to source data through the I/O portion of a design.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US4418383 *||Jun 30, 1980||Nov 29, 1983||International Business Machines Corporation||Data flow component for processor and microprocessor systems|
|US5021947 *||Jan 30, 1990||Jun 4, 1991||Hughes Aircraft Company||Data-flow multiprocessor architecture with three dimensional multistage interconnection network for efficient signal and data processing|
|US5455928||Jun 14, 1993||Oct 3, 1995||Cadence Design Systems, Inc.||Method for modeling bidirectional or multiplicatively driven signal paths in a system to achieve a general purpose statically scheduled simulator|
|US5842045 *||Aug 9, 1996||Nov 24, 1998||Nec Corporation||Terminal unit having a universal multi-protocol modular jack automatically sets its protocol to match with protocol of a modular plug connecting to the jack|
|US5852449||Jan 27, 1992||Dec 22, 1998||Scientific And Engineering Software||Apparatus for and method of displaying running of modeled system designs|
|US5870588 *||Oct 23, 1996||Feb 9, 1999||Interuniversitair Micro-Elektronica Centrum(Imec Vzw)||Design environment and a design method for hardware/software co-design|
|US5872810||Jan 26, 1996||Feb 16, 1999||Imec Co.||Programmable modem apparatus for transmitting and receiving digital data, design method and use method for said modem|
|US5911776 *||Dec 18, 1996||Jun 15, 1999||Unisys Corporation||Automatic format conversion system and publishing methodology for multi-user network|
|US5978574||Nov 5, 1997||Nov 2, 1999||Hewlett-Packard Company||Formal verification of queue flow-control through model-checking|
|US6044211 *||Mar 14, 1994||Mar 28, 2000||C.A.E. Plus, Inc.||Method for graphically representing a digital device as a behavioral description with data and control flow elements, and for converting the behavioral description to a structural description|
|US6182024||Oct 14, 1997||Jan 30, 2001||International Business Machines Corporation||Modeling behaviors of objects associated with finite state machines and expressing a sequence without introducing an intermediate state with the arc language|
|US6233540||Mar 13, 1998||May 15, 2001||Interuniversitair Micro-Elektronica Centrum||Design environment and a method for generating an implementable description of a digital system|
|US6272451 *||Jul 16, 1999||Aug 7, 2001||Atmel Corporation||Software tool to allow field programmable system level devices|
|US6356796||Dec 17, 1998||Mar 12, 2002||Antrim Design Systems, Inc.||Language controlled design flow for electronic circuits|
|US6393515 *||Jan 6, 2000||May 21, 2002||Nms Communications Corporation||Multi-stream associative memory architecture for computer telephony|
|US6421808 *||Apr 22, 1999||Jul 16, 2002||Cadance Design Systems, Inc.||Hardware design language for the design of integrated circuits|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US8397213 *||Sep 19, 2006||Mar 12, 2013||Silicon Software Gmbh||Hardware programming and layout design|
|US8434045 *||Mar 14, 2011||Apr 30, 2013||Tabula, Inc.||System and method of providing a memory hierarchy|
|US8797062||Jul 16, 2012||Aug 5, 2014||Tabula, Inc.||Configurable IC's with large carry chains|
|US8943448 *||May 23, 2013||Jan 27, 2015||Nvidia Corporation||System, method, and computer program product for providing a debugger using a common hardware database|
|US9195786 *||Mar 9, 2015||Nov 24, 2015||Mentor Graphics Corp.||Hardware simulation controller, system and method for functional verification|
|US9244810||May 23, 2013||Jan 26, 2016||Nvidia Corporation||Debugger graphical user interface system, method, and computer program product|
|US20080256511 *||Sep 19, 2006||Oct 16, 2008||Silicon Software Gmbh||Hardware Programming and Layout Design|
|US20140351775 *||May 23, 2013||Nov 27, 2014||Nvidia Corporation||System, method, and computer program product for providing a debugger using a common hardware database|
|US20150178426 *||Mar 9, 2015||Jun 25, 2015||Mentor Graphics Corporation||Hardware simulation controller, system and method for functional verification|
|U.S. Classification||718/106, 709/237, 703/14|
|International Classification||G06F15/16, G06F17/50, G06F9/46|
|Cooperative Classification||G06F17/5022, G06F17/5045|
|European Classification||G06F17/50C3, G06F17/50D|
|May 23, 2002||AS||Assignment|
Owner name: ANNAPOLIS MICRO SYSTEMS, INC., MARYLAND
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:DONALDSON, ROBERT L.;HUDSON, RHETT D.;MARSHALL, LAWRENCEM., JR.;AND OTHERS;REEL/FRAME:012921/0720
Effective date: 20020522
|Sep 7, 2010||CC||Certificate of correction|
|Sep 28, 2012||FPAY||Fee payment|
Year of fee payment: 4
|Sep 28, 2012||SULP||Surcharge for late payment|