US 20020100031 A1 Abstract One aspect of the invention includes a method of address expression optimization of source-level code. The source-level code describes the functionality of an application to be executed on a digital device. The method comprises first inputting first source-level code that describes the functionality of the application into optimization system. The optimization system then transforms the first source-level into a second source level that has fewer nonlinear operations than the first source-level code.
Claims(37) 1. A method of optimizing address expressions within source-level code, wherein the source-level code describes the functionality of an application to be executed on a digital device, the method comprising:
inputting a first source-level code that describes the functionality of the application, the first source-level code comprising address computation code and a plurality of arrays with address expressions, wherein the address computation code or one of the address expressions has nonlinear operations; and transforming the first source-level code into a second source-level code that describes substantially the same functionality as the first source-level code, wherein the second source-level code has fewer nonlinear operations than the first source-level code, and wherein the digital device comprises at least one programmable instruction set processor. 2. The method of in 3. The method of 4. The method of 5. The method of 6. The method of 7. The method of 8. The method of 9. The method of 10. The method of 11. The method of 12. The method of 13. The method of 14. The method of 15. The method of 16. The method of 17. The method of 18. The method of 19. The method of 20. The method of 21. The method of 22. The method of 23. The method of 24. The method of 25. The method of 26. The method of 27. The method of 28. The method of 29. The method of 30. The method of 31. The method of 32. The method of 33. The method of 34. The method of 35. A system for address expression optimization of source-level code, wherein the source-level code describes the functionality of an application to be executed on a digital device, wherein the digital device comprises at least one programmable instruction set processor, the system comprising:
means for inputting a first source-level code that describes the functionality of the application, the first source-level code comprising address computation code and a plurality of arrays with address expressions, wherein at least the address computation code or one of the address expressions has nonlinear operations; and means for transforming the first source-level code into a second source-level code that describes substantially the same functionality as the first source-level code, wherein the second source-level code has fewer nonlinear operations than the first source-level code. 36. A system for address expression optimization of source-level code, wherein the source-level code describes the functionality of an application to be executed on a digital device, the system comprising:
an optimizing system for receiving a first source-level code that describes the functionality of the application, wherein the first source-level code has nonlinear operations, and wherein the optimizing system transforms the first source-level code into a second source-level code that has fewer nonlinear operations than the first source-level code, and wherein the digital device comprises at least one programmable instruction set processor. 37. Executable code that has been optimized by the method comprising:
transforming first source-level code that defines the operation of the executable code into a second source-level code that describes substantially the same functionality as the first source-level code, wherein the second source-level code has fewer nonlinear operations than the first source-level code, and wherein the second source-level code has been compiled for subsequent execution on a programmable instruction set processor. Description [0001] The present application claims priority to and incorporates by reference, in its entirety, U.S. Provisional Application No. 60/176,248, filed Jan. 14 2000, titled “An Apparatus and Method For Address Expression Optimisation.” [0002] 1. Field of the Invention [0003] The field of the invention relates to digital devices. More particularly, the field of the invention relates to optimizing high-level source code for digital devices comprising programmable processors. [0004] 2. Description of the Related Technology [0005] As is known in the art, a software application can be described by a high-level programming language or code, e.g., Pascal, Fortran, C++. Although there has been some research in the field of high-level programming language optimization, there has been little work done in the field of optimizing code that is targeted for programmable processors. For example, the article, Miranda M., Catthoor F., Janssen M. and De Man H., “High-level address optimization and synthesis techniques for data-transfer intensive applications”, IEEE Transactions on VLSI systems, Vol. 6., No. 4, pp. 677-686, December 1998, describes a system for optimizing source code that is targeted for custom processors design or synthesis. However, this article fails to describe how address optimization can be applied in a context wherein resources are fixed or are predetermined. [0006] Furthermore, the article, Liem C., Paulin P., Jerraya A., “Address calculation of retargetable compilation and exploration of instruction-set architectures”, Proceedings 33 [0007] For example, in Liem, address expression optimization is performed on pointer types, which masks the actual address expression. Since instruction level code is typically longer than source code, pointer type optimization is typically limited to smaller blocks of the code. In instruction level optimization, a block of code typically relates to a branch of the code that is induced by a conditional statement. In Lien, since only parts of the code are examined in the optimization and due to the masking of the address expression, global optimization opportunities are overlooked. Furthermore, this approach does not describe how to provide optimization methods for partly programmable processors wherein the optimization method is processor-independent. [0008] Thus, there is a need for examining address expression optimization for source code as opposed to instruction code. There is also a need for a system and method of optimizing high-level source code wherein the optimization is machine-independent. [0009] One aspect of the invention includes a method of address expression optimization of source-level code, the source code describing the functionality of an application to be executed on a digital device, the method comprising: inputting a first source-level code that describes the functionality of the application, the first source-level code comprising address computation code and a plurality of arrays with address expressions, wherein at least the address computation code or one of the address expressions has nonlinear operations, and transforming the first source-level code into a second source-level code describing the same functionality as the first source-level code, wherein the second source-level code having fewer nonlinear operations than the first source-level code, and wherein the digital device comprises at least one programmable instruction set processor. [0010] Another aspect of the invention includes a method of address expression optimization of source-level code, wherein the source-level code describes the functionality of an application to be executed on a digital device, the method comprising: inputting a first source-level code, describing the functionality of the application, the first source-level code, comprising arithmetic computation code and a plurality of arithmetic expressions in one or more algorithm statements, wherein the arithmetic computation code or one of the arithmetic expressions has nonlinear operations; and transforming the first source-level code into a second source-level code describing the same functionality as the first source-level code, wherein the second source-level code has fewer nonlinear operations than the first source-level code, and wherein the digital device comprises at least one programmable instruction set processor. [0011]FIG. 1 is a dataflow diagram showing an optimization system for optimizing source-level code. [0012]FIG. 2 is a flowchart illustrating an optimization process that is performed by the optimization system of FIG. 1. [0013]FIG. 3 is a diagram showing exemplary source-level code that may be processed by the optimization system of FIG. 1. [0014]FIG. 4 is a diagram showing the exemplary source code of FIG. 1 subsequent to an algebraic transformation and common subexpression elimination process of the optimization system of FIG. 1. [0015]FIG. 5 is a diagram illustrating exemplary algebraic transformations that may be performed by the optimization system of FIG. 1. [0016] FIGS. [0017]FIG. 9 is a diagram showing the source code of FIG. 8 subsequent to performing a second iteration of algebraic transformations and common subexpression elimination. [0018]FIG. 10 is a diagram showing the source code of FIG. 9 subsequent to applying a second iteration of code hoisting. [0019]FIG. 11 is a diagram illustrating another piece of exemplary source code. [0020]FIG. 12 is a diagram illustrating the source code of FIG. 11 subsequent to the optimization system of FIG. 1 reducing the operator strength of the source code. [0021]FIG. 13 shows a decision tree that may be used to decide for given expressions in two different conditional scopes whether it is beneficial to perform code hoisting. [0022] The following detailed description is directed to certain specific embodiments of the invention. However, the invention can be embodied in a multitude of different ways as defined and covered by the claims. [0023]FIG. 1 is a dataflow diagram illustrating a process of optimizing a first high-level source code [0024] The high-level source code [0025] Source-level descriptions are to be distinguished from instruction level descriptions. Instruction level descriptions include single operation instructions, each instruction having one or two operands and at most one operation per line of code. In source-level code, a line of code can have multiple operands and multiple operations. Before source code can be executed, it needs to be converted into compiled code. [0026] By rewriting source code under the constraint that the functionality of the application is maintained, better execution of the application can be achieved. Better execution can include a reduction in power consumption or faster execution in term of run-cycles. [0027] One aspect of the present invention relates to code rewriting for relatively fixed hardware resources, such as is found in general purpose programmable processors, e.g., general purpose RISC machines or more domain specific digital signal processors or ASIP. As part of the optimization process, the first source-level code [0028] In one embodiment of the invention, the optimization that is performed by the optimization system [0029] As an example, the optimization system [0030]FIG. 2 is a flowchart illustrating in overview one embodiment of an optimization process of the optimization system [0031] Next, the optimization system [0032] Proceeding to a state [0033] Continuing to a state [0034] Next, at the state [0035] In one embodiment of the invention, a range analysis step is used before applying code hoisting. In one embodiment of the invention, code hoisting is never applied to expressions which do not have overlapping ranges. Furthermore, code hoisting is only applied on expressions having at least one common factor. [0036] The factorization (state [0037] In one embodiment, the first iteration of the code hoisting step (state [0038] Proceeding to a state [0039] The cost minimization problem can use any kind of optimization approach. As the actual cost associated with a source-level code is the cost in terms of power consumption, speed and other measures of performance of a digital device executing the source-level code, or even measures of performance of a digital device still to be designed, it is to be understood that only cost estimates are used within the cost minimization approach. [0040] The efficiency of a cost minimization problem can be increased by selecting basic transformation steps which have a large impact. It is an aspect of the invention to use composite transformations which are each finite sequences of elementary transformations. The composite transformations have look-ahead local minima avoiding capabilities (described below). Elementary transformations are applied when a single set of preconditions are satisfied. The preconditions can be formulated at the level of the source code on which the elementary transformation will be applied but preferably an alternative representation of the source code is used as is explained further. A composite transformation is applied when one of a plurality of sets of preconditions are satisfied. [0041] Local minima avoiding capabilities allow the optimization engine [0042] If the source code has not been modified sufficiently, the process proceeds to a state [0043] Referring again to the decision state [0044] One particularly advantageous feature of the optimization system [0045] Data Model [0046] Because source-level code is not a suitable format for automatic handling by a computing device, an alternative efficient representation of source-level code is used. In one embodiment of the invention, the optimization system [0047] The use of a shared data-flow graph enables the joining of intrinsically disjunct graphs within a common graph, hence enabling larger optimization opportunities. A (sub)graph is disjunct when the related expressions have no iterators in common. When the related expressions have only a few iterators in common sharing can be beneficial. The concept of a shared data-flow graph allows for time-multiplexing of address calculations [0048] The transformations can be subdivided into two categories: (i) elementary transformations, and (ii) composite transformations. [0049] Elementary transformations are simple shared data-flow graph transformations having strict preconditions that check the validity of a transformation. Elementary transformations are the lowest level of transformations that are executed on a shared data-flow graph. In one embodiment of the invention, the elementary transformations are the only transformations that change the structure of a shared data-flow graph. Composite transformations compose sequences of elementary transformations. Composite transformations do not change the structure of a shared data-flow graph directly. For example, certain elementary transformations are shown in FIG. 5. It is noted that there are a number of other commonly known elementary transformations that may also be performed, e.g., replacing a multiply operation that includes a power-of-two constant with a shift operation with a constant. [0050] Each elementary transformation is characterized by: (i) a source and target subgraph, (ii) handles, (iii) preconditions, (iv) postconditions, and (v) relationships. The source subgraph is the subgraph wherein a transformation is to be applied. The target graph is the subgraph resulting subgraph from a transformation. The handles specify the starting point for finding transformation candidates. The preconditions of a transformation define the conditions that have to be satisfied before the transformation can be executed. The postconditions of a transformation define the conditions that hold after the transformation has been executed. The relationships of a transformation define a mapping of characteristics between the subgraph defined by the transformation before being executed and the subgraph defined by the transformation after being executed. [0051] The source and target subgraphs of an elementary transformation are parameterized, i.e., several aspects are allowed to vary, such as the type of an operation. The amount of variation is defined by the preconditions of an elementary transformation. Preconditions can be seen as patterns, which can be matched against parts of a shared data-flow graph to verify if the transformation can be applied there. The handles define a unique starting point for pattern matching. In case pattern matching is successful, the result is a set of valid values for the parameters of the source subgraph. To be able to create the appropriate target subgraph instance, these values are used in the relationships of the transformation to compute the set of valid values for the parameters of the target subgraph. The postconditions of an elementary transformation describe the situation in the target subgraph instance. This information is used when sequences of elementary transformations are composed. [0052] Due to the strict preconditions of the elementary transformations, their applicability to many instruction sequences is very low. Thus, certain composite transformations are executed to improve their applicability. Composite transformations help determine which transformations should be performed and in which order. Composite transformations perform a limited look-ahead to determine when transformations should be performed. [0053] Composite transformations work as follows. The strict preconditions of an elementary transformation are made more flexible by detecting “failures.” A failure is detected when a precondition is not satisfied. For each non-fatal failure there exists a short sequence of transformations, which effectively repairs the failure such that the associated precondition is satisfied. Transformations can have several preconditions, and hence many combinations of failures can occur. [0054] These combinations result in a dynamic composition of sequences of transformations. This approach provides a flexible and powerful approach to ordering transformations. [0055] Factorization Exploration [0056] The following text describes in further detail the factorization exploration steps that occur in state [0057] The factorization exploration step is applied to those operations containing modulo arithmetic. FIG. 5 shows two examples of transformations that may be made with respect to expressions that are affected by a modulo operation. [0058] As can be seen in FIGS. 5A and 5B, after this step, the total number of modulo operations in [0059] Furthermore, for example, the following code segment
[0060] can be transformed in following code segment
[0061] ) [0062] The expression A=(x+k [0063] Common Subexpression Elimination [0064] The following text describes in further detail the common subexpression elimination steps that occur in state [0065]FIG. 4 is a diagram showing the source code of FIG. 3 subsequent to common subexpression elimination. As part of CSE, a variable has been defined for each common subexpression and each subexpression has been replaced with the respective variable. The variable is assigned to the value of the subexpression prior to the first of operation that uses the variable. For example, in FIG. 4, the expression “(y-4)% 3” is assigned to the variable “cseymin4mod3”, and this variable is used later by the operations instead of the expression. It is noted that the common subexpression elimination only occurs with respect to expressions having the same scope. However, subsequent to code hoisting and in later iterations, there may opportunities to apply again a factorization/CSE step on a larger scope. [0066] Code Hoisting [0067] The following text describes in further detail the process for code hoisting that occurs in state [0068] Code hoisting across conditionals can lead to either a gain or loss in performance. The result depends on the way conditional ranges are overlapping and on the amount of common factors between the considered expressions. [0069]FIG. 13 shows a decision tree that may be used to decide for given expressions in two different conditional scopes whether it is beneficial to do code hoisting. It is to be appreciated by the skilled technologist that other criteria may be used. [0070] As a first decision (branch [0071] If the sum of the ranges of each of the conditionals is not less than the whole range, no code hoisting is performed. However, if the sum of the ranges of each of the conditionals is greater than the whole range, code hoisting is preformed with respect to those common expressions. [0072] Referring again to branch [0073] wherein [0074] S is a similarity coefficient (between 0 and 1); [0075] c [0076] c [0077] k [0078] k [0079] A discussion of how to estimate the cost of the conditionals, e.g., c [0080] For example, in FIG. 4, certain expressions in both scopes have common factor in the sense that they all have the form (x-constant)%3*3 or (y-constant)% 3. In this example, as part of the optimization process, the decision tree of FIG. 13 is traversed to decide on which expressions code hoisting are to be applied. Since the conditional ranges surrounding these expressions can for one embodiment of the invention said to be overlapping (branch ( [0081] FIGS. [0082] One advantage of code hoisting across loops is that it reduces the number of times a certain operation is executed. As part of this process, the optimization system [0083] Therefore, these operations that do not change can be moved outside the scope of the x loop and thereby remove unnecessary computations. The code shown in FIG. 8 shows the code of FIG. 7 subsequent to code hoisting across loops. [0084] Subsequent to code hoisting, the arithmetic in address expressions is now exposed for further factorization exploration and common subexpression elimination on more global scope. [0085] Linear Induction Analysis [0086] The following text describes in further detail the steps that occur in state [0087] At this point, the number of executed operations is significantly reduced and in particular, the number of modulo operations. But often there still remaining non-linear operations present in the code. One goal of this step is to replace the remaining piece-wise linear operations (like modulo or integer divisions) by a less costly arithmetic. Furthermore, it is noted that this step is at the boundary between the processor independent and processor specific stages. In a first step, an analysis of all expressions containing modulo or division operations is performed. This analysis is part of the processor independent stage. At this first step, the optimization system [0088] If “value” is not a power of two, a more efficient solution for executing modulo/division operations can be still applied. If the right hand-side of the operator is a constant value and if the expression is a linear or a piece-wise linear function of a loop iterator in a loop construct, the expression can be transformed. To perform this transformation, several code instructions are introduced in the code to replace the selected operation. For each operator, one pointer (“variable”) increment per loop and per operation is inserted into the source code. Each pointer created is incremented at the beginning of the loop where the operation takes place and it is initialized in the previous loop of the nest. A test per pointer and per loop is also introduced to reset the pointer for the modulo or to increment the pointer in case of the division operation. This test is used the pointer that is associated with the modulo operation reaches the modulo value. The initialization value for the pointer is the one taken by the considered expression for the first value of the iterator. The amount of the increment is determined by the slope of the affine expression. Therefore, each expression affected by a modulo operation is transformed by an auto-incremented variable and a conditional. An expression affected by an integer division can also be transformed by using a pointer. Note that because of the relation between modulo and division (both gets reset or incremented at the same time as shown, only one conditional is required to follow their evolution, hence saving one conditional. When essentially all modulo and division operations have been transformed, the conditionals and increments related to each expression remains in the scope where they have been placed during the code hoisting step. In most cases, expressions affected by modulo or division operations are affine functions of the loop iterators and a straight-forward application of the technique can be done. This approach is applicable to nested operations. [0089] For example, using the above approach, the code segment: [0090] for(i=0; i<20; i++) [0091] B[i %3]; [0092] can be converted into the following code segment:
[0093] Furthermore for example, FIG. 11 illustrates certain exemplary source code. FIG. 12 illustrates the source of FIG. 11 subsequent to performing linear induction analysis. [0094] Cost Calculation [0095] Address expression optimization is formulated as a cost minimization problem. During the optimization step, transformations are selected and executed based on the resulting reduction in cost. In one embodiment of the invention, the cost is defined by the cost of a shared data-flow graph. For instruction-set based ACUs, the cost measures used by the cost functions of the different node types are based on a estimate of the number of cycles for a given instruction-set processor “family” (i.e., RISC, VLIW, etc.) and weighted by the execution rate of the corresponding application code scope. The cost of a shared data-flow graph is an estimate of the area of the cheapest possible implementation on a not multiplexed or a lowly-multiplexed custom data-path, and is defined as the sum of the cost of each node (v) in the shared data-flow graph. Signal type information can be used for calculating the cost of a shared data-flow graph. For some node in the graph, the cost is calculated based on the signal types of its input and output signals. [0096] Using a small set of abstract operations to model both a larger set of arithmetic operations and operation sharing, means that a direct mapping of operations in a shared data-flow graph to operators in a custom data-path is not always possible for cost calculation. In most cases, the cost of an operation depends on its context. This is also the case for constants. [0097] A distinction can be made between signal type context and node context. The signal type context of a node v includes the input and output signal types of v. The signal type context does not change when transformations are applied in the neighbourhood of v. The node context of a node v consists of the nodes in the neighbourhood of v. The node context does change when transformations are applied in the neighbourhood of v. As us used below, a context-dependent cost (function) is defined as a cost (function) that depends on the node context. Context-dependent or not, each cost (function) depends on the signal type context: [0098] The cost of a constant is context-dependent. In those situations where a constant is used as ‘control input’, the cost of the constant is zero. An example is a constant multiplication, which is expanded into an add/sub/shift network depending on the value of the constant. Another example is a delay operation, whose size, i.e., number of registers in a register file) is given by the value of the constant. In all other situations, the cost of a constant is the cost of storing the constant in a (index) register file. [0099] The cost of an input and output is zero. The cost of an addition is not context-dependent. If an addition has more than two input ports, the best possible, i.e., minimum cost) expansion into binary additions among all modes is chosen. For a binary addition, the maximal cost among all modes is chosen. For one mode, the cost is calculated as follows: (i) the cost of an addition with two constants (c [0100] The cost of a multiplication is context-dependent. If a multiplication has more than two input ports, the best possible, i.e., minimum cost expansion into binary multiplication among all modes is chosen. For a binary multiplication, the maximal cost among all modes is chosen. [0101] For one mode, the cost is calculated as follows: (i) the cost of a multiplication with two constants (c [0102] The cost of a delay is not context-dependent. A delay can occur only in a shared data-flow graph with one mode, because a delay cannot be shared. For one mode, the cost is calculated as follows: (i) the cost of a delay with a constant at the first input port (c [0103] The cost of a select operation is context-dependent. In one embodiment of the invention, it is calculated as follows. The worst-case wordlength at each input port is determined among all signal types whose mode is in the demand set of that input port. Next, except for the smallest wordlength, all wordlengths are added up. The number obtained is the total number of multiplexing bits that are required for all select operations. [0104] The motivation behind this calculation is that the cheapest expansion of an n-ary select operation is obtained by iteratively splitting off the two input ports with the smallest wordlengths. The cost in bits of the select operation split off is the largest of the two wordlengths. This wordlength is also the one to use for the new input port of the (n−1)-ary select operation that remains. In other words, the smallest wordlength is removed from the set of wordlengths of the select operation, and the cost is increased with the second smallest wordlength, which becomes the new smallest wordlength. [0105] Execution of Compiled Code [0106] The compiled code is optimized for execution on an essentially digital device, either an existing programmable one, a plurality of programmable processors, possibly having certain custom hardware. [0107] The compiled code can be optimally executed on such digital devices when certain features within the calculation units, such as auto-increment, are used. The above optimizations reduce the amount of overhead for controlling the calculation units, also called local control overhead, and minimize the number of accesses to the registers. The optimization process of the optimizing system [0108] While the above detailed description has shown, described, and pointed out novel features of the invention as applied to various embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the device or process illustrated may be made by those skilled in the art without departing from the spirit of the invention. The scope of the invention is indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope. Referenced by
Classifications
Legal Events
Rotate |