WO1992003781A1 - Vectorized lr parsing of computer programs - Google Patents
Vectorized lr parsing of computer programs Download PDFInfo
- Publication number
- WO1992003781A1 WO1992003781A1 PCT/US1991/004065 US9104065W WO9203781A1 WO 1992003781 A1 WO1992003781 A1 WO 1992003781A1 US 9104065 W US9104065 W US 9104065W WO 9203781 A1 WO9203781 A1 WO 9203781A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- transition
- parsing
- state
- machine
- parser
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/42—Syntactic analysis
- G06F8/427—Parsing
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B44—DECORATIVE ARTS
- B44C—PRODUCING DECORATIVE EFFECTS; MOSAICS; TARSIA WORK; PAPERHANGING
- B44C5/00—Processes for producing special ornamental bodies
- B44C5/02—Mountings for pictures; Mountings of horns on plates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/0223—User address space allocation, e.g. contiguous or non contiguous base addressing
- G06F12/023—Free address space management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/12—Replacement control
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/43—Checking; Contextual analysis
- G06F8/433—Dependency analysis; Data or control flow analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/445—Exploiting fine grain parallelism, i.e. parallelism at instruction level
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/45—Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99941—Database schema or data structure
- Y10S707/99942—Manipulating data structure, e.g. compression, compaction, compilation
Definitions
- the present invention relates to parsing computer programs for use in a compiling system.
- the invention relates more particularly to vectorized parsing tables in an LR automatic parser for enabling taking advantage of highly pipelined computer systems in a compiler.
- Each programming language uses its own syntax and seman ⁇ tics; syntax used in the Fortran language is different from the C language syntax, etc.
- Programs written in any programming language have to be compiled, and during that process their syntax and semantics are verified. Syntax is the structure and specification of each language according to rules established for each language, i.e., grammar. Semantics of each language is the meaning conveyed by and associated with the syntax of such language.
- parsing of an analysis of a stream of program expressions (sentences) for determining whether or not the program expressions are syntactically correct. Once it is determined that a stream of program Expressions is syntactically correct, that stream of program expressions can be compiled into executable modules. Parsing is automatically performed in a computer using a computer program. In parsing a computer program input stream written in
- Fortran for example, a scanner using a set of rules groups predetermine characters in the input stream into tokens.
- Scanners are programs constructed to recognize different types of tokens, such as identifiers, decimal constants, floating point constants, and the like. In recognizing or identifying a token, a parser may look ahead in the input stream for additional predetermined characters for finding additional tokens.
- the parser imposes a structure on the sequence of tokens using a set of rules appropriate for the language.
- rules are referred to as a context-free grammar; such rules are often specified in the so-called and well known Backus Naur form.
- Such a grammar specification for a program expression consisting of decimal digits and the operations ⁇ +" and ⁇ * *" may be represented as follows:
- each of the five grammar rules above, one on each line, is referred to as a "production”.
- the tokens detected by the scanner are "+", ⁇ * • ⁇ and decimal_digits.
- Such tokens are passed to the parser program.
- Each string in the input stream that is parsed as having correct syntax is said to be "accepted”. For example, the string 2+3*5 is "accepted” while the string 2++5 will be rejected as syntactically incorrect.
- a left-to-right, right-most derivative (LR) parser accepts a subset of a context-free grammar.
- Each LR parser has an input, an output, a push-down stack, a driver program and a parsing table.
- the parsing table is created from the grammar of the language to be parsed and is unique to such language and its grammar.
- the driver program serially reads tokens one at a time from the input stream.
- the input stream is typically stored in a computer storage and is scanned by the driver program scanning the stored input stream to fetch the tokens.
- the driver program may shift the input token into the stack, reduce it by one of the productions, accept a string of such tokens, or reject the string of such tokens as being syntactically wrong. Reduction means that the right-hand side of a production is replaced by the left-hand side.
- LR parser may also fetch a next token from the input stream for determining whether or not to shift or to reduce the token. Such a token is termed a "lookahead" token and is referred to herein as a look ahead portion of the input stream. The lookahead portion may include more than one token.
- additional semantic checks also termed semantic actions
- Parsers use tables in the parsing process. It is desired to enhance the parsing process, particularly in an LR(k) parser, wherein k is the lookahead limit in the parsing. As indicated above, such parsers are well known as taking a tokenized sentence from a computer language to produce an output which is canonical parse of the sentence. While the actual parsing procedure is performed by a known parser interpr .er, the parser table itself is in a form of data structures or tables.
- each LR parser consists of a known modified finite automation with an attached push-down stack. At each discrete instance during a parsing opera ⁇ tion, parser control resides in one of the parser's machine states, the stack being filled with the most recent past parser states. The parser is looking ahead in the input stream (the computer program to be parsed and compiled) for a next token.
- Each parser state offers an automatic choice between two types of actions; reduction and read transitions. Each parser state may contain any number of defined reductions or read transitions to be utilized in parsing.
- Reductions consist of a production number P and a collection of terminal symbols R, taken as a pair, and are always considered first in reach state of the parser. If lookahead symbol L is in set R for production P, then the reduction is to be performed (there can never be more than one candidate pair) . As output of the produc ⁇ tion, the number P is given to a semantic synthesizer. Then, as many states as there are symbols on the right hand side of production P are popped off the stack; the non ⁇ terminal on the left-handed side of the production P is put in place for the next lookahead (the original lookahead L is pushed back into the input stream) and the state exposed at the top of the push-down stack takes control of th parser action.
- Read transitions consist of pairs of parser states S and vocabulary symbols X.
- lookahead symbol L matches the read symbol X, by construction there can be at most one such match, and lookahead symbol L is stripped from the input stream, state S is pushed onto the stack and state s controls the parsing operation.
- ⁇ sentence> represents the programming language goal symbol and the symbol _
- the parser's basic program structure is a parser loop over the discrete time steps defined by parser state changes. Each cycle searches for and performs one reduction or one transition. While a parser need only maintain a state stack, the current parser state and the lookahead symbol, more information is maintained for tracing, semantic and error correction purposes.
- the push-down stack can have several fields, one field holding the token just read from the input stream when the state was stacked, another field holding the actual character string read from the input stream, another field holds a serial number for the token and extra fields can be used for maintenance by a semantic synthes; er.
- Prior art parse tables for finding reductions are constructed as four tables: The lookahead set table 18 representing sets of terminal symbols, the lookahead set numbers table 17 representing the terminal symbol sets that a particular reduction can be made one, the first lookahead table 15 representing the beginning of a collection of lookahead set numbers for a particular parser state, and production table 16 representing which production to reduce by for a corresponding lookahead set number.
- the prior art parser uses th first lookahead table 15 at positions S 20 and S+l 22 to determine the collection of lookahead set numbers 30 to 33 for state S.
- the prior art parser checks whether or not the lookahead symbol X is a member of any corresponding lookahead set 34, 35 or 36. If symbol X is found to be member of any corresponding lookahead set, then a reduction is performed in the corre ⁇ sponding production P 25 as indicated by the production table 16.
- Prior art parse tables for finding read transitions include three parse tables: The entrance symbol table 42 representing the symbol which must be read to enter a parse state, the transition state table 41 representing the target parse states of all possible read transitions, and a first transition table 40 representing a beginning of a collection of possible target states from a particular parser state.
- a prior art parser uses a first transition table 40 at positions S 46 and S+l 47 to determine the collection of possible target states 50 to 52. The prior art parser then determines whether or not lookahead symbol X appears as the entrance symbol 60, 61 or 62 for each possible target parser state. When the lookahead symbol X is found, as at symbol position 61, then a next read transition is made to target parser state S' 50 on lookahead symbol X.
- first table means are established and used for indicating a plurality of reduc ⁇ tions and read transitions.
- a linearized set of vectors form an output table having input table means having a given plurality of input entries and an output table means having a given plurality of output entries, the input and output entries that correspond are at a same offset (logi ⁇ cally) within the table.
- Searching is conducted in the single linearized table without indirect references to any table.
- the input and output entries being parallel vectors provide a one-for-one base plus offset addressing in the input and output table means. This arrangement is provided both for reduction and transition processing.
- Figure 1 is a simplified diagram of a parsing system.
- Figures 2 and 3 illustrate the prior art respectively for executing reductions and read transitions in a parser.
- FIGS 4 and 5 respectively, illustrate the construction of parsing tables using the present invention for reductions and read transitions.
- Figure 4 shows a preferred embodi ⁇ ment of the invention for performing reductions in a given parser state.
- the first lookahead table 15 is still used.
- Reduction table means 71 is a linearized table of parallel vectors of productions and lookahead sets, respectively, in separate data structures 72 and 73. Data structures 72 and 73 have an identical number of entries and utilize base plus offset addressing using a different base but the same offsets.
- the parser 11 upon checking for possible reduc ⁇ tions in state S 20 determines the collection of produc ⁇ tions 80, 81 and 83 (83 denotes a plurality of productions) .
- State S+l 22 has a collection of productions that begin at numeral 82. Horizontal dashed lines collectively denominated by numeral 85 symbolize the parallel vector relationship between the production table
- lookahead set table 73 72 entries and the lookahead set table 73.
- production 80 is at the same offset as lookahead set 90
- production P 81 is at the same offset as lookahead set 91
- the parser 11 for state S scans lookahead sets 90 through 93 for lookahead symbol X by directly accessing the table 73. When symbol X is found in lookahead set 91, the parser reduces using the corresponding production P 81.
- Figure 5 illustrates the parser tables in accordance with the invention for finding read transitions, the alternate operation in each state to performing a reduction via finding the productions.
- First state transition table 40 is used as in the prior art.
- table 101 is vectorizeable as reduction table 71, i.e., table 101 includes two data structures, the entrance symbol table 102 and transition state table 103. Each of the tables 103 and
- a collection of entrance symbols 110, 111 and 112 are for state S 46.
- a new read transition from state S 46 and symbol X 111 is found by searching for symbol X in the collection of entrance symbols 110-112 of table 102.
- the first transition of state S+l 47 at area 113 of table 102 determines the last of the entrance symbols 112 for state S 46. Since each entry in the entrance symbol table 102 has a corresponding entry in the transition state table 103, this searching is completed without searching the transition state table 103. That is, entrance symbols 110-112 in table 102 have corresponding target states S' 120-122 in table 103.
- transition state table 103 is explicitly addressed to obtain the target state S' 121, i.e., the offsets in the two tables are identical for corresponding entries, respectively. This search is conducted in but one table rather than through two tables as in the prior art of
- a parser loop 12 for practicing the present invention is set forth below in pseudo-code form.
- INPUT_ERROR() In the above pseudo-code listing, the term (i) represents the offset in the parse tables 72, 73, 102 and 103. Note that the searching is conducted in tables 73 and 102 while access to tables 72 and 103 is by explicit offset address (i or j) .
- the term FINAL_STATE represents a constant or value which represents the final or last reduction to ⁇ system goal symbol" * *. This value indicates the termination of the current parsing operation. Completion of a parse always occurs with a last production and performing a transition to the FINAL_STATE.
- entrance_symbol represents a vector of integers representing the appropriate entrance symbols which must be read to enter state S.
- first_transition represents a vector of pointers into the entrance_symbol vector. This vector has a length of NO_STATES+l where NO_STATES is a constant indicating the number of states in the parse tables 72 and 73 and in parse tables 102 and 103.
- NO_STATES is a constant indicating the number of states in the parse tables 72 and 73 and in parse tables 102 and 103.
- state S is the current parse state.
- transition_state(first_transition(S) through transi- tion_state(first_transition(S+l)-l) are the collection of potential target states with entrance_symbol(first-transi ⁇ tion(S)) through entrance_symbol(first_transition(S+l)-1) being the entrance symbols to transition to those states.
- Such a range could be empty.
- first_lookahead represents a vector N0_STATES+1 long of pointers into the vector lookahead_set and the vector production.
- the entries between the first_lookahead(S) and first_lookahead(S+l)-l in the lookahead_set and production form the lookahead set produc ⁇ tion pairs needed for reductions in state S. This range may also be empty.
- production represents a vector of production numbers used as one-half of lookahead set of production pairs. This vector contains as many entries as there are reductions in the parser tables and is used with first_lookahead described above.
- left_symbol is a vector having a length of NO_PRODUCTIONS.
- the entry in position p-1 is the non-terminal on the left side of the production p (left on the : as listed in Background of the
- length_reduc- tion represents a vector having a length of NO_PRODUCTIONS.
- Entry p-1 is the length of the right side of production p and is used to POP the stack when reducing using production p.
- lookahead_set represents a vector of lookahead sets. There are at least as many entries as there are terminals in each lookahead set. Each lookahead set is a bit vector over the terminal symbols. Terminal S is in set
Abstract
A parser (11) for parsing computer programs in a compiler has parsing tables arranged as linear vectors. In a reduction portion (71) of the parser, a production table (72) and a lookahead set table (73) have paired entries (20, 22) at identical address offsets such that a one-to-one relationship exists between each lookahead set (0, 91, 93) in the lookahead set table (73) and the representation of the lookahead set in the lookahead set table. In a read transition portion of the parser, an entrance symbol table (102) has entries paired with transition state representations and each pair being at an identical address offset in the respective tables. For a reduction or read transition operation, the lookahead set table (73) or the entrance symbol table (102) is scanned to find the appropriate entry. Once the appropriate entry is found, the production table (72) or the transition state table (103) is addressed using the offset of the appropriate entry found during the scanning process.
Description
VECTORIZED LR PARSING OF COMPUTER PROGRAMS
RELATED APPLICATIONS
This application is a continuation-in-part of commonly assigned U.S. patent application. Serial No. 07/537,466, filed June 11, 1990, for INTEGRATED SOFTWARE ARCHITECTURE FOR A HIGHLY PARALLEL MULTIPROCESSOR SYSTEM by George A. Spix et al.
FIELD OF THE INVENTION
The present invention relates to parsing computer programs for use in a compiling system. The invention relates more particularly to vectorized parsing tables in an LR automatic parser for enabling taking advantage of highly pipelined computer systems in a compiler.
BACKGROUND OF THE INVENTION
Each programming language uses its own syntax and seman¬ tics; syntax used in the Fortran language is different from the C language syntax, etc. Programs written in any programming language have to be compiled, and during that process their syntax and semantics are verified. Syntax is the structure and specification of each language according to rules established for each language, i.e., grammar. Semantics of each language is the meaning conveyed by and associated with the syntax of such language. In compiling computer programs, parsing of an analysis of a stream of program expressions (sentences) for determining whether or not the program expressions are syntactically correct. Once it is determined that a stream of program Expressions is syntactically correct, that stream of program expressions can be compiled into executable modules. Parsing is automatically performed in a computer using a computer program.
In parsing a computer program input stream written in
Fortran, for example, a scanner using a set of rules groups predetermine characters in the input stream into tokens.
Scanners are programs constructed to recognize different types of tokens, such as identifiers, decimal constants, floating point constants, and the like. In recognizing or identifying a token, a parser may look ahead in the input stream for additional predetermined characters for finding additional tokens.
The parser imposes a structure on the sequence of tokens using a set of rules appropriate for the language. Such rules are referred to as a context-free grammar; such rules are often specified in the so-called and well known Backus Naur form. Such a grammar specification for a program expression consisting of decimal digits and the operations ■■+" and ■**" may be represented as follows:
F : decimal_digits
Each of the five grammar rules above, one on each line, is referred to as a "production". In the above program specifications, the tokens detected by the scanner are "+", ■■*•■ and decimal_digits. Such tokens are passed to the parser program. Each string in the input stream that is parsed as having correct syntax is said to be "accepted". For example, the string 2+3*5 is "accepted" while the string 2++5 will be rejected as syntactically incorrect.
A left-to-right, right-most derivative (LR) parser accepts a subset of a context-free grammar. Each LR parser has an input, an output, a push-down stack, a driver program and a parsing table. The parsing table is created from the grammar of the language to be parsed and is unique to such
language and its grammar. The driver program serially reads tokens one at a time from the input stream. The input stream is typically stored in a computer storage and is scanned by the driver program scanning the stored input stream to fetch the tokens. Based upon the information in the parsing table that corresponds to the token being analyzed (input token) and the current program state, the driver program may shift the input token into the stack, reduce it by one of the productions, accept a string of such tokens, or reject the string of such tokens as being syntactically wrong. Reduction means that the right-hand side of a production is replaced by the left-hand side. An
LR parser may also fetch a next token from the input stream for determining whether or not to shift or to reduce the token. Such a token is termed a "lookahead" token and is referred to herein as a look ahead portion of the input stream. The lookahead portion may include more than one token. When an LR parser performs reduction, additional semantic checks (also termed semantic actions) are performed.
Parsers use tables in the parsing process. It is desired to enhance the parsing process, particularly in an LR(k) parser, wherein k is the lookahead limit in the parsing. As indicated above, such parsers are well known as taking a tokenized sentence from a computer language to produce an output which is canonical parse of the sentence. While the actual parsing procedure is performed by a known parser interpr .er, the parser table itself is in a form of data structures or tables. The tables are generated or estab¬ lished by a so-called LR analyzer as a series of data loaded variable declarations from a context free grammar for each language being parsed, each language will be parsed by interpreting the parsing tables established for each language.
As mentioned above, each LR parser consists of a known modified finite automation with an attached push-down stack. At each discrete instance during a parsing opera¬ tion, parser control resides in one of the parser's machine states, the stack being filled with the most recent past parser states. The parser is looking ahead in the input stream (the computer program to be parsed and compiled) for a next token. Each parser state offers an automatic choice between two types of actions; reduction and read transitions. Each parser state may contain any number of defined reductions or read transitions to be utilized in parsing.
Reductions, as mentioned above, consist of a production number P and a collection of terminal symbols R, taken as a pair, and are always considered first in reach state of the parser. If lookahead symbol L is in set R for production P, then the reduction is to be performed (there can never be more than one candidate pair) . As output of the produc¬ tion, the number P is given to a semantic synthesizer. Then, as many states as there are symbols on the right hand side of production P are popped off the stack; the non¬ terminal on the left-handed side of the production P is put in place for the next lookahead (the original lookahead L is pushed back into the input stream) and the state exposed at the top of the push-down stack takes control of th parser action.
Read transitions consist of pairs of parser states S and vocabulary symbols X. When the lookahead symbol L matches the read symbol X, by construction there can be at most one such match, and lookahead symbol L is stripped from the input stream, state S is pushed onto the stack and state s controls the parsing operation. LR parsers always begin in a state 0 (zero) with the push-down stack being empty and finish with production 0 which is the production:
<system goal symbol>: :=_|_
The term <sentence> represents the programming language goal symbol and the symbol _|_ is a terminal symbol reserved for this production.
The parser's basic program structure is a parser loop over the discrete time steps defined by parser state changes. Each cycle searches for and performs one reduction or one transition. While a parser need only maintain a state stack, the current parser state and the lookahead symbol, more information is maintained for tracing, semantic and error correction purposes. The push-down stack can have several fields, one field holding the token just read from the input stream when the state was stacked, another field holding the actual character string read from the input stream, another field holds a serial number for the token and extra fields can be used for maintenance by a semantic synthes; er.
DISCUSSION OF THE PRIOR ART
Prior art parse tables for finding reductions (Figure 2) are constructed as four tables: The lookahead set table 18 representing sets of terminal symbols, the lookahead set numbers table 17 representing the terminal symbol sets that a particular reduction can be made one, the first lookahead table 15 representing the beginning of a collection of lookahead set numbers for a particular parser state, and production table 16 representing which production to reduce by for a corresponding lookahead set number. To find a possible reduction from st ;te S with lookahead symbol X, the prior art parser uses th first lookahead table 15 at positions S 20 and S+l 22 to determine the collection of lookahead set numbers 30 to 33 for state S. Then, the prior art parser checks whether or not the lookahead symbol X is a member of any corresponding lookahead set 34, 35 or
36. If symbol X is found to be member of any corresponding lookahead set, then a reduction is performed in the corre¬ sponding production P 25 as indicated by the production table 16.
Prior art parse tables for finding read transitions (Figure 3) include three parse tables: The entrance symbol table 42 representing the symbol which must be read to enter a parse state, the transition state table 41 representing the target parse states of all possible read transitions, and a first transition table 40 representing a beginning of a collection of possible target states from a particular parser state. To find a possible read transition from parser state S with lookahead symbol X, a prior art parser uses a first transition table 40 at positions S 46 and S+l 47 to determine the collection of possible target states 50 to 52. The prior art parser then determines whether or not lookahead symbol X appears as the entrance symbol 60, 61 or 62 for each possible target parser state. When the lookahead symbol X is found, as at symbol position 61, then a next read transition is made to target parser state S' 50 on lookahead symbol X.
SUMMARY OF THE INVENTION
It is an object of this invention to provide an enhanced, more time-efficient parsing system and method than found in the prior art.
In accordance with the invention, first table means are established and used for indicating a plurality of reduc¬ tions and read transitions. A linearized set of vectors form an output table having input table means having a given plurality of input entries and an output table means having a given plurality of output entries, the input and output entries that correspond are at a same offset (logi¬ cally) within the table. Searching is conducted in the
single linearized table without indirect references to any table. The input and output entries being parallel vectors provide a one-for-one base plus offset addressing in the input and output table means. This arrangement is provided both for reduction and transition processing.
The foregoing and other objects, features and advantages of the invention will be apparent form the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings.
DESCRIPTION OF THE DRAWINGS
Figure 1 is a simplified diagram of a parsing system.
Figures 2 and 3 illustrate the prior art respectively for executing reductions and read transitions in a parser.
Figures 4 and 5, respectively, illustrate the construction of parsing tables using the present invention for reductions and read transitions.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
Referring now more particularly to the appended drawing, like numerals indicate like parts and structural features in the various figures. Figure 4 shows a preferred embodi¬ ment of the invention for performing reductions in a given parser state. The first lookahead table 15 is still used. Reduction table means 71 is a linearized table of parallel vectors of productions and lookahead sets, respectively, in separate data structures 72 and 73. Data structures 72 and 73 have an identical number of entries and utilize base plus offset addressing using a different base but the same offsets. The parser 11 upon checking for possible reduc¬ tions in state S 20 determines the collection of produc¬ tions 80, 81 and 83 (83 denotes a plurality of
productions) . State S+l 22 has a collection of productions that begin at numeral 82. Horizontal dashed lines collectively denominated by numeral 85 symbolize the parallel vector relationship between the production table
72 entries and the lookahead set table 73. For example, production 80 is at the same offset as lookahead set 90, production P 81 is at the same offset as lookahead set 91, etc. The parser 11 for state S scans lookahead sets 90 through 93 for lookahead symbol X by directly accessing the table 73. When symbol X is found in lookahead set 91, the parser reduces using the corresponding production P 81.
Figure 5 illustrates the parser tables in accordance with the invention for finding read transitions, the alternate operation in each state to performing a reduction via finding the productions. First state transition table 40 is used as in the prior art. The transition finding table
101 is vectorizeable as reduction table 71, i.e., table 101 includes two data structures, the entrance symbol table 102 and transition state table 103. Each of the tables 103 and
102 have an identical number of entries of positions and each entry having a one-for-one correspondence with one and only one entry i the other table within table 101. A collection of entrance symbols 110, 111 and 112 are for state S 46. A new read transition from state S 46 and symbol X 111 is found by searching for symbol X in the collection of entrance symbols 110-112 of table 102. The first transition of state S+l 47 at area 113 of table 102 determines the last of the entrance symbols 112 for state S 46. Since each entry in the entrance symbol table 102 has a corresponding entry in the transition state table 103, this searching is completed without searching the transition state table 103. That is, entrance symbols 110-112 in table 102 have corresponding target states S' 120-122 in table 103. Each such corresponding entry is deemed to be in a parallel vector. Once the entrance symbol X 111 is found in entrance symbol table 102,
transition state table 103 is explicitly addressed to obtain the target state S' 121, i.e., the offsets in the two tables are identical for corresponding entries, respectively. This search is conducted in but one table rather than through two tables as in the prior art of
Figure 3.
A parser loop 12 for practicing the present invention is set forth below in pseudo-code form.
PARSE_LOOP:
FOR i=first_lookahead(S) TO first_lookahead(S+l)-l DO IF (X IN lookahead_set(i) ) THEN
S=POP(length__reduction (production(i) ) ) Y=left_symbol (production(i) )
FOR j=first_transition(S) TO first_transition(S+l)-l DO IF (Y=entrance_symbol(j) ) THEN S=transition_state( ) IF (S=FINAL_STATE) STOP PUSH (S, Y) GOTO PARSE_LOOP ENDIF ENDFOR
FOR i=first_transition(S) TO first_transition(S+l)-l DO IF (X=entrance_symbol(i)) THEN S=transition_state(i) PUSH (S, X) READ (X) GOTO PARSE_LOOP ENDIF ENDFOR
INPUT_ERROR()
In the above pseudo-code listing, the term (i) represents the offset in the parse tables 72, 73, 102 and 103. Note that the searching is conducted in tables 73 and 102 while access to tables 72 and 103 is by explicit offset address (i or j) . The term FINAL_STATE represents a constant or value which represents the final or last reduction to <system goal symbol"**. This value indicates the termination of the current parsing operation. Completion of a parse always occurs with a last production and performing a transition to the FINAL_STATE. The term entrance_symbol represents a vector of integers representing the appropriate entrance symbols which must be read to enter state S. The term first_transition represents a vector of pointers into the entrance_symbol vector. This vector has a length of NO_STATES+l where NO_STATES is a constant indicating the number of states in the parse tables 72 and 73 and in parse tables 102 and 103. For identifying a read transition, the first_transition vector and transition_state vector are used. Assume that state S is the current parse state. Then, transition_state(first_transition(S) through transi- tion_state(first_transition(S+l)-l) are the collection of potential target states with entrance_symbol(first-transi¬ tion(S)) through entrance_symbol(first_transition(S+l)-1) being the entrance symbols to transition to those states. Such a range could be empty.
The term first_lookahead represents a vector N0_STATES+1 long of pointers into the vector lookahead_set and the vector production. The entries between the first_lookahead(S) and first_lookahead(S+l)-l in the lookahead_set and production form the lookahead set produc¬ tion pairs needed for reductions in state S. This range may also be empty. The term production represents a vector of production numbers used as one-half of lookahead set of production pairs. This vector contains as many entries as there are reductions in the parser tables and is used with
first_lookahead described above. The term left_symbol is a vector having a length of NO_PRODUCTIONS. The entry in position p-1 is the non-terminal on the left side of the production p (left on the : as listed in Background of the
Invention) . Such a non-terminal is pushed back on th stack when a reduction by p is performed. The term length_reduc- tion represents a vector having a length of NO_PRODUCTIONS.
Entry p-1 is the length of the right side of production p and is used to POP the stack when reducing using production p. The term lookahead_set represents a vector of lookahead sets. There are at least as many entries as there are terminals in each lookahead set. Each lookahead set is a bit vector over the terminal symbols. Terminal S is in set
L if and only if bit S of set L is one. The number of lookahead sets is determined by the number of possible reduction in all possible states. Other aspects of the pseudo-code listing are apparent from inspection of the listing.
While the invention has been particularly shown and described with reference to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the inven¬ tion.
What is claimed is:
Claims
1. In a machine-effected method of operating a parser portion of a compiler for parsing a computer program in an input stream, including the machine-executed steps of:
automatically establishing a parser state table means having a plurality of state indicating entries, one entry means for indicating each possible state of the parser;
automatically establishing a linearized vector table means having an input table including a given plurality of input entries, respectively, related to the parser state table means entries such that one or more of the input entries relate to a one of the state indicating entries, respec¬ tively, and an output table having said given plurality of output entries and arranged to be identically addressable within the output table as the entries are respectively addressed within the input table; and
parsing the computer program in the input stream using the established parser sate and linearized table means.
2. In the machine-effected method set forth in claim further including the machine-executed steps of:
when establishing the linearized table means, establishing a given plurality of independent portions, each portion having a separate input table and an output table arranged to be identically addressable, respectively;
during said parsing step, identifying a program sentence in the computer program; and
processing the identified program sentence using a one of said plurality of independent portions.
3. In the machine-effected method set forth in claim 2, further including the machine-executed steps of:
in said linearized table means, making one of the portions a reduction portion such that in the processing step, the identified program sentence is reduced; and
in said linearized table means, making a second one of the portions a read transition portion such that in the pro¬ cessing step, that read transitions are processed.
4. In the machine-effected method set forth in claim 3 further including the machine-executed steps of:
in said reduction portion, making the reduction portion input table a table of productions which define reductions and making the reduction portion output table a lookahead set table; and
in said read transition portion, making the transition portion input table a table of entrance symbols respectively representing entrances to parser states and the transition portion output table a transition state table.
5. In the machine-effected method set forth in claim 4 further including the machine-executed steps of:
performing said parsing step as a parse loop including automatically selecting a reduction loop or a transition loop;
in said reduction loop, using only the reduction portion tables including detecting a FINAL STATE for ending the parsing; and
in said transition loop, using only the transition portion tables and always performing a reduction loop before ending the parsing.
6. In the machine-effected method of parsing a computer program in an input stream, including the machine-executed steps of:
establishing first look tables having entries identifying the parser states when parsing the computer program in the input stream;
establishing linear vector parsing means having an input means with input indicia respectively addressable from the first look tables and an output means with output indicia respectively in the same vectors as the input indicia; and
parsing the computer program of the input stream including detecting a current parser state in the first look tables, then scanning the input indicia identifiable with the current parser state and then taking th the output indicia in the vectors of the input indicia for continuing the parsing.
7. In the machine-effected method set forth in claim 6 further including the machine-executed steps of:
in said linear vector, parsing means output means, estab¬ lishing a given plurality of independent portions, each portion having a separate input table and an output table arranged to be identically addressable, respectively;
during said parsing step, identifying a program sentence in the computer program; and
parsing the identified program sentence using only one of said plurality of independent portions.
8. In the machine-effected method set forth in claim 7 further including the machine-executed steps of:
in said output table means, making one of the portions a reduction portion such that in the parsing step, the identified program sentence is reduced; and
in said output table means, making a second one of the portions a read transition portion such that in the parsing step, that read transitions are processed.
9. In the machine-effected method set forth in claim 8 further including the machine-executed steps of:
in said reduction portion, including a table of productions which define reductions and a lookahead set table, each table in the reduction portion having entries which have a same offset and which are pairs of a given respective reduction and a given respective lookahead set; and
in staid read transition portion, including a table of entrance symbols respectively representing entrances to parser states and a transition state table having a like number of transition states as there are entrance symbols, placing the entrance symbols and the transition states as pairs so that a scan of the entrance symbols identifies the offset in the transition table of its identified transition state whereby a single scan of the entrance symbol table identifies an offset in the transition state table of the respective identified transition state.
10. In the machine-effected method set forth in claim 9 further including the machine-executed steps of:
performing said parsing step as a parse loop including automatically selecting a reduction loop or a transition loop;
in said reduction loop, using only the reduction portion tables including detecting a FINAL STATE for ending the parsing; and
in said transition loop, using only the transition portion tables and always performing a reduction loop before ending the parsing.
11. In apparatus for parsing a computer program, including, in combination:
state means for indicating a current parser state;
reduction means for reducing program expressions and including a production table having a first number of productions and a lookahead set table having the first number of lookahead set entries, each of the productions being paired with a respective one of the lookahead set entries and each respective paired lookahead set and production having an identical offset address in the respective tables in the reduction means;
read transition means for identifying read transitions in the computer program parsing and including an entrance symbol table having a second number of entrance symbols for entering respective ones of the parser states and a transi¬ tion state table having a second number of representations of transition states, each of the entrance symbols being paired with a respective one of the transition states and being at an address offset in the entrance symbol table identical to the address offset of the respective paired representation of transition states; and
parse loop means connected to the state means, to the reduction means and to the read transition means for activating either the reduction means or the read transi¬ tion means for each state indicated in the state means and including changing the state indicated in the state means only when the transition means is used.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US571,502 | 1990-08-23 | ||
US07/571,502 US5193192A (en) | 1989-12-29 | 1990-08-23 | Vectorized LR parsing of computer programs |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1992003781A1 true WO1992003781A1 (en) | 1992-03-05 |
Family
ID=24283962
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1991/004065 WO1992003781A1 (en) | 1990-08-23 | 1991-06-10 | Vectorized lr parsing of computer programs |
Country Status (2)
Country | Link |
---|---|
US (1) | US5193192A (en) |
WO (1) | WO1992003781A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0840211A2 (en) * | 1996-10-11 | 1998-05-06 | Sun Microsystems, Inc. | Variable lookahead parser generator |
Families Citing this family (54)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6397263B1 (en) | 1993-11-03 | 2002-05-28 | International Business Machines Corporation | String command parser for message based systems |
JP3339741B2 (en) * | 1994-01-13 | 2002-10-28 | 株式会社リコー | Language analyzer |
US5905979A (en) * | 1996-07-02 | 1999-05-18 | Electronic Data Systems Corporation | Abstract manager system and method for managing an abstract database |
US5916305A (en) * | 1996-11-05 | 1999-06-29 | Shomiti Systems, Inc. | Pattern recognition in data communications using predictive parsers |
US5991539A (en) * | 1997-09-08 | 1999-11-23 | Lucent Technologies, Inc. | Use of re-entrant subparsing to facilitate processing of complicated input data |
US6154754A (en) * | 1997-09-25 | 2000-11-28 | Siemens Corporate Research, Inc. | Automatic synthesis of semantic information from multimedia documents |
US6120552A (en) * | 1998-05-27 | 2000-09-19 | International Business Machines Corporation | Method to exhibit parallelism for computer implementation of computational processing |
US7225467B2 (en) * | 2000-11-15 | 2007-05-29 | Lockheed Martin Corporation | Active intrusion resistant environment of layered object and compartment keys (airelock) |
US7213265B2 (en) * | 2000-11-15 | 2007-05-01 | Lockheed Martin Corporation | Real time active network compartmentalization |
US7737134B2 (en) * | 2002-03-13 | 2010-06-15 | The Texas A & M University System | Anticancer agents and use |
US7080094B2 (en) * | 2002-10-29 | 2006-07-18 | Lockheed Martin Corporation | Hardware accelerated validating parser |
US7146643B2 (en) * | 2002-10-29 | 2006-12-05 | Lockheed Martin Corporation | Intrusion detection accelerator |
KR20050072128A (en) * | 2002-10-29 | 2005-07-08 | 록히드 마틴 코포레이션 | Hardware parser accelerator |
US20040083466A1 (en) * | 2002-10-29 | 2004-04-29 | Dapp Michael C. | Hardware parser accelerator |
US20070061884A1 (en) * | 2002-10-29 | 2007-03-15 | Dapp Michael C | Intrusion detection accelerator |
US20060259508A1 (en) * | 2003-01-24 | 2006-11-16 | Mistletoe Technologies, Inc. | Method and apparatus for detecting semantic elements using a push down automaton |
US7424571B2 (en) * | 2004-07-27 | 2008-09-09 | Gigafin Networks, Inc. | Array machine context data memory |
US20050281281A1 (en) * | 2003-01-24 | 2005-12-22 | Rajesh Nair | Port input buffer architecture |
US7130987B2 (en) * | 2003-01-24 | 2006-10-31 | Mistletoe Technologies, Inc. | Reconfigurable semantic processor |
US7415596B2 (en) * | 2003-01-24 | 2008-08-19 | Gigafin Networks, Inc. | Parser table/production rule table configuration using CAM and SRAM |
CN100470480C (en) * | 2003-02-28 | 2009-03-18 | 洛克希德马丁公司 | Hardware accelerator personality compiler |
US7716742B1 (en) | 2003-05-12 | 2010-05-11 | Sourcefire, Inc. | Systems and methods for determining characteristics of a network and analyzing vulnerabilities |
US7398356B2 (en) | 2004-07-22 | 2008-07-08 | Mistletoe Technologies, Inc. | Contextual memory interface for network processor |
US7539681B2 (en) * | 2004-07-26 | 2009-05-26 | Sourcefire, Inc. | Methods and systems for multi-pattern searching |
US7451268B2 (en) * | 2004-07-27 | 2008-11-11 | Gigafin Networks, Inc. | Arbiter for array machine context data memory |
US20060026377A1 (en) * | 2004-07-27 | 2006-02-02 | Somsubhra Sikdar | Lookup interface for array machine context data memory |
US7496962B2 (en) * | 2004-07-29 | 2009-02-24 | Sourcefire, Inc. | Intrusion detection strategies for hypertext transport protocol |
US20070019661A1 (en) * | 2005-07-20 | 2007-01-25 | Mistletoe Technologies, Inc. | Packet output buffer for semantic processor |
US20060031555A1 (en) * | 2004-08-05 | 2006-02-09 | Somsubhra Sikdar | Data context switching in a semantic processor |
US20070043871A1 (en) * | 2005-07-19 | 2007-02-22 | Mistletoe Technologies, Inc. | Debug non-terminal symbol for parser error handling |
US20070027991A1 (en) * | 2005-07-14 | 2007-02-01 | Mistletoe Technologies, Inc. | TCP isolation with semantic processor TCP state machine |
US7970600B2 (en) * | 2004-11-03 | 2011-06-28 | Microsoft Corporation | Using a first natural language parser to train a second parser |
US7873947B1 (en) * | 2005-03-17 | 2011-01-18 | Arun Lakhotia | Phylogeny generation |
US20060277028A1 (en) * | 2005-06-01 | 2006-12-07 | Microsoft Corporation | Training a statistical parser on noisy data by filtering |
US20070016906A1 (en) * | 2005-07-18 | 2007-01-18 | Mistletoe Technologies, Inc. | Efficient hardware allocation of processes to processors |
US20070022225A1 (en) * | 2005-07-21 | 2007-01-25 | Mistletoe Technologies, Inc. | Memory DMA interface with checksum |
US20070022275A1 (en) * | 2005-07-25 | 2007-01-25 | Mistletoe Technologies, Inc. | Processor cluster implementing conditional instruction skip |
US8046833B2 (en) * | 2005-11-14 | 2011-10-25 | Sourcefire, Inc. | Intrusion event correlation with network discovery information |
US7733803B2 (en) * | 2005-11-14 | 2010-06-08 | Sourcefire, Inc. | Systems and methods for modifying network map attributes |
US7948988B2 (en) * | 2006-07-27 | 2011-05-24 | Sourcefire, Inc. | Device, system and method for analysis of fragments in a fragment train |
US7701945B2 (en) | 2006-08-10 | 2010-04-20 | Sourcefire, Inc. | Device, system and method for analysis of segments in a transmission control protocol (TCP) session |
CA2672908A1 (en) * | 2006-10-06 | 2008-04-17 | Sourcefire, Inc. | Device, system and method for use of micro-policies in intrusion detection/prevention |
US8069352B2 (en) | 2007-02-28 | 2011-11-29 | Sourcefire, Inc. | Device, system and method for timestamp analysis of segments in a transmission control protocol (TCP) session |
WO2008134057A1 (en) * | 2007-04-30 | 2008-11-06 | Sourcefire, Inc. | Real-time awareness for a computer network |
US8474043B2 (en) * | 2008-04-17 | 2013-06-25 | Sourcefire, Inc. | Speed and memory optimization of intrusion detection system (IDS) and intrusion prevention system (IPS) rule processing |
US8762969B2 (en) * | 2008-08-07 | 2014-06-24 | Microsoft Corporation | Immutable parsing |
US8272055B2 (en) * | 2008-10-08 | 2012-09-18 | Sourcefire, Inc. | Target-based SMB and DCE/RPC processing for an intrusion detection system or intrusion prevention system |
US8332810B2 (en) * | 2008-11-24 | 2012-12-11 | Sap Aktiengeselleschaft | Optimal code generation for derivation tables |
JP5809238B2 (en) | 2010-04-16 | 2015-11-10 | シスコ テクノロジー,インコーポレイテッド | System and method for near real-time network attack detection, and system and method for integrated detection by detection routing |
US8433790B2 (en) | 2010-06-11 | 2013-04-30 | Sourcefire, Inc. | System and method for assigning network blocks to sensors |
US8671182B2 (en) | 2010-06-22 | 2014-03-11 | Sourcefire, Inc. | System and method for resolving operating system or service identity conflicts |
US8601034B2 (en) | 2011-03-11 | 2013-12-03 | Sourcefire, Inc. | System and method for real time data awareness |
US8806456B2 (en) * | 2012-05-31 | 2014-08-12 | New York University | Configuration-preserving preprocessor and configuration-preserving parser |
US9875319B2 (en) * | 2013-03-15 | 2018-01-23 | Wolfram Alpha Llc | Automated data parsing |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5105353A (en) * | 1987-10-30 | 1992-04-14 | International Business Machines Corporation | Compressed LR parsing table and method of compressing LR parsing tables |
-
1990
- 1990-08-23 US US07/571,502 patent/US5193192A/en not_active Expired - Lifetime
-
1991
- 1991-06-10 WO PCT/US1991/004065 patent/WO1992003781A1/en unknown
Non-Patent Citations (2)
Title |
---|
ARTHUR PYSTER, "Compiler Design and Construction", 1980, by VAN NOSTRAND REINHOLD COMPANY INC., pp. 54-72. * |
ROBIN HUNTER, "The Design and Construction of Compilers", 1981, by JOHN WILEY & SONS LTD., pp. 97-120. * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0840211A2 (en) * | 1996-10-11 | 1998-05-06 | Sun Microsystems, Inc. | Variable lookahead parser generator |
EP0840211A3 (en) * | 1996-10-11 | 2003-10-22 | Sun Microsystems, Inc. | Variable lookahead parser generator |
Also Published As
Publication number | Publication date |
---|---|
US5193192A (en) | 1993-03-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5193192A (en) | Vectorized LR parsing of computer programs | |
US5487147A (en) | Generation of error messages and error recovery for an LL(1) parser | |
US6353925B1 (en) | System and method for lexing and parsing program annotations | |
EP0281742B1 (en) | Method for verifying spelling of compound words | |
USRE33706E (en) | Table driven translator | |
EP0171631B1 (en) | A method for performing global common subexpression elimination and code motion in an optimizing compiler | |
Burke et al. | A practical method for LR and LL syntactic error diagnosis and recovery | |
Pennello | Very fast LR parsing | |
CN1188933A (en) | Recognition method for internal stored operation error in programming | |
CN1545665A (en) | Predictive cascading algorithm for multi-parser architecture | |
EP0371944A2 (en) | Computer system and method for translating a program | |
WO1992003782A1 (en) | Parsing program data streams | |
EP0371943A2 (en) | System and method for generating programming language translators | |
US5649201A (en) | Program analyzer to specify a start position of a function in a source program | |
WO1996017310A1 (en) | System and process for creating structured documents | |
Ganapathi | Semantic predicates in parser generators | |
Capon et al. | Syntax analysis | |
EP0314503A2 (en) | Dictionary structure for document processing apparatus | |
Abrahams | The CIMS PL/I compiler | |
Capon et al. | Compiler organisation | |
Setzer | Non‐recursive top‐down syntax analysis | |
CN117008919A (en) | Analysis method and device of simulation model file, electronic equipment and storage medium | |
Mössenböck | A generator for fast compiler front-ends | |
Kumar et al. | Automatic retargetable code generation: A new technique | |
JPH08153101A (en) | Proofreading method for japanese sentence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): JP KR |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH DE DK ES FR GB GR IT LU NL SE |