Table of Contents
The grammar specification itself is not sufficient to produce a
parser. There also needs to be output language specific information to
allow the parser to interface with the program it is to be part of. In
the case of the C output routines, sid
needs to know the
following information:
What code should precede and succeed the automatically generated code.
How to map the sid
identifiers into C
identifiers.
How to do assignments for each type.
How to get the current terminal number.
How to get the result of the current terminal.
How to advance the lexical analyser, to get the next terminal.
What the actions are defined as, and how to pass parameters to them.
How to save and restore the current terminal when an error occurs.
Eventually almost all of this should be user suppliable. At the
moment, some of the information is supplied by the user in the C
information file, some through macros, and some is built in.
sid
currently gets the information as follows:
The C information file has a header and a trailer section,
which define code that precedes and succeeds the code that
sid
generates.
The C information file has a section that allows the user to
specify mappings from sid
identifiers into C
identifiers. These are only valid for the following types of
identifiers: types, functions (implementations of rules) and
terminals. For other identifier types (or when no mapping is
supplied), sid
uses some default rules:
Firstly, sid
applies a transform to the
sid
identifier name, to make it a legal C identifier.
At present this maps _
to __
,
-
to _H
and :
(this occurs
in scoped names) to _C
. All other characters are
left unmodified. This transform cannot be changed.
sid
also puts a prefix before all identifiers, to
try to prevent clashes (and also to make automatically generated
- i.e. numeric - identifiers legal). These prefixes can be
redefined for each class of identifier, in the C information file.
They should be chosen so as not to clash with any other
identifiers (i.e. no other identifiers should begin with that
prefix).
By default, the following prefixes are used:
Table 6.1. Identifier prefixes
Prefix | Meaning |
---|---|
ZT | This prefix is used before type identifiers, for the type name itself. |
ZR | This prefix is used before rule identifiers, for the rule's implementation function. |
ZL | This prefix is used before rule identifiers, for the rule's label when tail recursion is being eliminated. In this case, a number is added to the suffix before the identifier name, to prevent clashes when a rule is inlined twice in the same function. It is also used before other labels that are automatically generated and are just numbered. |
ZI | This prefix is used before name identifiers used as parameters to functions, or in normal usage. It is also used by non-local names (which doesn't cause a problem as they always occur scoped, and local names never do). |
ZO | This prefix is used before name identifiers used as
results of functions. Results are passed as reference
parameters, and this suffix is used then. Another
identifier with the ZI prefix is also used
within the function, and the type reference assignment
operator is used at the end of the function to assign
the results to the reference parameters. |
ZB | This prefix is used before the terminal symbol names in the generated header file. |
Normally, sid
will do assignments using the C
assignment operator. Sometimes, this will not do the right thing,
so the user can define a set of assignment operations for any type
in the C information file.
sid
expects the CURRENT_TERMINAL
macro to be defined, and its definition should return an integer
that is the current terminal. The macro should be an expression,
not a statement.
It is necessary to define how to extract the results of all terminals in the C information file (if a terminal doesn't return anything, then it is not necessary to define how to get the result).
sid
expects the ADVANCE_LEXER
macro
to be defined, and its definition should cause the lexical
analyser to read a new token. The new terminal number should be
accessible through the CURRENT_TERMINAL
macro. On
entry into the parser CURRENT_TERMINAL
should give
the first terminal number.
All actions, and their parameter and result names are defined in the C information file.
sid
expects the SAVE_LEXER
and
RESTORE_LEXER
macros to be defined. The first is
called with an argument which is the error terminal value. The
macro should save the current terminal's value, and set the
current terminal to be the error terminal value. The second macro
is called without arguments, and should restore the saved value of
the current terminal. SAVE_LEXER
will never be
called more than once without a call to
RESTORE_LEXER
, so the save stack only needs one
element.
The remainder of this section describes the layout of the C
information file. The lexical conventions are described first, followed
by a description of the sections in the order in which they should
appear. Unlike the sid
grammar file, not all sections are
mandatory.
The lexical conventions of the C information file are very similar
to those of the sid
grammar file. There is a second class
of identifier: the C identifier, which is a subset of the valid
sid
identifiers; there is also the C code block.
A C code block begins with @{
and is terminated by
@}
. The code block consists of all of the characters
between the start and end of the code block, subject to substitutions.
All substitutions begin with the @
character. The
following substitutions are recognised:
@@
This substitutes the @
character itself.
@:
label
This form marks a label, which will be substituted for in
the output code. This is necessary, because an action may be
inlined into the same function more than once. If this happens,
then without doing label substitution there would be two
identical labels in the same scope. With label substitution,
this problem is avoided. In general, all references to a label
within an action should be prefixed with @:
. This
substitution may not be used in header and trailer code
blocks.
@
identifier
This form marks a parameter or result identifier
substitution. If parameter and result identifiers are not
prefixed with an @
character, then they will not be
substituted. It is an error if the identifier is not a parameter
or a result. Header and trailer code blocks have no parameters
or results, so it is always an error to use identifier
substitution in them. It is an error if any of the result
identifiers are not substituted at least once.
Result identifiers may be assigned to using this form of
identifier substitution, but parameter identifiers may not be
(nor may there address be taken - they are immutable). To try
to prevent this, parameters that are substituted may be cast
to their own type, which makes them unmodifiable in ISO C (see
the notes on the casts
language specific option).
@&
identifier
This form marks a parameter identifier whose address is to be substituted, but whose contents will not be modified. The effects of modifying the identifier are undefined. It is an error to use this in parameter assignment operator definitions.
@=
identifier
This form marks a parameter identifier that will be modified. For this to be useful, the parameter should be a call by reference parameter, so that the effect of the modification will be propagated. This substitution is only valid in actions (assignment operators are not allowed to modify their parameters).
@!
This form marks an exception raise. In the generated code, a jump to the current exception handler will be substituted. This substitution is only valid in actions.
@.
This form marks an attempt to access the current terminal. This substitution is only valid in actions.
@>
This form marks an attempt to advance the lexical analyser. This substitution is only valid in actions.
All other forms are illegal. Note that in the case of labels and
identifiers, no white space is allowed between the @:
,
@
, @&
or @=
and the
identifier name. An example of a code block is:
@{ /* A code block */ { int i ; if ( @param ) { @! ; } @result = 0 ; for ( i = 0 ; i < 100 ; i++ ) { printf ( "{%d}\n", i ) ; @result += i ; } @=param += @result ; if ( @. == TOKEN_SEMI ) { @> ; } } @}
The first section in the C information file is the prefix
definition section. This section is optional. It begins with the
section header, followed by a list of prefix definitions. A prefix
definition begins with the prefix name, followed by a =
symbol, followed by a C identifier that is the new prefix, and
terminated by a semicolon. The following example shows all of the
prefix names, and their default values:
%prefixes% type = ZT ; function = ZR ; label = ZL ; input = ZI ; output = ZO ; terminal = ZB ;
The section that follows the prefixes section is the maps section.
This section is also optional. It begins with its section header,
followed by a list of identifier mappings. An identifier mapping
begins with a sid
identifier (either a type, a rule or a
terminal), followed by the ->
symbol, followed by the
C identifier it is to be mapped to, and terminated by a semicolon. An
example follows:
%maps% NumberT -> unsigned ; calculator -> calculator ;
Note that it is not possible to map type identifiers to be
arbitrary C types. It will be necessary to typedef
or
macro define the type name in the C file.
It is recommended that all types, terminals and entry point rules have their names mapped in this section, although this is not necessary. If the names are not mapped, they will have funny names in the rest of the program.
After the maps section comes the header section. This begins with the section header, followed by a code block, followed by a comma, followed by a second code block, and terminated with a semicolon. The first code block will be inserted at the beginning of the generated parser file; the second code block will be inserted at the start of the generated header file. An example is:
%header% @{ #include "lexer.h" LexerT token ; #define CURRENT_TERMINAL token.t #define ADVANCE_LEXER next_token () extern void terminal_error () ; extern void syntax_error () ; @}, @{ @} ;
The assignments section follows the header section. This section is optional. Normally, assignment between two identifiers will be done using the C assignment operator. In some cases this will not do the correct thing, and it is necessary to do the assignment differently. All types for which this applies should have an entry in the assignments section. The section begins with its header, followed by definitions for each type that needs its own assignment operator. Each definition should have one parameter, and one result. The action's name should be the name of the type. An example follows:
%assignments% ListT : ( l1 ) -> ( l2 ) = @{ if ( @l2.head = @l1.head ) { @l2.tail = @l1.tail ; } else { @l2.tail = &( @l2.head ) ; } @} ;
If a type has an assignment operator defined, it must also have a parameter assignment operator type defined and a result assignment operator defined (more precisely it must have either no assignment operations defined, or all three assignment operations defined).
The parameter assignments section is very similar to the assignments section (which it follows), and is also optional. If a type has an assignment section entry, it must have a parameter assignment entry as well.
The parameter assignment operator is used in function calls to ensure that the object is copied correctly: if no parameter assignment operator is provided for a type, the standard C call by copy mechanism is used; if a parameter assignment operator is provided for a type, then the address of the object is passed by the calling function, and the called function declares a local of the same type, and uses the parameter assignment operator to copy the object (this should be remembered when passing parameters to entry points that have arguments of a type that has a parameter assignment operator defined).
The difference between the parameter assignment operator and the assignment operator is that the parameter identifier to the parameter assignment operator is a pointer to the object being manipulated, rather than the object itself. An example reference assignment section is:
%parameter-assignments% ListT : ( l1 ) -> ( l2 ) = @{ if ( @l2.head = @l1->head ) { @l2.tail = @l1->tail ; } else { @l2.tail = &( @l2.head ) ; } @} ;
The result assignments section is very similar to the assignments section and the parameter assignments section (which it follows), and is also optional. If a type has an assignment section entry, it must also have a result assignment entry. The only difference between the two is that the result identifier of the result assignment operation is a pointer to the object being manipulated, rather than the object itself. Result assignments are only used when the results of a rule are assigned back through the reference parameters passed into the function. An example result assignment section is:
%result-assignments% ListT : ( l1 ) -> ( l2 ) = @{ if ( @l2->head = @l1.head ) { @l2->tail = @l1.tail ; } else { @l2->tail = &( @l2->head ) ; } @} ;
The terminal result extraction section follows the reference assignment section. It defines how to extract the results from terminals. The section begins with its section header, followed by the terminal extraction definitions.
There must be a definition for every terminal in the grammar that returns a result. It is an error to include a definition for a terminal that doesn't return a result. The result of the definition should be the same as the result of the terminal. An example of the terminal result extraction section follows:
%terminals% number : () -> ( n ) = @{ @n = token.u.number ; @} ; identifier : () -> ( i ) = @{ @i = token.u.identifier ; @} ; string : () -> ( s ) = @{ @s = token.u.string ; @} ;
The action definition section follows the terminal result extractor definition section. The format is similar to the previous sections: the section header followed by definitions for all of the actions. An action definition has the following form:
<action-name> : ( parameters ) -> ( results ) = code-block ;
This is similar to the form of all previous definitions, except that the name is surrounded in angle brackets. What follows is also true of the other definitions as well (unless they state otherwise).
The action-name
is a sid
identifier that
is the name of the action being defined; parameters
is a
comma separated list of C identifiers that will be the names of the
parameters passed to the action, and results
is a comma
separated list of C identifiers that will be the names of the result
parameters passed to the action. The code-block
is the C
code that defines the action. It is expected that this will assign a
valid result to each of the result identifier names.
The parameter and result tuples have the same form as in the language independent file, except that the types are optional. Like the language independent file, if the type of an action is zero-tuple to zero-tuple, then the type can be omitted, e.g.:
<action> = @{ /* .... */ @} ;
An example action definition section is:
%actions% <add> : ( v1, v2 ) -> ( v3 ) = @{ @v3 = @v1 + @v2 ; @} ; <subtract> : ( v1 : NumberT, v2 : NumberT ) -> ( v3 : NumberT ) = @{ @v3 = @v1 - @v2 ; @} ; <multiply> : ( v1 : NumberT, v2 ) -> ( v3 ) = @{ @v3 = @v1 * @v2 ; @} ; <divide> : ( v1, v2 ) -> ( v3 : NumberT ) = @{ @v3 = @v1 / @v2 ; @} ; <print> : ( v ) -> () = @{ printf ( "%u\n", @v ) ; @} ; <error> = @{ fprintf ( stderr, "ERROR\n" ) ; exit ( EXIT_FAILURE ) ; @} ;
Do not define static variables in action definitions; if you do, you will get unexpected results. If you wish to use static variables in actions definitions, then define them in the header block.
After the action definition section comes the trailer section. This has the same form as the header section. An example is:
%trailer% @{ int main () { next_token () ; calculator ( NULL ) ; return 0 ; } @}, @{ @} ;
The code blocks will be appended to the generated parser, and the generated header file respectively.