Chapter 10: Compilation

The State

ANS Forth specifies the system variable STATE to be zero (false) in interpretation state, and non-zero (true) in compilation state. In StrongForth, STATE is a variable of data type FLAG:

FALSE VARIABLE STATE

Messing around with STATE is discouraged by the ANS Forth specification. Only few words are supposed to alter the value of STATE. Among these words are [ and ], which are typically used during compilation to temporarily turn to interpretation state. Their definition is straight-forward:

: [ ( -- )
  FALSE STATE ! ;

: ] ( -- )
  TRUE STATE ! ;

The usage of several words is bound to a specific state in the sense that they may only be used in either interpretation or compilation state. For example, using IF or BEGIN only makes sense in compilation state. On the other hand, : and :NOMANE may only be executed while in interpretation state, because compilation must not be nested. To ensure that these restrictions are not violated, it is recommended to verify the correct state before executing such words. For this purpose, StrongForth provides the two words ?COMPILE and ?EXECUTE. ?COMPILE throws an exception if not in compilation state, while ?EXECUTE throws an exception if not in interpretation state.

: ?COMPILE ( -- )
  STATE @ INVERT IF -14 THROW THEN ;
  
: ?EXECUTE ( -- )
  STATE @ IF -29 THROW THEN ;

You already saw a typical application of ?EXECUTE in the definition of (CREATE) as explained in chapter 7. ?COMPILE is used by immediate words that are supposed to be executed during compilation.

Colon-Definitions

New colon definitions are initiated by either : or :NONAME, and terminated by ;. To understand what's going on during the definition of a new word, let's again have a look at a definition's data structure in the various memory spaces.

The name space contains all information that is required to find a word in the dictionary:


name field

link field
attribute field
token field

input parameter field


output parameter field

Via the pointer in the token field, you can get to the code and data fields, which are located in the constant data space:

code field

data field

The code field contains another pointer, which points to the machine code of the definition:


machine code

All colon definitions share the same machine code. This code is the so-called inner interpreter of StrongForth. It executes the virtual machine code that is stored as a list of tokens in the colon definition's data field. Therefore, it's not necessary to create additional machine code for a new colon definition.

A typical colon definition consists of four parts:

This is actually very similar to the definition of a machine code word, which consists of three parts (see chapter 7):

Both : and CODE use (CREATE) to generate the name field, the link field, the attribute field, and the token field, and the code field. The input and output parameter fields are being filled by the optional stack diagram. But while machine code definitions leave the data field empty and compile machine code in the code space, do colon definitions compile virtual machine code into the data field and use the inner interpreter to execute it.

Now, here's the definition of : and, as a reference, the definition of CODE:

: : ( -- COLON-DEFINITION )
  ?EXECUTE CONST-HERE (CREATE) ['CODE] CR CONST,
  ] DTP! LOCALS! LATEST CAST COLON-DEFINITION ;
    
: CODE ( -- )
  ?EXECUTE CONST-HERE (CREATE) CODE-HERE CONST, ;

: always compiles the machine code address of the inner interpreter into the code field. This address can easily be obtained from any existing colon definition, like for example CR. ] switches from interpretation state to compilation state. Next, DTP! unlocks and initializes the compiler data type heap, which is empty by default. By providing a stack diagram before the first word of virtual code is compiled, the compiler data type heap will be re-initialized with the input parameters. As you already learned in chapter 7, ) actually pushes the input parameters of the stack diagram onto the compiler data type heap.

LOCALS! empties the local name space and initializes the system variable #LOCALS with zero. At the beginning of a colon definition, no locals are defined yet. Therefore, the local name space is empty.

LOCALS! ( -- )

#LOCALS contains the number of cells reserved for locals, which StrongForth keeps on the return stack until the end of the definition. For some reason explained later, #LOCALS is a signed number:

-0 VARIABLE #LOCALS

: returns the definition it created as an item of data type COLON-DEFINITION, which is used by ) and consumed by ;.

:NONAME is a variant of : that creates colon definitions without name and link fields. Instead of (CREATE), it executes (CREATE-NONAME). :NONAME returns an additional copy of the definition itself for further processing:

: :NONAME ( -- DEFINITION COLON-DEFINITION )
  ?EXECUTE CONST-HERE (CREATE-NONAME) ['CODE] CR CONST,
  ] DTP! LOCALS! LATEST DUP CAST COLON-DEFINITION ;

Colon definitions initiated by : and :NONAME have to be terminated by ;. Because ; is executed while in compilation state, it has to be an immediate word. It uses [ to turn back to interpretation state. Executing LOCALS! ensures that local variables do not remain visible outside of the definition:

: ; ( COLON-DEFINITION -- )
  ?COMPILE DUP >BODY CONST-HERE =
  IF >CODE ['CODE] NOOP SWAP ! ?PARAMS
  ELSE DROP DTP@ IF POSTPONE EXIT THEN
  THEN POSTPONE [ LOCALS! END-CODE ; IMMEDIATE

: has already compiled the address of the inner interpreter into the code field of the new definition. However, in StrongForth it is sometimes useful to define words with no semantics at all. These words have no virtual code at all, although they might have a stack effect. In those cases where the data field is empty, ; replaces the content of the code field with a pointer to the machine code of NOOP. NOOP is a word that performs no operation. It doesn't start the inner interpreter and does not have a data field:

NOOP ( -- )

Here's an example of a definition with no virtual code, where ; compiles the code field of NOOP:

: TOKEN>ADDR ( TOKEN -- CONST -> CODE )
  CAST CONST -> CODE ;

The final action of ; is linking the new definition to the current compilation word list, because this makes the word visible for SEARCH and SEARCH-ALL.

Is that all ; does? Well, almost. In StrongForth, ; additionally has to compare the output parameters of the new definition with the contents of the compiler data type heap at the end of the definition. If the parameters do not exactly match the data types, the new definition is rejected. ?PARAMS (and also EXIT, which contains ?PARAMS), may abort ; before END-CODE links the definition to the dictionary:

: ?PARAMS ( -- )
  DTP@ IF LATEST DTP@ DEPTH - ?CONGRUENT THEN ;

The IF clause in the definition of ?PARAMS takes care of the case when the compiler data type heap is locked. This happens, for example, when an endless loop is being compiled:

: ... BEGIN ... AGAIN ;

AGAIN locks the compiler data type heap, because any virtual code after AGAIN will never be executed. In this case, comparing the definition's output parameters with the contents of compiler data type heap does not make sense.

Note than ; contains a similar IF clause around POSTPONE EXIT. If a definition ends with the compiler data type heap locked, as in the endless loop example above, it is not necessary and not even possible to compile (EXIT) (or (+EXIT)).

But let's get back to ?PARAMS.. The data type comparison is actually performed by ?CONGRUENT. ?CONGRUENT expects the definition whose output parameters shall be checked, and a pointer to the first data type on the data type heap:

?CONGRUENT ( DEFINITION DATA -> DATA-TYPE -- )

The output parameters of the definition are one by one compared to the data types starting at the pointer up to the top of the data type heap. Data type references in the output parameter list are resolved. For example, given the definition

: EXAMPLE ( CDATA -> CHARACTER UNSIGNED -- 2ND 3RD 1ST FLAG )

the data type heap must contain the following data types in order for ?CONGRUENT not to throw an exception:

CHARACTER UNSIGNED CDATA -> CHARACTER FLAG

FLAG is the topmost data type on the data type heap. Note that ?CONGRUENT works both in interpretation and in compilation state, because it automatically selects the appropriate data type heap. In the context of ;, ?CONGRUENT always uses the compiler data type heap.

If the new definition does not contain virtual code, nothing else is to be done by ;. Otherwise, ; additionally has to compile code for the definition to exit, i. e., to make the inner interpreter return to the calling definition. This is done by EXIT:

: EXIT ( -- )
  ?COMPILE ?PARAMS #LOCALS @
  IF POSTPONE (+EXIT) #LOCALS @ CELLS CONST,
  ELSE POSTPONE (EXIT)
  THEN DTP| ; IMMEDIATE

EXIT queries #LOCALS to check whether the new definition contains any locals that need to be removed from the return stack before exiting the definition. Remember that #LOCALS contains the number of cells reserved for locals. If there are no locals, EXIT just compiles (EXIT), which causes the new definition to terminate. If, on the other hand, one or more locals have been defined, EXIT compiles (+EXIT) and the number of address units occupied by the locals on the return stack. The number of address units is actually a parameter for (+EXIT). When a definition executes (+EXIT), it removes the locals from the return stack and then performs the semantics of (EXIT) in order to terminate the definition.

(EXIT) ( -- )
(EXIT)
(+EXIT) ( -- )
(+EXIT) address units

As in ANS Forth, EXIT may be used independently from ; within a definition. That's why it is defined as an immediate word, and that's why it executes ?COMPILE at the beginning. An interesting detail is the fact that EXIT uses DTP| to lock the compiler data type heap. With the compiler data type heap locked, no further data processing is allowed. As a consequence, you cannot compile any words with input and/or output parameters immediately after compiling EXIT. This makes no sense anyway, because these words would never be executed. Compilation can only continue after the compiler data type heap has been unlocked again. This is typically done by an immediate word like THEN, which joins two different control flows while keeping track of the data flow:

... IF ... EXIT THEN ...

You'll see later how this works.

Compiling Literals

A literal is a constant with a specific value and a specific data type, which has no other semantics than being a constant. For example, -6172 is a literal of data type SIGNED. When the interpreter executes a literal, it pushes an item with the literal's value onto the data stack, and adds the literal's data type to the interpreter data type heap. When the compiler compiles a literal, it generates virtual machine code that pushes the literal's value onto the data stack at runtime, and adds the literal's data type to the compiler data type heap. How does this virtual machine code look like? It consists of a token, followed by the binary value of the literal. StrongForth has two separate tokens for single-cell and double-cell literals:

LIT ( -- SINGLE )
LIT single-cell value
DLIT ( -- DOUBLE )
DLIT double-cell value

When the token of LIT is executed by the inner interpreter, a single-cell value is pushed onto the data stack. DLIT pushes a double-cell value onto the stack. To compile the complete sequences of virtual code, the overloaded word LITERAL, can be used:

: LITERAL, ( SINGLE -- )
  ['TOKEN] LIT CONST, CONST, ;

: LITERAL, ( DOUBLE -- )
  ['TOKEN] DLIT CONST, CONST, ;

['TOKEN] is an immediate word that parses the input source for the name of a word, finds a word with this name, and finally compiles the token of this word as a literal. Further details on ['TOKEN] can be found later in this section.

Unfortunately, compiling the virtual code is only half of the job to be done by the compiler. In StrongForth, the compiler data type heap has to be updated as well. Since the runtime behaviour of a literal is simply to push its value onto the stack, the data type of the literal has to be appended to the compiler data type heap at compile time.

The ANS Forth word LITERAL is an immediate word that compiles a literal whose value is calculated at compile time. StrongForth provides overloaded versions for single-cell and double-cell literals:

: LITERAL ( SINGLE -- )
  ?COMPILE POSTPONE [ DTP@ ] @>DT LITERAL, ; IMMEDIATE

: LITERAL ( DOUBLE -- )
  ?COMPILE POSTPONE [ DTP@ ] @>DT LITERAL, ; IMMEDIATE

LITERAL is typically used during compilation in the following way:

[ ... ] LITERAL

The code enclosed in brackets is immediately executed in order to calculate the value of a literal at compile time. LITERAL compiles the virtual machine code of the literal, and additionally adds the data type of the literal to the compiler data type heap. But where can we get the data type from? Obviously, the data type is stored at the top of the interpreter data type heap. At the point where LITERAL is executed, LITERAL has already been parsed and the interpreter data type heap has been updated. This means, the interpreter data type heap pointer now points directly to the data type of the literal. To get the interpreter data type heap pointer, LITERAL temporarily switches to interpretation state. Otherwise, DTP@ would return the compiler data type heap pointer, because LITERAL is always executed in compilation state. @>DT eventually adds this data type to the compiler data type heap:

: @>DT ( DATA -> DATA-TYPE -- )
  BEGIN DUP @ DUP DT-PREFIX AND >DT DT-PREFIX ATTRIBUTE?
  WHILE 1+
  REPEAT DROP ;

Generally, @>DT copies a compound data type from a given address to the interpreter or compiler data type heap, depending on the value of STATE. It consists of nothing else but a loop that executes >DT while the prefix attribute is set. All other attributes are removed before copying a data type to the data type heap.

Other interesting applications of LITERAL, are the ANS Forth words ['] and [CHAR], which both calculate a value at compile time and then compile it as a literal. As a first approach, they could be defined as follows:

: ['] ( -- )
   ?COMPILE ' LITERAL, [ DT DEFINITION ] LITERAL >DT ; IMMEDIATE
   
: [CHAR] ( -- )
   ?COMPILE CHAR LITERAL, [ DT CHARACTER ] LITERAL >DT ; IMMEDIATE

The values are simply calculated by ' and CHAR, respectively. Then, the literal is added to the virtual machine code using LITERAL, and the data type is added to the compiler data type heap with >DT. This solution works fine. However, it leaves the bad taste of low-level programming, because the data type of the literal has to be specified manually instead of just using the data type of the calculated value, which is already present on the compiler data type heap during compilation of ['] and [CHAR]. The new solution is [LITERAL]. [LITERAL] is an immediate word that compiles the phrase LITERAL, [ DT datatype ] LITERAL >DT, where datatype represents the data type of the value compiled by LITERAL,:

: [LITERAL] ( -- )
  ?COMPILE POSTPONE LITERAL, DTP@
  BEGIN DUP @ DUP [ DT DATA-TYPE ] LITERAL >DT LITERAL,
     POSTPONE >DT DT-PREFIX ATTRIBUTE?
  WHILE 1+
  REPEAT DROP ; IMMEDIATE

Compiling LITERAL, consumes the topmost compound data type of the compiler data type heap, leaving the compiler data type heap pointer pointing to exactly this data type. Note that [LITERAL] itself has no input parameter. Next, a loop compiles a data type literal plus >DT for each basic data type the compound data type consists of.

Although the resulting virtual machine code is identical to the first solution, [LITERAL] simplifies the definitions of ['] and [CHAR]:

: ['] ( -- )
  ?COMPILE ' [LITERAL] ; IMMEDIATE

: [CHAR] ( -- )
  ?COMPILE CHAR [LITERAL] ; IMMEDIATE

Three more words, which are unique to StrongForth, also take advantage of [LITERAL]. [DT] compiles a data type literal. It can be used in the same way as DT. In the above definition on [LITERAL], the phrase [ DT DATA-TYPE ] LITERAL could have been replaced by [DT] DATA-TYPE, if this wouldn't result in a forward reference, because [DT] itself needs [LITERAL] in its definition. ['CODE] and ['TOKEN] evaluate the content of the code field and the token of a word found in the dictionary and compile it as a literal, respectively. Examples of using ['TOKEN] are the definitions of LITERAL, as explained at the beginning of this section. ['CODE] is, among others, used by ?DATA-TYPE, DT and ;.

: [DT] ( -- )
  ?COMPILE DT [LITERAL] ; IMMEDIATE
  
: ['CODE] ( -- )
  ?COMPILE 'CODE [LITERAL] ; IMMEDIATE

: ['TOKEN] ( -- )
  ?COMPILE 'TOKEN [LITERAL] ; IMMEDIATE

Other Forth system that compile and interpret virtual machine code as well, often use a simple trick to compile a token as a literal. Instead of

... ['TOKEN] name ...

you could write

... LIT name ...

during compilation. The compiler will compile the tokens of LIT and name just as required for the virtual machine. Do you see why this does not work in StrongForth? Yes, it's StrongForth's type system that prevents such things. When compiling LIT and name, the compiler applies the stack effects of these two words to the compiler data type heap, although the intended stack effect of compiling LIT name should always be ( -- token ). Let's have a look at an example:

: TEST1
  ['TOKEN] SPACES . ;
 OK
TEST1
552  OK
: TEST2
  LIT SPACES . ;
  
  LIT SPACES ? undefined word
SINGLE

TEST1 works fine. It prints the numerical value of the token of SPACES. But compiling TEST2 fails, because compiling LIT adds data type SINGLE to the compiler data type heap, while compiling SPACES expects the top of the compiler data type heap to contain either data type SIGNED or UNSIGNED. However, even if this programming trick might work in some special cases, where the data types are accidentally correct, source code like this is not portable.

Compiling String Literals

In the previous section, you've seen how literals for single-cell and double-cell items are being compiled. Now, what about string literals? A string literal is compiled as the token of SLIT, followed by a number of bytes that contain the length of the string and its characters. To prevent the constant data space from being unaligned after compiling a string literal, one or more padding bytes may be compiled after the last character.

SLIT ( -- CCONST -> CHARACTER UNSIGNED )
SLIT n char 1 char 2 char 3 ... char n

When executed, SLIT returns the address of the first character and the length of the string. Note that the string is stored in the CONST memory area, together with the virtual machine code.

Now, here's the ANS Forth word that actually compiles a string literal. SLITERAL is an immediate word, that can be applied in the same way as LITERAL for single-cell and double-cell items:

: SLITERAL ( CDATA -> CHARACTER UNSIGNED -- )
  ?COMPILE SPACE@ CONST-SPACE ROT ROT
  POSTPONE SLIT ", ALIGN SPACE! ; IMMEDIATE

Other than LITERAL, SLITERAL does not need to take care of the actual data type of the literal it compiles. The stack diagrams of LIT and DLIT contain SINGLE and DOUBLE only as representatives for the actual data type of the literal, while SLIT already has the final stack diagram CCONST -> CHARACTER UNSIGNED. Therefore, POSTPONE SLIT is all that is needed in order to compile the virtual machine code of SLIT and to update the compiler data type heap.

", compiles a counted string into the current memory space. It is enclosed in SPACE@ CONST-SPACE ... SPACE!, because the string has to be compiled into the constant data space. ALIGN re-aligns the constant memory space if necessary.

: ", ( CDATA -> CHARACTER UNSIGNED -- )
  DUP 255 >
  IF -18 THROW
  ELSE DUP C,
     BEGIN DUP
     WHILE OVER @ C, /STRING
     REPEAT
  THEN DROP DROP ;

Since the length of the string has to be stored in a single byte, the maximum string length is 255. Of course, strings of length zero are also allowed. After checking the maximum string length, ", compiles the length byte and then executes a loop to compile the characters one by one. Note that ", expects the string to be stored in the DATA memory area.

Although being an immediate word, SLITERAL is not commonly used at compile time. In order to compile a string literal, " is the preferred word. " parses a string delimited by the character " before executing SLITERAL:

: " ( -- )
  [CHAR] " PARSE POSTPONE SLITERAL ; IMMEDIATE

" has the same semantics as S" in ANS Forth. Since StrongForth does not support counted strings, C" is obsolete, and it is not necessary to distungiush two different words for compiling string listerals. If S" is required for compatibility reasons, you may define it as a synonym for ":

' " ALIAS S" IMMEDIATE

Compiling Other Virtual Code

So far, you've seen how literals are compiled into virtual machine code. But what about compiling normal words? Again, the compilation consists of two tasks:

Both tasks are performed by a word called COMPILE,. This is an ANS Forth word, but in ANS Forth, only the first of the two tasks is performed. Here's the definition of COMPILE, in StrongForth:

: COMPILE, ( DEFINITION -- )
  TRUE DT>DT DUP CAST CONST -> CODE @
  ['CODE] NOOP = IF DROP ELSE CONST, THEN ;

Since the order in which the two tasks are executed is irrelevant, COMPILE, may as well perform the second task first. TRUE DT>DT updates the compiler data type heap and returns the token of the definition. Remember that the token is a pointer to the code field of the definition. If this code field is identical with the code field of NOOP. COMPILE, does not compile anything. It simply drops the token. Otherwise, it compiles the token into the virtual machine code of the current definition.

DT>DT expects the definition on the data stack, plus a flag indicating which data type heap it should update. If FLAG is FALSE, DT>DT updates the interpreter data type heap, otherwise it updates the compiler data type heap. The output parameter of DT>DT is the token of the definition. An exception is thrown if the compiler data type heap is locked:

: DT>DT ( DEFINITION FLAG -- TOKEN )
  DUP IF DTP@ 0= IF -257 THROW THEN THEN 
  OVER DUP PARAMS SWAP #PARAMS
  ROT (CAST) >CODE CAST TOKEN ;

Okay, it seems that most of the semantics of DT>DT is hidden within another mysterious word called (CAST). Here's its stack diagram:

(CAST) ( FAR-ADDRESS -> DATA-TYPE UNSIGNED FLAG -- )

Instead of a definition, (CAST) expects the address of a stack diagram, plus the number of basic data types the stack diagram consists of. This means that (CAST) can be applied to stack diagrams that are not necessarily associated with a definition. As DT>DT demonstrates, the two parameters can easily be calculated from an item of data type DEFINITION.

Both (CAST) and DT>DT update the data type heap by incorporating a given stack effect. What does this mean? When a literal is compiled or interpreted, we just had to add the data type of the literal to the respective data type heap, because the stack diagram of a literal is something like this:

( -- ADDRESS )
( -- SIGNED-DOUBLE )
( -- CCONST -> CHARACTER UNSIGNED ) \ string literal

Literals do not have any input parameters. When compiling or interpreting words without input parameters, DT>DT simply adds the data types of the output parameters to the data type heap. This means, words with no input parameters are compiled or interpreted like literals. Here are a few examples:

HERE ( -- ADDRESS )
BASE ( -- DATA -> UNSIGNED )
DT> ( -- _DATA-TYPE FLAG )

That's easy. If, on the other hand, the word being compiled or interpreted has one or more input parameters, the corresponding data types have to be replaced by the data types of the output parameters. For example, when compiling or interpreting the word 0<, data type SIGNED or SIGNED-DOUBLE is replaced by data type FLAG:

WORDS 0<
0< ( SIGNED-DOUBLE -- FLAG )
0< ( SIGNED -- FLAG )
 OK
+721 .S
SIGNED  OK
0< .S
FLAG  OK

You already know that. But what if the output parameters of the word contain references to the input parameters? Again, you probably know the answer. In this case, the basic or compound data type on the data type heap that corresponds to the referenced input parameter is added to the data type heap:

BASE .S
DATA -> UNSIGNED  OK
@ .S
UNSIGNED  OK

The overloaded version of @ that actually matches in this case is

@ ( DATA -> SINGLE -- 2ND )

By interpreting @, the compound data type on the interpreter data type heap, DATA -> UNSIGNED, is replaced by the output parameter of @, 2ND. 2ND is a reference to the second basic data type of the input parameter list of @, which is SINGLE. And SINGLE corresponds to UNSIGNED on the data type heap.

Finally, let's have a look at two ANS Forth words, that are actually applications of COMPILE,: [COMPILE] and RECURSE.

[COMPILE] has actually the same semantics as COMPILE, but it additionally parses the input source for the name of the word, and looks it up in the dictionary. It is an immediate word that is used during compilation to compile an immediate word, which would otherwise be interpreted. Here's the StrongForth definition:

: [COMPILE] ( -- )
  ?COMPILE PARSE-WORD TRUE MATCH SEARCH-ALL
  IF COMPILE,
  ELSE DROP -13 THROW
  THEN ; IMMEDIATE

Note that SEARCH-ALL uses the additional search criterion MATCH in combination with the parameter TRUE (= -1). As explained in chapter 8, this means that SEARCH-ALL always performs an input parameter match against the compiler data type heap. This behaviour differs from the standard compilation behaviour, because immediate words are treated the same as non-immediate words. During normal compilation, SEARCH-ALL is executed with FALSE (= 0) as the parameter of MATCH, in order to perform an input parameters match against the interpreter data type heap for immediate words, because these words are interpreted.

Another interesting application of COMPILE, is RECURSE, which allows a word to execute itself recursively. Without RECURSE, this wouldn't be possible, because a word cannot be found in the dictionary before its compilation is completed. Since CREATE stored the current definition in the system variable LATEST, it is available to RECURSE.

: RECURSE ( -- )
  ?COMPILE LATEST COMPILE, ; IMMEDIATE

The standard example for RECURSE is the calculation of the faculty of an integer number, although the faculty can be calculated by a simple iteration as well:

: FAK ( SIGNED -- 1ST )
  DUP +1 > IF DUP 1- FAK * ELSE DROP +1 THEN ;
  
  DUP 1 > IF DUP 1- FAK ? undefined word
UNSIGNED UNSIGNED
: FAK ( SIGNED -- 1ST )
  DUP +1 > IF DUP 1- RECURSE * ELSE DROP +1 THEN ;
 OK
+6 FAK .
720  OK
-3 FAK .
1  OK

Dr. Stephan Becher - June 12th, 2007