Chapter 11: Parsing The Input Source

The Input Source

As in ANS Forth, the input source is the source of the character stream that is being parsed. During a Forth session, the input source is usually changed quite frequently. In StrongForth, the following input sources exist:

Since blocks are described separately in chapter 17, they shall not be considered yet. To get access to the current input source, you can simply use the word SOURCE. It returns a string within the DATA memory area that contains the characters from the input source:

:NONAME ( -- CDATA -> CHARACTER UNSIGNED )
  SOURCE-ID
  IF SOURCE-SPEC SPLIT
     ( SINGLE SINGLE -- CDATA -> CHARACTER UNSIGNED )CAST
  ELSE TIB #TIB @
  THEN ; IS SOURCE

SOURCE is a deferred definition, because it has to be replaced later with a version that allows blocks and files as input sources. SOURCE-ID is actually a file handle, but as long as the File-Access word set is not loaded, it can only assume the binary equivalents of the signed numbers 0 and -1. -1 indicates that the current input source is a character string, and 0 indicates that it is the user input device. STRING-ID is defined as a constant with value -1 and data type FILE.

Let's first assume the input source is a string. In this case, SOURCE returns the address and the length of the string. Both have previously been merged and stored by EVALUATE into the double-cell value SOURCE-SPEC. If, on the other hand, SOURCE-ID is 0, SOURCE returns the address of the terminal input buffer plus the number of characters that have been entered. The terminal input buffer is a simple character array in the DATA memory area. Its size determines the maximum number of characters you may type in one line. TIB is the constant address of the terminal input buffer, while #TIB is a variable that contains the number of characters in the terminal input buffer. Here are the definitions of all those values, variables and constants:

NULL FILE VALUE SOURCE-ID
NULL DOUBLE VALUE SOURCE-SPEC
-1 CAST FILE CONSTANT STRING-ID
DATA-SPACE HERE CAST CDATA -> CHARACTER CONSTANT TIB
80 CHARS ALLOT
0 VARIABLE #TIB

Just like ANS Forth, StrongForth uses an index variable to track the current position within the input source during parsing:

0 VARIABLE >IN

>IN starts with 0 and is incremented for each character that is being parsed. But what if >IN exceeds the string length of the input source? If the input source is a string or a block, parsing simply stops. If the current input source is the user input device, it is possible to refill the input source by requesting another line from the user. This is what the ANS Forth word REFILL does. It returns TRUE if the refilling succeeded, otherwise it returns FALSE. Just like SOURCE, REFILL is a deferred definition, because the introduction of blocks will require some changes:

:NONAME ( -- FLAG )
  SOURCE-ID 0= DUP
  IF TIB 80 ACCEPT #TIB ! 0 >IN !
  THEN ; IS REFILL

The ANS Forth words SAVE-INPUT and RESTORE-INPUT cause a problem for StrongForth. SAVE-INPUT returns a variable number of items, depending on the requirements of the current input source. RESTORE-INPUT consumes these items and returns a flag that indicates whether the operation succeeded or failed:

SAVE-INPUT ( -- xn ... x1 n ) \ ANS Forth
RESTORE-INPUT ( xn ... x1 n -- flag )

Stack diagrams like these are not allowed in StrongForth, because the type system requires that stack diagrams are deterministic. The data types of all input and output parameters must be known at compile time. The solution to this problem is quite tricky. StrongForth provides a special data type called TUPLE, which represents a structure consisting of a variable number of cells plus a count as one item. Before continuing with SAVE-INPUT and RESTORE-INPUT, let's have a closer look at this new data type.

Excursion: Tuples

Data types SINGLE and DOUBLE and their direct and indirect subtypes represent data items with a fixed size of one or two cells, respectively. In contrast to this concept, data type TUPLE represents data structures of variable sizes. The size of the structure is not known at compile time. However, the size can easily be determined at runtime, because it is included in the structure itself, just as in the parameters of the ANS Forth words SAVE-INPUT and RESTORE-INPUT.

NULL DATA-TYPE PROCREATES TUPLE 0 CONST, \ invalid size

Data type TUPLE has no parent. Applying SIZE to it yields zero in order to indicate an invalid value:

DT TUPLE SIZE .
0  OK
Only very few words are available for processing items of data type TUPLE:

NEW-TUPLE ( -- TUPLE )
SIZE ( TUPLE -- 1ST UNSIGNED )
>T ( TUPLE -> SINGLE 2ND -- 1ST )
>T ( TUPLE -> DOUBLE 2ND -- 1ST )
T> ( TUPLE -> SINGLE -- 1ST 2ND )
T> ( TUPLE -> DOUBLE -- 1ST 2ND )
DROP ( TUPLE -- )

NEW-TUPLE creates an empty tuple, i. e., a tuple with size zero. SIZE returns the size of a tuple in cells. Please don't confuse this version of SIZE with the one that expects an item of data type DATA-TYPE as an input parameter. Single-cell and double-cell items can be added to a tuple with the two versions of >T. T> extracts and removes those items in the order in which they have been added, until the tuple is empty. An attempt to extract an item from an empty tuple will cause an exception to be thrown. Finally, an overloaded version of DROP removes an empty or non-empty tuple from the stack. Note that SIZE, >T and T> do not remove the tuple itself from the stack.

In order to add or extract items from a tuple, the data type of the items have to be specified. This is done in the same way as for addresses by constructing compound data types. For example, items of data type SIGNED can only be added to tuples of data type TUPLE -> SIGNED, and if you extract an item from a tuple of data type TUPLE -> SIGNED, you'll get an item with data type SIGNED. If you want to store items of different data types in one tuple, you have to use type casts between accessing the different items. But now let's view an example about how these words are applied:

NEW-TUPLE -> SIGNED .S
TUPLE -> SIGNED  OK
SIZE .
0  OK
+23 >T -17 >T .S
TUPLE -> SIGNED  OK
SIZE .
2  OK
T> .S .
TUPLE -> SIGNED SIGNED -17  OK
SIZE .
1  OK
CAST TUPLE -> CHARACTER CHAR A >T SIZE .
2  OK
CAST TUPLE -> UNSIGNED-DOUBLE 1000000. >T SIZE .
4  OK
T> . CAST TUPLE -> CHARACTER T> .
1000000 A OK
SIZE .
1  OK
DROP .S
 OK

Naturally, the usage of tuples is bound to some restrictions. With the small number of words dealing with tuples, it is not possible do any stack juggling, although it would actually be possible to define words like SWAP ( SINGLE TUPLE -- 2ND 1ST ). But that's not necessary as long as tuples are only used in a limited set of applications, like saving and restoring the input source specification. Tuples cannot be stored in memory, and there's no version of . for tuples. There are no tuple variables and tuple constants, and no means to push and pop tuples from and to the return stack, respectively. And so on. But all these missing words can be defined if desired. One thing you can definitely not do with tuples is determining the size of their memory image on the stack at compile time. The size of a tuple can only be determined at runtime. One of the consequences of this restriction is that words whose stack diagram contain tuples may not be passed to CATCH.

Although there's no overloaded version of . for tuples, it's not difficult to view a tuple's memory image on the stack:

NEW-TUPLE -> UNSIGNED
 OK
6 >T 7 >T 8 >T SIZE .
3  OK
SP@ -> UNSIGNED 4 DUMP
0088: 0003 0008 0007 0006
 OK

The value 3, which is the size of the tuple, is on top of the data stack. Next comes 8, the value that added most recently, and then 7 and 6. This is exactly the structure produced by SAVE-INPUT according to the ANS Forth specification. With tuples being available, the implementation of SAVE-INPUT and RESTORE-INPUT is straight-forward.

SAVE-INPUT And RESTORE-INPUT

First of all, we define a new data type called INPUT-SOURCE, which is a subtype of data type TUPLE:

DT TUPLE PROCREATES INPUT-SOURCE

Since INPUT-SOURCE is a data type that is only used by SAVE-INPUT and RESTORE-INPUT, these two words always have to be used in pairs. StrongForth's type system efficiently prevents other words from messing around with the input source specification. The only way to get around this restriction is by using type casts. You've already seen this technique being used for pictured numeric output, where the unique data type NUMBER-DOUBLE enforces the correct syntax of <# ... # ... #S ... #>. Defining dedicated data types is a common programming technique in StrongForth, whenever syntactic rules must be obeyed.

The contents of a tuple of data type INPUT-SOURCE depends on the input source. At this point, we're just considering either the user input device or a string as input source. The input source specification of the user input device only consists of the value of >IN, because the terminal input buffer is always the same. If the input source is a string, we have to save the source specification of the string as well in order to allow RESTORE-INPUT to find out whether the input source was changed. Remember that EVALUATE merges the start address and the length of the string being evaluated into SOURCE-SPEC.

User input device:
     >IN     

String:

 SOURCE-SPEC 

>IN

SAVE-INPUT creates a tuple with size 1 if the input source is the user input device, and a tuple of size 3 if the input source is a string. RESTORE-INPUT can thus easily determine the input source. If the size of the tuple is neither 1 nor 3, RESTORE-INPUT returns TRUE as FLAG to indicate that the input source specification cannot be restored. A true flag is also returned if the current value of SOURCE-ID does not fit to the input source indicated by the size of the tuple, or if a string serving as the input source was changed between SAVE-INPUT and RESTORE-INPUT:

: SAVE-INPUT ( -- INPUT-SOURCE )
  NEW-TUPLE SOURCE-ID
  IF -> DOUBLE SOURCE-SPEC >T CAST TUPLE
  THEN -> UNSIGNED >IN @ >T CAST INPUT-SOURCE ;

: RESTORE-INPUT ( INPUT-SOURCE -- FLAG )
  CAST TUPLE -> UNSIGNED SIZE
  CASE 1 OF T> >IN ! DROP SOURCE-ID 0<> ENDOF
       3 OF T> >IN ! CAST TUPLE -> DOUBLE T> SOURCE-SPEC <> 
            SOURCE-ID 0= OR >R DROP R> ENDOF
       >R DROP TRUE R> 
  ENDCASE ;

Note that the return stack has to be used twice as temporary storage within RESTORE-INPUT. This is necessary because a tuple can only be dropped if it is on top of the stack. Words like NIP and SWAP are not available for tuples.

In chapter 17, where blocks are being introduced, SAVE-INPUT and RESTORE-INPUT are extended in order to consider blocks as input source as well.

Parsing

The input source is where the characters come from. But before StrongForth can start interpreting or compiling, the long chain of characters has to be cut into suitable pieces, for example into words delimited by spaces. The actual cutting is performed by a word called ENCLOSE. ENCLOSE is not an ANS Forth word, but it provides most of the low-level semantics for the ANS Forth word PARSE:

ENCLOSE ( CHARACTER CDATA -> 1ST UNSIGNED 4 TH -- 2ND 4 TH 4 TH 4 TH )

So, what does ENCLOSE do with all these input and output parameters? The most important input parameters is CDATA -> 1ST, which is the starting address of the character string to be parsed. This address is returned unchanged as the first output parameter 2ND. CHARACTER is the delimiter, i. e. the character which indicates the position where to cut the string.

All other parameters are indexes. UNSIGNED contains the position within the string where ENCLOSE shall start parsing. It is returned unchanged as the second output parameter. The last input parameter, 4 TH, is the length of the character string. The last two output parameters deliver the index positions of the character after the last non-delimiter character in sequence, and the first character not included in parsing. What does this mean? Let's first view an example in which the character string contains at least one delimiter between the starting position and the end of the string:

PAD 20 ACCEPT .
This is a string.
17  OK
CHAR i PAD 8 17 ENCLOSE . . . PAD = .
14 13 8 TRUE  OK

At the beginning of this example, PAD is filled with the string This is a string., which is 17 characters long. Next, ENCLOSE searches for the character i within the string, starting at position 8. Therefore, the two characters i in the first two words are being skipped.The first delimiter character i is found at position 13, and the first character not included in parsing is the character n at position 14.

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

T

h

i

s

 

i

s

 

a

 

s

t

r

i

n

g

.

With these index values, it is easy to calculate the length of the ENCLOSEd string: 13 - 8 = 5. If parsing the string shall be continued, the next ENCLOSE starts at 14, at the first character that was not parsed by the previous ENCLOSE.

To see what happens if the string does not contain any delimiter, let's parse the string from the first example once again:

PAD 17 TYPE
This is a string. OK
CHAR x PAD 8 17 ENCLOSE . . . PAD = .
17 17 8 TRUE  OK

Because the string does not contain an x, ENCLOSE parses until the end of the string. The last non-delimiter character in sequence is the period at position 16, so the next index position after this character is 17. 17 is also the first position not included in parsing. You can repeat the calculation from the first example: 17 - 8 = 9, which is indeed the length of the string a string..

Using ENCLOSE, parsing a string becomes pretty simple. Here's the definition of PARSE:

: PARSE ( CHARACTER -- CDATA -> CHARACTER UNSIGNED )
  SOURCE >IN @ SWAP ENCLOSE >IN ! OVER - ROT ROT + SWAP ;

PARSE expects the delimiter CHARACTER on the stack and forwards it together with the input source and the value of >IN to ENCLOSE. >IN contains the current parsing position within the input source. After ENCLOSE returns, PARSE updates >IN with the the first character that was not parsed by ENCLOSE, so parsing can continue seamlessly when PARSE is executed next time. From the remaining output parameters of ENCLOSE, PARSE calculates the starting address and the length of the parsed string.

A simple application of PARSE is \. The backslash is StrongForth's only means to provide comments within the source code, because parentheses are already used up for stack diagrams. In ANS Forth, parentheses enclose small comments within a source line, while comments that extend to the end of the line start with the backslash:

... ( ANS Forth comment ) ...
... \ another ANS Forth comment
...

In this example, ... indicates executable source code. To compensate for the missing parenthesis, StrongForth allows using another backslash to terminate the comment, just like the right parenthesis in ANS Forth:

... \ StrongForth comment \ ...
... \ another StrongForth comment
...

The definition of \ is quite simple. It just parses until the next occurrence of a backslash and then discards the parsed string. Because \ works in both interpretation and compilation state, it has to be an immediate word:

: \ ( -- )
  [CHAR] \ PARSE DROP DROP ; IMMEDIATE

In order to parse the input source for words delimited by spaces, StrongForth provides a variant of PARSE, which is called PARSE-WORD. It is suggested by the ANS Forth specification, although it is not a part of the standard. PARSE-WORD has no input parameter, because it always assumes space characters as delimiters. In addition to PARSE, it skips leading spaces and checks the length of the parsed word, which must not be greater than 31, the maximum length of a word in the dictionary. The definition of PARSE-WORD looks quite similar to the definition of PARSE:

: PARSE-WORD ( -- CDATA -> CHARACTER UNSIGNED )
  SOURCE >IN @ SWAP ENCLOSE-WORD >IN ! OVER - ROT ROT + SWAP
  DUP 31 > IF -19 THROW THEN ;

PARSE-WORD uses ENCLOSE-WORD, which is in turn similar to ENCLOSE:

ENCLOSE-WORD ( CDATA -> CHARACTER UNSIGNED 3RD -- 1ST 3RD 3RD 3RD )

Since ENCLOSE-WORD always assumes space characters as delimiters, it does not expect the delimiter character as input parameter. Other than ENCLOSE, the value of the second output parameter is not always identical to the value of the input parameter UNSIGNED. This is because ENCLOSE-WORD skips leading spaces. The second output parameter is actually the index position of the first non-space character after the position indicated by UNSIGNED. However, for both ENCLOSE and ENCLOSE-WORD the second output parameter contains the starting position of the enclosed string.

A very simple application of PARSE-WORD is CHAR. CHAR parses the input source for a word delimited by spaces and returns the first character of the word. Any additional characters are discarded. If the parsed word is empty, which happens at the end of the input source, CHAR simply returns a space character. Here's the definition of CHAR:

: CHAR ( -- CHARACTER )
  PARSE-WORD IF @ ELSE DROP BL THEN ;

Parsing Numbers

StrongForth does not yet include the ANS Forth Floating-Point word set. But it is able to clearly distinguish between unsigned and signed, single-precision and double-precision numbers when parsing the input source. Generally, the input format of a number consists of a sequence of digits, which is optionally preceded by a sign character (either + or -) and is optionally succeeded by a decimal point:

<integer number> := [<sign>]<digits>[.]
<sign>           := { + | - }
<digits>         := <digit><digit>*
<digit>          := { 0 ... 9 } | { A ... Z }

The range of digits is limited by the current number-conversion radix BASE. For example, if BASE is 16, valid digits are 0 to 9 and A to F. G to Z are invalid digits for a hexadecimal number. If BASE is 8, only digits 0 to 7 are allowed. The optional sign and the optional decimal point determine the data type of the number:

Sign Decimal Point Example Data Type
no no 7144 UNSIGNED
yes no +812 SIGNED
no yes 3000511. UNSIGNED-DOUBLE
yes yes -713306492. SIGNED-DOUBLE

The StrongForth word NUMBER converts a sequence of characters from a string into a number:

NUMBER ( CDATA -> CHARACTER UNSIGNED -- INTEGER-DOUBLE DATA-TYPE )

The result of the conversion is delivered as an item of data type INTEGER-DOUBLE, because this data type is capable of holding all kinds of single-precision and double-precision numbers. But remember that stack diagrams in StrongForth are always static. Therefore, it is not possible for NUMBER to return different data types depending on the contents of the string. Only in Forth systems without strong static typing, something like

NUMBER (  c-addr u1 – n | u2 | d | ud )

is actually allowed. Of course, StrongForth provides an acceptable solution for this problem. NUMBER returns the recommended data type of the conversion result explicitly as an item of data type DATA-TYPE. Here are some examples:

DECIMAL
 OK
PARSE-WORD 7144 NUMBER . .
UNSIGNED 7144  OK
PARSE-WORD +812 NUMBER . .
SIGNED 812  OK
PARSE-WORD 3000511. NUMBER . .
UNSIGNED-DOUBLE 3000511  OK
PARSE-WORD -713306492. NUMBER . .
SIGNED-DOUBLE 3581660804  OK
HEX
 OK
PARSE-WORD 6DF4 NUMBER . .
UNSIGNED 6DF4  OK
PARSE-WORD 28H5 NUMBER . .
28  OK

In the last example, NUMBER does not return a valid data type. In fact, the value of the data type is zero, because the string does not contain a valid hexadecimal representation of a single-precision or double-precision number.

Before continuing with the actual definition of NUMBER, we need to have a look at a few words that support NUMBER in parsing digits, sequences of digits and sign characters. The first one is DIGIT?, which is defined as follows:

: DIGIT? ( CHARACTER -- UNSIGNED FLAG )
  [CHAR] 0 - CAST UNSIGNED
  DUP [ CHAR A CHAR 0 - CAST UNSIGNED ] LITERAL <
  IF DUP 9 > IF DROP BASE @ THEN
  ELSE [ CHAR A CHAR 0 10 + - CAST UNSIGNED ] LITERAL -
  THEN DUP BASE @ < ;

DIGIT? expects a character as input parameter, which might be a valid digit or not, given the current number-conversion radix BASE. If the character is a valid digit, DIGIT? returns its numerical value UNSIGNED and a TRUE flag. If is it not, UNSIGNED is undefined and FLAG is FALSE:

CHAR 6 DIGIT? . .
TRUE 6  OK
CHAR C DIGIT? . .
FALSE 12  OK
CHAR C HEX DIGIT? DECIMAL . .
TRUE 12  OK

DIGIT? divides the ASCII character set into 5 segments:

invalid 0 to 9 invalid A to Z invalid

By subtracting [CHAR] 0 and casting the result to data type UNSIGNED, characters 0 to 9 are being converted to their respective numeric values. Simultaneously, the first and the last invalid segments are merged. Next, the invalid segment between characters 9 and A is excluded. By replacing the numerical value with the contents of BASE, characters belonging to this segment will be recognized as being invalid digits at the end. Finally, after adjusting the numerical value of all characters with ASCII values above or equal to character A, the result is compared to the number-conversion radix.

DIGIT? converts only one character to its numerical value. >NUMBER is an ANS Forth word that converts a whole sequence of digits into it's numerical value. It expects on the data stack a double-precision number INTEGER-DOUBLE and the character string CDATA -> CHARACTER UNSIGNED containing the digits, and returns these parameters in the same order. The converted digits are accumulated into INTEGER-DOUBLE by multiplying INTEGER-DOUBLE by the contents of BASE and then adding the numerical value of the digit.

: >NUMBER ( INTEGER-DOUBLE CDATA -> CHARACTER UNSIGNED -- 1ST 2ND 4 TH )
  ROT LOCALS| D |
  BEGIN DUP
  WHILE OVER @ DIGIT?
  WHILE D BASE @ * SWAP + TO D 1 /STRING
  REPEAT DROP
  THEN D ROT ROT ;

The implementation of >NUMBER is straight forward. In order to reduce the number of stack movements, the double-precision number is kept in a local variable. The unusual structure of the loop, which is built with the words BEGIN ... WHILE ... WHILE ... REPEAT ... THEN, is fully compliant to the ANS Forth standard. If the condition of the first WHILE is not met, a branch to the code after THEN is executed, while the second WHILE exits the loop with a branch to the code following REPEAT. DROP removes the (unused) first output parameter of DIGIT?. Note that >NUMBER uses mixed-mode operations, that are overloaded versions of * and +.

Finally, NUMBER takes advantage of a word that recognizes sign characters. >SIGN converts an item of data type CHARACTER into a signed number. >SIGN simply returns +1 if CHARACTER is +, -1 if CHARACTER is -, and 0 for all other characters. The StrongForth version of CASE ... OF .... ENDOF ... ENDCASE works exactly as in ANS Forth.

: >SIGN ( CHARACTER -- SIGNED )
  CASE [CHAR] + OF +1 ENDOF
       [CHAR] - OF -1 ENDOF
  +0 SWAP ENDCASE ;

Now it's time to study the definition of NUMBER itself. It looks quite complex, but it is not too difficult to understand. First, a double-precision number of data type INTEGER-DOUBLE is created and rotated down to the third stack position. The resulting parameter configuration matches the input parameter list of >NUMBER. Next, local S of data type SIGNED is defined. S keeps the information about the optional sign character, which is evaluated in the following IF clause.

: NUMBER ( CDATA -> CHARACTER UNSIGNED --
     INTEGER-DOUBLE DATA-TYPE )
  NULL INTEGER-DOUBLE ROT ROT +0 LOCALS| S | DUP
  IF OVER @ >SIGN DUP TO S
     IF /STRING
     THEN
  THEN DUP
  IF OVER @ DIGIT? NIP
     IF >NUMBER FALSE
     ELSE TRUE
     THEN
  ELSE TRUE
  THEN 
  IF DROP DROP NULL DATA-TYPE EXIT 
  THEN DUP
  IF " ." COMPARE
     IF NULL DATA-TYPE EXIT
     THEN S
     IF [DT] SIGNED-DOUBLE
     ELSE [DT] UNSIGNED-DOUBLE
     THEN
  ELSE DROP DROP S
     IF [DT] SIGNED
     ELSE [DT] UNSIGNED
     THEN
  THEN S 0<
  IF SWAP NEGATE SWAP
  THEN ;

Of course, a number should contain at least one digit. If the string is empty, or contains just a sign character, or the first character is not a valid numerical digit, NUMBER just returns zero and a null data type. Otherwise, it uses >NUMBER to parse the sequence of digits and convert it into a number.

If the string does not contain any characters after the last digit, NUMBER has successfully parsed a single-precision number of data type SIGNED or UNSIGNED, depending on the presence of a leading sign character. Otherwise, the only character that may follow the last digit is a decimal point, which indicates a double-precision number of data type SIGNED-DOUBLE or UNSIGNED-DOUBLE. Any other character or a sequence of more than one character is invalid.

At the end of the definition of NUMBER, S is queried again to find out whether the number is negative. If it is, the numerical value needs to be negated. We're done.

It is certainly worth to consider shortening the definition of NUMBER by doing some more factoring. For example, some of the first-level IF clauses are candidates for being factored out into separate words. However, these words would all have very specialized semantics which are unlikely to be usable for other purposes. After all, it's just a matter of programming style. The definition of NUMBER doesn't look like good Forth style, but it works correctly and it doesn't contain spaghetti code or programming tricks.

There are actually two unsigned numbers that are being used much more often that any other number: 0 and 1. Since all numbers are compiled as literals with the tokens LIT or DLIT followed by one or two cells containing the numeric value, it is desirable to have special definitions for 0 and 1, which compile as only one cell of virtual code. Therefore, the following two definitions are included in the StrongForth dictionary:

0 CONSTANT 0
1 CONSTANT 1

Note that the compiler recognizes these two constants only if they are contained in the input source exactly as 0 or 1. 01, for example, will be compiled as the numerical literal 1 with two cells of virtual code, and not as the constant 1 with only one cell of virtual code. Of course, the runtime semantics is exactly the same.

The Interpreter

Now we're getting into the very heart of StrongForth. Parsing the input source, looking up words in the dictionary and interpreting or compiling these words are actually the core functionalities of each Forth system. StrongForth's type system doesn't change this statement, but by considering data types, it adds useful features like type checking and operator overloading.

All of these tasks are performed by a single word: INTERPRET. Since Forth doesn't make a big difference between interpreting and compiling, INTERPRET serves as both the interpreter and the compiler. Whether a word is to be interpreted or to be compiled depends on whether the system is in interpretation or compilation state. Immediate words are always interpreted, as long as they are not forced to be compiled by [COMPILE] or POSTPONE.

So here's the definition of INTERPRET. Remember that INTERPRET is a deferred definition:

:NONAME ( -- )
  BEGIN PARSE-WORD DUP
  WHILE OVER OVER SERACH-LOCAL DUP
     IF ROT DROP ROT DROP ABS LOCAL,
     ELSE DROP DROP OVER OVER FALSE MATCH SEARCH-ALL DUP
        IF ROT DROP ROT DROP 0< STATE @ AND
           IF COMPILE,
           ELSE FALSE DT>DT (EXECUTE)
           THEN
        ELSE DROP DROP NUMBER DUP 0=
           IF DROP DROP -13 THROW
           ELSE DUP >DT SIZE 1- STATE @
              IF IF LITERAL,
                 ELSE D>S LITERAL,
                 THEN
              ELSE
                 IF ( DOUBLE -- )CAST
                 ELSE D>S ( SINGLE -- )CAST
                 THEN
              THEN
           THEN
        THEN
     THEN
  REPEAT DROP DROP ; IS INTERPRET

INTERPRET is a loop that parses the input source as long as there are words to parse. For example, if the input source is the user input device, INTERPRET processes one complete line of text.

By passing the character string returned by PARSE-WORD to SEARCH-LOCAL, INTERPRET first checks whether the word is a local. Locals only exist in compilation state. If SEARCH-LOCAL finds a match in the local dictionary, the local is directly compiled by LOCAL,. SEARCH-LOCAL, LOCAL, and the local dictionary will be described in chapter 12.

If SEARCH-LOCAL does not succeed, SEARCH-ALL gets a chance to search the complete dictionary. With MATCH and its parameter FALSE (= 0) as the additional search criterion, SEARCH-ALL performs an input parameter match on the interpreter or compiler data type heap, depending on whether the word is to be interpreted or compiled. Let's assume SEARCH-ALL really finds a suitable word. If the system is in compilation state and it is not an immediate word, the word is being compiled by COMPILE,. Otherwise, i. e., either the system is in interpretation state or the word is immediate, the word is to be interpreted. In StrongForth, interpreting a word means applying its stack effect to the interpreter data type heap and then executing the word. The interpreter data type heap is updated by FALSE DT>DT. (EXECUTE) calls the inner interpreter in order to execute the word. It expects the word's execution token on the data stack, which is delivered by DT>DT:

(EXECUTE) ( TOKEN -- )

Note that (EXECUTE) is a low-level word, that should be used with care. Since (EXECUTE) does not consider the stack effects of the words it executes, it can easily corrupt StrongForth's type system. Its usage in INTERPRET is correct, because DT>DT applies a word's stack effect to the interpreter data type heap before the word is executed.

Finally, let's see what happens if neither SEARCH-LOCAL nor SEARCH-ALL can find the word whose name was parsed in the input source. Then the last hope is on NUMBER. Only if NUMBER fails, does INTERPRET throw an exception. But let's assume that NUMBER has accepted the word. Then it returns the data type of the number, which is one of the following:

Depending on the contents of STATE, the number is either interpreted or compiled. In both cases, >DT pushes the data type to the corresponding data type heap. LITERAL, compiles the number as virtual machine code, whereas ( SINGLE -- )CAST and ( DOUBLE -- )CAST interpret it. If the data type of the number is either SIGNED or UNSIGNED, the value delivered by NUMBER needs to be converted from a double-precision number to a single-precision number. Note that LITERAL, is overloaded to provide versions for single-precision and double-precision literals.

)CAST is actually a very interesting immediate word. It compiles nothing! At least, it has no execution semantics. All it does is applying a given stack diagram to the compiler or interpreter data type heap, just as if a word with this stack diagram has been compiled. During the compilation of INTERPRET, it simply removes a data type (SINGLE or DOUBLE) from the compiler data type heap. But why does removing a data type from the compiler data type heap actually interpret a number? In StrongForth, interpreting a number requires two things to be done:

  1. Push the data type of the number to the interpreter data type heap.
  2. Push the numerical value of the number to the data stack.

Both tasks have already been completed before )CAST is executed. >DT has taken care of the interpreter data type heap, and the number, either single-precision or double-precision, is already on the data stack. So, all that has to be done is a little clean-up during the compilation of INTERPRET. )CAST simply removes data types INTEGER-DOUBLE or SINGLE from INTERPRET's compiler data type heap. The number is now invisible for INTERPRET. However, an important precondition for this to work correctly is that INTERPRET does not have anything left on the data stack below the number. Otherwise, next time INTERPRET tried to access these data, it would find the number instead. Needless to say, )CAST should be used very carefully, because it might easily corrupt StrongForth's type system.

: )CAST ( MEMORY-SPACE FLAG STACK-DIAGRAM -- )
  <DIAGRAM DUP OFFSET FAR-HERE -> DATA-TYPE OVER - SWAP
  STATE @ (CAST) DIAGRAM> ; IMMEDIATE

)CAST marks the end of a stack diagram. The length of the stack diagram is calculated by OFFSET. Since the stack diagram is stored at the end of the local data space, its start address can be calculated by subtracting its length from the local data space pointer. The application of the stack diagram on the interpreter or compiler data type heap is then performed by (CAST).

Note that )CAST and CAST are different in many ways. It's not only the syntax that distinguishes these two words. Because CAST always converts the data type of only one item, it is save in the sense that it cannot corrupt StrongForth's data type system. Whenever a single-cell item is converted into a double-cell item or vice versa, it compiles appropriate conversion words in order to keep the data stack and the data type heap aligned. In contrast to this behaviour, )CAST is potentially unsave. It can change the contents of the data type heap arbitrarily, without taking regard of the data stack. Its use should be limited to rare occasions.

Evaluating and Postponing

StrongForth implements the ANS Forth word EVALUATE as an application of INTERPRET. The ANS Forth glossary on EVALUATE says:

    ( i*x c-addr u -- j*x )

    Save the current input source specification. Store minus-one (-1) in SOURCE-ID if it is present. Make the string described by c-addr and u both the input source and input buffer, set >IN to zero, and interpret. When the parse area is empty, restore the prior input source specification. Other stack effects are due to the words EVALUATEd.

And that's exactly what EVALUATE does in StrongForth:

: EVALUATE ( CDATA -> CHARACTER UNSIGNED -- )
  SOURCE-SPEC SOURCE-ID >IN @ LOCALS| I S X |
  0 >IN ! STRING-ID TO SOURCE-ID MERGE TO SOURCE-SPEC
  INTERPRET
  I >IN ! S TO SOURCE-ID X TO SOURCE-SPEC ?REFILL ;

The input source specification, consisting of SOURCE-SPEC, SOURCE-ID and >ID, is stored in locals. These values cannot be kept on the data stack, because the data stack is usually affected by the evaluated words. While evaluating, SOURCE-SPEC contains the merged address and the length of the string. In order to allow for EVALUATE to be used recursively, i. e., using EVALUATE in turn within an evaluated string, SOURCE-SPEC needs to be saved as part of the input source specification. ?REFILL is a dummy word with no semantics so far. It can be used to perform additional tasks that are required to restore the input source specification. Since it's a deferred definition, it's semantics can easily be changed:

' NOOP IS ?REFILL

This version of EVALUATE works with strings that are located in the DATA memory area; because that's where SOURCE expects the input sources to be. But in many cases, it is necessary to evaluate strings located in the CONST memory area. For examples, strings compiled into words are always kept in the constant data space. The phrase

... " ..." EVALUATE ...

can only be compiled if an overloaded version of EVALUATE for strings in the CONST memory area exists.

: EVALUATE ( CCONST -> CHARACTER UNSIGNED -- )
  SPACE@ SWAP LOCAL-SPACE HERE CAST CDATA -> CHARACTER
  OVER CHARS ALLOT ALIGN LOCALS| ADDR COUNT | SPACE!
  ADDR COUNT MOVE ADDR COUNT EVALUATE
  SPACE@ LOCAL-SPACE ADDR CAST ADDRESS HERE - ALLOT SPACE! ;

You might be surprised about the complexity of this definition. But cince SOURCE does not support input sources in the CONST memory area, the string has to be copied to the DATA memory area before it can be evaluated. EVALUATE allocates space for a copy of the string in the local data space, copies the string from the CONST memory area to this location, evaluates it and then deallocates the allocated memory. Again, locals are used to clean up the data stack before actually evaluating the string.

Using the local data space for storing a copy of the evaluated string has the advantage that EVALUATE can be used recursively, because strings on different levels of evaluation do not interfere with each other. But this technique also has a drawback. The evaluated string must not itself change the local data space. Anything like

" ... LOCAL-SPACE ... ALLOT ..." EVALUATE \ Don't do that!

is strictly forbidden, unless the sum of allocated and deallocated cells is zero:

" ... LOCAL-SPACE ... 8 ALLOT ... -8 ALLOT ... " EVALUATE \ Okay

One of the most prominent applications of EVALUATE is the ANS Forth word POSTPONE. According to the ANS Forth specification, it compiles the compilation semantics of a word to the current definition. What does this mean? The compilation semantics of an immediate word is to execute the word immediately. POSTPONE compiles this compilation semantics by adding the token of the immediate word into the current definition using COMPILE,. The immediate word is then executed at runtime instead of at compile time. It's execution is POSTPONEd.

The compilation semantics of a non-immediate word is to compile the token of the word into the current definition. Therefore, POSTPONE compiles code that compiles the word at runtime. It actually compiles the name of the word as a string and then compiles the token of EVALUATE. Here's the definition of POSTPONE:

: POSTPONE ( -- )
  ?COMPILE PARSE-WORD OVER OVER TRUE MATCH SEARCH-ALL +1 =
  IF COMPILE, DROP DROP
  ELSE DROP [COMPILE] SLITERAL " EVALUATE" EVALUATE
  THEN ; IMMEDIATE

But why does POSTPONE use EVALUATE instead of just calculating the execution token of the word and then compiling something like

LIT <token> CONST,

into the current definition? This is what an ANS Forth system would do. In StrongForth, the integrity of the type system requires that a word is compiled within the context of the current definition. When POSTPONE is executed, it only knows the context of the word that compiles the postponed word, but not the context of the postponed word itself. Selecting the correct token is not possible if the word is overloaded. Therefore, it is not sufficient to postpone the creation of virtual code for the word. Searching the word in the dictionary and updating the data type heap has to be postponed as well. EVALUATE does exactly this. For POSTPONE, it does not even matter whether a postponed non-immediate word is actually found in the dictionary or not, because the context is invalid anyway. If SEARCH-ALL either finds a non-immediate word or just fails, it simply compiles the name of the word as a string, plus EVALUATE to have the word compiled at runtime.

Would you have guessed that POSTPONE itself is a typical application of POSTPONE? If not, have a look at the following alternative definition of POSTPONE:

: POSTPONE ( -- )
  ?COMPILE PARSE-WORD OVER OVER TRUE MATCH SEARCH-ALL +1 =
  IF COMPILE, DROP DROP
  ELSE DROP POSTPONE SLITERAL POSTPONE EVALUATE
  THEN ; IMMEDIATE

This produces exactly the same virtual code as the first definition of POSTPONE. But let's have a look at another example:

: SLITERAL ( CDATA -> CHARACTER UNSIGNED -- )
  ?COMPILE SPACE@ CONST-SPACE ROT ROT
  POSTPONE SLIT ", ALIGN SPACE! ; IMMEDIATE
 OK
SEE SLITERAL
: SLITERAL ( CDATA -> CHARACTER UNSIGNED -- )
  ?COMPILE SPACE@ CONST-SPACE ROT ROT " SLIT" EVALUATE ", ALIGN SPACE! ; IMMEDIATE  OK

During compilation of SLITERAL, the context SLIT will be compiled into is not yet known. However, when SLITERAL is finally being executed, the context is known, and EVALUATE is able to compile SLIT into this context.


Dr. Stephan Becher - May 16th, 2007