Chapter 6: Character Strings

What's A String?

A string is a sequence of characters with a well-defined length between 0 and 255. The characters are stored in consecutive character size memory locations. According to the ANS Forth specification, a character string is specified by a cell pair (c-addr u) representing its starting address and length in characters. Since StrongForth can store strings in different memory areas, the representation of a string can be one of the following:

CDATA -> CHARACTER UNSIGNED
CCONST -> CHARACTER UNSIGNED
CFAR-ADDRESS -> CHARACTER UNSIGNED

The CODE memory area is usually not used for storing strings. Among the predefined storage locations for strings in the DATA memory area is the scratchpad PAD:

PAD .S DROP
CDATA -> CHARACTER  OK

PAD is defined as follows:

DATA-SPACE HERE CAST CDATA -> CHARACTER CONSTANT PAD
84 CHARS ALLOT

But ANS Forth actually specifies a second kind of string storage. A so-called counted string is a sequence of characters in memory, which is preceded by a length character. The length character is a character size memory location that contains the length of the string as an unsigned number. Here's an example of the memory image of a counted string:

11

s

t

r

o

n

g

F

o

r

t

h

A counted string in memory is identified by the address of its length character. ANS Forth even specifies a word that converts a counted string into the cell ( c-addr u ) representation of a character string:

: COUNT ( c-addr1 -- c-addr2 u ) DUP CHAR+ SWAP @ ;

Anyway, since ANS Forth explicitly discourages using counted strings, they are abandoned in StrongForth. The ANS Forth words C", COUNT and WORD do not exist in StrongForth. FIND has been replaced by SEARCH-ALL, which (among other differences) expects a string in the CDATA -> CHARACTER UNSIGNED representation (see chapter 8). This is not considered as being a deficiency, because strings in the CDATA -> CHARACTER UNSIGNED representation can easily replace counted strings, and the advantage of having only one kind of representation is pretty obvious.

String Processing

A small group of string processing words were already presented in of chapter 2:

FILL ( CDATA -> SINGLE UNSIGNED 2ND -- )
ERASE ( CDATA -> SINGLE UNSIGNED -- )
MOVE ( CDATA -> SINGLE CDATA -> 2ND UNSIGNED -- )
MOVE ( CCONST -> SINGLE CDATA -> 2ND UNSIGNED -- )

Since character strings are stored in memory blocks, these four words can be applied to character strings as well. FILL initializes a string with any character:

PAD 5 CHAR A FILL
 OK
PAD 5 TYPE
AAAAA OK

ERASE is a specialized version of FILL that initializes a memory block with zero. For strings, it is far more common to get initialized with space characters. This is what BLANK does. BLANK can only be applied to strings, but not to memory blocks in general:

: BLANK ( CDATA -> CHARACTER UNSIGNED -- )
  BL FILL ;

The two overloaded versions of MOVE replace the ANS Forth words CMOVE and CMOVE> for copying strings between different memory locations.

Now, let's continue with some more string processing words. /STRING adjusts a string by the number of characters given as the last input parameter. This input parameter is of data type INTEGER in order to allow signed positive, signed negative and unsigned numbers. But StrongForth provides an additional, overloaded version of /STRING without this parameter:

WORDS /STRING
/STRING ( CDATA -> CHARACTER UNSIGNED -- 1ST 3RD )
/STRING ( CDATA -> CHARACTER UNSIGNED INTEGER -- 1ST 3RD )
 OK

This second version of /STRING (which appears first in the above list) adjusts a string by always removing the first character, i. e., it assumes a default adjustment value of 1. ANS Forth, on the other hand, specifies only the version with an adjustment value. Similar to LSHIFT and RSHIFT, StrongForth takes advantage of its overloading capability by providing a special version for the most common usage of a word. Here's a simple example:

PAD 16 BLANK
 OK
PAD 16 TYPE
                 OK
PARSE-WORD StrongForth PAD SWAP MOVE
 OK
PAD 16 TYPE
StrongForth      OK
PAD 16 5 /STRING OVER OVER TYPE
gForth      OK
/STRING OVER OVER TYPE
Forth      OK
-TRAILING TYPE
Forth OK

This example leads to the next string processing word:

: -TRAILING ( CDATA -> CHARACTER UNSIGNED -- 1ST 3RD )
  BEGIN DUP
  WHILE OVER OVER + 1- @ BL =
  WHILE 1-
  REPEAT THEN ;

The semantics is as specified by ANS Forth. -TRAILING removes trailing spaces from a string. The implementation contains a loop with two exit conditions, one for encountering a non-space character and one for the string being empty.

StrongForth provides three overloaded versions of the ANS Forth word COMPARE for different memory areas. At least one of the two strings to be compared has to be located in the DATA memory area:

COMPARE ( CDATA -> CHARACTER UNSIGNED CFAR-ADDRESS -> 2ND 3RD -- SIGNED )
COMPARE ( CDATA -> CHARACTER UNSIGNED CCONST -> 2ND 3RD -- SIGNED )
COMPARE ( CDATA -> CHARACTER UNSIGNED 1ST 3RD -- SIGNED )

SEARCH is an application of /STRING and COMPARE:

: SEARCH ( CDATA -> CHARACTER UNSIGNED 1ST 3RD -- 1ST 3RD FLAG )
  LOCALS| N2 ADDR2 N1 ADDR1 | ADDR1 N1
  BEGIN DUP N2 < INVERT
  WHILE OVER N2 ADDR2 N2 COMPARE
  WHILE /STRING
  REPEAT TRUE
  ELSE DROP DROP ADDR1 N1 FALSE
  THEN ;

Note that only strings located in the DATA memory area can be searched for substrings. The substring has to be located in the DATA memory area as well. However, is it very easy to define an overloaded version for substrings that are located in other memory areas, for example the CONST memory area:

: SEARCH ( CDATA -> CHARACTER UNSIGNED CCONST -> 2ND 3RD -- 1ST 3RD FLAG )
  LOCALS| N2 ADDR2 N1 ADDR1 | ADDR1 N1
  BEGIN DUP N2 < INVERT
  WHILE OVER N2 ADDR2 N2 COMPARE
  WHILE /STRING
  REPEAT TRUE
  ELSE DROP DROP ADDR1 N1 FALSE
  THEN ;

Except for the stack diagram, this definition is absolutely identical to the version for both strings in the DATA memory area. The compiler automatically chooses the right version of CONVERT, because it is aware of the data types.


Dr. Stephan Becher - January 4th, 2008