Part XII: Miscellany - 5 June 1989
Introduction
This installment is another one of those excursions into side alleys that don’t seem to fit into the mainstream of this tutorial series. As I mentioned last time, it was while I was writing this installment that I realized some changes had to be made to the compiler structure. So I had to digress from this digression long enough to develop the new structure and show it to you.
Now that that’s behind us, I can tell you what I set out to in the first place. This shouldn’t take long, and then we can get back into the mainstream.
Several people have asked me about things that other languages provide, but so far I haven’t addressed in this series. The two biggies are semicolons and comments. Perhaps you’ve wondered about them, too, and wondered how things would change if we had to deal with them. Just so you can proceed with what’s to come, without being bothered by that nagging feeling that something is missing, we’ll address such issues here.
Semicolons
Ever since the introduction of Algol, semicolons have been a part of almost every modern language. We’ve all used them to the point that they are taken for granted. Yet I suspect that more compilation errors have occurred due to misplaced or missing semicolons than any other single cause. And if we had a penny for every extra keystroke programmers have used to type the little rascals, we could pay off the national debt.
Having been brought up with FORTRAN, it took me a long time to get used to using semicolons, and to tell the truth I’ve never quite understood why they were necessary. Since I program in Pascal, and since the use of semicolons in Pascal is particularly tricky, that one little character is still by far my biggest source of errors.
When I began developing KISS, I resolved to question every construct in other languages, and to try to avoid the most common problems that occur with them. That puts the semicolon very high on my hit list.
To understand the role of the semicolon, you have to look at a little history.
Early programming languages were line-oriented. In FORTRAN, for example, various parts of the statement had specific columns or fields that they had to appear in. Since some statements were too long for one line, the “continuation card” mechanism was provided to let the compiler know that a given card was still part of the previous line. The mechanism survives to this day, even though punched cards are now things of the distant past.
When other languages came along, they also adopted various mechanisms for dealing with multiple-line statements. BASIC is a good example. It’s important to recognize, though, that the FORTRAN mechanism was not so much required by the line orientation of that language, as by the column-orientation. In those versions of FORTRAN where free-form input is permitted, it’s no longer needed.
When the fathers of Algol introduced that language, they wanted
to get away from line-oriented programs like FORTRAN and BASIC,
and allow for free-form input. This included the possibility of
stringing multiple statements on a single line, as in
a=b; c=d; e=e+1;
.
In cases like this, the semicolon is almost required. The same
line, without the semicolons, just looks “funny”:
a=b c= d e=e+1
.
I suspect that this is the major … perhaps only … reason for
semicolons: to keep programs from looking funny.
But the idea of stringing multiple statements together on a single line is a dubious one at best. It’s not very good programming style, and harks back to the days when it was considered important to conserve cards. In these days of CRT’s and indented code, the clarity of programs is far better served by keeping statements separate. It’s still nice to have the option of multiple statements, but it seems a shame to keep programmers in slavery to the semicolon, just to keep that one rare case from “looking funny.”
When I started in with KISS, I tried to keep an open mind. I decided that I would use semicolons when it became necessary for the parser, but not until then. I figured this would happen just about the time I added the ability to spread statements over multiple lines. But, as you can see, that never happened. The TINY compiler is perfectly happy to parse the most complicated statement, spread over any number of lines, without semicolons.
Still, there are people who have used semicolons for so long, they feel naked without them. I’m one of them. Once I had KISS defined sufficiently well, I began to write a few sample programs in the language. I discovered, somewhat to my horror, that I kept putting semicolons in anyway. So now I’m facing the prospect of a new rash of compiler errors, caused by unwanted semicolons. Phooey!
Perhaps more to the point, there are readers out there who are designing their own languages, which may include semicolons, or who want to use the techniques of these tutorials to compile conventional languages like C. In either case, we need to be able to deal with semicolons.
Syntactic Sugar
This whole discussion brings up the issue of “syntactic sugar” … constructs that are added to a language, not because they are needed, but because they help make the programs look right to the programmer. After all, it’s nice to have a small, simple compiler, but it would be of little use if the resulting language were cryptic and hard to program. The language FORTH comes to mind (a premature OUCH! for the barrage I know that one’s going to fetch me). If we can add features to the language that make the programs easier to read and understand, and if those features help keep the programmer from making errors, then we should do so. Particularly if the constructs don’t add much to the complexity of the language or its compiler.
The semicolon could be considered an example, but there are
plenty of others, such as the THEN
in a IF
-statement, the DO
in a WHILE
-statement, and even the PROGRAM
statement, which I
came within a gnat’s eyelash of leaving out of TINY. None of
these tokens add much to the syntax of the language … the
compiler can figure out what’s going on without them. But some
folks feel that they DO
add to the readability of programs, and
that can be very important.
There are two schools of thought on this subject, which are well represented by two of our most popular languages, C and Pascal.
To the minimalists, all such sugar should be left out. They argue that it clutters up the language and adds to the number of keystrokes programmers must type. Perhaps more importantly, every extra token or keyword represents a trap laying in wait for the inattentive programmer. If you leave out a token, misplace it, or misspell it, the compiler will get you. So these people argue that the best approach is to get rid of such things. These folks tend to like C, which has a minimum of unnecessary keywords and punctuation.
Those from the other school tend to like Pascal. They argue that having to type a few extra characters is a small price to pay for legibility. After all, humans have to read the programs, too. Their best argument is that each such construct is an opportunity to tell the compiler that you really mean for it to do what you said to. The sugary tokens serve as useful landmarks to help you find your way.
The differences are well represented by the two languages. The most oft-heard complaint about C is that it is too forgiving. When you make a mistake in C, the erroneous code is too often another legal C construct. So the compiler just happily continues to compile, and leaves you to find the error during debug. I guess that’s why debuggers are so popular with C programmers.
On the other hand, if a Pascal program compiles, you can be pretty sure that the program will do what you told it. If there is an error at run time, it’s probably a design error.
The best example of useful sugar is the semicolon itself.
Consider the code fragment a=1+(2*b+c) b...
.
Since there is no operator connecting the token b
with the rest
of the statement, the compiler will conclude that the expression
ends with the )
, and the b
is the beginning of a new
statement. But suppose I have simply left out the intended
operator, and I really want to say a=1+(2*b+c)*b...
.
In this case the compiler will get an error, all right, but it
won’t be very meaningful since it will be expecting an =
sign
after the b
that really shouldn’t be there.
If, on the other hand, I include a semicolon after the b
, then
there can be no doubt where I intend the statement to end.
Syntactic sugar, then, can serve a very useful purpose by
providing some additional insurance that we remain on track.
I find myself somewhere in the middle of all this. I tend to favor the Pascal-ers’ view … I’d much rather find my bugs at compile time rather than run time. But I also hate to just throw verbosity in for no apparent reason, as in COBOL. So far I’ve consistently left most of the Pascal sugar out of KISS/TINY. But I certainly have no strong feelings either way, and I also can see the value of sprinkling a little sugar around just for the extra insurance that it brings. If you like this latter approach, things like that are easy to add. Just remember that, like the semicolon, each item of sugar is something that can potentially cause a compile error by its omission.
Dealing with Semicolons
There are two distinct ways in which semicolons are used in popular languages. In Pascal, the semicolon is regarded as an statement SEPARATOR. No semicolon is required after the last statement in a block. The syntax is:
<block> ::= <statement> ( ';' <statement>)*
<statement> ::= <assignment> | <if> | <while> ... | null
(The null statement is important!)
Pascal also defines some semicolons in other places, such as after the PROGRAM statement.
In C and Ada, on the other hand, the semicolon is considered a statement terminator, and follows all statements (with some embarrassing and confusing exceptions). The syntax for this is simply:
<block> ::= ( <statement> ';')*
Of the two syntaxes, the Pascal one seems on the face of it more
rational, but experience has shown that it leads to some strange
difficulties. People get so used to typing a semicolon after
every statement that they tend to type one after the last
statement in a block, also. That usually doesn’t cause any harm
… it just gets treated as a null statement. Many Pascal
programmers, including yours truly, do just that. But there is
one place you absolutely cannot type a semicolon, and that’s
right before an ELSE
. This little gotcha has cost me many an
extra compilation, particularly when the ELSE
is added to
existing code. So the C/Ada choice turns out to be better.
Apparently Nicklaus Wirth thinks so, too: In his Modula 2, he
abandoned the Pascal approach.
Given either of these two syntaxes, it’s an easy matter (now that we’ve reorganized the parser!) to add these features to our parser. Let’s take the last case first, since it’s simpler.
To begin, I’ve made things easy by introducing a new recognizer:
{--------------------------------------------------------------}
{ Match a Semicolon }
procedure Semi;
begin
MatchString(';');
end;
{--------------------------------------------------------------}
This procedure works very much like our old Match
. It insists on
finding a semicolon as the next token. Having found it, it skips
to the next one.
Since a semicolon follows a statement, procedure Block
is almost
the only one we need to change:
{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }
procedure Block;
begin
Scan;
while not(Token in ['e', 'l']) do begin
case Token of
'i': DoIf;
'w': DoWhile;
'R': DoRead;
'W': DoWrite;
'x': Assignment;
end;
Semi;
Scan;
end;
end;
{--------------------------------------------------------------}
Note carefully the subtle change in the case statement. The call
to Assignment
is now guarded by a test on Token
. This is to
avoid calling Assignment
when the token is a semicolon (which
could happen if the statement is null).
Since declarations are also statements, we also need to add a
call to Semi
within procedure TopDecls
:
{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }
procedure TopDecls;
begin
Scan;
while Token = 'v' do begin
Alloc;
while Token = ',' do
Alloc;
Semi;
end;
end;
{--------------------------------------------------------------}
Finally, we need one for the PROGRAM
statement:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
MatchString('PROGRAM');
Semi;
Header;
TopDecls;
MatchString('BEGIN');
Prolog;
Block;
MatchString('END');
Epilog;
end.
{--------------------------------------------------------------}
It’s as easy as that. Try it with a copy of TINY and see how you like it.
The Pascal version is a little trickier, but it still only
requires minor changes, and those only to procedure Block
. To
keep things as simple as possible, let’s split the procedure into
two parts. The following procedure handles just one statement:
{--------------------------------------------------------------}
{ Parse and Translate a Single Statement }
procedure Statement;
begin
Scan;
case Token of
'i': DoIf;
'w': DoWhile;
'R': DoRead;
'W': DoWrite;
'x': Assignment;
end;
end;
{--------------------------------------------------------------}
Using this procedure, we can now rewrite Block
like this:
{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }
procedure Block;
begin
Statement;
while Token = ';' do begin
Next;
Statement;
end;
end;
{--------------------------------------------------------------}
That sure didn’t hurt, did it? We can now parse semicolons in Pascal-like fashion.
A Compromise
Now that we know how to deal with semicolons, does that mean that I’m going to put them in KISS/TINY? Well, yes and no. I like the extra sugar and the security that comes with knowing for sure where the ends of statements are. But I haven’t changed my dislike for the compilation errors associated with semicolons.
So I have what I think is a nice compromise: Make them optional!
Consider the following version of Semi:
{--------------------------------------------------------------}
{ Match a Semicolon }
procedure Semi;
begin
if Token = ';' then Next;
end;
{--------------------------------------------------------------}
This procedure will accept a semicolon whenever it is called, but it won’t insist on one. That means that when you choose to use semicolons, the compiler will use the extra information to help keep itself on track. But if you omit one (or omit them all) the compiler won’t complain. The best of both worlds.
Put this procedure in place in the first version of your program (the one for C/Ada syntax), and you have the makings of TINY Version 1.2.
Comments
Up until now I have carefully avoided the subject of comments. You would think that this would be an easy subject … after all, the compiler doesn’t have to deal with comments at all; it should just ignore them. Well, sometimes that’s true.
Comments can be just about as easy or as difficult as you choose to make them. At one extreme, we can arrange things so that comments are intercepted almost the instant they enter the compiler. At the other, we can treat them as lexical elements. Things tend to get interesting when you consider things like comment delimiters contained in quoted strings.
Single-Character Delimiters
Here’s an example. Suppose we assume the Turbo Pascal standard and use curly braces for comments. In this case we have single-character delimiters, so our parsing is a little easier.
One approach is to strip the comments out the instant we
encounter them in the input stream; that is, right in procedure
GetChar
. To do this, first change the name of GetChar
to
something else, say GetCharX
. (For the record, this is going to
be a temporary change, so best not do this with your only copy of
TINY. I assume you understand that you should always do these
experiments with a working copy.)
Now, we’re going to need a procedure to skip over comments. So key in the following one:
{--------------------------------------------------------------}
{ Skip A Comment Field }
procedure SkipComment;
begin
while Look <> '}' do
GetCharX;
GetCharX;
end;
{--------------------------------------------------------------}
Clearly, what this procedure is going to do is to simply read and discard characters from the input stream, until it finds a right curly brace. Then it reads one more character and returns it in Look.
Now we can write a new version of GetChar
that SkipComment
to
strip out comments:
{--------------------------------------------------------------}
{ Get Character from Input Stream }
{ Skip Any Comments }
procedure GetChar;
begin
GetCharX;
if Look = '{' then SkipComment;
end;
{--------------------------------------------------------------}
Code this up and give it a try. You’ll find that you can,
indeed, bury comments anywhere you like. The comments never even
get into the parser proper … every call to GetChar
just returns
any character that’s not part of a comment.
As a matter of fact, while this approach gets the job done, and may even be perfectly satisfactory for you, it does its job a little too well. First of all, most programming languages specify that a comment should be treated like a space, so that comments aren’t allowed to be embedded in, say, variable names. This current version doesn’t care where you put comments.
Second, since the rest of the parser can’t even receive a {
character, you will not be allowed to put one in a quoted string.
Before you turn up your nose at this simplistic solution, though,
I should point out that as respected a compiler as Turbo Pascal
also won’t allow a {
in a quoted string. Try it. And as for
embedding a comment in an identifier, I can’t imagine why anyone
would want to do such a thing, anyway, so the question is moot.
For 99% of all applications, what I’ve just shown you will work
just fine.
But, if you want to be picky about it and stick to the conventional treatment, then we need to move the interception point downstream a little further.
To do this, first change GetChar
back to the way it was and
change the name called in SkipComment
. Then, let’s add the left
brace as a possible whitespace character:
{--------------------------------------------------------------}
{ Recognize White Space }
function IsWhite(c: char): boolean;
begin
IsWhite := c in [' ', TAB, CR, LF, '{'];
end;
{--------------------------------------------------------------}
Now, we can deal with comments in procedure SkipWhite
:
{--------------------------------------------------------------}
{ Skip Over Leading White Space }
procedure SkipWhite;
begin
while IsWhite(Look) do begin
if Look = '{' then
SkipComment
else
GetChar;
end;
end;
{--------------------------------------------------------------}
Note that SkipWhite
is written so that we will skip over any
combination of whitespace characters and comments, in one call.
OK, give this one a try, too. You’ll find that it will let a comment serve to delimit tokens. It’s worth mentioning that this approach also gives us the ability to handle curly braces within quoted strings, since within such strings we will not be testing for or skipping over whitespace.
There’s one last item to deal with: Nested
comments. Some
programmers like the idea of nesting comments, since it allows
you to comment out code during debugging. The code I’ve given
here won’t allow that and, again, neither will Turbo Pascal.
But the fix is incredibly easy. All we need to do is to make
SkipComment
recursive:
{--------------------------------------------------------------}
{ Skip A Comment Field }
procedure SkipComment;
begin
while Look <> '}' do begin
GetChar;
if Look = '{' then SkipComment;
end;
GetChar;
end;
{--------------------------------------------------------------}
That does it. As sophisticated a comment-handler as you’ll ever need.
Multi-Character Delimiters
That’s all well and good for cases where a comment is delimited by single characters, but what about the cases such as C or standard Pascal, where two characters are required? Well, the principles are still the same, but we have to change our approach quite a bit. I’m sure it won’t surprise you to learn that things get harder in this case.
For the multi-character situation, the easiest thing to do is to
intercept the left delimiter back at the GetChar
stage. We can
“tokenize” it right there, replacing it by a single character.
Let’s assume we’re using the C delimiters /*
and */
. First,
we need to go back to the GetCharX
approach. In yet another
copy of your compiler, rename GetChar
to GetCharX
and then enter
the following new procedure GetChar
:
{--------------------------------------------------------------}
{ Read New Character. Intercept '/*' }
procedure GetChar;
begin
if TempChar <> ' ' then begin
Look := TempChar;
TempChar := ' ';
end
else begin
GetCharX;
if Look = '/' then begin
Read(TempChar);
if TempChar = '*' then begin
Look := '{';
TempChar := ' ';
end;
end;
end;
end;
{--------------------------------------------------------------}
As you can see, what this procedure does is to intercept every
occurrence of /
. It then examines the next character in the
stream. If the character is a *
, then we have found the
beginning of a comment, and GetChar
will return a single
character replacement for it. (For simplicity, I’m using the
same {
character as I did for Pascal. If you were writing a C
compiler, you’d no doubt want to pick some other character that’s
not used elsewhere in C. Pick anything you like … even $FF
,
anything that’s unique.)
If the character following the /
is not a *
, then GetChar
tucks it away in the new global TempChar, and returns the /
.
Note that you need to declare this new variable and initialize it
to
. I like to do things like that using the Turbo “typed
constant” construct:
const TempChar: char = ' ';
Now we need a new version of SkipComment
:
{--------------------------------------------------------------}
{ Skip A Comment Field }
procedure SkipComment;
begin
repeat
repeat
GetCharX;
until Look = '*';
GetCharX;
until Look = '/';
GetChar;
end;
{--------------------------------------------------------------}
A few things to note: first of all, function IsWhite
and
procedure SkipWhite
don’t need to be changed, since GetChar
returns the {
token. If you change that token character, then
of course you also need to change the character in those two
routines.
Second, note that SkipComment
doesn’t call GetChar
in its loop,
but GetCharX
. That means that the trailing /
is not
intercepted and is seen by SkipComment
. Third, although GetChar
is the procedure doing the work, we can still deal with the
comment characters embedded in a quoted string, by calling
GetCharX
instead of GetChar
while we’re within the string.
Finally, note that we can again provide for nested comments by
adding a single statement to SkipComment
, just as we did before.
One-Sided Comments
So far I’ve shown you how to deal with any kind of comment
delimited on the left and the right. That only leaves the
one-sided comments like those in assembler language or in Ada, that
are terminated by the end of the line. In a way, that case is
easier. The only procedure that would need to be changed is
SkipComment
, which must now terminate at the newline characters:
{--------------------------------------------------------------}
{ Skip A Comment Field }
procedure SkipComment;
begin
repeat
GetCharX;
until Look = CR;
GetChar;
end;
{--------------------------------------------------------------}
If the leading character is a single one, as in the ;
of
assembly language, then we’re essentially done. If it’s a
two-character token, as in the --
of Ada, we need only modify the
tests within GetChar
. Either way, it’s an easier problem than
the balanced case.
Conclusion
At this point we now have the ability to deal with both comments and semicolons, as well as other kinds of syntactic sugar. I’ve shown you several ways to deal with each, depending upon the convention desired. The only issue left is: which of these conventions should we use in KISS/TINY?
For the reasons that I’ve given as we went along, I’m choosing the following:
- Semicolons are terminators, not separators
- Semicolons are optional
- Comments are delimited by curly braces
- Comments may be nested
Put the code corresponding to these cases into your copy of TINY. You now have TINY Version 1.2.
Now that we have disposed of these sideline issues, we can finally get back into the mainstream. In the next installment, we’ll talk about procedures and parameter passing, and we’ll add these important features to TINY. See you then.