Part XVI: Unit Construction - 29 May 1995
Introduction
This series of tutorials promises to be perhaps one of the longest-running mini-series in history, rivalled only by the delay in Volume IV of Knuth. Begun in 1988, the series ran into a four-year hiatus in 1990 when the “cares of this world,” changes in priorities and interests, and the need to make a living seemed to stall it out after Installment 14. Those of you with loads of patience were finally rewarded, in the spring of last year, with the long-awaited Installment 15. In it, I began to try to steer the series back on track, and in the process, to make it easier to continue on to the goal, which is to provide you with not only enough understanding of the difficult subject of compiler theory, but also enough tools, in the form of canned subroutines and concepts, so that you would be able to continue on your own and become proficient enough to build your own parsers and translators. Because of that long hiatus, I thought it appropriate to go back and review the concepts we have covered so far, and to redo some of the software, as well. In the past, we’ve never concerned ourselves much with the development of production-quality software tools … after all, I was trying to teach (and learn) concepts, not production practice. To do that, I tended to give you, not complete compilers or parsers, but only those snippets of code that illustrated the particular point we were considering at the moment.
I still believe that’s a good way to learn any subject; no one wants to have to make changes to 100,000 line programs just to try out a new idea. But the idea of just dealing with code snippets, rather than complete programs, also has its drawbacks in that we often seemed to be writing the same code fragments over and over. Although repetition has been thoroughly proven to be a good way to learn new ideas, it’s also true that one can have too much of a good thing. By the time I had completed Installment 14 I seemed to have reached the limits of my abilities to juggle multiple files and multiple versions of the same software functions. Who knows, perhaps that’s one reason I seemed to have run out of gas at that point.
Fortunately, the later versions of Borland’s Turbo Pascal allow us to have our cake and eat it too. By using their concept of separately compilable units, we can still write small subroutines and functions, and keep our main programs and test programs small and simple. But, once written, the code in the Pascal units will always be there for us to use, and linking them in is totally painless and transparent.
Since, by now, most of you are programming in either C or C++, I know
what you’re thinking: Borland, with their Turbo Pascal (TP), certainly
didn’t invent the concept of separately compilable modules. And of
course you’re right. But if you’ve not used TP lately, or ever, you may
not realize just how painless the whole process is. Even in C or C++,
you still have to build a make file, either manually or by telling the
compiler how to do so. You must also list, using extern
statements or
header files, the functions you want to import. In TP, you don’t even
have to do that. You need only name the units you wish to use, and all
of their procedures automatically become available.
It’s not my intention to get into a language-war debate here, so I won’t pursue the subject any further. Even I no longer use Pascal on my job … I use C at work and C++ for my articles in Embedded Systems Programming and other magazines. Believe me, when I set out to resurrect this series, I thought long and hard about switching both languages and target systems to the ones that we’re all using these days, C/C++ and PC architecture, and possibly object-oriented methods as well. In the end, I felt it would cause more confusion than the hiatus itself has. And after all, Pascal still remains one of the best possible languages for teaching, not to mention production programming. Finally, TP still compiles at the speed of light, much faster than competing C/C++ compilers. And Borland’s smart linker, used in TP but not in their C++ products, is second to none. Aside from being much faster than Microsoft-compatible linkers, the Borland smart linker will cull unused procedures and data items, even to the extent of trimming them out of defined objects if they’re not needed. For one of the few times in our lives, we don’t have to compromise between completeness and efficiency. When we’re writing a TP unit, we can make it as complete as we like, including any member functions and data items we may think we will ever need, confident that doing so will not create unwanted bloat in the compiled and linked executable.
The point, really, is simply this: By using TP’s unit mechanism, we can have all the advantages and convenience of writing small, seemingly stand-alone test programs, without having to constantly rewrite the support functions that we need. Once written, the TP units sit there, quietly waiting to do their duty and give us the support we need, when we need it.
Using this principle, in Installment 15 I set out to minimize our tendency to re-invent the wheel by organizing our code into separate Turbo Pascal units, each containing different parts of the compiler. We ended up with the following units:
Input
Output
Errors
Scanner
Parser
CodeGen
Each of these units serves a different function, and encapsulates
specific areas of functionality. The Input
and Output
units, as their
name implies, provide character stream I/O and the all-important
lookahead character upon which our predictive parser is based. The
Errors unit, of course, provides standard error handling. The Scanner
unit contains all of our boolean functions such as IsAlpha
, and the
routines GetName
and GetNumber
, which process multi-character tokens.
The two units we’ll be working with the most, and the ones that most
represent the personality of our compiler, are Parser and CodeGen.
Theoretically, the Parser
unit should encapsulate all aspects of the
compiler that depend on the syntax of the compiled language (though, as
we saw last time, a small amount of this syntax spills over into
Scanner
). Similarly, the code generator unit, CodeGen
, contains all of
the code dependent upon the target machine. In this installment, we’ll
be continuing with the development of the functions in these two
all-important units.
Just Like Classical?
Before we proceed, however, I think I should clarify the relationship between, and the functionality of these units. Those of you who are familiar with compiler theory as taught in universities will, of course, recognize the names, Scanner, Parser, and CodeGen, all of which are components of a classical compiler implementation. You may be thinking that I’ve abandoned my commitment to the KISS philosophy, and drifted towards a more conventional architecture than we once had. A closer look, however, should convince you that, while the names are similar, the functionalities are quite different.
Together, the scanner and parser of a classical implementation comprise the so-called “front end,” and the code generator, the back end. The front end routines process the language-dependent, syntax-related aspects of the source language, while the code generator, or back end, deals with the target machine-dependent parts of the problem. In classical compilers, the two ends communicate via a file of instructions written in an intermediate language (IL).
Typically, a classical scanner is a single procedure, operating as a co-procedure with the parser. It “tokenizes” the source file, reading it character by character, recognizing language elements, translating them into tokens, and passing them along to the parser. You can think of the parser as an abstract machine, executing “op codes,” which are the tokens. Similarly, the parser generates op codes of a second abstract machine, which mechanizes the IL. Typically, the IL file is written to disk by the parser, and read back again by the code generator.
Our organization is quite different. We have no lexical scanner, in the
classical sense; our unit Scanner
, though it has a similar name, is not
a single procedure or co-procedure, but merely a set of separate
subroutines which are called by the parser as needed.
Similarly, the classical code generator, the back end, is a translator
in its own right, reading an IL “source” file, and emitting an object
file. Our code generator doesn’t work that way. In our compiler, there
IS no intermediate language; every construct in the source language
syntax is converted into assembly language as it is recognized by the
parser. Like Scanner, the unit CodeGen
consists of individual
procedures which are called by the parser as needed.
This “code ’em as you find ’em” philosophy may not produce the world’s most efficient code – for example, we haven’t provided (yet!) a convenient place for an optimizer to work its magic – but it sure does simplify the compiler, doesn’t it?
And that observation prompts me to reflect, once again, on how we have managed to reduce a compiler’s functions to such comparatively simple terms. I’ve waxed eloquent on this subject in past installments, so I won’t belabor the point too much here. However, because of the time that’s elapsed since those last soliloquies, I hope you’ll grant me just a little time to remind myself, as well as you, how we got here. We got here by applying several principles that writers of commercial compilers seldom have the luxury of using. These are:
- The KISS philosophy: Never do things the hard way without a reason
- Lazy coding: Never put off until tomorrow what you can put of forever (with credits to P.J. Plauger)
- Skepticism: Stubborn refusal to do something just because that’s the way it’s always been done.
- Acceptance of inefficient code
- Rejection of arbitrary constraints
As I’ve reviewed the history of compiler construction, I’ve learned that virtually every production compiler in history has suffered from pre-imposed conditions that strongly influenced its design. The original FORTRAN compiler of John Backus, et al, had to compete with assembly language, and therefore was constrained to produce extremely efficient code. The IBM compilers for the minicomputers of the 70’s had to run in the very small RAM memories then available – as small as 4k. The early Ada compiler had to compile itself. Per Brinch Hansen decreed that his Pascal compiler developed for the IBM PC must execute in a 64k machine. Compilers developed in Computer Science courses had to compile the widest variety of languages, and therefore required LALR parsers.
In each of these cases, these preconceived constraints literally dominated the design of the compiler.
A good example is Brinch Hansen’s compiler, described in his excellent book, “Brinch Hansen on Pascal Compilers” (highly recommended). Though his compiler is one of the most clear and un-obscure compiler implementations I’ve seen, that one decision, to compile large files in a small RAM, totally drives the design, and he ends up with not just one, but many intermediate files, together with the drivers to write and read them.
In time, the architectures resulting from such decisions have found their way into computer science lore as articles of faith. In this one man’s opinion, it’s time that they were re-examined critically. The conditions, environments, and requirements that led to classical architectures are not the same as the ones we have today. There’s no reason to believe the solutions should be the same, either.
In this tutorial, we’ve followed the leads of such pioneers in the world of small compilers for Pcs as Leor Zolman, Ron Cain, and James Hendrix, who didn’t know enough compiler theory to know that they “couldn’t do it that way.” We have resolutely refused to accept arbitrary constraints, but rather have done whatever was easy. As a result, we have evolved an architecture that, while quite different from the classical one, gets the job done in very simple and straightforward fashion.
I’ll end this philosophizing with an observation re the notion of an intermediate language. While I’ve noted before that we don’t have one in our compiler, that’s not exactly true; we do have one, or at least are evolving one, in the sense that we are defining code generation functions for the parser to call. In essence, every call to a code generation procedure can be thought of as an instruction in an intermediate language. Should we ever find it necessary to formalize an intermediate language, this is the way we would do it: emit codes from the parser, each representing a call to one of the code generator procedures, and then process each code by calling those procedures in a separate pass, implemented in a back end. Frankly, I don’t see that we’ll ever find a need for this approach, but there is the connection, if you choose to follow it, between the classical and the current approaches.
Fleshing out the Parser
Though I promised you, somewhere along about Installment 14, that we’d never again write every single function from scratch, I ended up starting to do just that in Installment 15. One reason: that long hiatus between the two installments made a review seem eminently justified … even imperative, both for you and for me. More importantly, the decision to collect the procedures into modules (units), forced us to look at each one yet again, whether we wanted to or not. And, finally and frankly, I’ve had some new ideas in the last four years that warranted a fresh look at some old friends. When I first began this series, I was frankly amazed, and pleased, to learn just how simple parsing routines can be made. But this last time around, I’ve surprised myself yet again, and been able to make them just that last little bit simpler, yet.
Still, because of this total rewrite of the parsing modules, I was only able to include so much in the last installment. Because of this, our hero, the parser, when last seen, was a shadow of its former self, consisting of only enough code to parse and process a factor consisting of either a variable or a constant. The main effort of this current installment will be to help flesh out the parser to its former glory. In the process, I hope you’ll bear with me if we sometimes cover ground we’ve long since been over and dealt with.
First, let’s take care of a problem that we’ve addressed before: Our
current version of procedure Factor
, as we left it in Installment 15,
can’t handle negative arguments. To fix that, we’ll introduce the
procedure SignedFactor
:
{--------------------------------------------------------------}
{ Parse and Translate a Factor with Optional Sign }
procedure SignedFactor;
var Sign: char;
begin
Sign := Look;
if IsAddop(Look) then
GetChar;
Factor;
if Sign = '-' then Negate;
end;
{--------------------------------------------------------------}
Note that this procedure calls a new code generation routine, Negate
:
{--------------------------------------------------------------}
{ Negate Primary }
procedure Negate;
begin
EmitLn('NEG D0');
end;
{--------------------------------------------------------------}
(Here, and elsewhere in this series, I’m only going to show you the new routines. I’m counting on you to put them into the proper unit, which you should normally have no trouble identifying. Don’t forget to add the procedure’s prototype to the interface section of the unit.)
In the main program, simply change the procedure called from Factor
to
SignedFactor
, and give the code a test. Isn’t it neat how the Turbo
linker and make facility handle all the details?
Yes, I know, the code isn’t very efficient. If we input a number, -3
,
the generated code is:
MOVE #3,D0
NEG D0
which is really, really dumb. We can do better, of course, by simply
pre-appending a minus sign to the string passed to LoadConstant
, but it
adds a few lines of code to SignedFactor
, and I’m applying the KISS
philosophy very aggressively here. What’s more, to tell the truth, I
think I’m subconsciously enjoying generating “really, really dumb” code,
so I can have the pleasure of watching it get dramatically better when
we get into optimization methods.
Most of you have never heard of John Spray, so allow me to introduce him to you here. John’s from New Zealand, and used to teach computer science at one of its universities. John wrote a compiler for the Motorola 6809, based on a delightful, Pascal-like language of his own design called “Whimsical.” He later ported the compiler to the 68000, and for awhile it was the only compiler I had for my homebrewed 68000 system.
For the record, one of my standard tests for any new compiler is to see how the compiler deals with a null program like:
program main;
begin
end.
My test is to measure the time required to compile and link, and the
size of the object file generated. The undisputed loser in the test
is the DEC C compiler for the VAX, which took 60 seconds to compile, on
a VAX 11/780, and generated a 50k object file. John’s compiler is the
undisputed, once, future, and forever king in the code size department.
Given the null program, Whimsical generates precisely two bytes of code,
implementing the one instruction, RET
.
By setting a compiler option to generate an include file rather than a standalone program, John can even cut this size, from two bytes to zero! Sort of hard to beat a null object file, wouldn’t you say?
Needless to say, I consider John to be something of an expert on code
optimization, and I like what he has to say: “The best way to optimize
is not to have to optimize at all, but to produce good code in the first
place.” Words to live by. When we get started on optimization, we’ll
follow John’s advice, and our first step will not be to add a peephole
optimizer or other after-the-fact device, but to improve the quality of
the code emitted before optimization. So make a note of SignedFactor
as
a good first candidate for attention, and for now we’ll leave it be.
Terms and Expressions
I’m sure you know what’s coming next: We must, yet again, create the rest of the procedures that implement the recursive-descent parsing of an expression. We all know that the hierarchy of procedures for arithmetic expressions is:
expression
term
factor
However, for now let’s continue to do things one step at a time, and consider only expressions with additive terms in them. The code to implement expressions, including a possibly signed first term, is shown next:
{--------------------------------------------------------------}
{ Parse and Translate an Expression }
procedure Expression;
begin
SignedFactor;
while IsAddop(Look) do
case Look of
'+': Add;
'-': Subtract;
end;
end;
{--------------------------------------------------------------}
This procedure calls two other procedures to process the operations:
{--------------------------------------------------------------}
{ Parse and Translate an Addition Operation }
procedure Add;
begin
Match('+');
Push;
Factor;
PopAdd;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Subtraction Operation }
procedure Subtract;
begin
Match('-');
Push;
Factor;
PopSub;
end;
{--------------------------------------------------------------}
The three procedures Push
, PopAdd
, and PopSub
are new code generation
routines. As the name implies, procedure Push
generates code to push
the primary register (D0
, in our 68000 implementation) to the stack.
PopAdd
and PopSub
pop the top of the stack again, and add it to, or
subtract it from, the primary register. The code is shown next:
{--------------------------------------------------------------}
{ Push Primary to Stack }
procedure Push;
begin
EmitLn('MOVE D0,-(SP)');
end;
{--------------------------------------------------------------}
{ Add TOS to Primary }
procedure PopAdd;
begin
EmitLn('ADD (SP)+,D0');
end;
{--------------------------------------------------------------}
{ Subtract TOS from Primary }
procedure PopSub;
begin
EmitLn('SUB (SP)+,D0');
Negate;
end;
{--------------------------------------------------------------}
Add these routines to Parser
and CodeGen
, and change the main
program to
call Expression
. Voila!
The next step, of course, is to add the capability for dealing with
multiplicative terms. To that end, we’ll add a procedure Term
, and code
generation procedures PopMul
and PopDiv
. These code generation
procedures are shown next:
{--------------------------------------------------------------}
{ Multiply TOS by Primary }
procedure PopMul;
begin
EmitLn('MULS (SP)+,D0');
end;
{--------------------------------------------------------------}
{ Divide Primary by TOS }
procedure PopDiv;
begin
EmitLn('MOVE (SP)+,D7');
EmitLn('EXT.L D7');
EmitLn('DIVS D0,D7');
EmitLn('MOVE D7,D0');
end;
{--------------------------------------------------------------}
I admit, the division routine is a little busy, but there’s no help for
it. Unfortunately, while the 68000 CPU allows a division using the top
of stack (TOS), it wants the arguments in the wrong order, just as it
does for subtraction. So our only recourse is to pop the stack to a
scratch register (D7
), perform the division there, and then move the
result back to our primary register, D0
. Note the use of signed multiply
and divide operations. This follows an implied, but unstated,
assumption, that all our variables will be signed 16-bit integers. This
decision will come back to haunt us later, when we start looking at
multiple data types, type conversions, etc.
Our procedure Term
is virtually a clone of Expression
, and looks like
this:
{--------------------------------------------------------------}
{ Parse and Translate a Term }
procedure Term;
begin
Factor;
while IsMulop(Look) do
case Look of
'*': Multiply;
'/': Divide;
end;
end;
{--------------------------------------------------------------}
Our next step is to change some names. SignedFactor
now becomes
SignedTerm
, and the calls to Factor
in Expression
, Add
, Subtract
and
SignedTerm
get changed to call Term
:
{--------------------------------------------------------------}
{ Parse and Translate a Term with Optional Leading Sign }
procedure SignedTerm;
var Sign: char;
begin
Sign := Look;
if IsAddop(Look) then
GetChar;
Term;
if Sign = '-' then Negate;
end;
{--------------------------------------------------------------}
...
{--------------------------------------------------------------}
{ Parse and Translate an Expression }
procedure Expression;
begin
SignedTerm;
while IsAddop(Look) do
case Look of
'+': Add;
'-': Subtract;
end;
end;
{--------------------------------------------------------------}
If memory serves me correctly, we once had both a procedure SignedFactor
and a procedure SignedTerm
. I had reasons for doing that at the time …
they had to do with the handling of Boolean algebra and, in particular,
the Boolean “not” function. But certainly, for arithmetic operations,
that duplication isn’t necessary. In an expression like -x*y
,
it’s very apparent that the sign goes with the whole term, x*y
, and not
just the factor x
, and that’s the way Expression
is coded.
Test this new code by executing Main
. It still calls Expression
, so you
should now be able to deal with expressions containing any of the four
arithmetic operators.
Our last bit of business, as far as expressions goes, is to modify
procedure Factor
to allow for parenthetical expressions. By using a
recursive call to Expression
, we can reduce the needed code to virtually
nothing. Five lines added to Factor
do the job:
{--------------------------------------------------------------}
{ Parse and Translate a Factor }
procedure Factor;
begin
if Look ='(' then begin
Match('(');
Expression;
Match(')');
end
else if IsDigit(Look) then
LoadConstant(GetNumber)
else if IsAlpha(Look)then
LoadVariable(GetName)
else
Error('Unrecognized character ' + Look);
end;
{--------------------------------------------------------------}
At this point, your “compiler” should be able to handle any legal expression you can throw at it. Better yet, it should reject all illegal ones!
Assignments
As long as we’re this close, we might as well create the code to deal
with an assignment statement. This code needs only to remember the name
of the target variable where we are to store the result of an
expression, call Expression
, then store the number. The procedure is
shown next:
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment;
var Name: string;
begin
Name := GetName;
Match('=');
Expression;
StoreVariable(Name);
end;
{--------------------------------------------------------------}
The assignment calls for yet another code generation routine:
{--------------------------------------------------------------}
{ Store the Primary Register to a Variable }
procedure StoreVariable(Name: string);
begin
EmitLn('LEA ' + Name + '(PC),A0');
EmitLn('MOVE D0,(A0)');
end;
{--------------------------------------------------------------}
Now, change the call in Main
to call Assignment
, and you should see a
full assignment statement being processed correctly. Pretty neat, eh?
And painless, too.
In the past, we’ve always tried to show BNF relations to define the syntax we’re developing. I haven’t done that here, and it’s high time I did. Here’s the BNF:
<factor> ::= <variable> | <constant> | '(' <expression> ')'
<signed_term> ::= [<addop>] <term>
<term> ::= <factor> (<mulop> <factor>)*
<expression> ::= <signed_term> (<addop> <term>)*
<assignment> ::= <variable> '=' <expression>
Booleans
The next step, as we’ve learned several times before, is to add Boolean
algebra. In the past, this step has at least doubled the amount of code
we’ve had to write. As I’ve gone over this step in my mind, I’ve found
myself diverging more and more from what we did in previous
installments. To refresh your memory, I noted that Pascal treats the
Boolean operators pretty much identically to the way it treats
arithmetic ones. A Boolean “and” has the same precedence level as
multiplication, and the “or” as addition. C, on the other hand, sets
them at different precedence levels, and all told has a whopping 17
levels. In our earlier work, I chose something in between, with seven
levels. As a result, we ended up with things called Boolean
expressions, paralleling in most details the arithmetic expressions, but
at a different precedence level. All of this, as it turned out, came
about because I didn’t like having to put parentheses around the Boolean
expressions in statements like
IF (c >= 'A') and (c <= 'Z') then ...
.
In retrospect, that seems a pretty petty reason to add many layers of complexity to the parser. Perhaps more to the point, I’m not sure I was even able to avoid the parens.
For kicks, let’s start anew, taking a more Pascal-ish approach, and just treat the Boolean operators at the same precedence level as the arithmetic ones. We’ll see where it leads us. If it seems to be down the garden path, we can always backtrack to the earlier approach.
For starters, we’ll add the “addition-level” operators to Expression
.
That’s easily done; first, modify the function IsAddop
in unit Scanner
to include two extra operators: |
for “or,” and ~
for “exclusive
or”:
{--------------------------------------------------------------}
function IsAddop(c: char): boolean;
begin
IsAddop := c in ['+','-', '|', '~'];
end;
{--------------------------------------------------------------}
Next, we must include the parsing of the operators in procedure
Expression
:
{--------------------------------------------------------------}
procedure Expression;
begin
SignedTerm;
while IsAddop(Look) do
case Look of
'+': Add;
'-': Subtract;
'|': _Or;
'~': _Xor;
end;
{--------------------------------------------------------------}
end;
(The underscores are needed, of course, because or
and xor
are
reserved words in Turbo Pascal.)
Next, the procedures _Or
and _Xor
:
{--------------------------------------------------------------}
{ Parse and Translate a Subtraction Operation }
procedure _Or;
begin
Match('|');
Push;
Term;
PopOr;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Subtraction Operation }
procedure _Xor;
begin
Match('~');
Push;
Term;
PopXor;
end;
{--------------------------------------------------------------}
And, finally, the new code generator procedures:
{--------------------------------------------------------------}
{ Or TOS with Primary }
procedure PopOr;
begin
EmitLn('OR (SP)+,D0');
end;
{--------------------------------------------------------------}
{ Exclusive-Or TOS with Primary }
procedure PopXor;
begin
EmitLn('EOR (SP)+,D0');
end;
{--------------------------------------------------------------}
Now, let’s test the translator (you might want to change the call
in Main
back to a call to Expression
, just to avoid having to type
x=
for an assignment every time).
So far, so good. The parser nicely handles expressions of the
form x|y~z
.
Unfortunately, it also does nothing to protect us from mixing
Boolean and arithmetic algebra. It will merrily generate code
for (a+b)*(c~d)
.
We’ve talked about this a bit, in the past. In general the rules for what operations are legal or not cannot be enforced by the parser itself, because they are not part of the syntax of the language, but rather its semantics. A compiler that doesn’t allow mixed-mode expressions of this sort must recognize that c and d are Boolean variables, rather than numeric ones, and balk at multiplying them in the next step. But this “policing” can’t be done by the parser; it must be handled somewhere between the parser and the code generator. We aren’t in a position to enforce such rules yet, because we haven’t got either a way of declaring types, or a symbol table to store the types in. So, for what we’ve got to work with at the moment, the parser is doing precisely what it’s supposed to do.
Anyway, are we sure that we don’t want to allow mixed-type
operations? We made the decision some time ago (or, at least, I
did) to adopt the value 0000
as a Boolean false
, and -1
, or
FFFFh
, as a Boolean true
. The nice part about this choice is
that bitwise operations work exactly the same way as logical ones.
In other words, when we do an operation on one bit of a logical
variable, we do it on all of them. This means that we don’t need
to distinguish between logical and bitwise operations, as is done
in C with the operators &
and &&,
and |
and ||
. Reducing the
number of operators by half certainly doesn’t seem all bad.
From the point of view of the data in storage, of course, the
computer and compiler couldn’t care less whether the number FFFFh
represents the logical TRUE
, or the numeric -1
. Should we? I
sort of think not. I can think of many examples (though they
might be frowned upon as “tricky” code) where the ability to mix
the types might come in handy. Example, the Dirac delta function,
which could be coded in one simple line, -(x=0)
,
or the absolute value function (definitely tricky code!),
x*(1+2*(x<0))
.
Please note, I’m not advocating coding like this as a way of life.
I’d almost certainly write these functions in more readable form,
using IF
s, just to keep from confusing later maintainers. Still,
a moral question arises: Do we have the right to enforce our
ideas of good coding practice on the programmer, but writing the
language so they can’t do anything else? That’s what Nicklaus Wirth
did, in many places in Pascal, and Pascal has been criticized for
it – for not being as “forgiving” as C.
An interesting parallel presents itself in the example of the Motorola 68000 design. Though Motorola brags loudly about the orthogonality of their instruction set, the fact is that it’s far from orthogonal. For example, you can read a variable from its address:
MOVE X,D0 (where X is the name of a variable)
but you can’t write in the same way. To write, you must load an
address register with the address of X
. The same is true for
PC-relative addressing:
MOVE X(PC),DO (legal)
MOVE D0,X(PC) (illegal)
When you begin asking how such non-orthogonal behavior came about, you find that someone in Motorola had some theories about how software should be written. Specifically, in this case, they decided that self-modifying code, which you can implement using PC-relative writes, is a Bad Thing. Therefore, they designed the processor to prohibit it. Unfortunately, in the process they also prohibited all writes of the forms shown above, however benign. Note that this was not something done by default. Extra design work had to be done, and extra gates added, to destroy the natural orthogonality of the instruction set.
One of the lessons I’ve learned from life: If you have two choices, and can’t decide which one to take, sometimes the best thing to do is nothing. Why add extra gates to a processor to enforce some stranger’s idea of good programming practice? Leave the instructions in, and let the programmers debate what good programming practice is. Similarly, why should we add extra code to our parser, to test for and prevent conditions that the user might prefer to do, anyway? I’d rather leave the compiler simple, and let the software experts debate whether the practices should be used or not.
All of which serves as rationalization for my decision as to how to prevent mixed-type arithmetic: I won’t. For a language intended for systems programming, the fewer rules, the better. If you don’t agree, and want to test for such conditions, we can do it once we have a symbol table.
Boolean “and”
With that bit of philosophy out of the way, we can press on to the
“and” operator, which goes into procedure Term
. By now, you can
probably do this without me, but here’s the code, anyway:
In Scanner
,
{--------------------------------------------------------------}
function IsMulop(c: char): boolean;
begin
IsMulop := c in ['*','/', '&'];
end;
{--------------------------------------------------------------}
In Parser
,
{--------------------------------------------------------------}
procedure Term;
begin
Factor;
while IsMulop(Look) do
case Look of
'*': Multiply;
'/': Divide;
'&': _And;
end;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Boolean And Operation }
procedure _And;
begin
Match('&');
Push;
Factor;
PopAnd;
end;
{--------------------------------------------------------------}
and in CodeGen
,
{--------------------------------------------------------------}
{ And Primary with TOS }
procedure PopAnd;
begin
EmitLn('AND (SP)+,D0');
end;
{--------------------------------------------------------------}
Your parser should now be able to process almost any sort of logical expression, and (should you be so inclined), mixed-mode expressions as well.
Why not “all sorts of logical expressions”? Because, so far, we haven’t
dealt with the logical “not” operator, and this is where it gets tricky.
The logical “not” operator seems, at first glance, to be identical in
its behavior to the unary minus, so my first thought was to let the
exclusive or operator, ~
, double as the unary “not.” That didn’t
work. In my first attempt, procedure SignedTerm
simply ate my ~
,
because the character passed the test for an addop, but SignedTerm
ignores all addops except -
. It would have been easy enough to add
another line to SignedTerm
, but that would still not solve the problem,
because note that Expression
only accepts a signed term for the first
argument.
Mathematically, an expression like -a * -b
makes little or no sense, and the parser should flag it as an error.
But the same expression, using a logical “not,” makes perfect sense:
not a and not b
.
In the case of these unary operators, choosing to make them act the same
way seems an artificial force fit, sacrificing reasonable behavior on
the altar of implementational ease. While I’m all for keeping the
implementation as simple as possible, I don’t think we should do so at
the expense of reasonableness. Patching like this would be missing the
main point, which is that the logical “not” is simply NOT the same kind
of animal as the unary minus. Consider the exclusive or, which is most
naturally written as a~b ::= (a and not b) or (not a and b)
.
If we allow the “not” to modify the whole term, the last term in
parentheses would be interpreted as not(a and b)
.
which is not the same thing at all. So it’s clear that the logical “not” must be thought of as connected to the FACTOR, not the term.
The idea of overloading the ~
operator also makes no sense from a
mathematical point of view. The implication of the unary minus is that
it’s equivalent to a subtraction from zero: -x <=> 0-x
.
In fact, in one of my more simple-minded versions of Expression
, I
reacted to a leading addop by simply preloading a zero, then processing
the operator as though it were a binary operator. But a “not” is not
equivalent to an exclusive or with zero … that would just give back
the original number. Instead, it’s an exclusive or with FFFFh
, or -1
.
In short, the seeming parallel between the unary “not” and the unary
minus falls apart under closer scrutiny. “not” modifies the factor, not
the term, and it is not related to either the unary minus nor the
exclusive or. Therefore, it deserves a symbol to call its own. What
better symbol than the obvious one, also used by C, the !
character?
Using the rules about the way we think the “not” should behave, we
should be able to code the exclusive or (assuming we’d ever need to), in
the very natural form: a & !b | !a & b
.
Note that no parentheses are required – the precedence levels we’ve chosen automatically take care of things.
If you’re keeping score on the precedence levels, this definition puts
the !
at the top of the heap. The levels become:
!
-
(unary)*
,/
,&
+
,-
,|
,~
Looking at this list, it’s certainly not hard to see why we had trouble
using ~
as the “not” symbol!
So how do we mechanize the rules? In the same way as we did with
SignedTerm
, but at the factor level. We’ll define a procedure
NotFactor
:
{--------------------------------------------------------------}
{ Parse and Translate a Factor with Optional "Not" }
procedure NotFactor;
begin
if Look ='!' then begin
Match('!');
Factor;
Notit;
end
else
Factor;
end;
{--------------------------------------------------------------}
and call it from all the places where we formerly called Factor
, i.e.,
from Term
, Multiply
, Divide
, and _And
. Note the new code generation
procedure:
{--------------------------------------------------------------}
{ Bitwise Not Primary }
procedure NotIt;
begin
EmitLn('EOR #-1,D0');
end;
{--------------------------------------------------------------}
Try this now, with a few simple cases. In fact, try that exclusive or
example, a&!b|!a&b
.
You should get the code (without the comments, of course):
MOVE A(PC),DO ; load a
MOVE D0,-(SP) ; push it
MOVE B(PC),DO ; load b
EOR #-1,D0 ; not it
AND (SP)+,D0 ; and with a
MOVE D0,-(SP) ; push result
MOVE A(PC),DO ; load a
EOR #-1,D0 ; not it
MOVE D0,-(SP) ; push it
MOVE B(PC),DO ; load b
AND (SP)+,D0 ; and with !a
OR (SP)+,D0 ; or with first term
That’s precisely what we’d like to get. So, at least for both
arithmetic and logical operators, our new precedence and new, slimmer
syntax hang together. Even the peculiar, but legal, expression with
leading addop, ~x
,
makes sense. SignedTerm
ignores the leading ~
, as it should, since
the expression is equivalent to 0~x
,
which is equal to x.
When we look at the BNF we’ve created, we find that our boolean algebra now adds only one extra line:
<not_factor> ::= [!] <factor>
<factor> ::= <variable> | <constant> | '(' <expression> ')'
<signed_term> ::= [<addop>] <term>
<term> ::= <not_factor> (<mulop> <not_factor>)*
<expression> ::= <signed_term> (<addop> <term>)*
<assignment> ::= <variable> '=' <expression>
That’s a big improvement over earlier efforts. Will our luck continue to hold when we get to relational operators? We’ll find out soon, but it will have to wait for the next installment. We’re at a good stopping place, and I’m anxious to get this installment into your hands. It’s already been a year since the release of Installment 15. I blush to admit that all of this current installment has been ready for almost as long, with the exception of relational operators. But the information does you no good at all, sitting on my hard disk, and by holding it back until the relational operations were done, I’ve kept it out of your hands for that long. It’s time for me to let go of it and get it out where you can get value from it. Besides, there are quite a number of serious philosophical questions associated with the relational operators, as well, and I’d rather save them for a separate installment where I can do them justice.
Have fun with the new, leaner arithmetic and logical parsing, and I’ll see you soon with relationals.