I've been working on a noteworthy amount of language and compiler design for the last couple of years, both professionally and at home. In compiler theory, some of the most useful tools are intermediate languages, intermediate representation (IR), and syntax-directed translation (SDT).
This article is an introduction to IRs — I plan to work on a series to cover the intricacies of what I call the 'middleware' of a compiler.
Simple case
Consider the expression:
2 * (1 + 1)
Of course, there are a few ways we could represent and store the meaning of this expression. The first that comes to mind is probably a syntax tree:
*
/ \
2 +
/ \
1 1
We can construct the expression (and, even the result) by evaluating the tree. We could build it with any algebraic data type or language with pointers.
What other representations do you think would be useful?
Another is Polish notation:
* 2 + 1 1
The usefulness of polish notation may not be so apparent at first, but it in fact has many! First, wrap the operators with parenthesis, and we have what's called an s-expression. A symbolic expression of a tree-like data construct that comes from lisp. But there's more — that means the expression can be evaluated as-is in a lisp program! Your calculator in grade school evaluated its calculations in one of polish or reverse polish notation. This form supports smaller or embedded machines with limited resources.
I like to call s-expressions a 'compression' of foldable data structures and algorithms. It's easier to store and re-use the expression in this representation, with far fewer characters, and we get the same result.
I've used s-expressions to compress, store, and construct immutable data structures. They are also useful in automata theory and regular languages.
On paper, polish notation may be easier to identify optimizations than with the AST representation:
\[\begin{aligned} \ * \ 2 + 1 \ 1 =\\ * \ 2 + 2 \end{aligned}\]
We can merge the trailing + 1 1
into simply a 2
. Congratulations, you've learned the gist of intermediate representations! The outcome is the same, and we simplify the expression.
The intermediate representation
The generalization of choosing a representation or representation language — and the difference in facilitating efficiency and optimization before the final result — is the heart of the theory of IRs. In our first example, we represent mathematical arithmetic.
It is this simple concept that constitutes the representation of a programming language or formal system to a machine (or abstract machine, such as a Turing machine) in a compiler.
The cleverness a compiler may have in reasoning with a piece of code to platform-dependent machine code is founded on informed assumptions it can make between its abstract representation and the capabilities of the abstract target machine.
Next, consider the program:
x = 10
y = 12
if x == 5:
x = x + 2
else:
x = x - 1
The same scientific process we used in our first example to adequately represent the equation before the evaluation applies here. Except, instead of arithmetic to a calculation — a program to an executable or machine.
Parser and language designers only tend to think of abstract syntax trees and parse trees when they think of program or language representation — an AST is only one form of IR a program may use. The measure of potential and helpfulness a compiler or interpreter has towards a target platform depends on its IR flexibility. Support more than one IR, or a composable IR, and you get better potential.
Stacks and Registers
If you're reading this, you've probably heard of a stack. You probably also know that a 'program' may have what's called a stack
and a heap
. The analog of a polish or reverse polish representation to a math equation is the abstract stack machine to a computer for a program.
An abstract stack machine is a simple representation with minimal capabilities, but are fantastic for basic compilers and arithmetic expressions.
As Western University puts it:
The instructions are quite limited and fall into three classes:
- arithmetic operations
- stack manipulation
- control flow.
It similarly simulates the evaluation of a postfix representation of arguments for the stack. With a couple of control flow and stack manipulation instructions like push
, pop
, lvalue
(pushes a data address) rvalue
(content of data at a location) and goto
, gotrue
,gofalse
.
The stack machine representation of our program above may look like:
lvalue y -- add y lvalue address
rvalue 12 -- set rvalue to last address in stack (y)
lvalue x -- add x lvalue address
rvalue 10 -- set rvalue to last address in stack (x)
push x
push 5
eq -- greater or less than? pops last 2 values and push result
gotrue equal -- goto and pop if value at top of stack != 0
gofalse notequal -- goto and pop if value at top of stack is 0
label equal
copy -- make a copy of value at top of stack
push 2
+ -- pop last 2 values and run the '+' operator
:= -- sets r-value on top with l-value below it
label notequal
copy -- make a copy of value at the top of stack
push 1
- -- pop last 2 and run the '-' operator
:= -- sets r-value on top with l-value below it
For a little more information on abstract stack machines, check this out. A couple things to notice here: one, how limited the instructions are to get to our result. Two, how the representation looks like a generalization of an assembly language for an ISA. However, because an abstract stack machine is so simple to represent, a targeted compiler with this IR is tremendously easier to build.
I'm sure you've noticed (if not irritated, at this point) that we have a variable in our program that is not in use. If we were to visualize a table of the abstract stack, notice that we had to add y
first, so that we could pop
the lvalue
address x
off to assign it an rvalue
. How effective is this representation in providing things we care about in a compiler — dead code elimination, common subexpressions, redundancy, or semantic-preserving simplifications?
It's just like the punched-cards in the early Fortran programming era. Since the goal is a stack machine if we were to build a compiler with this representation, removing our y
lvalue to simplify the code before the final code generation would require rebuilding the entire program!
Words of thought
Indeed, this representation is for programs as simple as arithmetic parsers or the same reasons you'd use reverse polish notation (such as building state machines); compilers that use them are very easy to build and can fit on computers from the mid-1900s. However, it isn't effective for a program with general control logic flow and facilitating efficiency — the lack of support for register machines or memory-to-memory makes the case even worse.
In my first example, polish notation turned out to be a great representation as suddenly we could evaluate expressions as lisp and support embedded calculators. It was easy to spot and make optimizations to the equation, keeping correctness intact.
In my next article on the topic, we will explore better representations for general-purpose languages and programs that promote modern cross-platform support and computers — such as using C as an IR, or the LLVM. Moreover, the insight provided by choosing one of these IRs over another for more effective targeted code generation and optimization.