Intermediate Representations, Part 2

As I mentioned in my Hello, 2025 post, I've picked up work on my B language compiler again, and lately I haven't been able to put it down.

My first post on IRs was quite some time ago, and since then, I've designed compilers and bundlers at my workplace. Nonetheless, there's a reason a second post has taken so long.

The difference between modern compiler theory (in particular, backend compiler design) and the Dragon Book is so vast that it's a bit unnerving. Frankly, the Dragon Book today is obsolete. It doesn't cover Static single-assignment forms, def-use chains, LLVM IR, or CFGs; nothing that is day-to-day in industry compiler design.

For at least six years, I've been working on a B compiler on and off. I finished my first draft of the grammar in 2019. Last Christmas holiday, I finished the Lexer and Parser. I've been slow and rather timid to start work on the backend.

Until now, a light bulb went off while I was studying linear IRs. And the Engineering a Compiler (2022) book helped get me off the ground.

To avoid wasting too much time on less important AST data structures, the frontend compiles the AST as JSON using Python. This also felt useful as an AST because, as long as your schema matches the frontend, you can write and use your own as well.

The backend is in C++. How do I read the AST without recreating it as C++ STL data structures that perform the same thing?

A bit of background on the B language: a program consists of definitions, statements, and expressions. Expressions are rvalue data types. What I care about translating the most are the rvalues.

I broke down rvalue expressions into the following types:

struct RValue;
namespace detail {
using RValue_Pointer = std::shared_ptr<RValue>;
} // namespace detail

struct RValue
{
    // ... 
    
    using RValue_Pointer = detail::RValue_Pointer;

    // ...
    
    using Type = std::variant<std::monostate,
                              RValue_Pointer,
                              Symbol,
                              Unary,
                              Relation,
                              Function,
                              LValue,
                              Value>;


    Type value;
};

Note that RValue_Pointer may itself be an RValue, and this is a mutually recursive data type. I built an LL(1) top-down Parser that translates ASTs of expressions into a map of symbols and their algebraic type. This was substantial because it now makes it easier to pick a linear or non-linear IR that's closer to the target platform.

For now, the first IR that will translate to platform code will be linear, a version of Three-Address code I'm calling a "Quadruple", or 4-tuple. But there's another problem: how can we, in a linear form, express complex, recursive expressions and operands?

_____
___ |[]|_n__n_I_c
|___||__|###|____}
O-O--O-O+++--O-O

Simple. I construct a queue from an operator stack and the rvalues. Then I can read from the queue in order by operator precedence and partition operands as I please. Let's look at an example:

main() {
  auto x;
  x = (5 + 6) * (5 + 6); 
}

Translates to:

x (5:int:4) (6:int:4) + (5:int:4) (6:int:4) + * =

By operator precedence. Notice something familiar? That's right, that's reverse-polish notation!

Now look how easy translating into the quadruple linear IR becomes:

__main:
 BeginFunc ;
_t1 = (5:int:4) + (6:int:4);
_t2 = (5:int:4) + (6:int:4);
_t3 = _t1 * _t2;
x = _t3;
 EndFunc ;

Nice! Moreover, it is now clearer where and when we can apply dead code elimination as well. If an expression is not assigned a temporary, then it can be removed.

Translating statements is much easier — let's take a look at the return statement grammar:

return_statement   : "return" [ ( "(" rvalue ")" ) ] SEMI_COLON

Oh, it's just an rvalue wrapped in parentheses and the "return" lexeme. Easy.

Once I feel confident about translating this linear IR to the first platform, which should be arm64, I'll revisit and examine single-form assignment (SSAs).

Catch you then! 🚂