Intermediate Representations, Part 2

As I mentioned in my Hello, 2025 post, I've picked up work on my B language compiler again, and lately I haven't been able to put it down.

My first post on IRs was quite some time ago, and since then I've designed compilers and bundlers at my workplace. Nonetheless, there's a reason a second post has taken so long.


Five to seven years ago, the dragon book was in my opinion still considered the compiler theory bible. LLVM, WebAssembly, etc and their toolchain was beginning to make noise, but not quite bleeding edge. Fast forward today, and most likely if you're working on compilers, your using LLVM or part of their stack - and that's for good reason. The difference between modern compiler theory (in particular, backend compiler design) and the dragon book is so wide it's almost overwhelming.

Frankly, the dragon book today is obsolete. It doesn't cover Static single-assignment form , def-use chains, LLVM IR, CFGs; nothing that is day-to-day in industry compiler design.

For at least 6 years, I've been on-and-off working on a B compiler. I finished my first draft of the grammar in 2019. Last Christmas holiday, I finished the Lexer and Parser. I've been slow and rather timid to start work on the backend.

Until now. A light bulb went off while I was studying linear IRs. And the Engineering a Compiler (2022) book helped get me off the ground.

To not waste too much time on less-important AST data structures, the frontend compiles the AST as JSON with python. This also felt useful as an AST because as long as your schema matches the frontend, you can write and use your own, too.

The backend is in C++, so I had a problem. How do I read the AST without recreating internal data structures that do the same thing?

The answer is quite interesting a B program consists of definitions, statements, expressions. Expressions are lvalue or rvalue data types. What I care about translating the most are the rvalues.

I broke down rvalue expressions into the following types:

struct RValue;
namespace detail {
using RValue_Pointer = std::shared_ptr<RValue>;
} // namespace detail

struct RValue
{
    // ... 
    
    using RValue_Pointer = detail::RValue_Pointer;

    // ...
    
    using Type = std::variant<std::monostate,
                              RValue_Pointer,
                              Symbol,
                              Unary,
                              Relation,
                              Function,
                              LValue,
                              Value>;


    Type value;
};

Note that RValue_Pointer may itself be an RValue, and this is a mutual-recursive data type. I built a table that translates an AST into a map of symbols and these expression types. This was a big deal because now picking a linear or non-linear IR that's closer to the target platform becomes trivial.


For now, the first IR that will translate to platform code will be linear, a version of Three-Address code I'm calling a "Quadruple", or 4-tuple. But there's another problem, how can we in a linear form express complex, recursive expressions and operands?

_____
___ |[]|_n__n_I_c
|___||__|###|____}
O-O--O-O+++--O-O


Simple. I construct a queue and operator stack from the rvalues. Then I can read from this queue in-order by operator precedence and partition operands as I please. Let's look at an example:

main() {
  auto x;
  x = (5 + 6) * (5 + 6); 
}

Translates to:

x (5:int:4) (6:int:4) + (5:int:4) (6:int:4) + * = 

By operator precedence. Notice something familiar? Thats right, that's reverse-polish notation!

Now look how easy translating into the quadruple linear IR becomes:

__main:
 BeginFunc ;
_t1 = (5:int:4) + (6:int:4);
_t2 = (5:int:4) + (6:int:4);
_t3 = _t1 * _t2;
x = _t3;
 EndFunc ;

Woah! Further, It now is clearer where and when we can apply dead code elimination, too. If an expression is not in the rvalue table, then it can be removed.

Translating statements is much easier. Hopefully you understand now why it has become so simple, too. Let's take a look at the return statement grammar:

return_statement   : "return" [ ( "(" rvalue ")" ) ] SEMI_COLON

Oh, it's just an rvalue wrapped in parenthesis and the "return" lexeme. Easy.

Once I feel good about translating this linear IR to the first platform, which should be arm64, I'll come back and take a look at single-form assignment (SSAs).

Catch you then! 🚂