One of the reasons I chose B as the language for the Credence compiler is that it is simple, and I wouldn't have to spend much time on the frontend—lexical analysis and parsing. I wanted to focus most of my time on optimization, code generation, and intermediate representation—and it's been great.
In one of my previous posts in this series, I talked a little bit about the groundwork in the IR design I've chosen. It is nearing completion, and I'd like to talk about how it enabled type checking, compile-time pointer arithmetic, and boundary checking.
I have named it ITA — Instruction Tuple Abstraction — a set of 4-tuples that constitute a collection of platform-independent, generic instructions that approximate the structure and semantics of a target machine language.
Here is the list of instructions:
    enum class Instruction
    {
        FUNC_START,
        FUNC_END,
        LABEL,
        GOTO,
        LOCL,
        GLOBL,
        IF,
        JMP_E,
        PUSH,
        POP,
        CALL,
        CMP,
        MOV,
        RETURN,
        LEAVE,
        NOOP
    };
Most of these instructions should be pretty self-explanatory;  the IR is roughly an extension of Three-Address-Code, where instead of 3-tuples I support four and provide type information, local variable state, and a function frame.
Let's look at two examples:
main(argc,argv) {
  auto k, m, z;
  extrn array;
  k = array[2];
  m = "hello world";
  z = m;
}
array [3] 10, 20, 30;Translated, looks like this:
__main(argc,argv):
 BeginFunc ;
    LOCL k;
    LOCL m;
    LOCL z;
    GLOBL array;
    k = array[2];
    m = ("hello world":string:11);
    z = m;
_L1:
    LEAVE;
 EndFunc ;Another example that uses branching and jumps:
main() {
  auto x,y,z;
  x = 1;
  y = 2;
  if (y < x) {
    z = add(x, sub(x, y)) - 2;
  }
}
add(x,y) {
  return(x + y);
}
sub(x,y) {
  return(x - y);
}
__main():
 BeginFunc ;
    LOCL x;
    LOCL y;
    LOCL z;
    x = (1:int:4);
    y = (2:int:4);
_L2:
    _t5 = y < x;
    IF _t5 GOTO _L4;
_L3:
_L1:
    LEAVE;
_L4:
    _p1 = x;
    _p3 = x;
    _p4 = y;
    PUSH _p4;
    PUSH _p3;
    CALL sub;
    POP 16;
    _t6 = RET;
    _p2 = _t6;
    PUSH _p2;
    PUSH _p1;
    CALL add;
    POP 16;
    _t7 = RET;
    _t8 = _t7;
    z = (2:int:4) - _t8;
    GOTO _L3;
 EndFunc ;
__add(x,y):
 BeginFunc ;
    _t2 = x + y;
    RET _t2;
_L1:
    LEAVE;
 EndFunc ;
__sub(x,y):
 BeginFunc ;
    _t2 = x - y;
    RET _t2;
_L1:
    LEAVE;
 EndFunc ;
Let's break the first example down — we see that BeginFunc and EndFunc mark the frame of a function, and it has two labels: a local label L1 and the symbolic main function label _main(argc,argv). Note that the function label includes parameters, so they are pushed onto the local stack frame.
The first real instruction we see is LOCL - this not only tells the stack frame about the local symbol, it initializes it to an empty null value in the table.
Next, we see that the vector array from the global scope is pushed onto the frame via GLOBL. I've made vectors non-homogeneous but strictly typed by value order - i.e. array [3] "hello" 3 10 is type (string, int, int). We set our local symbol k  to the zeroth-numbered 2nd item in array - the last item, 30. The type of k is now int and cannot be changed. Each rvalue in the stack frame is constructed in the same way as the hello world string is on the next line. The symbol m is of type string with a size of 11 bytes. Lvalues set to an rvalue may be reassigned with rvalues of the same type; uninitialized symbols may be set to any data value.
When I designed my rvalue data structures, I knew I'd need type and size data readily available if I wanted to support translation to multiple platforms. What I hadn't realized was that it trivialized type checking and even pointer arithmetic!
Before we continue, I need to discuss the symbol table and data storage:
I designed a symbol table template class with a default template like this:
template<typename T = type::Value_Type, typename Pointer = type::Value_Pointer>
class Symbol_Table {
/// ...
};T is the storage type of rvalues, the second template parameter is the designated storage type of pointers. Throughout the compiler codebase, I use different template instantiations based on the current source-code pass.
The default, Value_Type , looks like this:
using Byte = unsigned char;
using Type_Size = std::pair<std::string, std::size_t>;
using Value = std::variant<
    std::monostate,
    int,
    long,
    Byte,
    float,
    double,
    bool,
    std::string,
    char>;
using Value_Type = std::pair<Value, Type_Size>;We can see that this is a nice storage type of the translated ITA code we saw earlier, ("hello world":string:11) .
The default Pointer type:
using Value_Pointer = std::vector<Value_Type>;This nicely represents a symbolic contiguous address in memory where data may reside.
The symbol table provides not only storage but a means of type validation. Next, let's look at the Table:
The table constructs the ITA and keeps roughly the storage of the following:
class Table {
  // ...
    using RValue_Data_Type = std::tuple<RValue, Type, Size>;
    using Locals = Symbol_Table<RValue_Data_Type, LValue>;
  // ...
    struct Function {
     // ...
        Label symbol{};
        Labels labels{};
        Parameters parameters{};
        Address_Table label_address{};
        static constexpr int max_depth{ 1000 };
        std::array<Address, 2> address_location{};
        unsigned int allocation{ 0 };
        std::map<LValue, RValue> temporary{};
    };
    struct Vector; // ...
   // ...
    private:
      Globals globals{};
      Stack_Frame stack_frame{ std::nullopt };
      ITA::Node hoisted_symbols;
  
    public:
      Address_Table address_table{};
      Symbolic_Stack stack{};
      Functions functions{};
      Vectors vectors{};
      Labels labels{};
}; See more here. Note that the table is essentially our pre-selection pass of the source code before finally starting instruction and register selection. I want as much information available as possible, as the goal is to support at least three different platforms and ISAs.
The last thing I'd like to touch on is pointers and pointer arithmetic. Since pointers and vectors work roughly the same (as arrays and pointer decay do in C), they share some of the same algorithms.
Let's take a closer look at the symbol table template I used for function stack frames:
using Locals = Symbol_Table<RValue_Data_Type, LValue>;The pointer storage here is called LValue, which is really just an alias of std::string : for pointers, I store the lvalue they point to and ensure the lvalue exists at initialization.
Here is the structure of the Vector object:
    struct Vector
    {
        using Storage = std::map<std::string, RValue_Data_Type>;
        Vector() = delete;
        explicit Vector(Address size_of)
            : size(size_of)
        {
        }
        Storage data{};
        int decay_index{ 0 };
        unsigned long size{ 0 };
        static constexpr int max_size{ 1000 };
    };We see that, although arrays are usually contiguous in C/C++, for this point in code translation, I store the data as an ordered sparse map. In the future, I may need to use a better data structure, but for now, this enables simple bounds and type checking.
RValue_Data_Type is the literal tuple of the value type in ITA, e.g. (10,"int", sizeof int) . To strictly type pointers vs non-pointers and dereferenced symbols, the algorithm mimics a state machine based on the storage and local symbol type of the lvalue and rvalue. The complete algorithm is here.
The list of valid lvalue and rvalue assignments:
/*
 *     auto *k, m, *z
 *     k = &m;             // allowed
 *     k = z;              // allowed
 *     array[2] = m;       // allowed
 *     array[1] = array[2] // allowed
 *     m = array[2];       // allowed
 *     k = &array[2];      // allowed
 *
 *  All other combinations are not allowed between lvalues and vectors
 */Most of the nuances are ensuring the correct handling of unary operations, such as indirection k = *m and address-of z = &k . The test suite for the type-checking algorithm is as rigorous as possible.
Another case is ++ and -- :
main() {
  auto k, *z;
  k = "hello, there!";
  k++; // fails
  z = &k;
}Since I have the value data and storage type of locals in the stack frame, providing a framework for unary operators based on their rvalue type was not too difficult. k++ shouldn't work under any circumstances, but if the storage is an LValue  (i.e, a pointer to another address) or numeric type, it may be allowed.
This work has made clear to me that during the days when the B and C languages were designed, there were very few resources available. Behavior may have been undefined simply because they didn't have the space to fit the additional type storage. I did not intend to have type checking, but by simply being cautious, I ended up achieving it by accident—and I think using a bit of extra storage (and even performance) is totally worth it.