Two months after my last post, I now have a 2-meter-long beard and haven't seen the outdoors in many weeks - just kidding π.
A few weeks after the first post, I was satisfied with the table and type checker. So I started engineering the backend - I chose x86-64 as the first platform for code generation.
I've made a lot of headway, but one thing has stuck out to me. Recall in my last post, I spoke about a type system I got "for free" by being prudent in my data structure decisions - in reality, it would have been harder to build this compiler without a type system.
The B language specification contains a lot of undefined behavior simply because, at the time, resources were not available as they are today. Take a look at this code:
main() {
auto x;
x = test("hello world");
print(test(x));
}
test(y) {
return(y);
}
There are some curious questions without a type system: what should be stored at x ? What is the return type of the test function? What is its size? For a clue, here's the x86-64 assembly of x = "hello world"
.data
._L_str1__:
.asciz "hello world"
.text
.global _start
_start:
push rbp
mov rbp, rsp
lea rax, [rip + ._L_str1__]
mov qword ptr [rbp - 8], rax
The x local variable, which is at the stack offset rbp - 8 is loaded with the address at rip + .L_str1__ - that is, x is now a pointer - and in C terms a "pointer to const char." From the B snippet, that is very unclear, and indeed difficult to implement because x is not typed, so its storage is undefined until initialization.
A solution could be multi-pass analysis, constructing and reallocating by reading the source code over and over, but I had a better idea. Here's a reminder of what my IR looks like:
__main():
BeginFunc ;
LOCL x;
x = ("hello world":string:11);
_L1:
LEAVE;
EndFunc ;Still, even with this much information, you'd need to backtrack (or worse, reallocate the %rbp stack) without knowing what x is at construction time.
I never, under any circumstance, want to reallocate the storage on the %rbp stack; I'd consider that catastrophic. Instead, I've enforced pointer to auto declarations at compiletime for ALL pointer rvalue types, including strings, which will always be an offset of %rip in the data section. Further, function parameters must be defined with the pointer syntax if the rvalue is meant to be a pointer - the first snippet does not compile. Here's the valid B version in Credence:
main() {
auto *x;
x = test("hello world");
}
test(*y) {
return(y);
}
This is great because I don't have to analyze the code in another pass to know the storage size of test, x, and the parameter y at compiletime. And, a non-pointer rvalue is not allowed to test , so test(5) for example is a compiletime error.
Moreover, I said that x is basically a const char * - but there's not much else it could be! The other option is all pointers are void* and you sort of "cast" to more specified types for accurate pointer decay, but that sounds terribly error-prone.
My type system works by inference at initialization, so auto *k and *k = 5 is now a "pointer to int", The term "pointer to int" and "pointer to K" in general becomes a little clearer if you look at this B example:
main() {
auto *x;
extrn strings;
x = strings[1];
}
strings [3] "string one", "string two", "string 3";In x86-64 assembly, it becomes:
.data
._L_str1__:
.asciz "string 3"
._L_str2__:
.asciz "string one"
._L_str3__:
.asciz "string two"
strings:
.quad ._L_str2__
.quad ._L_str3__
.quad ._L_str1__
.text
.global _start
.extern print
_start:
push rbp
mov rbp, rsp
mov rax, qword ptr [rip + strings+8]
mov qword ptr [rbp - 8], rax
mov rax, 33554433
mov rdi, 0
syscall
Look at the strings label, do you see that? The stacked .quad directives? .quad is a directive to store a "quad word" or address storage that fits in an x64 register - a pointer.
That means strings is a pointer to strings const char ** (or char[][]). Take another look at how we assign an index to x at rpb - 8 :
mov rax, qword ptr [rip + strings+8]
mov qword ptr [rbp - 8], rax%rip is the relative address register to access the data section, strings is the label of the address at the start of our strings data. Each "element" is a quad word or 8 bytes, so to get strings[1] we need [rip + strings+8]
For another example, look at a pointer to ints or int[]:
.data
ints:
.long 1
.long 2
.long 3
This is how arrays are defined in the data section, and I've demonstrated pointer decay: The address of x[1] for strings or ints will always be at the rip address + sizeof(k) * index. However, to keep the compiler dynamic as in the original B language, I allow arrays of any non-homogeneous type - and then type them like tuples at compile-time.
Much earlier in my career, I always read that "C and assembly are intrinsic." It's more than that. C is the way it is because of the architecture, not the other way aroundβ very fascinating stuff.