April 2026
Code: boot2.
Computers execute programs. Programs are specified as machine code, binary blobs loaded into memory that the processor interprets on the fly.
Humans write programs (well, some humans do, and now many non-humans as well). But humans specify programs in programming languages, not machine code. It's the job of a compiler to translate to machine code.
The compiler is itself a program. Where did it come from? Well, it was compiled of course, by another compiler!
Uh oh. Infinite regress? Not quite. It bottoms out in something directly specified in machine code, a "binary seed" from which the rest sprouts.
For decades, though, we had lost the seed. We were just going around using whatever compiler was on hand, figuring that there always would be one on hand, and trusting that it would do what it said on its tin. People noticed it was a bit odd that we had lost the chain, that we couldn't compile the compiler if we lost the compiler, but not much was done about it.
In 1983, Ken Thompson told a little horror story that spooked folks, but still, it didn't really change much. Like many good horror stories, this one was about a little invisible demon. Someone had put a virus into the compiler that replicated itself into all future compilers but would scrub itself out of existence if you ever tried compiling a program that went looking for it. All of our compilers were infected and none of us knew. Shivers. (Search for "Trusting Trust" if you want to read the full story).
Nobody did anything about it until someone did. Our heroes are Jeremiah Orians and Jan Nieuwenhuizen; in 2023, they delivered "THE FULL BOOTSTRAP" (tm). A few hundred bytes of binary seed that defined a little language called "hex0", and then source files from there on until they could compile Fabrice Bellard's Tiny C Compiler (TCC). Once you have TCC, you compile an old version of GCC that was still written in C (GCC has long since moved to C++, woe be unto us all), then compile a more recent GCC, then you use that to compile everything else, including Linux. Truly amazing work.
Jeremiah's early chain is quite beautiful:
hex0-seed = given
hex0 = hex0-seed(hex0.hex0) # hex0 compiles itself from source
hex1 = hex0(hex1.hex0)
hex2 = hex1(hex2.hex1)
catm = hex2(catm.hex2)
M0 = hex2(catm(ELF.hex2, M0.hex2))
hex0: the minimal seed hex compiler; it turns pairs of hexadecimal digits
into raw bytes. Supports # and ; line comments.
hex1: a small extension of hex0 that adds the first notion of symbolic control flow. Adds single-character labels and 32-bit relative offsets to a label.
hex2: the final "hex language" and first real linker-like stage in the chain. Extends hex1 with labels of arbitrary length, 8/16/32-bit relative references (!, @, %), and 16/32-bit absolute references ($, &).
catm: a minimal file concatenation utility, takes N files and outputs 1.
M0: a basic macro preprocessor built on top of hex2, used as a mini assembler. Features DEFINE substitution of names to hex bytes, hex and ascii strings, 8/16/32-bit decimal and hex immediates (!, @, %).
From this base, they build a "subset C" compiler and a mini shell called "kaem" in M0 (per architecture), then use that to build up to a little userspace called "M2-Planet" and the Mes Scheme interpreter and ecosystem. They then define another C compiler in Scheme (MesCC), and use that to compile TCC. And then you're off to the races.
There's something elegant and simple and beautiful about this bottle universe they constructed: an assembler and linker, a Scheme, a C. Timeless. What more do you need really?
Up through M0, it's about 2KLOC per architecture. But from M0 to TCC you climb a ladder of three C-like compilers, each more capable than the last: a subset C in M1, M2-Planet in that subset C, and finally MesCC (written in Mes Scheme). The ladder exists because Mes Scheme is written in C and uses enough of the language that you can't host it on the first rung — you have to pump your way up the C power ladder before MesCC will compile. ~30KLOC of climb to MesCC, plus ~20KLOC for TCC itself, and the first rung — the M1-level subset C compiler — is arch-specific (the rest inherit portability from being written in C).
Two complaints, two moves.
First, portability. M1 source is arch-specific: not just the bytes you emit
depend on whether you're targeting aarch64 or amd64 or riscv, but the source
code you write is dictated by arch-specific defines. Replace that with a
portable pseudo-ISA called P1. The interface is entirely portable, implemented
by arch-specific defines. Each target ships a standard backend —
P1A-aarch64.M1, P1A-amd64.M1, and so on — that DEFINEs each P1 mnemonic as
the right raw bytes for that arch. Catm the backend in at build time and the
same P1 source assembles anywhere. Think of it as portable assembly, well
before you get to the portable assembly that we call C, but targetable from
a simple textual expander instead of a full-blown compiler.
Second, power. The C ladder exists because there's no language at the bottom expressive enough to host Scheme directly — so you bootstrap C three times to get there. Give M1 a real macro layer — named-parameter macros, compile-time integer expressions, conditional expansion, local labels, a few structural shorthands — and the floor rises high enough to leap straight to a Lisp in one hop. Call the result M1pp (ht Bjarne); call P1 rewritten in it P1pp. From P1pp, host Scheme; from Scheme, write one C compiler. The three-rung ladder collapses to two.
Here's what I'm aiming for, in Lispy pseudocode:
;; Bootstrap from seed, same as before
(define hex0 (hex0-seed hex0.hex0))
(define hex1 (hex0 hex1.hex0))
(define hex2 (hex1 hex2.hex1))
(define catm (hex2 catm.hex2))
(define M0 (hex2 (catm ELF.hex2 M0.hex2)))
;; Compile+Link for arch-specific M1 source
(defn exe (M1-src) (hex2 (catm ELF.hex2 (M0 M1-src))))
;; P1 — portable pseudo-ISA at the M1 level.
;; P1A.M1 is the arch-specific backend
;; m1pp is itself a P1 program.
(define m1pp (exe (catm P1A.M1 m1pp.P1)))
;; P1pp — P1 rewritten with m1pp macros. Assemble any P1pp source via m1pp.
;; P1A.M1pp is the arch-specific backend, now rewritten to use M1pp
;; P1.M1pp is the arch-agnostic interface
;; P1pp.P1pp is "libp1pp", P1pp's standard library, niceties and utilities for
;; programming in P1pp
(defn ppexe (src) (exe (m1pp (catm P1A.M1pp P1.M1pp P1pp.P1pp src))))
;; To Scheme!
(define scheme (ppexe scheme1.P1pp))
;; And finally C
(defn scc (C-src) (ppexe (scheme cc.scm C-src)))
(define tcc0 (scc tcc.c)) ;; compiler: scheme cc.scm
(define tcc1 (tcc0 tcc.c)) ;; compiler: scheme-compiled tcc
(define tcc (tcc1 tcc.c)) ;; compiler: tcc-compiled tcc
So the new bits are:
P1A.M1m1pp.P1P1A.M1pp, P1.M1pp, P1pp.P1ppscheme1.P1ppcc.scmToday, I present P1, M1pp, and P1pp — the portability move and the power move. Next will come a Scheme in P1pp, and finally enough of a C compiler in Scheme to compile the tcc source.
P1 is a portable pseudo-ISA. Source looks a lot like assembly; each mnemonic is a macro, and each arch ships a backend that realizes those mnemonics as raw bytes for that target. Flip the backend file you catm in, and the same P1 source assembles for a different arch.
Twelve visible registers — a deliberately modest budget that keeps backends small and source uniform across architectures:
a0–a3 — argument and caller-saved general registers.t0–t2 — caller-saved temporaries.s0–s3 — callee-saved general registers.sp — stack pointer.Anything beyond this that a backend needs for encoding (scratch regs, link register, zero register) is backend-private and exposed only through the portable ops.
A word is register-sized — 32 bits on 32-bit targets, 64 bits on 64-bit
targets. LD/ST move words; LB/SB move bytes. No 16/32-bit slots in
between; code that needs them can OR/shift.
Ops:
LI (load immediate), LA (load label address), LA_BR
(load label into a hidden branch-target register).MOV.ADD, SUB, AND, OR, XOR, SHL, SHR, SAR, MUL,
DIV, REM.ADDI, ANDI, ORI, SHLI, SHRI, SARI.
Immediate widths are backend-dependent; the backend errors if the constant
doesn't fit.LD, ST, LB, SB.B, BR, BEQ, BNE, BLT, BLTU, BEQZ, BNEZ, BLTZ.
Direct branches don't take a label operand; LA_BR first loads the target
into the hidden branch-target register, which the immediately following
branch consumes. The register-indirect form BR takes its target in an
ordinary register. Signed and unsigned less-than; >=, >, <= are
synthesized by swapping operands or inverting branch sense.CALL, CALLR, RET, ERET, TAIL, TAILR. Direct
forms (CALL, TAIL) consume LA_BR like the branches; indirect forms
(CALLR, TAILR) take an ordinary register.ENTER.LDARG — reads stack-passed incoming args without
hard-coding the frame layout.SYSCALL.Calling convention:
a0-a3; extras in an incoming
stack-argument area read with LDARG.a0; two-word in a0/a1; wider returns use the
hidden-pointer trick — caller passes a writable buffer in a0, real args
slide over by one, callee returns the buffer in a0.ENTER builds the standard frame; ERET tears it down and returns
(TAIL/TAILR likewise combine teardown with a jump). Stack-passed
outgoing args are staged in a dedicated frame-local area before CALL,
so the callee finds them at a known offset from sp.Program entry: portable source writes :p1_main as an ordinary P1 function
with argc in a0 and argv in a1, and returns an exit status in a0.
The backend emits a small per-arch :_start stub that captures the native
entry state, calls p1_main, and hands its return value to sys_exit. The
portable side never sees the raw entry stack.
This shape is aimed at portability and simplicity of targeting from pure macro expansion, sacrificing some performance, which seems a reasonable tradeoff for the bootstrap.
P1 at the M1 level — P1A.M1 per arch, hand-coded DEFINEs for every fused
mnemonic — gets us portability, but M1 itself is austere: DEFINE substitution
and raw bytes, fine for hand-encoding a handful of instructions, painful for
anything of size. So layer a small macro expander on top: m1pp. The M1pp-level
flavor of P1 — call it P1pp — is the same portable interface (P1.M1pp) and
per-arch backend (P1A.M1pp) realized through macros rich enough to make
low-level programming bearable enough to linger down here near the bottom of
the sea.
m1pp adds three buckets of capability M0 doesn't have.
Mechanics:
%macro NAME(a, b) ... %endm, called with %NAME(x, y). Macro
bodies can themselves contain macro calls, so expansion is recursive.a ## b pastes two tokens into one identifier.Compile-time computation:
!(expr) for 8 bits, @(expr) for 16, %(expr) for 32, $(expr) for 64.
The punctuation mirrors M0's decimal/hex immediate widths (!@%) and adds
$ for the 64-bit case.%select(cond, then, else) evaluates cond as
an integer and emits exactly one of the two branches. Non-zero is true.%str(foo) turns a word token into the string literal
"foo".Structural shorthands:
%struct NAME { f1 f2 } synthesizes %NAME.f1 → 0,
%NAME.f2 → 8, %NAME.SIZE → 16. %enum does the same with stride 1 and
a COUNT terminator.:@loop and &@loop pick up a
fresh per-expansion suffix, so a macro can define its own labels without
colliding with itself at a second call site.%scope NAME ... %endscope pushes a lexical scope. Inside
it, a label written ::foo emits as :NAME__foo (stack joined by __
when nested), so several callers can reuse short names without collision.
Resolution follows the caller's scope, not the macro's, so a generic
%break or %continue inside a scoped loop lands on the right label.The expression syntax is, of course, Lisp-shaped:
atoms: decimal or 0x-prefixed integer literals
calls: (+ a b) (- a b) (* a b) (/ a b) (% a b) (<< a b) (>> a b)
(= a b) (!= a b) (< a b) (<= a b) (> a b) (>= a b)
(& a b) (| a b) (^ a b) (~ a)
(strlen "astr")
Here's a real macro from the aarch64 P1pp backend — the "add register, immediate" instruction:
%macro aa64_add_imm(rd, ra, imm12)
%((| 0x91000000 (<< (& imm12 0xFFF) 10) (<< %aa64_reg(ra) 5) %aa64_reg(rd)))
%endm
A call like %aa64_add_imm(a0, a0, 8) expands to a single Lisp expression
that ORs together the opcode, the 12-bit immediate shifted into place, the
source register field, and the destination register field. The outer %(...)
evaluates that expression and emits it as four little-endian hex bytes, which
M0 and hex2 then hand to the linker. The nested %aa64_reg(...) calls in the
body show that macros can call other macros during expansion — aa64_reg is a
tiny macro per register name that produces the aarch64-native register number
(e.g. a0 → 0, s0 → 19).
That one pattern — Lisp expression in, hex bytes out — covers all per-arch instruction encoding.
The scope mechanism is what lets libp1pp's standard library define a
generic function macro %fn:
%fn(parse_number, 16, {
...
LA_BR &::done
BEQZ t0
...
::done
})
The 16 is the frame-local byte count passed through to %enter. The
::done labels mangle to parse_number__done, so every function gets
its own short namespace with no hand-prefixing, and scoped loops inside
the body (%loop_scoped, %while_scoped_nez, etc.) stack further — a
%break anywhere inside the innermost loop finds its _end.
Hello world. p1_main is a leaf function — it issues no CALL — so it needs
no frame. It writes 14 bytes to stdout via sys_write and returns 0:
:p1_main
%li(a0, %sys_write)
%li(a1, 1)
%la(a2, &msg)
%li(a3, 14)
%syscall
%li(a0, 0)
%ret
:msg
"Hello, World!
"
:ELF_end
A few things to notice. %li(rd, imm) and %la(rd, target) each take their
payload as a macro argument; the backend emits a load-from-literal-pool
prefix and the literal-pool word itself, so the caller never has to count
bytes by hand. %sys_write is a zero-arg macro that returns the Linux
syscall number for write as an integer atom (the aarch64 backend's choice),
so it composes through %li's expression evaluator. &msg is a 4-byte
label reference (addresses fit in 32 bits in the stage0 image layout); %la
knows the right width because it lives in the backend.
Echo argv, one element per line, skipping the program name. This one shows
what real-feeling P1pp code reads like — a %fn definition with a
structured %for_lt loop, calling into libp1pp's print_cstr and print
helpers. s0-s2 are callee-saved, explicitly spilled to the frame
because they have to survive each CALL:
%fn(p1_main, 24, {
%st(s0, sp, 0)
%st(s1, sp, 8)
%st(s2, sp, 16)
%mov(s1, a1) # s1 = argv
%mov(s2, a0) # s2 = argc
%li(t0, 2)
%if_lt(s2, t0, { %b(&::done) }) # nothing to print
%for_lt(s0, s2, {
%if_eqz(s0, { %b(&::skip) }) # skip argv[0]
%shli(t0, s0, 3) # t0 = i * 8
%add(t0, s1, t0) # t0 = &argv[i]
%ld(a0, t0, 0) # a0 = argv[i]
%call(&print_cstr)
%la(a0, &nl)
%li(a1, 1)
%call(&print)
::skip
})
::done
%ld(s0, sp, 0)
%ld(s1, sp, 8)
%ld(s2, sp, 16)
%li(a0, 0)
})
:nl
"
"
:ELF_end
%fn(p1_main, 24, { ... }) puts the body in its own scope, opens a 24-byte
frame with %enter(24), and closes with %eret. %for_lt(s0, s2, { ... })
counts s0 from 0 up to s2, with the body free to read s0 as the index.
The two ::name labels — ::skip and ::done — pick up the function's
scope at emit time and mangle to :p1_main__skip and :p1_main__done, so
another %fn is free to write its own ::done without collision. Inside
the body, the calls to print_cstr and print are libp1pp doing the lift
— neither exists at the P1 level, both are just P1pp functions themselves.
P1 and M1pp echo C and its preprocessor, but a level lower.
Like C, P1 standardizes a target machine: write to a portable ISA and a backend handles the real-machine encoding. The calling convention is part of the contract — argument registers, return registers, who saves what, frame layout. But P1's machine stays firmly register-based and finite. Twelve visible registers, no infinite locals, no automatic spill. Run out and you save to the stack by hand. Every byte moved is a byte you wrote.
M1pp sits where the C preprocessor sits — a textual macro layer below the
language proper — but with a bit more structure than the C one gives you. cpp
gives you token substitution and arithmetic-by-accident; M1pp gives you
hygienic labels (:@loop is per-expansion), real compile-time integer
expressions, block arguments, and lexical scopes for naming. It borrows shapes
from structured C — %struct, %enum — but all they're providing is fixed
names and sizes. No type checker, no register allocator, no codegen pass. Every
expansion is predictable by eye. It's still assembly underneath, with just
enough decoration that you can stand to read it.
Concrete sizes for what's landed today:
m1pp.P1: ~4KLOC — the P1 source of the macro expander itself.P1.M1pp: ~150 LOC — the portable P1 interface.P1-{aarch64,riscv64,amd64}.M1pp: ~{480,430,650} LOC — arch backends.P1pp.P1pp: ~1KLOC — libp1pp, the standard library of control-flow
macros and utilities.(These are non-comment, non-blank LOC)
Both programs above assemble unchanged on aarch64, riscv64, and amd64 — that's the portability move at work. Adding a fourth arch is a few hundred lines of macro definitions, not another subset C compiler.
We can't really see the full payoff until we have scheme1.P1pp and cc.scm of course, but I think it's a meaningful improvement over per-arch subset C compilers, and it's a neat little macro language to boot.
I'll be working on a tiny Scheme implementation in P1pp, and will likely edit P1 and M1pp as I do. One thought is to have M1pp replace M0 entirely and directly target hex2; it grows the code a bit, but it may allow for a more uniform syntax and interop between DEFINEs and macros.
The dream is that we go from hex0 to P1pp to Scheme to C via nice direct hops and no external dependencies until TCC. Maybe we can even get to a tiny single-binary cross-compiling TCC-like C compiler. A micro Clang/LLVM, maybe 80% of the bang for 1% of the buck. Projects like TCC, QBE, and Tilde all make me hopeful that this is possible, and as ever, people like Jeremiah, Jan, and Fabrice are inspirations. Here's to the crazy ones.