May 2026
Code: boot2 (scheme1.P1pp, prelude.scm).
We continue our jaunt up the bootstrap chain from hex0-seed to TCC.
Last time brought a portable pseudo-ISA
(P1) and a real macro layer on top of M0 (M1++); the combination — P1++ — I
claimed was ergonomic enough to host a Scheme directly, skipping the usual
ladder of subset-C compilers.
Today I prove the claim by delivering the next step, a Scheme interpreter in
~5KLOC of P1++. It's a subset of the R7RS standard with booleans, word-sized
integers, strings, symbols, pairs, closures, and records. It supports
define, set!, lambda, let bindings (including multi-value bindings),
if, cond, when, case, and, or, begin, and do. It has a basic
system layer for file IO (open, close, read, write) and processes (clone,
exec, wait), so it can be used as a simple shell.
One revision to the toolchain from the last post before we get into the
Scheme. The original chain ran M1++ output straight into stage0's hex2,
with M1++ owning lexical scope for labels via %scope / %endscope and
mangling ::label to :scope__label. That arrangement is gone. There's
a new layer between M1++ and the final bytes — hex2++, a small
byte-oriented assembler/linker written in P1 — and lexical scoping has
moved into it. M1++ now keeps only per-expansion mangling (:@name →
name__N) for macro hygiene; anything that wants a real scope —
function-local labels, the %break / %continue of a scoped loop — goes
through hex2++'s .scope / .endscope directives and dotted :.name /
&.name references, resolved innermost-out by a two-pass scoped symbol
table. So where the last post wrote ::done, read :.done; where it
described %fn opening an M1++ scope, read hex2++ scope. M1++ got
smaller, hex2++ took on the symbol-resolution work that doesn't really
belong in a macro expander, and the boundary between "macro hygiene" and
"lexical scope" landed where you'd expect it. hex2++ also picked up a
handful of small niceties on top of hex2: a .ptrsize directive for
setting & / % reference width to 4 or 8, .align N [PATTERN] and
.fill N B for padding, and > as a synonym for - in the
LABEL-OTHER arithmetic form. So now the architecture-specific stage0 binaries
are entirely replaced by the portable and more powerful M1++ and hex2++.
The beauty of a Lisp is that you get a fairly ergonomic base from which to
work without much implementation complexity. The P1++ Scheme implementation
is a fairly standard recursive eval-apply loop, and thanks to M1++'s
ergonomics, it's fairly readable for an assembly-level implementation. Here's
apply — given an evaluated head and an evaluated args list, dispatch on the
head's runtime type and either invoke a primitive or evaluate a closure body
in a freshly bound environment.
A short reference for some of the P1++ macros used in the snippet:
| Macro | Meaning |
|---|---|
%fn2(name, {slots}, { body }) |
function declaration with named word-sized frame slots |
%stl(reg, slot) / %ldl(reg, slot) |
spill / reload frame local by slot name |
:.label / &.label |
function-scope label, lexically scoped via hex2++ |
%bieq(reg, imm, target, scratch) |
branch if reg equals immediate (%li + %beq) |
%tail / %tailr |
direct / register-indirect tail call |
%heap_ld(rd, ptr, off) |
load a word from a heap-tagged object |
%fn2(apply, {args body}, { # frame: two word slots, args and body
%stl(a1, args) # store-local: spill a1 -> frame[args]
# Dispatch on the heap object's header type byte.
%hdr_type(t0, a0) # read the object's header type byte
%bieq(t0, %HDR.PRIM, &.prim, t1)
%bieq(t0, %HDR.CLOSURE, &.closure, t1)
:.prim
# Read the native entrypoint out of the PRIM object and tail-jump
# to it with the args list in a0 and the PRIM itself in a1.
%mov(a1, a0)
%heap_ld(t0, a0, %PRIM.entry_w) # load word at object offset
%ldl(a0, args) # load-local: frame[args] -> a0
%tailr(t0) # tail-call register-indirect
:.closure
# Closure layout: [hdr][params][body][env]. Pull the three slots,
# extend env with the bound params, then tail-call eval_body.
%heap_ld(t0, a0, %CLOSURE.params)
%heap_ld(t1, a0, %CLOSURE.body)
%heap_ld(t2, a0, %CLOSURE.env)
%stl(t1, body) # persist body across bind_params
%mov(a0, t0)
%ldl(a1, args)
%mov(a2, t2)
%call(&bind_params)
%mov(a1, a0)
%ldl(a0, body)
%tail(&eval_body) # direct tail call
})
The two arms are the two kinds of callable. The prim arm pulls a native
function pointer out of the heap object and tail-jumps into it. The closure
arm pulls params/body/env out of the heap object, builds a new environment
via bind_params, and tail-calls eval_body to evaluate the body in that
env. Tags, headers, heap loads, tail calls — but the shape is recognizable as
the textbook eval-apply, just at the assembly level.
Two omissions are worth calling out: no macros and no garbage collector. Each got a replacement that's smaller and more explicit than the thing it's standing in for.
In place of macros, scheme1 ships a builtin pmatch for destructuring lists
and records: literal forms match shape, ,x binds a hole, (guard p) filters.
That covers most of what a compiler's form-walker actually wants from macros,
and skipping macro expansion machinery — hygiene, environments, phasing — saved
complexity at this rung.
In place of GC, scheme1 provides simple arenas with two complementary patterns,
each wrapped behind a thunk-taking combinator. When the shape of the output is
known up front, the caller pre-allocates it on the main heap and uses
call-with-heap-rewind: take a mark, run the thunk (which mutates into the
pre-allocated slots, allocating freely as it goes), and rewind to the mark on
return — all the intermediate junk evaporates, the output stays. When the
output's size is dynamic or just annoying to build by mutation, the caller
uses call-with-scratch-deep-copy: switch to the scratch heap, run the thunk
freely allocating as it goes, deep-copy the final value into the main heap,
and reset scratch in one shot.
Mark/rewind for known shapes, scratch+promote for unknown ones. The lifetime is named at the call site instead of inferred by a tracer at runtime — no runtime, no pauses, predictable. Here's the second pattern in use:
(define (tokenize-line line)
(call-with-scratch-deep-copy
(lambda ()
(let loop ((i 0) (acc '()))
(cond ((>= i (string-length line)) (reverse acc))
((char-whitespace? (string-ref line i))
(loop (+ i 1) acc))
(else
(let-values (((tok j) (read-token line i)))
(loop j (cons tok acc)))))))))
read-token and the cons cells of acc allocate freely on scratch; only the
final reversed list is deep-copied to the main heap, and scratch resets in one
go. Between the two combinators, you cover most of what a GC was doing for
you. It's not perfect, but I'm guessing it'll be enough to compile the TCC
source without blowing up memory usage.
And here's the kind of code Scheme1 actually runs. This is a form classifier —
exactly the shape every compiler's form-walker takes, which is precisely why
pmatch earns its place as a builtin even without a macro system to express
it:
(define (kind e)
(pmatch e
((quote ,_) 'lit)
((if ,_ ,_ ,_) 'if)
((lambda ,_ . ,_) 'lambda)
((set! ,_ ,_) 'set!)
(,x (guard (symbol? x)) 'var)
(,x (guard (integer? x)) 'int)
((,_ . ,_) 'call)
(else 'unknown)))
Destructure-and-dispatch like this is exactly the spine of the C compiler that's coming next.
Concrete sizes for what's landed (non-comment, non-blank LOC):
scheme1.P1pp: ~4.3KLOC, broken roughly into:prelude.scm: ~500 LOC of Scheme prepended to user code (extra
control-flow forms, list helpers, the pmatch surface).The eval/apply core itself comes in under 200 lines — the bulk of the interpreter is the special forms, primitives, and the reader/writer pair, which is where any honest Lisp ends up spending its LOC.
It assembles unchanged on aarch64, amd64, and riscv64 — the P1 move from last
time still doing its job. Statically-linked binaries land at ~51 KiB
(aarch64), ~41 KiB (amd64), ~57 KiB (riscv64). The test suite is 113 .scm
fixtures, all running end-to-end on all three arches via the hex0-seed → ...
→ scheme1 chain.
It's slow — recursive tree-walking eval-apply, no bytecode, no hashmaps, no JIT — but I'm betting it's fast enough to compile TCC in tolerable time, which is all the bootstrap needs.
Stepping back: part of my goal in this series is to make the bootstrap worth
reading on its own. Each rung should be edifying. The hex series and M0 are
the most minimal assembler you'll find — you learn a real ISA (whichever
arch you choose to read), instruction encoding, ELF, and how relative and
absolute references resolve to bytes. P1 introduces standardized calling
conventions and function abstraction at the assembly level. M1++ adds the
power of macros — named parameters, recursive expansion, compile-time
integer expressions, per-expansion label hygiene. hex2++ is the smallest
interesting linker layer you'll see: a two-pass scoped symbol table,
dotted local labels resolved innermost-out, label arithmetic at emit
time, and the small handful of directives (.scope, .align, .fill,
.ptrsize) that let an upstream macro layer stay target-neutral.
Scheme1, being a Lisp, is a great introduction to the lambda calculus,
functional programming, and dynamic languages. And finally, C is the
lingua franca of the software world and builds naturally on the preceding
stack, a sort of ergonomic portable assembly with the lexical scoping of
a Lisp.
So, now that we have a Scheme, the last remaining piece is a C compiler written in Scheme that can compile TCC. I'm hoping that too comes in under 5KLOC.
Cheers to Church, McCarthy, Steele, and Sussman for the Lambda Calculus, Lisp, and Scheme. The ideas are timeless.