Ryan Sepassi

boot2 Scheme

May 2026

Code: boot2 (scheme1.P1pp, prelude.scm).

We continue our jaunt up the bootstrap chain from hex0-seed to TCC. Last time brought a portable pseudo-ISA (P1) and a real macro layer on top of M0 (M1++); the combination — P1++ — I claimed was ergonomic enough to host a Scheme directly, skipping the usual ladder of subset-C compilers.

Today I prove the claim by delivering the next step, a Scheme interpreter in ~5KLOC of P1++. It's a subset of the R7RS standard with booleans, word-sized integers, strings, symbols, pairs, closures, and records. It supports define, set!, lambda, let bindings (including multi-value bindings), if, cond, when, case, and, or, begin, and do. It has a basic system layer for file IO (open, close, read, write) and processes (clone, exec, wait), so it can be used as a simple shell.

An update from last time

One revision to the toolchain from the last post before we get into the Scheme. The original chain ran M1++ output straight into stage0's hex2, with M1++ owning lexical scope for labels via %scope / %endscope and mangling ::label to :scope__label. That arrangement is gone. There's a new layer between M1++ and the final bytes — hex2++, a small byte-oriented assembler/linker written in P1 — and lexical scoping has moved into it. M1++ now keeps only per-expansion mangling (:@name → name__N) for macro hygiene; anything that wants a real scope — function-local labels, the %break / %continue of a scoped loop — goes through hex2++'s .scope / .endscope directives and dotted :.name / &.name references, resolved innermost-out by a two-pass scoped symbol table. So where the last post wrote ::done, read :.done; where it described %fn opening an M1++ scope, read hex2++ scope. M1++ got smaller, hex2++ took on the symbol-resolution work that doesn't really belong in a macro expander, and the boundary between "macro hygiene" and "lexical scope" landed where you'd expect it. hex2++ also picked up a handful of small niceties on top of hex2: a .ptrsize directive for setting & / % reference width to 4 or 8, .align N [PATTERN] and .fill N B for padding, and > as a synonym for - in the LABEL-OTHER arithmetic form. So now the architecture-specific stage0 binaries are entirely replaced by the portable and more powerful M1++ and hex2++.

eval/apply

The beauty of a Lisp is that you get a fairly ergonomic base from which to work without much implementation complexity. The P1++ Scheme implementation is a fairly standard recursive eval-apply loop, and thanks to M1++'s ergonomics, it's fairly readable for an assembly-level implementation. Here's apply — given an evaluated head and an evaluated args list, dispatch on the head's runtime type and either invoke a primitive or evaluate a closure body in a freshly bound environment.

A short reference for some of the P1++ macros used in the snippet:

Macro	Meaning
`%fn2(name, {slots}, { body })`	function declaration with named word-sized frame slots
`%stl(reg, slot)` / `%ldl(reg, slot)`	spill / reload frame local by slot name
`:.label` / `&.label`	function-scope label, lexically scoped via hex2++
`%bieq(reg, imm, target, scratch)`	branch if reg equals immediate (`%li` + `%beq`)
`%tail` / `%tailr`	direct / register-indirect tail call
`%heap_ld(rd, ptr, off)`	load a word from a heap-tagged object

%fn2(apply, {args body}, {           # frame: two word slots, args and body
    %stl(a1, args)                   # store-local: spill a1 -> frame[args]

    # Dispatch on the heap object's header type byte.
    %hdr_type(t0, a0)                # read the object's header type byte
    %bieq(t0, %HDR.PRIM,    &.prim,    t1)
    %bieq(t0, %HDR.CLOSURE, &.closure, t1)

    :.prim
    # Read the native entrypoint out of the PRIM object and tail-jump
    # to it with the args list in a0 and the PRIM itself in a1.
    %mov(a1, a0)
    %heap_ld(t0, a0, %PRIM.entry_w)  # load word at object offset
    %ldl(a0, args)                   # load-local: frame[args] -> a0
    %tailr(t0)                       # tail-call register-indirect

    :.closure
    # Closure layout: [hdr][params][body][env]. Pull the three slots,
    # extend env with the bound params, then tail-call eval_body.
    %heap_ld(t0, a0, %CLOSURE.params)
    %heap_ld(t1, a0, %CLOSURE.body)
    %heap_ld(t2, a0, %CLOSURE.env)
    %stl(t1, body)                   # persist body across bind_params

    %mov(a0, t0)
    %ldl(a1, args)
    %mov(a2, t2)
    %call(&bind_params)

    %mov(a1, a0)
    %ldl(a0, body)
    %tail(&eval_body)                # direct tail call
})

The two arms are the two kinds of callable. The prim arm pulls a native function pointer out of the heap object and tail-jumps into it. The closure arm pulls params/body/env out of the heap object, builds a new environment via bind_params, and tail-calls eval_body to evaluate the body in that env. Tags, headers, heap loads, tail calls — but the shape is recognizable as the textbook eval-apply, just at the assembly level.

Omissions

Two omissions are worth calling out: no macros and no garbage collector. Each got a replacement that's smaller and more explicit than the thing it's standing in for.

No macros

In place of macros, scheme1 ships a builtin pmatch for destructuring lists and records: literal forms match shape, ,x binds a hole, (guard p) filters. That covers most of what a compiler's form-walker actually wants from macros, and skipping macro expansion machinery — hygiene, environments, phasing — saved complexity at this rung.

No garbage collector

In place of GC, scheme1 provides simple arenas with two complementary patterns, each wrapped behind a thunk-taking combinator. When the shape of the output is known up front, the caller pre-allocates it on the main heap and uses call-with-heap-rewind: take a mark, run the thunk (which mutates into the pre-allocated slots, allocating freely as it goes), and rewind to the mark on return — all the intermediate junk evaporates, the output stays. When the output's size is dynamic or just annoying to build by mutation, the caller uses call-with-scratch-deep-copy: switch to the scratch heap, run the thunk freely allocating as it goes, deep-copy the final value into the main heap, and reset scratch in one shot.

Mark/rewind for known shapes, scratch+promote for unknown ones. The lifetime is named at the call site instead of inferred by a tracer at runtime — no runtime, no pauses, predictable. Here's the second pattern in use:

(define (tokenize-line line)
  (call-with-scratch-deep-copy
    (lambda ()
      (let loop ((i 0) (acc '()))
        (cond ((>= i (string-length line)) (reverse acc))
              ((char-whitespace? (string-ref line i))
               (loop (+ i 1) acc))
              (else
               (let-values (((tok j) (read-token line i)))
                 (loop j (cons tok acc)))))))))

read-token and the cons cells of acc allocate freely on scratch; only the final reversed list is deep-copied to the main heap, and scratch resets in one go. Between the two combinators, you cover most of what a GC was doing for you. It's not perfect, but I'm guessing it'll be enough to compile the TCC source without blowing up memory usage.

A taste

And here's the kind of code Scheme1 actually runs. This is a form classifier — exactly the shape every compiler's form-walker takes, which is precisely why pmatch earns its place as a builtin even without a macro system to express it:

(define (kind e)
  (pmatch e
    ((quote ,_)              'lit)
    ((if ,_ ,_ ,_)           'if)
    ((lambda ,_ . ,_)        'lambda)
    ((set! ,_ ,_)            'set!)
    (,x (guard (symbol? x))  'var)
    (,x (guard (integer? x)) 'int)
    ((,_ . ,_)               'call)
    (else                    'unknown)))

Destructure-and-dispatch like this is exactly the spine of the C compiler that's coming next.

What it cost

Concrete sizes for what's landed (non-comment, non-blank LOC):

scheme1.P1pp: ~4.3KLOC, broken roughly into:
- Primitives: ~1200
- Special forms: ~930
- Writer (display / write / format / error): ~500
- Reader (parse_one / parse_list): ~480
- Syscall primitives: ~400
- eval / apply: ~150
- Heap, multiple-values, symbol intern, helpers, startup, BSS, etc.: ~660
prelude.scm: ~500 LOC of Scheme prepended to user code (extra control-flow forms, list helpers, the pmatch surface).

The eval/apply core itself comes in under 200 lines — the bulk of the interpreter is the special forms, primitives, and the reader/writer pair, which is where any honest Lisp ends up spending its LOC.

It assembles unchanged on aarch64, amd64, and riscv64 — the P1 move from last time still doing its job. Statically-linked binaries land at ~51 KiB (aarch64), ~41 KiB (amd64), ~57 KiB (riscv64). The test suite is 113 .scm fixtures, all running end-to-end on all three arches via the hex0-seed → ... → scheme1 chain.

It's slow — recursive tree-walking eval-apply, no bytecode, no hashmaps, no JIT — but I'm betting it's fast enough to compile TCC in tolerable time, which is all the bootstrap needs.

On reading the chain

Stepping back: part of my goal in this series is to make the bootstrap worth reading on its own. Each rung should be edifying. The hex series and M0 are the most minimal assembler you'll find — you learn a real ISA (whichever arch you choose to read), instruction encoding, ELF, and how relative and absolute references resolve to bytes. P1 introduces standardized calling conventions and function abstraction at the assembly level. M1++ adds the power of macros — named parameters, recursive expansion, compile-time integer expressions, per-expansion label hygiene. hex2++ is the smallest interesting linker layer you'll see: a two-pass scoped symbol table, dotted local labels resolved innermost-out, label arithmetic at emit time, and the small handful of directives (.scope, .align, .fill, .ptrsize) that let an upstream macro layer stay target-neutral. Scheme1, being a Lisp, is a great introduction to the lambda calculus, functional programming, and dynamic languages. And finally, C is the lingua franca of the software world and builds naturally on the preceding stack, a sort of ergonomic portable assembly with the lexical scoping of a Lisp.

So, now that we have a Scheme, the last remaining piece is a C compiler written in Scheme that can compile TCC. I'm hoping that too comes in under 5KLOC.

Cheers to Church, McCarthy, Steele, and Sussman for the Lambda Calculus, Lisp, and Scheme. The ideas are timeless.