Ryan Sepassi

boot2 C

May 2026

We continue our jaunt up the bootstrap chain from hex0-seed to TCC. The first post brought a portable pseudo-ISA (P1) and a real macro layer on top of M0 (M1++); the second delivered a Scheme interpreter (scheme1) in ~5KLOC of P1++. The bet was that with a Scheme on the floor, one C compiler would be enough — collapsing live-bootstrap's ladder of three C-likes (subset C, M2-Planet, MesCC) into a single hop.

Today the bet pays out: cc.scm — a C compiler in scheme1, ~5KLOC — compiles TCC 0.9.26 on aarch64, amd64, and riscv64. TCC then compiles itself to a byte-identical fixed point. Five upstream projects in live-bootstrap's audit path collapse to one (the vendored seed bytes). The per-arch swap set drops from ~5KLOC to ~1KLOC.

The chain

Before going further, here is the whole chain, one row per stage. Each stage is a shell script in boot/; each one's outputs feed the next.

Stage	Inputs	Outputs
boot0	seed bytes (hex0-seed + 6 hex files)	`hex2`, `catm`, `M0`
boot1	boot0 + `M1pp.P1`, `hex2pp.P1`	`M1pp`, `hex2pp`
boot2	boot1 + `catm.P1pp`, `scheme1.P1pp`	`catm`, `scheme1`
boot3	boot1 + boot2 + `cc.scm` + `tcc.flat.c` + `libc.flat.c`	`tcc0`
boot4	tcc0 + `tcc.flat.c` + mes `libc.flat.c`	`tcc1`, `tcc2`, `tcc3` (assert `tcc2 == tcc3`), mes `libc.a`, `libtcc1.a`
boot5	tcc3 + musl-1.2.5 source	musl `libc.a`, `crt{1,i,n}.o`
boot6	tcc3 + `seed-kernel/`	`kernel.elf`

Posts 1 and 2 covered boot0 → boot2 (everything up through scheme1). This post is boot3 → boot5 (the C compiler, the self-hosted TCC, then musl). The next post will be boot6 plus the DRIVER=seed loop that re-runs the whole chain inside the kernel boot6 produces.

Shape

cc.scm is a streaming pipeline: lexer → preprocessor → parser → codegen, each pulling one token at a time from the previous stage. No materialized token list, no AST. The parser emits P1++ assembly directly into a fixed-capacity buffer as it walks the source. C99 names eight translation phases; we collapse them into three:

Lex. Bytestream → token stream. Trigraphs and backslash-newline splice get handled in the byte reader. Comments fall away, but newlines survive as NL tokens because the preprocessor needs them to delimit directives.
Preprocess. Token stream → expanded token stream. #define, #if, #include-was-already-flattened (more on that below), macro expansion with hide-sets, #if evaluated through the same constant-expression machinery the parser uses.
Parse + emit. Token stream → P1++ text. Recursive descent for declarations and statements, Pratt for expressions, a value stack for intermediates. No AST: the parser calls codegen directly as it walks.

Three pieces of state thread through the pipeline. world holds the scope and tag bindings, the string-literal pool, and the tentative-definition list. pstate owns the token iterator and the current function context. cg owns everything to do with emission — the per-section output buffers, the value stack, the frame allocator, the label counter.

The no-AST, streaming, value-stack shape is borrowed straight from TCC itself, which compiles in a single pass with a vtop-driven value stack and emits machine code as the parser walks. The chain owes TCC more than its source — its design is the model for cc.scm's codegen too. Every expression operand becomes an opnd record (kind, type, lvalue?) pushed on cc.scm's vstack; operators pop their inputs and push their results. Most operands are register-resident at first; binary operators spill the left operand to a fresh frame slot before evaluating the right, so register pressure stays bounded by tree depth, not subexpression count. It's the simplest codegen shape that compiles tcc.c without blowing up. No optimization, no register allocator, no SSA. Predictable by eye, in keeping with the rest of the chain.

Two things from the scheme1 post show up in production here. First, the scratch heap: every top-level declaration runs inside call-with-scratch-deep-copy, allocating tokens, expanded macros, and per-decl parser state freely on the scratch heap and deep-copying just the persistent symbol-table entry back to the main heap before reset. Working memory stays bounded in the size of one declaration's worth of junk. Second, pmatch: it ended up earning its keep in the parser, which spends most of its time destructuring tokens by shape. The form classifier from the prior post was not a coincidence; it was the spine of what was coming.

Here is parse-stmt — exactly the destructure-and-dispatch from the prior post, now driving statement parsing for real:

(define (parse-stmt ps)
  (pmatch (peek ps)
    (($ tok? (kind PUNCT) (value lbrace)) (parse-cstmt ps))
    (($ tok? (kind KW) (value if))        (parse-if-stmt ps))
    (($ tok? (kind KW) (value while))     (parse-while-stmt ps))
    (($ tok? (kind KW) (value do))        (parse-do-stmt ps))
    (($ tok? (kind KW) (value for))       (parse-for-stmt ps))
    (($ tok? (kind KW) (value switch))    (parse-switch-stmt ps))
    (($ tok? (kind KW) (value return))    (parse-return-stmt ps))
    (($ tok? (kind KW) (value goto))      (parse-goto-stmt ps))
    (($ tok? (kind KW) (value break))
     (advance ps) (expect-punct ps 'semi) (do-break ps))
    (($ tok? (kind KW) (value continue))
     (advance ps) (expect-punct ps 'semi) (do-continue ps))
    (($ tok? (kind KW) (value case))      (parse-case-stmt ps))
    (($ tok? (kind KW) (value default))   (parse-default-stmt ps))
    (($ tok? (kind IDENT))
     (guard (and (eq? (tok-kind (peek2 ps)) 'PUNCT)
                 (eq? (tok-value (peek2 ps)) 'colon)))
     (parse-labelled-stmt ps))
    (else
     (cond ((stmt-starts-decl? ps) (parse-local-decl ps))
           (else (parse-expr-stmt ps))))))

The parse-if-stmt arm is the recursive heart in miniature — parse the condition expression to leave its value on the vstack, then ask the codegen for an if/else skeleton with the two arms supplied as thunks the codegen calls between emitting the branches:

(define (parse-if-stmt ps)
  (expect-kw ps 'if)
  (expect-punct ps 'lparen)
  (parse-expr ps) (rval! ps)
  (expect-punct ps 'rparen)
  (cg-ifelse (ps-cg ps)
             (lambda () (parse-stmt ps))
             (lambda ()
               (cond ((at-kw? ps 'else)
                      (advance ps) (parse-stmt ps))
                     (else #t)))))

Parser supplies the what; codegen supplies the how (label allocation, branch emission, frame-slot reuse). They share state through cg but neither walks the other's data structures. The whole compiler is shaped like this: small functions, value-stack discipline, recursion mirroring the grammar.

How much C

The target is just enough C to compile

tcc-0.9.26-1147-gee75a10c/tcc.c

with the same defines that live-bootstrap's tcc-mes step uses: BOOTSTRAP=1, HAVE_LONG_LONG=1, ONE_SOURCE=1, TCC_TARGET_X86_64=1, inline=, and the usual configuration paths. Notably not defined: HAVE_FLOAT, HAVE_BITFIELD, HAVE_SETJMP. Each of those gates off entire code paths in tcc.c; we don't have to compile any of it.

That sets the lower bound. The upper bound is what MesCC accepts, since we're its replacement. Anything MesCC strips silently — const, inline, __attribute__ — we strip silently too.

What's actually in: integer types and arithmetic, pointers and arrays, struct/union with anonymous members, enums, typedefs, function pointers, all the control-flow forms, designated initializers, sizeof, offsetof, variadic functions, sret for struct returns >16 bytes, the C11 ternary common-type rule for ?:. Single translation unit — the parser sees one big bytestream, with #include resolved by a pre-flatten pass upstream. P1-64 only (LP64) for now. Floats parse as types for ABI sizing but every operation emits integer bitpattern arithmetic. "Sufficient for tcc.c at this point in time" is the design.

A worked moment

The end-to-end flow, on one C function:

int sum(int n) {
    int acc = 0;
    for (int i = 1; i <= n; i++) acc += i;
    return acc;
}

cc.scm emits a function body wrapped in libp1pp's %fn macro, which opens a hex2++ scope, allocates the frame, and emits %enter / %eret. The body is vstack-style spills and reloads — every operand goes through a frame slot:

%fn(sum, 32, {
    %st(a0, sp, 0)              # spill incoming n
    %li(t0, 0)                  # acc = 0
    %st(t0, sp, 8)
    %li(t0, 1)                  # i = 1
    %st(t0, sp, 16)
.scope
    :.top
    %ld(t0, sp, 16)             # i
    %ld(t1, sp, 0)              # n
    %ble(t0, t1, &.body)        # if i <= n goto body
    %b(&.end)
    :.body
    %ld(t0, sp, 8)              # acc
    %ld(t1, sp, 16)             # i
    %add(t0, t0, t1)
    %st(t0, sp, 8)              # acc += i
    %ld(t0, sp, 16)             # i++
    %addi(t0, t0, 1)
    %st(t0, sp, 16)
    %b(&.top)
    :.end
.endscope
    %ld(a0, sp, 8)              # return acc
})

Every named bit of P1++ here came from earlier posts. %fn, %enter, %eret, %ld, %st, %add, %addi, %ble, %b are libp1pp macros expanding through the per-arch backend; :.top / :.end are dotted local labels resolved by hex2++'s innermost-out scope walk; the scope itself is opened by %fn and the inner .scope for the loop. cc.scm's codegen has no idea what aarch64 or amd64 looks like — it generates the same P1++ on every arch, and the backend lowers it.

Building TCC

With cc.scm in hand, the chain reaches its first C binary in boot3:

scheme1 cc.scm tcc.flat.c   →  tcc.P1pp           # cc.scm reads C, emits P1++
m1pp + hex2pp + ELF prelude →  tcc0               # boot3 ends here

tcc.flat.c is a single bytestream: upstream tcc-0.9.26 plus a small set of patches, with #includes recursively resolved by a pre-flatten pass (written in scheme1 itself). The compiler proper sees one TU.

Then comes the self-host bounce in boot4:

tcc0 tcc.flat.c   →  tcc1
tcc1 tcc.flat.c   →  tcc2
tcc2 tcc.flat.c   →  tcc3
assert tcc2 == tcc3 byte-for-byte

Why three bounces and not one? cc.scm and tcc's own codegen produce different — but both correct — code for the same source, so tcc0 and tcc1 differ in their bytes (same behavior, different codegen choices). tcc1 and tcc2 are both produced by tcc, but tcc1 itself was a cc.scm-shaped binary, so its codegen choices when emitting tcc2 need one more bounce to reach the tcc-shaped fixed point. tcc2 == tcc3 is the audit point: tcc compiling itself with no help from cc.scm reaches a stable image, and any future build through this stage will produce the same bytes. The loop has closed.

The whole sequence — a few hundred bytes of hex0-seed, no host C compiler at any rung — assembles unchanged on aarch64, amd64, and riscv64. The P1 portability move from the first post is still doing its job.

A real libc

tcc3 is a real C compiler but mes-libc is a sketch — enough to host tcc itself and not much more. boot5 trades it in for the real thing: musl 1.2.5, built by tcc3 with about 1300 source files driven through the scheme1 driver one tcc -c at a time, archived into libc.a with crt{1,i,n}.o alongside.

musl is written for GCC and Clang, not TCC, and it leans on a handful of GCC extensions TCC doesn't accept: register-asm variables for syscall stubs, __attribute__((alias)) for weak references, _Complex for the math library, x86_64 SSE/x87 inline asm. We patch around each of these during source prep — the patch list lives in bootprep/musl-vendor.sh — replacing the syscall stubs with TCC-shaped inline asm, alias attributes with explicit thunk functions, _Complex math with separate-real-imaginary alternatives, and the x86_64 SIMD bits with their generic C equivalents. Most are a few lines each; the patch set is small enough to read.

From boot5 onward, every binary the chain builds can link against musl and use the real printf / malloc / pthread_create (the last only formally — we still don't run threads). mes-libc retires.

What it cost

Lines of code, comments and blanks stripped, seed → tcc 0.9.26 window:

Scope	boot2	live-bootstrap	reduction
Single arch, hand-written	21,468	30,813	30%
Single arch, incl. generated tables	22,047	34,624	36%
3 arches, hand-written	22,695	41,053	45%
3 arches, incl. generated tables	24,434	44,864	46%
Per-arch swap set (3-arch sum)	3,649	15,595	77%

The single-arch number is real but modest. The per-arch number is the load-bearing one: ~1.2KLOC per arch in boot2 vs ~5.2KLOC per arch in live-bootstrap. The reason is structural. live-bootstrap's first C compiler above the M0 macro stage is cc_<arch>.M1 — a hand-written, arch-specific implementation of M2-Planet (a C-subset compiler) in M1 macro assembly, 3,800–5,800 lines per architecture. Each new arch is a from-scratch port of that compiler.

boot2 has no equivalent because P1, the portable pseudo-ISA introduced in the first post, sits immediately above M0. Everything above the per-arch P1 backend (P1-<arch>.M1pp, ~500–700 lines) — M1++, hex2++, scheme1, the prelude, cc.scm itself — is single-source, compiled per arch by swapping the backend file. There is no per-arch C compiler because there's no C compiler that stops at the bottom of the chain; the only one that exists is the portable Scheme one, all the way up.

Distinct upstream projects in the audit path also collapse: live-bootstrap walks five (stage0-posix + per-arch cc_<arch>.M1, mescc-tools, M2-Planet, mes, nyacc); boot2 walks one (stage0-posix's vendored seed bytes only). Both projects ship the same opaque hex0-seed bytes, the same hex/M0 ladder up to M0, the same tcc 0.9.26 source. Everything between is ours.

The comparison is not entirely apples-to-apples because live-bootstrap uses some of the artifacts from its intermediate Mes stages downstream, whereas boot2's code is used exclusively for the bootstrap (though scheme1 could be used as an early shell and associated tools). Vendored seed bytes (identical between projects) and the tcc 0.9.26 source itself are excluded from all totals, on both sides. Full breakdown in docs/COMPARE-LIVEBOOTSTRAP.md.

The work that live-bootstrap did on patching tcc and providing a minimal libc for the pre-musl build was invaluable, and we use both with some cleanup.

On C

Reaching C is the moment the bootstrap exits the museum. Below this rung we've been doing curatorial work — preserving hex/M0 because they're the seed, making assembly portable with P1 and ergonomic with M1++, and building a a Lisp to gain purchase on building complex programs. Above it, we step into the language modern software is actually written in. Linux, qemu, sqlite, git, the kernel that runs your phone, the C library every other language calls into eventually — all C. You can't bootstrap a working software stack without it, and there is no other language with the same combination of small spec, total hardware reach, and ecosystem stability. The chain has to terminate in C because the world terminates in C.

C earns its place not because it's elegant but because it's the contract everything else honors. A couple folks at Bell Labs in the early seventies sat down to write an OS in a high-level language portable enough to move Unix from a PDP-7 to a PDP-11; what came out was a language whose abstract machine is so close to the physical one that fifty years of CPUs have stayed close to it rather than the other way around. Pointers, structs, calling conventions, the linker model — all of these are now load-bearing for languages that don't even resemble C on the surface. To bring a system up from nothing without ever writing C is to bring up something that can't talk to the system that already exists. So we write C, and we keep its compiler in the chain.

Cheers to Dennis Ritchie for C, and Fabrice Bellard for TCC.

Up to here, the chain has been hosted on Linux. Every binary along the way — hex0 through tcc through musl — has been a Linux user-mode process trusting the host kernel and userland to load it, run it, and feed it files. The seed is small; the runtime envelope is not.

The next post takes that envelope away. We use tcc3 to compile a tiny kernel — ~1300 lines of portable C plus a few hundred per arch — small enough to satisfy the bootstrap's own minimal OS contract: roughly a dozen syscalls to build and drive. Boot it under qemu with the chain's inputs on a virtio-mmio disk, let the chain re-run inside its own kernel, and pull the outputs off a second disk on shutdown. Bit-equivalent to the host-driven build. The loop closes.