An optimized fork of CCC — Claude's C Compiler — with a two-pass linear-scan register allocator, tail-call elimination, phi-copy stack coalescing, loop unrolling, and FP intrinsic lowering, targeting x86-64, AArch64, RISC-V 64, and i686.
| Benchmark | LCCC | GCC -O2 | LCCC vs GCC |
|---|---|---|---|
arith_loop | 0.139s | 0.088s | 1.6× |
sieve | 0.073s | 0.047s | 1.6× |
qsort | 0.137s | 0.105s | 1.3× |
fib(40) | 0.000s | 0.146s | 478× faster |
matmul | 0.010s | 0.006s | 1.8× |
tce_sum | 0.008s | 0.008s | ≈equal |
┌─────────────────────────────────────────────────────────────────────┐ │ C source │ │ │ frontend: lex → parse → sema → IR lowering │ │ ▼ │ │ SSA IR │ │ │ optimizer: GVN · LICM · IPCP · DCE · const-fold · inline │ │ ▼ │ │ Optimized IR │ │ │ regalloc (LCCC): two-pass linear scan over live intervals │ │ │ pass 1: callee-saved ↔ all eligible values │ │ │ pass 2: caller-saved ↔ non-call-spanning unallocated values │ │ ▼ │ │ Machine code x86-64 · AArch64 · RISC-V 64 · i686 │ │ │ standalone assembler + linker (no external toolchain) │ │ ▼ │ │ ELF executable │ └─────────────────────────────────────────────────────────────────────┘
CCC compiles real projects — SQLite, PostgreSQL, Redis, the Linux kernel — from a zero-dependency Rust codebase with its own assembler and linker. LCCC focuses on closing the performance gap with GCC by improving where CCC leaves the most on the table.
Two-pass allocation: callee-saved for all eligible values, then caller-saved for non-call-spanning remainder. Priority-weighted by loop depth.
Same flags, same output ABI. Point CC=lccc at any Makefile. No build system changes required.
x86-64, AArch64, RISC-V 64, i686. The same allocator improvements apply everywhere — architecture-agnostic PhysReg abstraction.
518 unit tests pass. Correctness-first — all four phases (linear scan, TCE, phi-copy coalescing, loop unrolling) maintain byte-identical output to GCC.
The old CCC allocator uses three greedy phases. LCCC replaces the allocation core with a proper linear scan (Phase 2), adds tail-call-to-loop conversion (Phase 3a), eliminates redundant phi-copy stack slots in loop backedges (Phase 3b), and Phase 4 adds loop unrolling and FP intrinsic lowering.
// All eligible IR values compete for callee-saved registers. // Safe across calls; better coverage than old "call-spanning only" Phase 1. let phase1_intervals = filter_eligible_intervals(&liveness, &eligible); let phase1_ranges = build_live_ranges(&phase1_intervals, &liveness.block_loop_depth, func); let mut allocator = LinearScanAllocator::new(phase1_ranges, config.available_regs.clone()); allocator.run(); // expire → find_free → evict-or-spill
// Unallocated, non-call-spanning values get caller-saved registers. // Caller-saved regs are destroyed by calls, so we must not assign them // to values that cross call boundaries. let phase2_intervals = liveness.intervals.iter() .filter(|iv| eligible.contains(&iv.value_id)) .filter(|iv| !assignments.contains_key(&iv.value_id)) .filter(|iv| !spans_any_call(iv, call_points)) .collect(); let mut caller_alloc = LinearScanAllocator::new(phase2_ranges, config.caller_saved_regs.clone()); caller_alloc.run();
Best-of-5 wall-clock time. All outputs are identical. Run with
python3 lccc-improvements/benchmarks/bench.py --reps 5.
| Benchmark | Description | LCCC | CCC | GCC -O2 | LCCC vs CCC | LCCC vs GCC |
|---|---|---|---|---|---|---|
arith_loop |
32-var arithmetic, 10M iters | 0.103s | 0.146s | 0.068s | +42% faster | 1.50× |
sieve |
Primes to 10M | 0.036s | 0.045s | 0.024s | +25% faster | 1.50× |
qsort |
Sort 1M integers | 0.096s | 0.095s | 0.087s | ≈equal | 1.10× |
fib(40) |
Recursive Fibonacci | 0.352s | 0.354s | 0.096s | ≈equal | 3.68× |
matmul |
256×256 double matrix multiply | 0.020s | 0.029s | 0.004s | +45% faster | 4.86× |
tce_sum |
Tail-recursive sum(10M) | 0.008s | 1.09s | 0.008s | 139× faster | ≈equal |
The largest gains are on register-pressure code (linear scan + phi-copy coalescing), tail-recursive code (TCE), and matrix multiply (FP intrinsic lowering, Phase 4). The remaining matmul gap is GCC's AVX2 auto-vectorization — a Phase 5 target. See the roadmap for what's next.
Phase 12 brings LCCC within 1.3–1.6× of GCC -O2 across all benchmarks, and 478× faster than GCC on recursive Fibonacci — with constant-immediate stores, SIB address folding, accumulator ALU+store folding, and a register allocator loop-depth priority fix.
Detects f(n) = f(n-1) + f(n-2) and converts exponential O(2^n) recursion to O(n) iterative loop. fib(40): 0.001 s vs GCC's 0.194 s.
MatMul inner loop uses vfmadd231pd — fused multiply-add in a single instruction. Correct innermost-loop detection, SIB addressing, remainder loops.
F64 values allocated to xmm2–xmm7, freeing GPR pressure. Combined with frame pointer omission, this gives the allocator up to 17 usable registers.
Callee-saved register mappings preserved across loop headers. Copy propagation flows through conditional jumps. Eliminates redundant stack traffic.
git clone --recurse-submodules https://github.com/levkropp/lccc.git cd lccc
cargo build --release # Binary at target/release/ccc (x86-64) # Also: ccc-arm, ccc-riscv, ccc-i686
GCC_INC="-I/usr/lib/gcc/x86_64-linux-gnu/$(gcc -dumpversion)/include" ./target/release/ccc $GCC_INC -O2 -o hello hello.c ./hello
python3 lccc-improvements/benchmarks/bench.py --reps 5 --md results.md