Experiments

Longer notes on systems experiments, evaluation design, and engineering tradeoffs behind selected projects.

TinyCodeTest Evaluation Stack

Deterministic verifiers, difficulty buckets, and multi-model pass@k benchmarking

Notes on how the TinyCodeTest evaluation stack turns model runs into reproducible, inspectable benchmark results.

Code

F1 Strategy Lab Notes

Scenario design, deterministic grading, ablations, and stress tests for strategic reasoning

Experiment notes from building and evaluating an RL-style environment for race-strategy reasoning.

Code

Compression Benchmark Notes

Repeatable measurement, SIMD-aware implementation work, and fair algorithm comparisons

A closer look at the benchmarking discipline behind Compression Bench.

Code