TinyCodeTest Evaluation Stack
Deterministic verifiers, difficulty buckets, and multi-model pass@k benchmarking
Notes on how the TinyCodeTest evaluation stack turns model runs into reproducible, inspectable benchmark results.
Deterministic verifiers, difficulty buckets, and multi-model pass@k benchmarking
Notes on how the TinyCodeTest evaluation stack turns model runs into reproducible, inspectable benchmark results.
Scenario design, deterministic grading, ablations, and stress tests for strategic reasoning
Experiment notes from building and evaluating an RL-style environment for race-strategy reasoning.
Repeatable measurement, SIMD-aware implementation work, and fair algorithm comparisons
A closer look at the benchmarking discipline behind Compression Bench.