TinyCodeTest Evaluation Stack

This writeup captures the evaluation design decisions behind TinyCodeTest.

Core experiment

The main question was how to benchmark code-generation systems without hiding the execution conditions. The environment therefore treats every attempt as a constrained run with explicit verifier logic, timeout boundaries, and scoring rules.

Design choices

  • Deterministic subprocess execution instead of ad hoc notebook-style testing
  • Difficulty buckets and adversarial splits to reduce misleading aggregate scores
  • Bootstrap confidence intervals so pass@k is interpreted with uncertainty, not as a single magic number

Outcome

The result is an evaluation stack that is easier to trust, compare, and debug when a model improves in one slice and regresses in another.