TinyCodeTest Evaluation Stack

This writeup captures the evaluation design decisions behind TinyCodeTest.

Core experiment

The main question was how to benchmark code-generation systems without hiding the execution conditions. The environment therefore treats every attempt as a constrained run with explicit verifier logic, timeout boundaries, and scoring rules.

Design choices

Deterministic subprocess execution instead of ad hoc notebook-style testing
Difficulty buckets and adversarial splits to reduce misleading aggregate scores
Bootstrap confidence intervals so pass@k is interpreted with uncertainty, not as a single magic number

Outcome

The result is an evaluation stack that is easier to trust, compare, and debug when a model improves in one slice and regresses in another.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Farouq Oguntoye

Core experiment

Design choices

Outcome

Share on