TinyCodeTest
Python, RL evaluation, sandboxed execution, benchmarking, Vercel
TinyCodeTest is a deterministic evaluation environment for code-generation models. It combines verifier-driven task execution with reproducible scoring so model outputs can be compared under the same constraints every time.
What it does
- Runs generated code in sandboxed subprocesses with per-attempt timeouts and memory caps
- Organizes tasks into difficulty buckets with adversarial dataset splits
- Computes pass@k metrics with bootstrap confidence intervals for more stable comparisons
- Supports multiple model providers and prompt strategies from one evaluation registry
Why it matters
The project is built around trustable evaluation. Instead of treating model performance as a one-number benchmark, it exposes the execution conditions, verification trace, and scoring pipeline so results can be reproduced and inspected.
Delivery
- Multi-model registry for OpenAI, Anthropic, and Gemini APIs
- Export paths for JSON, Markdown, and HTML leaderboard output
- Vercel-based UI for dataset upload, model selection, API key management, and live evaluation runs
