Experiments

Longer notes on systems experiments, evaluation design, and engineering tradeoffs behind selected projects.

TinyCodeTest Evaluation Stack

Deterministic verifiers, difficulty buckets, and multi-model pass@k benchmarking

Notes on how the TinyCodeTest evaluation stack turns model runs into reproducible, inspectable benchmark results.

Scenario design, deterministic grading, ablations, and stress tests for strategic reasoning

Experiment notes from building and evaluating an RL-style environment for race-strategy reasoning.

Repeatable measurement, SIMD-aware implementation work, and fair algorithm comparisons

A closer look at the benchmarking discipline behind Compression Bench.