Market Data Pipeline

Python, Parquet, Arrow, SQL, data validation

Market Data Pipeline is a reproducible data-quality framework for building versioned datasets with auditability built in from the start. The goal is to make every output easy to trace back to its inputs, configuration, and execution context.

What it does

Produces versioned Parquet datasets with manifests, checksums, and run provenance
Tracks git commit, config hash, and dependency versions for each generated artifact
Adds deterministic leakage audits for forward-looking joins, survivorship bias, and label misalignment
Gates releases on drift and validation checks instead of pushing questionable data downstream

Evaluation layer

The pipeline also includes a baseline evaluation harness that models cost and latency assumptions, then measures how those assumptions affect the apparent value of a signal. That makes it easier to spot inflated results caused by hidden lookahead.

Code

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Farouq Oguntoye

What it does

Evaluation layer

Share on