samuel.brooks — builds data pipelines

Deep Dive

Premier League Predictive Analytics

A deeper look at the modelling workflow, evaluation discipline, and outputs behind the Premier League project. This page is built directly from the actual project repository rather than from a polished summary alone.

What the project is doing

The project predicts three-way Premier League match outcomes across 15 seasons, benchmarks those probabilities against bookmaker markets, then extends the workflow into league table simulation. It combines feature engineering, walk-forward validation, calibration, and probability blending rather than stopping at a raw classification score.

Why it matters

The interesting part is not just building a predictive model. It is building an evaluation pipeline that respects time, compares against realistic baselines, and shows where the model actually adds value beyond bookmaker odds.

Feature pipeline diagram

Feature pipeline

The project combines bookmaker odds, Elo ratings, expected goals, transfer spending, and manager tenure into model-ready seasonal features.

Metrics comparison

Model comparison

The strongest comparison is against bookmakers, not against a weak baseline. This figure summarises the walk-forward performance across model variants.

Calibration plot

Calibration

Calibration matters in a probability project. The pipeline includes post-processing and blending so that probability quality is part of the evaluation, not an afterthought.

Feature importance chart

Feature importance

The feature layer is one of the strongest parts of the project: Elo, form, odds-derived probabilities, and recency-aware signals drive the modelling work.

Predicted 2025-26 league table

Season forecast

The project extends beyond single-match prediction into full-season simulation using a Dixon-Coles approach and Monte Carlo forecasts.

Walk-forward schematic

Walk-forward structure

This is a good example of disciplined evaluation design: season-by-season expanding-window validation instead of random splits that would leak temporal information.