samuel.brooks — builds data pipelines
~/portfolio — zsh — 122×33

samuel@portfolio:~/portfolio % cat intro.md

London  ·  Data Engineer  ·  MSc Business Analytics, UCL

I build data pipelines and analytics systems.

Most of the work here is about taking messy source data and making it usable. I am currently completing an MSc in Business Analytics at UCL in London.

Interested in data engineering, analytics engineering, and applied machine learning infrastructure.

flagship system metrics

43RSS outlets ingested across 17 languages
5,500article embeddings in the vector store
8containers in the polyglot Docker stack
pipeline.py — python3 — 96×25

samuel@portfolio:~/work % python pipeline.py --info

featured work

Geopolitical Crisis Narrative Tracker

Built to a UCL Data Engineering brief requiring RAG, graph databases, LLM agents, data lineage, and API deployment — all against live, real-world news pipelines.

PostgreSQL · MongoDB · Neo4j · ChromaDB · 5,500+ article embeddings · Neo4j: 2,500+ actors · 10k+ relationships · LangChain agent with 7 tools · Airflow orchestration · OpenLineage / Marquez lineage · FastAPI + MCP (12 endpoints) · Docker Compose (8 services) · spaCy NER · sentence-transformers · LibreTranslate · 216 tests across ingestion, storage, API, agent

>open project
projects.sh — zsh — 80×24

samuel@portfolio:~/projects % ls -la

#name

Built around 1,216 indie games and eight data sources, this project moves from collection and scraping through Spark processing into a dataset that is actually usable for analysis.

8 integrated data sources · PostgreSQL + MongoDB storage · Apache Spark processing · Parquet + DuckDB analytics · Success tier classification

>open project-02
03Premier League Predictive Analytics

The project focuses on disciplined evaluation rather than inflated forecasting claims, using walk-forward validation, calibration, and model-odds blending to test whether the pipeline adds signal beyond the market.

Bet365 and Pinnacle benchmarks · Best single model: 0.9678 log-loss · Best blend beats Bet365, p=0.007 · Dixon-Coles season simulation · 6.7/20 exact positions

>open project-03
04Marketing Analytics Segmentation

A supporting analytics project focused on the less glamorous but more useful work of cleaning structure, shaping features, and producing interpretable segments.

Analytics workflow · Feature framing · Decision-ready outputs

>open project-04
notes.md — vim — 72×30

samuel@portfolio:~/notes % vim notes.md

-- NORMAL --notes.md 2L
1

Lessons from building a Spark pipeline

What broke, what became clearer, and what changed once the work had to run repeatedly rather than just succeed once in a notebook.

2

Designing reproducible data pipelines

Some thoughts on traceable movement, storage boundaries, and why quiet documentation habits matter more than people admit.

contact.sh — zsh — 60×18

samuel@portfolio:~ % ./contact.sh

checking endpoints...

LinkedInLinkedIn
EmailEmail
Phone (US)Phone (US)
Phone (UK)Phone (UK)
CV/ResumeCV/Resume
GitHubGitHub
UCL EmailUCL Email

done.