I build data pipelines and analytics systems.

Most of the work here is about taking messy source data and making it usable. I am currently completing an MSc in Business Analytics at UCL in London.

Interested in data engineering, analytics engineering, and applied machine learning infrastructure.

flagship system metrics

43	RSS outlets ingested across 17 languages
5,500	article embeddings in the vector store
8	containers in the polyglot Docker stack

→flagship project all projects about

pipeline.py — python3 — 96×25

samuel@portfolio:~/work % python pipeline.py --info

featured work

★Geopolitical Crisis Narrative Tracker43 RSS outlets · 17 languages

Built to a UCL Data Engineering brief requiring RAG, graph databases, LLM agents, data lineage, and API deployment — all against live, real-world news pipelines.

PostgreSQL · MongoDB · Neo4j · ChromaDB · 5,500+ article embeddings · Neo4j: 2,500+ actors · 10k+ relationships · LangChain agent with 7 tools · Airflow orchestration · OpenLineage / Marquez lineage · FastAPI + MCP (12 endpoints) · Docker Compose (8 services) · spaCy NER · sentence-transformers · LibreTranslate · 216 tests across ingestion, storage, API, agent

>open project

projects.sh — zsh — 80×24

samuel@portfolio:~/projects % ls -la

#nametype

02Indie Game Success Data Pipeline1,216 indie games

Built around 1,216 indie games and eight data sources, this project moves from collection and scraping through Spark processing into a dataset that is actually usable for analysis.

8 integrated data sources · PostgreSQL + MongoDB storage · Apache Spark processing · Parquet + DuckDB analytics · Success tier classification

>open project-02

03Premier League Predictive Analytics15 seasons, 5,700 matches

The project focuses on disciplined evaluation rather than inflated forecasting claims, using walk-forward validation, calibration, and model-odds blending to test whether the pipeline adds signal beyond the market.

Bet365 and Pinnacle benchmarks · Best single model: 0.9678 log-loss · Best blend beats Bet365, p=0.007 · Dixon-Coles season simulation · 6.7/20 exact positions

>open project-03

04Marketing Analytics SegmentationCustomer segmentation

A supporting analytics project focused on the less glamorous but more useful work of cleaning structure, shaping features, and producing interpretable segments.

Analytics workflow · Feature framing · Decision-ready outputs

>open project-04

notes.md — vim — 72×30

samuel@portfolio:~/notes % vim notes.md

-- NORMAL --notes.md 2L

Lessons from building a Spark pipeline

What broke, what became clearer, and what changed once the work had to run repeatedly rather than just succeed once in a notebook.

Designing reproducible data pipelines

Some thoughts on traceable movement, storage boundaries, and why quiet documentation habits matter more than people admit.

contact.sh — zsh — 60×18

samuel@portfolio:~ % ./contact.sh

checking endpoints...

LinkedIn→LinkedIn

Email→Email

Phone (US)→Phone (US)

Phone (UK)→Phone (UK)

CV/Resume→CV/Resume

GitHub→GitHub

UCL Email→UCL Email

done.