samuel.brooks — builds data pipelines
indie-game-success-data-pipeline.py — python3 — 118×36

samuel@portfolio:~/projects % cat indie-game-success-data-pipeline.md

flagship project

Steam logo

Indie Game Success Data Pipeline

An end-to-end data engineering project analysing the commercial success of indie games on Steam through a reproducible multi-source pipeline.

Problem

The project set out to analyse what commercial success looks like for indie games on Steam using a pipeline that could handle multiple external signals and keep the final dataset analytically useful.

Data sources

The pipeline combines eight sources, including Steam data, Steam Charts scraping, Google Trends, Reddit sentiment, and additional metadata sources so the final dataset reflects both platform and external demand signals.

Architecture

The system follows a practical engineering structure: collect and scrape source data, standardise and process it in Spark, persist intermediate and final data in storage layers, then expose clean analytical outputs for downstream use.

Pipeline / processing

Apache Spark handled the main processing workflow. Samuel focused on debugging and building the Spark pipeline so the data could move from raw source collection into a consistent feature-ready dataset.

Storage

PostgreSQL and MongoDB were used as storage layers, with Parquet and DuckDB supporting the analytics layer. This gave the project a clear split between operational storage and analysis-friendly formats.

Analytics outputs

The result was a master dataset with roughly 62 features and a classification approach that grouped games into three success tiers based on estimated owner counts at 20k, 200k, and 1M.

Samuel's contribution

Samuel was responsible for Google Trends data collection, Steam Charts scraping, and substantial Spark pipeline debugging and build work. The portfolio should emphasise these direct contributions rather than the work of the broader group.

Key takeaway

The strongest signal from this project is not a headline metric. It is the ability to build and stabilise a real multi-source pipeline that turns messy external data into a decision-ready analytical dataset.

schema summary

PostgreSQL~150,000 rows

Structured data, normalised dimensions, time series

11 data tables + 2 Spark output tables + 1 view

MongoDB2,432 docs / ~113,000 nested records

Nested reviews and posts with sub-documents

steam_reviews, reddit_posts

Parquet1,216 rows each

Columnar output for DuckDB queries

master_dataset, sentiment_scores, pre_launch_signals

premier-league-predictive-analytics.ipynb — jupyter — 96×30

samuel@portfolio:~/projects % cat premier-league-predictive-analytics.md

project 02

Premier League logo

Premier League Predictive Analytics

A predictive analytics project built on 15 seasons of Premier League data, combining match prediction, bookmaker benchmarking, and season simulation.

Problem

The aim was to predict three-way match outcomes in the Premier League while testing the models against the strongest baseline in the space: bookmaker-implied probabilities.

Data sources

The data stack combines 15 seasons of match data with ClubElo ratings, transfer spending, expected goals data, and manager tenure information.

Architecture

The workflow runs from source ingestion and feature engineering into walk-forward training, probability calibration, ensemble blending, and season-level simulation.

Pipeline / processing

The processing stage builds rolling form, goal difference, days rest, Elo momentum, shots-on-target, and implied probability features before training Logistic Regression, LightGBM, and MLP models with temporal decay weighting.

Storage

This project is best understood as a reproducible analytics workflow rather than a storage-heavy data engineering system. The value lies in pipeline structure, feature design, and honest evaluation.

Analytics outputs

Outputs include calibration plots, feature importance, model comparison metrics, league table forecasts, and Dixon-Coles Monte Carlo simulations for full-season outcomes.

Samuel's contribution

Samuel's contribution sits across the full modelling workflow: assembling and cleaning inputs, engineering and testing features, evaluating models against bookmaker baselines, and presenting the results through reproducible figures and comparisons.

Key takeaway

The strongest signal from the project is not a claim of beating the market outright. It is the ability to build a disciplined predictive pipeline, benchmark it correctly, and identify where blending or simulation adds real analytical value.

geopolitical-crisis-narrative-tracker.sh — airflow — 104×32

samuel@portfolio:~/projects % cat geopolitical-crisis-narrative-tracker.md

project 03

Geopolitical Crisis Narrative Tracker

A polyglot data engineering system that ingests global news across ten sources, stores it across four database types, and exposes a LangChain agent that answers deep questions about how different countries and media outlets frame geopolitical crises.

Problem

The system addresses a question that matters in finance, intelligence, and journalism: how do BBC, Al Jazeera, Xinhua, and RT differ in covering the same conflict — and what do those patterns reveal about narrative, bias, and escalation risk?

Data sources

Ten sources feed the pipeline: GDELT DOC 2.0 and GKG APIs for global event data and tone scores, five RSS feeds (BBC, Al Jazeera, Reuters, Xinhua, RT) scraped via feedparser, Wikipedia for background context, ReliefWeb for UN humanitarian data, and UN Security Council records via BeautifulSoup.

Architecture

Airflow DAGs orchestrate ingestion on 15-minute, hourly, and daily schedules. Python ETL applies spaCy NER and entity resolution before routing records to the appropriate storage layer. OpenLineage instruments every transformation and emits provenance events to a Marquez lineage store.

Storage

Four databases handle different access patterns: PostgreSQL for structured GDELT event data and tone time-series, MongoDB for raw article documents with flexible per-outlet schemas, Neo4j for actor–event–source–country relationship graphs, and ChromaDB for sentence-transformer embeddings supporting semantic RAG retrieval.

Agent tools

The LangChain agent exposes six tools: semantic narrative search via ChromaDB, tone timeline queries against PostgreSQL, actor graph traversal via Neo4j Cypher, cross-outlet coverage comparison via MongoDB aggregation, live humanitarian context from ReliefWeb, and Wikipedia background lookup.

API

FastAPI exposes the agent as a REST endpoint with an MCP interface, containerised in Docker Compose alongside all six backend services. The stack is deployable to GCP Cloud Run.

Key takeaway

The strongest signal from this project is the architectural judgement: each database was chosen for a specific access pattern, not for coverage. The tooling follows the problem — a principle that holds up under interview scrutiny.

marketing-analytics-segmentation.R — r — 80×26

samuel@portfolio:~/projects % cat marketing-analytics-segmentation.md

segmentation output — 3 clusters

  ●  ●●   ●        ▲  ▲▲             
 ●●●  ●  ●●       ▲ ▲ ▲▲  ▲          
  ●  ●●   ●      ▲▲  ▲  ▲▲           
   ●●  ●          ▲  ▲▲  ▲           
    ●  ●●                             
                       ■  ■■  ■       
                      ■■■ ■  ■■       
                       ■  ■■■  ■      
                      ■■  ■  ■■       

project 04

Marketing Analytics Segmentation

A segmentation-oriented analytics project focused on turning raw customer data into clearer groupings for decision-making.

Problem

The objective was to identify useful structure in customer data so that analysis could move from broad averages to more actionable groupings.

Data sources

The project works with customer or marketing data shaped into a form suitable for segmentation analysis and downstream interpretation.

Architecture

This is best represented as a clean analytics workflow: prepare the dataset, create segment-relevant features, test clustering or grouping logic, and present interpretable outputs.

Pipeline / processing

The emphasis is on reproducible preparation and careful feature design so that segmentation results remain understandable rather than opaque.

Storage

Storage detail should stay lightweight here, since the portfolio value lies more in analytical structure than in infrastructure complexity.

Analytics outputs

Outputs focus on segment definitions, comparative patterns between groups, and the practical use of those distinctions in analysis.

Samuel's contribution

Samuel's contribution should be framed around analytics design, feature work, and interpretation rather than claims about dramatic commercial uplift.

Key takeaway

This project reinforces Samuel's ability to translate raw data into decision-ready structure, which aligns well with analytics engineering roles.

snapshot

  • Customer segmentation
  • Analytics workflow
  • Feature framing
  • Decision-ready outputs