The Indie Game Survival Guide: What Makes an Indie Game Succeed on Steam?¶
Team: Ana Marco Morales (25050808), Azada Imranzada Karimli (25133600), Samuel Brooks (25200964), Theo Chornogubsky (25191662), Orestis Metaxas (25204655)
Module: UCL MSIN0166 Data Engineering 2025/26
Repository Link https://github.com/orestismet99/indie_game_pipeline
Word Count 3984
1. Introduction to the Project¶
2. Pipeline Architecture¶
#pipeline architecture
from IPython.display import Image, display
display(Image('pipeline_notebook_images/pipeline_architecture.png', width=1600))
Storage justification¶
We use two storage systems because the data splits into two shapes.
- PostgreSQL stores structured, schema-consistent data: game metadata, IGDB ratings, SteamSpy stats, Google Trends time series, Steam Charts monthly history and Steam tags. These sources have predictable fields, need relational joins (e.g. joining games to IGDB on app_id) and benefit from normalisation. Steam tags are stored in a dedicated game_tags table with one row per game-tag pair and a vote count column, rather than as a comma-separated string. This enables the tag analysis in Q11 via a simple GROUP BY tag_name join, which would require string parsing if stored denormalised.
- MongoDB (NoSQL document store) stores nested, variable-structure documents: Steam reviews (each game has an array of reviews, each review contains an author sub-document with playtime and profile stats) and Reddit posts (each game has separate search_posts and subreddit_posts arrays, plus an optional subreddit_info sub-document that is null for 490 games without a dedicated subreddit). Flattening these into relational tables would require either wide sparse tables with many nullable columns or multiple join-heavy tables. The document model preserves the natural hierarchy and handles the variable structure without schema compromises.
3. Data Sources¶
Data was collected from eight sources spanning REST APIs, web scraping, and flat files, chosen to capture pricing, community engagement, sentiment, and discoverability signals across 1,216 indie games.
| # | Source | Type | Auth | Records | Key Fields |
|---|---|---|---|---|---|
| 1 | Kaggle Steam Dataset | Flat file (JSON) | None | 122,611 (filtered to 1,216) | app_id, name, price, peak_ccu, genres, tags |
| 2 | Steam Store API | REST API | None | 1,006 | descriptions, categories, platforms, metacritic |
| 3 | Steam Reviews API | REST API | None | 57,938 reviews | review text, voted_up, playtime, author stats |
| 4 | SteamSpy API | REST API | None | 1,216 | ownership estimates, playtime, CCU |
| 5 | Reddit JSON endpoints | Web scraping | None | 52,842 posts | post title, score, comments, subreddit info |
| 6 | IGDB via Twitch | REST API | OAuth2 | 1,150 | ratings, platforms, genres, game modes |
| 7 | Google Trends | pytrends | None | 1,216 | weekly search interest 2019–2024 |
| 8 | Steam Charts | Web scraping (BS4) | None | 40,875 data points | monthly player counts |
4. Schema Design¶
PostgreSQL (13 tables)¶
PostgreSQL follows a star-schema-inspired layout with games as the central table — every other table joins to it on app_id. We normalised tags into a dedicated game_tags table (one row per game-tag pair with a vote count) so that the tag analysis in Q11 works with a simple GROUP BY rather than string parsing. Similarly, igdb_platforms and igdb_genres are their own tables to avoid duplicating game-level data across rows. Time-series data from Google Trends and Steam Charts is stored as one row per time period per game.
MongoDB (2 collections)¶
The steam_reviews collection holds one document per game with an array of reviews inside it — each review has a nested author object containing playtime and profile info. reddit_data follows a similar pattern: one document per game with arrays for search posts and subreddit posts, plus a subreddit_info sub-document that's null for the ~490 games that don't have their own subreddit. Both collections use app_id so Spark can join them to the PostgreSQL data during processing.
PostgreSQL (13 tables)¶
All tables link back to a central games table (1,216 rows, one per game) via app_id as a foreign key.
games
The fact table. Stores name, release date, price, platform support (windows,mac,linux), review counts, playtime, ownership bracket, Metacritic score, achievements, and the success label. Every query starts here.steamspy_stats
Ownership estimates, median and average playtime, peak concurrent users, and pricing.igdb_metadata Stores
rating(user score) andaggregated_rating(critic score) from IGDB, independent of Steam. Spark renames these toigdb_critic_scoreandigdb_critic_countin the master dataset. Also stores game modes and player perspectives as PostgreSQL arrays, which Steam does not expose.steam_store
Store page metadata including supported languages, recommendations, price, and release date.google_trends (1,216 rows) Summary table with peak and average search interest per game.
google_trends_weekly (~69,000 rows) Weekly time series of search interest values per game, used by Spark for pre and post launch comparisons.
steam_charts (~40,000 rows) Monthly player counts scraped from Steam Charts, storing
avg_players,gain,gain_pct, andpeak_playersper month. Used to track whether a game held its audience over time.price_history Price snapshots from three sources collected at different dates. The primary key is a composite of (
app_id,snapshot_date,source), demonstrating how a multi-source fact table can track price changes over time.game_tags, igdb_platforms, igdb_genres
Each stores one row per game per value rather than comma separated strings. This makes SQL filtering exact and aggregation reliable. For example, counting how many successful games share a given tag becomes a straightforwardGROUP BYrather than aLIKEpattern match.sentiment_scores, pre_launch_signals
Schema only tables populated by Spark after processing. They store VADER sentiment scores, Reddit engagement metrics, and Google Trends activity split before and after each game's release date.
#postgres schema
display(Image('pipeline_notebook_images/postgres_schema.png', width=1200))
MongoDB (2 collections)¶
MongoDB holds one document per game across two collections. Reviews and Reddit posts are nested and vary in size per game, so storing them in PostgreSQL would require four extra tables and many nullable columns. Keeping them as documents is cleaner.
steam_reviews
Each document holds a review summary (total reviews, rating label) and an array of up to 100 individual reviews. Each review contains the text, a thumbs up or down flag, timestamps, and an author sub document with Steam ID, games owned, and playtime at the time of writing.reddit_posts
Each document holds asubreddit_infosub document (null if no dedicated subreddit exists), up to 25 general search posts, and up to 25 posts from the game’s own subreddit. Fields such ashas_subredditand post counts are stored at the top level so queries can filter on them without scanning the arrays.
.
#mongodb schema
display(Image('pipeline_notebook_images/mongodb_schema.png', width=1600))
#summary
display(Image('pipeline_notebook_images/schema_summary.png', width=1600))
5. Spark Pipeline¶
The Spark pipeline reads from both storage systems, transforms and joins the data, engineers analytical features, and writes the final dataset to Parquet for DuckDB to query.
Step 1 reads 11 PostgreSQL tables into Spark DataFrames via JDBC:
games,steamspy_stats,igdb_metadata,steam_store,google_trends,google_trends_weekly,steam_charts,game_tags,igdb_platforms,igdb_genresandprice_history. It then reads the two MongoDB collections (steam_reviewsandreddit_posts) using mongo-spark-connector 10.4.0. MongoDB documents are collected and converted usingrow.asDict(recursive=True)to handle the nested author sub-documents inside each review.Step 2 applies VADER (Valence Aware Dictionary and Sentiment Reasoner), a lexicon- and rule-based sentiment analysis tool designed for short informal text, to Steam review text and aggregates the results at game level, producing
avg_vader_compound,pct_positive_reviews,avg_playtime_at_reviewandnum_reviews_analysed. It also processes Reddit data to produceavg_reddit_score,total_reddit_posts,total_reddit_comments,has_dedicated_subredditandsubreddit_subscribers.Step 3 computes pre and post launch signals by joining the
google_trends_weeklytime series with each game'srelease_datefrom the games table. The release date acts as a pivot to split weekly interest values into pre-launch and post-launch averages. The pipeline also computestrend_direction(rising, falling or stable) anddays_to_peak_interest. Reddit posts are similarly split by timestamp relative to release date.Step 4 joins all 11 PostgreSQL tables plus the sentiment and pre-launch DataFrames on
app_idinto a single wide master DataFrame of 1,216 rows.Step 5 engineers 11 new features:
price_tier(5 brackets: free, under_5, 5_to_15, 15_to_30, premium_30+),review_positivity_ratio(positive / total reviews),steam_platform_count(Windows + Mac + Linux as integers),num_supported_languages(split count of the supported_languages string),playtime_retention_ratio(median playtime / average playtime, closer to 1 means more even retention),reddit_engagement(avg score multiplied by avg comments per post),has_franchise(boolean from IGDB),has_dedicated_subreddit. Ownership strings from SteamSpy (e.g. "200,000 .. 500,000") are parsed into numeric midpoints via a UDF to produceowners_estimate, with a fallback chain ofsteamspy_owners→estimated_owners→peak_ccu × 50. This drives a four-tiersuccess_tiercolumn (Breakout Hit ≥1M owners, Strong Success ≥200k, Moderate Success ≥20k, Underperformer <20k) and an integertier_rank(1–4) for ordering. The original binarysuccesscolumn is retained for backward compatibility.Step 6 writes the master dataset to Parquet partitioned by
success_tier, and writessentiment_scoresandpre_launch_signalsas separate Parquet files. A flat CSV (storage/master_dataset.csv) is also exported for notebook use. Sentiment and pre-launch tables are also written back to PostgreSQL so they are queryable via JDBC alongside the original source tables.
One key version constraint: PySpark 3.5.5 is required. Spark 4.x breaks mongo-spark-connector 10.4.0 due to API changes in the connector interface. An explicit StructType schema is also defined for MongoDB reads because auto schema inference fails on nested nullable fields.
# Load the master dataset from Parquet (Lakehouse pattern: Spark → Parquet → DuckDB)
import duckdb, pandas as pd
con = duckdb.connect()
con.execute("""
CREATE OR REPLACE VIEW games AS
SELECT * FROM read_parquet('../storage/master_dataset.parquet/**/*.parquet')
""")
df = con.execute("SELECT * FROM games").df()
print(f"Dataset: {df.shape[0]} games x {df.shape[1]} columns (from Parquet)")
df.head()
Dataset: 1216 games x 62 columns (from Parquet)
| app_id | name | release_date | required_age | price | dlc_count | short_description | windows | mac | linux | ... | review_positivity_ratio | price_tier | steam_platform_count | num_supported_languages | playtime_retention_ratio | reddit_engagement | has_franchise | owners_estimate | tier_rank | success_tier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4000 | Garry's Mod | Nov 29, 2006 | 0 | 4.99 | 0 | Garry's Mod is a physics sandbox. There aren't... | True | True | True | ... | 0.968 | under_5 | 3 | 31 | 0.135 | 237108.98 | False | 35000000.0 | 1 | Breakout Hit |
| 1 | 4920 | Natural Selection 2 | Oct 30, 2012 | 0 | 0.99 | 10 | A fast paced multiplayer shooter that pits ali... | True | False | False | ... | 0.853 | under_5 | 1 | 10 | 0.189 | 6617640.46 | False | 1500000.0 | 1 | Breakout Hit |
| 2 | 16900 | GROUND BRANCH | Aug 14, 2018 | 0 | 25.49 | 0 | From one of the developers behind the original... | True | False | False | ... | 0.893 | 15_to_30 | 1 | 0 | 0.415 | 858368.74 | False | 1500000.0 | 1 | Breakout Hit |
| 3 | 40700 | Machinarium | Oct 16, 2009 | 0 | 4.99 | 1 | Machinarium is the award-winning independent a... | True | True | False | ... | 0.950 | under_5 | 2 | 0 | 0.761 | 775480.33 | False | 1500000.0 | 1 | Breakout Hit |
| 4 | 40800 | Super Meat Boy | Nov 30, 2010 | 0 | 2.99 | 1 | The infamous, tough-as-nails platformer comes ... | True | True | True | ... | 0.946 | under_5 | 3 | 0 | 0.351 | 389284.71 | False | 3500000.0 | 1 | Breakout Hit |
5 rows × 62 columns
6. DuckDB Analysis¶
The following 12 queries run directly on Parquet files using DuckDB (Lakehouse pattern). Q11 queries PostgreSQL for normalised tag data.
import warnings
warnings.filterwarnings('ignore')
# con, df, and games view already created from Parquet in cell above
# ── Success tier thresholds (based on SteamSpy owners_estimate) ──
# Breakout Hit : 1,000,000+ owners — viral, culturally significant
# Strong Success : 200,000–999,999 — commercially healthy, healthy community
# Moderate Success : 20,000–199,999 — viable, recouped costs
# Underperformer : < 20,000 — failed to find audience
TIERS = ['Breakout Hit', 'Strong Success', 'Moderate Success', 'Underperformer']
TIER_COLORS = {
'Breakout Hit': '#7c3aed',
'Strong Success': '#2563eb',
'Moderate Success': '#16a34a',
'Underperformer': '#ef4444',
}
# ── Shorthand colour aliases & helper ─────────────────────────────────────────
C_BREAKOUT = '#7c3aed'
C_STRONG = '#2563eb'
C_MODERATE = '#16a34a'
C_UNDER = '#ef4444'
def clean_ax(ax, keep_left=True):
"""Strip chart chrome for the dark theme."""
for spine in ['top', 'right']:
ax.spines[spine].set_visible(False)
if not keep_left:
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_edgecolor(GRID)
if keep_left:
ax.spines['left'].set_edgecolor(GRID)
ax.set_facecolor(CARD)
ax.tick_params(colors=SUBTEXT)
print(f"Ready: {df.shape[0]} games x {df.shape[1]} columns (DuckDB → Parquet)")
print("\nTier breakdown (using owners_estimate from SteamSpy):")
for tier in TIERS:
if tier == 'Breakout Hit':
n = (df['owners_estimate'] >= 1_000_000).sum()
elif tier == 'Strong Success':
n = ((df['owners_estimate'] >= 200_000) & (df['owners_estimate'] < 1_000_000)).sum()
elif tier == 'Moderate Success':
n = ((df['owners_estimate'] >= 20_000) & (df['owners_estimate'] < 200_000)).sum()
else:
n = (df['owners_estimate'] < 20_000).sum()
print(f" {tier:20s}: {n:4d} games")
Ready: 1216 games x 62 columns (DuckDB → Parquet) Tier breakdown (using owners_estimate from SteamSpy): Breakout Hit : 368 games Strong Success : 490 games Moderate Success : 255 games Underperformer : 103 games
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
# ── Theme ─────────────────────────────────────────────────────────────────────
BG = '#0D1117'
CARD = '#161B22'
GRID = '#21262D'
TEXT = '#E6EDF3'
SUBTEXT = '#8B949E'
plt.rcParams.update({
'figure.facecolor': BG,
'axes.facecolor': CARD,
'axes.edgecolor': GRID,
'axes.labelcolor': TEXT,
'axes.titlepad': 16,
'text.color': TEXT,
'xtick.color': SUBTEXT,
'ytick.color': SUBTEXT,
'grid.color': GRID,
'grid.alpha': 0.4,
'font.family': 'sans-serif',
'font.size': 11,
'axes.titlesize': 15,
'axes.labelsize': 12,
'figure.titlesize': 18,
'legend.facecolor': CARD,
'legend.edgecolor': GRID,
'legend.fontsize': 10,
})
# Tier palette mapped to neon accents
tier_order = ['Breakout Hit', 'Strong Success', 'Moderate Success', 'Underperformer']
tier_colors = ['#D2A8FF', '#58A6FF', '#7EE787', '#FF7B72'] # purple, blue, green, red
tier_counts = df['success_tier'].value_counts().reindex(tier_order)
fig, axes = plt.subplots(1, 2, figsize=(14, 5), facecolor=BG)
fig.patch.set_facecolor(BG)
# ── Left: horizontal bar chart ─────────────────────────────────────────────
bars = axes[0].barh(
tier_order[::-1], tier_counts[tier_order[::-1]],
color=tier_colors[::-1], edgecolor=BG, linewidth=1.5
)
for bar, count in zip(bars, tier_counts[tier_order[::-1]]):
axes[0].text(
bar.get_width() + 8, bar.get_y() + bar.get_height() / 2,
f'{count:,} ({count/1216*100:.1f}%)',
va='center', fontsize=10, color=SUBTEXT
)
axes[0].set_xlim(0, 1400)
axes[0].set_xlabel('Number of games', fontsize=11, color=SUBTEXT)
axes[0].set_title('Games per success tier', fontsize=13, fontweight='600', color=TEXT, pad=14)
axes[0].tick_params(axis='y', labelsize=11, colors=TEXT)
axes[0].tick_params(axis='x', colors=SUBTEXT)
axes[0].set_facecolor(CARD)
for spine in axes[0].spines.values():
spine.set_edgecolor(GRID)
axes[0].spines['left'].set_visible(False)
axes[0].spines['top'].set_visible(False)
axes[0].spines['right'].set_visible(False)
axes[0].yaxis.set_tick_params(length=0)
axes[0].grid(True, axis='x', alpha=0.2, color=GRID)
# ── Right: ownership KDE distribution ─────────────────────────────────────
for tier, color in zip(tier_order, tier_colors):
vals = df.loc[df.success_tier == tier, 'owners_estimate'].dropna()
vals = vals[vals > 0]
if len(vals) > 1:
sns.kdeplot(np.log10(vals), ax=axes[1], color=color, fill=True,
alpha=0.25, linewidth=2, label=tier)
# Threshold lines
axes[1].axvline(np.log10(20_000), color='#7EE787', ls='--', lw=1.2, alpha=0.5)
axes[1].axvline(np.log10(200_000), color='#58A6FF', ls='--', lw=1.2, alpha=0.5)
axes[1].axvline(np.log10(1_000_000), color='#D2A8FF', ls='--', lw=1.2, alpha=0.5)
axes[1].set_xticks([3, 4, np.log10(20_000), 5, np.log10(200_000), 6, np.log10(1_000_000), 7])
axes[1].set_xticklabels(['1k', '10k', '20k', '100k', '200k', '1M', '1M', '10M'],
fontsize=9, color=SUBTEXT)
axes[1].set_xlabel('Estimated owners (log scale)', fontsize=11, color=SUBTEXT)
axes[1].set_yticks([])
axes[1].set_title('Ownership distribution across tiers', fontsize=13, fontweight='600',
color=TEXT, pad=14)
axes[1].legend(fontsize=10, frameon=True, facecolor=CARD, edgecolor=GRID, labelcolor=TEXT)
axes[1].set_facecolor(CARD)
for spine in axes[1].spines.values():
spine.set_edgecolor(GRID)
axes[1].spines['left'].set_visible(False)
axes[1].spines['top'].set_visible(False)
axes[1].spines['right'].set_visible(False)
fig.suptitle(
'Q1 · Four-Tier Success Landscape: 1,216 Indie Games on Steam',
fontsize=14, fontweight='700', color=TEXT, y=1.02
)
plt.tight_layout()
plt.show()
Strong Success dominates at 40.3%, followed by Breakout Hit (30.3%), Moderate Success (21%), and Underperformer (8.5%). The low proportion of Underperformers reflects our CCU ≥ 50 filter, which excludes the most dormant games. The density plot shows each tier forms a distinct cluster with clear separation at the 20k, 200k and 1M thresholds. The Breakout Hit distribution extends past 10M owners, driven by outliers like Rust and Stardew Valley, with minimal overlap between tiers validating owners_estimate as a reliable classification signal.
# Q2: Price bracket vs success tier — cross-tabulation.
# For each price tier, shows the breakdown across all four success tiers,
# revealing whether price is a differentiator between Breakout Hits and Underperformers
# or whether it flattens out once you control for the middle tiers.
q2 = con.execute("""
WITH tiered AS (
SELECT *,
CASE
WHEN owners_estimate >= 1000000 THEN 'Breakout Hit'
WHEN owners_estimate >= 200000 THEN 'Strong Success'
WHEN owners_estimate >= 20000 THEN 'Moderate Success'
ELSE 'Underperformer'
END AS success_tier,
CASE
WHEN owners_estimate >= 1000000 THEN 1
WHEN owners_estimate >= 200000 THEN 2
WHEN owners_estimate >= 20000 THEN 3
ELSE 4
END AS tier_rank
FROM games
)
SELECT price_tier,
COUNT(*) AS total_games,
SUM(CASE WHEN success_tier='Breakout Hit' THEN 1 ELSE 0 END) AS breakout_hits,
SUM(CASE WHEN success_tier='Strong Success' THEN 1 ELSE 0 END) AS strong_successes,
SUM(CASE WHEN success_tier='Moderate Success' THEN 1 ELSE 0 END) AS moderate_successes,
SUM(CASE WHEN success_tier='Underperformer' THEN 1 ELSE 0 END) AS underperformers,
ROUND(SUM(CASE WHEN success_tier='Breakout Hit' THEN 1 ELSE 0 END)*100.0/COUNT(*),1) AS pct_breakout,
ROUND(AVG(peak_ccu), 0) AS avg_peak_ccu,
ROUND(AVG(owners_estimate), 0) AS avg_owners
FROM tiered WHERE price_tier IS NOT NULL
GROUP BY price_tier ORDER BY avg_owners DESC
""").df()
display(q2)
| price_tier | total_games | breakout_hits | strong_successes | moderate_successes | underperformers | pct_breakout | avg_peak_ccu | avg_owners | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | free | 157 | 63.0 | 62.0 | 20.0 | 12.0 | 40.1 | 2353.0 | 2920478.0 |
| 1 | 15_to_30 | 181 | 44.0 | 74.0 | 51.0 | 12.0 | 24.3 | 2613.0 | 1944116.0 |
| 2 | under_5 | 352 | 134.0 | 126.0 | 65.0 | 27.0 | 38.1 | 1079.0 | 1708139.0 |
| 3 | premium_30+ | 10 | 2.0 | 5.0 | 2.0 | 1.0 | 20.0 | 2685.0 | 1214500.0 |
| 4 | 5_to_15 | 516 | 125.0 | 223.0 | 117.0 | 51.0 | 24.2 | 850.0 | 1214205.0 |
# ── Q2 Chart: Price Tier × Success Tier breakdown (revised) ──────────────────
# ── Prep ──────────────────────────────────────────────────────────────────────
price_order = ['free', 'under_5', '5_to_15', '15_to_30', 'premium_30+']
tier_label_map = {
'free': 'Free', 'under_5': 'Under $5',
'5_to_15': '$5–15', '15_to_30': '$15–30', 'premium_30+': '$30+'
}
q2_plot = q2[q2['price_tier'].isin(price_order)].copy()
q2_plot['price_tier'] = pd.Categorical(q2_plot['price_tier'], categories=price_order, ordered=True)
q2_plot = q2_plot.sort_values('price_tier')
q2_plot['label'] = q2_plot['price_tier'].map(tier_label_map)
labels = q2_plot['label'].tolist()
totals = q2_plot['total_games'].values
breakouts = q2_plot['breakout_hits'].values
strongs = q2_plot['strong_successes'].values
moderates = q2_plot['moderate_successes'].values
unders = q2_plot['underperformers'].values
pct_b = q2_plot['pct_breakout'].values
avg_own = q2_plot['avg_owners'].values / 1e6
pct_break = breakouts / totals * 100
pct_strong = strongs / totals * 100
pct_mod = moderates / totals * 100
pct_under = unders / totals * 100
x = np.arange(len(labels))
W = 0.55
# ── Layout: title row + 3 panels + legend row ─────────────────────────────────
fig = plt.figure(figsize=(18, 7), facecolor=BG)
# Leave room at bottom for legend
gs = fig.add_gridspec(2, 3, height_ratios=[1, 0.07], hspace=0.18, wspace=0.32,
left=0.05, right=0.97, top=0.88, bottom=0.14)
ax1 = fig.add_subplot(gs[0, 0])
ax2 = fig.add_subplot(gs[0, 1])
ax3 = fig.add_subplot(gs[0, 2])
fig.suptitle('Q2 · Price Tier × Success Tier — Where Do Breakout Hits Come From?',
fontsize=15, fontweight='700', color=TEXT, y=0.97)
# ── Panel 1: Stacked bar ──────────────────────────────────────────────────────
b1 = ax1.bar(x, pct_break, width=W, color=C_BREAKOUT, alpha=0.92, edgecolor=BG, label='Breakout Hit')
b2 = ax1.bar(x, pct_strong, width=W, bottom=pct_break, color=C_STRONG, alpha=0.92, edgecolor=BG, label='Strong Success')
b3 = ax1.bar(x, pct_mod, width=W, bottom=pct_break+pct_strong, color=C_MODERATE, alpha=0.92, edgecolor=BG, label='Moderate Success')
b4 = ax1.bar(x, pct_under, width=W, bottom=pct_break+pct_strong+pct_mod, color=C_UNDER, alpha=0.92, edgecolor=BG, label='Underperformer')
# Annotate breakout % inside bar, only if segment tall enough
for i, pb in enumerate(pct_break):
if pb > 8:
ax1.text(i, pb / 2, f'{pb:.0f}%', ha='center', va='center',
fontsize=10, fontweight='bold', color=BG)
# n= totals above bars
for i, tot in enumerate(totals):
ax1.text(i, 103, f'n={tot}', ha='center', va='bottom', fontsize=8, color=SUBTEXT)
ax1.set_xticks(x); ax1.set_xticklabels(labels, fontsize=10)
ax1.set_ylabel('% of games', color=SUBTEXT, fontsize=10)
ax1.set_title('Tier breakdown per price band', fontsize=12, color=TEXT, pad=12)
ax1.set_ylim(0, 116)
ax1.grid(True, axis='y', alpha=0.15)
clean_ax(ax1)
# ── Panel 2: Breakout rate line ───────────────────────────────────────────────
ax2.plot(x, pct_b, color=C_BREAKOUT, linewidth=3, marker='o', markersize=10,
markerfacecolor=C_BREAKOUT, markeredgecolor=BG, markeredgewidth=2, zorder=5)
ax2.fill_between(x, pct_b, alpha=0.12, color=C_BREAKOUT)
for i, v in enumerate(pct_b):
offset = 14 if i != len(pct_b)-1 else -20
ax2.annotate(f'{v:.1f}%', (i, v), textcoords='offset points',
xytext=(0, offset), ha='center', fontsize=11,
fontweight='bold', color=C_BREAKOUT)
ax2.set_xticks(x); ax2.set_xticklabels(labels, fontsize=10)
ax2.set_ylabel('% Breakout Hits', color=SUBTEXT, fontsize=10)
ax2.set_title('Breakout hit rate by price', fontsize=12, color=TEXT, pad=12)
ax2.set_ylim(0, max(pct_b) * 1.55)
ax2.grid(True, axis='y', alpha=0.15)
clean_ax(ax2)
# ── Panel 3: Avg owners ───────────────────────────────────────────────────────
max_idx = np.argmax(avg_own)
bar_colors = [C_BREAKOUT if i == max_idx else '#2D333B' for i in range(len(avg_own))]
bars = ax3.bar(x, avg_own, color=bar_colors, alpha=0.9, edgecolor=BG, width=W)
for bar, val in zip(bars, avg_own):
ax3.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.04,
f'{val:.2f}M', ha='center', va='bottom',
fontsize=10, fontweight='bold', color=TEXT)
ax3.set_xticks(x); ax3.set_xticklabels(labels, fontsize=10)
ax3.set_ylabel('Avg estimated owners (M)', color=SUBTEXT, fontsize=10)
ax3.set_title('Avg ownership by price tier', fontsize=12, color=TEXT, pad=12)
ax3.grid(True, axis='y', alpha=0.15)
clean_ax(ax3)
# ── Legend: single row below all panels ──────────────────────────────────────
legend_patches = [
mpatches.Patch(color=C_BREAKOUT, label='Breakout Hit (≥1M owners)'),
mpatches.Patch(color=C_STRONG, label='Strong Success (200k–1M)'),
mpatches.Patch(color=C_MODERATE, label='Moderate Success (20k–200k)'),
mpatches.Patch(color=C_UNDER, label='Underperformer (<20k)'),
]
fig.legend(handles=legend_patches, loc='lower center', ncol=4,
fontsize=10, frameon=True, facecolor=CARD, edgecolor=GRID,
bbox_to_anchor=(0.5, 0.01), handlelength=1.2, handleheight=1.0,
columnspacing=2.0)
plt.show()
Free and sub-$5 games achieve breakout rates of 40.1% and 38.1%, roughly double that of higher-priced tiers. Free games average 2.92M owners, highest across all tiers, though this likely includes wishlist adds and impulse downloads rather than genuine engagement. The $15–$30 tier is notable: despite a lower breakout rate (24.3%), it achieves the second-highest average ownership at 1.94M, suggesting games in this bracket attract more committed audiences. Pricing maximises reach but does not guarantee depth of engagement.
# Q3: Platform count (from IGDB) vs success tier.
# Breaks down the four success tiers by number of platforms available,
# allowing us to see if Breakout Hits disproportionately appear on more platforms.
q3 = con.execute("""
WITH tiered AS (
SELECT *,
CASE
WHEN owners_estimate >= 1000000 THEN 'Breakout Hit'
WHEN owners_estimate >= 200000 THEN 'Strong Success'
WHEN owners_estimate >= 20000 THEN 'Moderate Success'
ELSE 'Underperformer'
END AS success_tier,
CASE
WHEN owners_estimate >= 1000000 THEN 1
WHEN owners_estimate >= 200000 THEN 2
WHEN owners_estimate >= 20000 THEN 3
ELSE 4
END AS tier_rank
FROM games
)
SELECT num_platforms, COUNT(*) AS num_games,
SUM(CASE WHEN success_tier='Breakout Hit' THEN 1 ELSE 0 END) AS breakout_hits,
SUM(CASE WHEN success_tier='Strong Success' THEN 1 ELSE 0 END) AS strong_successes,
SUM(CASE WHEN success_tier='Moderate Success' THEN 1 ELSE 0 END) AS moderate_successes,
SUM(CASE WHEN success_tier='Underperformer' THEN 1 ELSE 0 END) AS underperformers,
ROUND(AVG(igdb_critic_score), 1) AS avg_critic_score,
ROUND(AVG(owners_estimate), 0) AS avg_owners,
ROUND(AVG(peak_ccu), 0) AS avg_peak_ccu
FROM tiered WHERE num_platforms IS NOT NULL
GROUP BY num_platforms ORDER BY num_platforms
""").df()
display(q3)
| num_platforms | num_games | breakout_hits | strong_successes | moderate_successes | underperformers | avg_critic_score | avg_owners | avg_peak_ccu | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 8 | 3.0 | 4.0 | 1.0 | 0.0 | NaN | 1141875.0 | 236.0 |
| 1 | 1 | 421 | 79.0 | 173.0 | 120.0 | 49.0 | 79.5 | 996188.0 | 1030.0 |
| 2 | 2 | 122 | 28.0 | 48.0 | 32.0 | 14.0 | 79.1 | 1143852.0 | 1801.0 |
| 3 | 3 | 204 | 64.0 | 80.0 | 42.0 | 18.0 | 76.1 | 1488995.0 | 1141.0 |
| 4 | 4 | 91 | 25.0 | 44.0 | 17.0 | 5.0 | 74.2 | 1905769.0 | 1338.0 |
| 5 | 5 | 89 | 40.0 | 35.0 | 13.0 | 1.0 | 79.8 | 2632135.0 | 1192.0 |
| 6 | 6 | 90 | 42.0 | 42.0 | 5.0 | 1.0 | 79.9 | 2551889.0 | 1362.0 |
| 7 | 7 | 48 | 24.0 | 17.0 | 6.0 | 1.0 | 79.7 | 3025938.0 | 1572.0 |
| 8 | 8 | 41 | 24.0 | 14.0 | 3.0 | 0.0 | 81.0 | 2137805.0 | 754.0 |
| 9 | 9 | 20 | 8.0 | 10.0 | 2.0 | 0.0 | 82.2 | 1890000.0 | 698.0 |
| 10 | 10 | 7 | 4.0 | 2.0 | 0.0 | 1.0 | 83.4 | 7001429.0 | 6425.0 |
| 11 | 11 | 5 | 4.0 | 1.0 | 0.0 | 0.0 | 81.7 | 11470000.0 | 10459.0 |
| 12 | 12 | 2 | 1.0 | 0.0 | 1.0 | 0.0 | 82.0 | 1825000.0 | 251.0 |
| 13 | 15 | 1 | 1.0 | 0.0 | 0.0 | 0.0 | 81.0 | 35000000.0 | 24580.0 |
| 14 | 16 | 1 | 1.0 | 0.0 | 0.0 | 0.0 | 84.3 | 1500000.0 | 148.0 |
# ── Q3 Chart: Platform Count × Success Tier (v3) ─────────────────────────────
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
BG = '#0D1117'
CARD = '#161B22'
GRID = '#21262D'
TEXT = '#E6EDF3'
SUBTEXT = '#8B949E'
C_BREAKOUT = '#D2A8FF'
C_STRONG = '#58A6FF'
C_MODERATE = '#7EE787'
C_UNDER = '#FF7B72'
plt.rcParams.update({
'figure.facecolor': BG, 'axes.facecolor': CARD,
'axes.edgecolor': GRID, 'axes.labelcolor': TEXT,
'text.color': TEXT, 'xtick.color': SUBTEXT, 'ytick.color': SUBTEXT,
'grid.color': GRID, 'grid.alpha': 0.4, 'font.family': 'sans-serif',
'font.size': 11, 'legend.facecolor': CARD, 'legend.edgecolor': GRID,
})
def clean_ax(ax):
ax.set_facecolor(CARD)
for sp in ['top', 'right', 'left']:
ax.spines[sp].set_visible(False)
ax.spines['bottom'].set_edgecolor(GRID)
ax.tick_params(colors=SUBTEXT)
# ── Prep ──────────────────────────────────────────────────────────────────────
q3_plot = q3[(q3['num_games'] >= 10) & (q3['num_platforms'] <= 8)].copy()
q3_plot = q3_plot.sort_values('num_platforms')
plat = q3_plot['num_platforms'].astype(int).tolist()
totals = q3_plot['num_games'].values
avg_own = q3_plot['avg_owners'].values / 1e6
avg_crit = q3_plot['avg_critic_score'].fillna(0).values
pct_break = q3_plot['breakout_hits'].values / totals * 100
pct_strong = q3_plot['strong_successes'].values / totals * 100
pct_mod = q3_plot['moderate_successes'].values / totals * 100
pct_under = q3_plot['underperformers'].values / totals * 100
x = np.arange(len(plat))
xlabs = [str(p) for p in plat]
W = 0.6
# ── Figure ────────────────────────────────────────────────────────────────────
fig = plt.figure(figsize=(18, 7), facecolor=BG)
gs = fig.add_gridspec(1, 2, wspace=0.35, left=0.05, right=0.95, top=0.85, bottom=0.14)
ax1 = fig.add_subplot(gs[0, 0])
ax2 = fig.add_subplot(gs[0, 1])
fig.suptitle('Q3 · Platform Count vs Success — Does Being Everywhere Pay Off?',
fontsize=15, fontweight='700', color=TEXT, y=1.04)
# ── Panel 1: Stacked % bars ───────────────────────────────────────────────────
b_under = ax1.bar(x, pct_under, width=W, color=C_UNDER, alpha=0.92, edgecolor=BG)
b_mod = ax1.bar(x, pct_mod, width=W, bottom=pct_under, color=C_MODERATE, alpha=0.92, edgecolor=BG)
b_str = ax1.bar(x, pct_strong, width=W, bottom=pct_under+pct_mod, color=C_STRONG, alpha=0.92, edgecolor=BG)
b_brk = ax1.bar(x, pct_break, width=W, bottom=pct_under+pct_mod+pct_strong, color=C_BREAKOUT, alpha=0.92, edgecolor=BG)
for i, (pb, pu, pm, ps) in enumerate(zip(pct_break, pct_under, pct_mod, pct_strong)):
base = pu + pm + ps
if pb > 7:
ax1.text(i, base + pb / 2, f'{pb:.0f}%',
ha='center', va='center', fontsize=9, fontweight='bold', color=BG)
ax1.text(i, 102, f'n={totals[i]}', ha='center', va='bottom', fontsize=8, color=SUBTEXT)
ax1.set_xticks(x)
ax1.set_xticklabels(xlabs, fontsize=10)
ax1.set_xlabel('Number of platforms (IGDB)', color=SUBTEXT, fontsize=10)
ax1.set_ylabel('% of games', color=SUBTEXT, fontsize=10)
ax1.set_title('Success tier mix by platform count', fontsize=12, color=TEXT, pad=12)
ax1.set_ylim(0, 116)
ax1.grid(True, axis='y', alpha=0.15)
clean_ax(ax1)
# ── Panel 2: Clean line chart — platforms vs avg ownership + breakout % ───────
ax2b = ax2.twinx()
bar_cols = [C_BREAKOUT if o == avg_own.max() else '#2D333B' for o in avg_own]
ax2.bar(x, avg_own, width=W, color=bar_cols, alpha=0.75, edgecolor=BG, zorder=2)
ax2b.plot(x, pct_break, color=C_BREAKOUT, linewidth=2.5, marker='o', markersize=8,
markerfacecolor=C_BREAKOUT, markeredgecolor=BG, markeredgewidth=2, zorder=5)
ax2b.fill_between(x, pct_break, alpha=0.08, color=C_BREAKOUT)
for i, v in enumerate(avg_own):
ax2.text(i, v + 0.05, f'{v:.1f}M', ha='center', va='bottom',
fontsize=9, fontweight='bold', color=TEXT)
for i, v in enumerate(pct_break):
offset = 12 if i % 2 == 0 else -18
ax2b.annotate(f'{v:.0f}%', (i, v), textcoords='offset points',
xytext=(0, offset), ha='center', fontsize=9,
fontweight='bold', color=C_BREAKOUT)
ax2.set_xticks(x)
ax2.set_xticklabels(xlabs, fontsize=10)
ax2.set_xlabel('Number of platforms (IGDB)', color=SUBTEXT, fontsize=10)
ax2.set_ylabel('Avg estimated owners (M)', color=SUBTEXT, fontsize=10)
ax2b.set_ylabel('Breakout hit rate (%)', color=C_BREAKOUT, fontsize=10)
ax2b.tick_params(axis='y', colors=C_BREAKOUT)
ax2b.spines['right'].set_edgecolor(C_BREAKOUT)
ax2.set_title('Avg ownership & breakout rate\nby number of platforms', fontsize=12, color=TEXT, pad=12)
ax2.grid(True, axis='y', alpha=0.15)
clean_ax(ax2)
ax2.spines['right'].set_visible(True)
ax2.spines['right'].set_edgecolor(GRID)
# ── Shared legend ─────────────────────────────────────────────────────────────
legend_patches = [
mpatches.Patch(color=C_BREAKOUT, label='Breakout Hit (≥1M owners)'),
mpatches.Patch(color=C_STRONG, label='Strong Success (200k–1M)'),
mpatches.Patch(color=C_MODERATE, label='Moderate Success (20k–200k)'),
mpatches.Patch(color=C_UNDER, label='Underperformer (<20k)'),
]
fig.legend(handles=legend_patches, loc='lower center', ncol=4,
fontsize=10, frameon=True, facecolor=CARD, edgecolor=GRID,
bbox_to_anchor=(0.5, 0.01), handlelength=1.2, columnspacing=2.0)
plt.show()
Platform count shows a strong positive relationship with success. Single-platform games achieve a 19% breakout rate, while games on seven or eight platforms reach 50% and 59% respectively. Average ownership rises from 1.0M for single-platform games to 3.0M for seven-platform titles. Causality is unclear as successful games may attract ports rather than ports driving success, but multi-platform presence is one of the strongest indicators of breakout status in the dataset.
# Q4: VADER review sentiment vs success tier.
# Shows whether review sentiment separates the four tiers — particularly whether
# Breakout Hits attract qualitatively different language than Moderate Successes.
# avg_vader_compound is the mean VADER compound score across all English Steam reviews
# for each game, computed by the Spark pipeline UDF.
q4 = con.execute("""
WITH tiered AS (
SELECT *,
CASE
WHEN owners_estimate >= 1000000 THEN 'Breakout Hit'
WHEN owners_estimate >= 200000 THEN 'Strong Success'
WHEN owners_estimate >= 20000 THEN 'Moderate Success'
ELSE 'Underperformer'
END AS success_tier,
CASE
WHEN owners_estimate >= 1000000 THEN 1
WHEN owners_estimate >= 200000 THEN 2
WHEN owners_estimate >= 20000 THEN 3
ELSE 4
END AS tier_rank
FROM games
)
SELECT success_tier, tier_rank,
COUNT(*) AS num_games,
ROUND(AVG(avg_vader_compound), 4) AS avg_vader_score,
ROUND(AVG(pct_positive_reviews), 1) AS avg_pct_positive,
ROUND(AVG(avg_playtime_at_review), 0) AS avg_playtime_at_review_mins,
ROUND(AVG(num_reviews_analysed), 0) AS avg_reviews_per_game,
ROUND(AVG(owners_estimate), 0) AS avg_owners
FROM tiered
WHERE avg_vader_compound IS NOT NULL
GROUP BY success_tier, tier_rank
ORDER BY tier_rank
""").df()
display(q4)
| success_tier | tier_rank | num_games | avg_vader_score | avg_pct_positive | avg_playtime_at_review_mins | avg_reviews_per_game | avg_owners | |
|---|---|---|---|---|---|---|---|---|
| 0 | Breakout Hit | 1 | 368 | 0.4290 | 83.2 | 7490.0 | 74.0 | 4769022.0 |
| 1 | Strong Success | 2 | 490 | 0.4147 | 81.2 | 5721.0 | 45.0 | 547551.0 |
| 2 | Moderate Success | 3 | 254 | 0.4525 | 81.2 | 4295.0 | 27.0 | 101811.0 |
| 3 | Underperformer | 4 | 100 | 0.4385 | 81.4 | 3950.0 | 17.0 | 10000.0 |
# ── Q4 Chart: VADER Sentiment × Success Tier (v2) ────────────────────────────
# ── Prep ──────────────────────────────────────────────────────────────────────
q4_plot = q4.set_index('success_tier').reindex(tier_order).reset_index()
vader = q4_plot['avg_vader_score'].values.astype(float)
pct_pos = q4_plot['avg_pct_positive'].values.astype(float)
playtime = q4_plot['avg_playtime_at_review_mins'].values.astype(float)
n_reviews = q4_plot['avg_reviews_per_game'].values.astype(float)
x = np.arange(4)
W = 0.5
short_labels = ['Breakout', 'Strong', 'Moderate', 'Under']
bar_colors = [TIER_COLORS[t] for t in tier_order]
# ── Figure ────────────────────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(17, 5.5), facecolor=BG)
fig.subplots_adjust(left=0.06, right=0.96, top=0.84, bottom=0.18, wspace=0.4)
fig.suptitle('Q4 · Review Sentiment — Do Players Feel Differently About Breakout Hits?',
fontsize=14, fontweight='700', color=TEXT, y=0.97)
# ── Panel 1: VADER score ──────────────────────────────────────────────────────
ax = axes[0]
bars = ax.bar(x, vader, width=W, color=bar_colors, alpha=0.88, edgecolor=BG, zorder=3)
ax.axhline(0, color=SUBTEXT, linewidth=1, linestyle='--', alpha=0.4, zorder=2)
vmin = min(vader) * 0.9
vmax = max(vader) * 1.18
ax.set_ylim(vmin, vmax)
for bar, val in zip(bars, vader):
ax.text(bar.get_x() + bar.get_width()/2, val + (vmax - vmin) * 0.02,
f'{val:.3f}', ha='center', va='bottom',
fontsize=11, fontweight='bold', color=TEXT)
ax.set_xticks(x)
ax.set_xticklabels(short_labels, fontsize=9.5)
ax.set_ylabel('Avg VADER compound score', fontsize=10)
ax.set_title('VADER sentiment score', fontsize=12, color=TEXT, pad=10)
ax.grid(True, axis='y', alpha=0.15, zorder=0)
clean_ax(ax)
# ── Panel 2: % Positive reviews ───────────────────────────────────────────────
ax = axes[1]
bars2 = ax.bar(x, pct_pos, width=W, color=bar_colors, alpha=0.88, edgecolor=BG, zorder=3)
pmin = min(pct_pos) * 0.92
pmax = max(pct_pos) * 1.12
ax.set_ylim(pmin, pmax)
for bar, val in zip(bars2, pct_pos):
ax.text(bar.get_x() + bar.get_width()/2, val + (pmax - pmin) * 0.02,
f'{val:.1f}%', ha='center', va='bottom',
fontsize=11, fontweight='bold', color=TEXT)
ax.set_xticks(x)
ax.set_xticklabels(short_labels, fontsize=9.5)
ax.set_ylabel('Avg % positive reviews', fontsize=10)
ax.set_title('Steam review positivity', fontsize=12, color=TEXT, pad=10)
ax.grid(True, axis='y', alpha=0.15, zorder=0)
clean_ax(ax)
# ── Panel 3: Playtime at review (hrs) + review volume line ───────────────────
ax = axes[2]
ax2 = ax.twinx()
playtime_hrs = playtime / 60
bars3 = ax.bar(x, playtime_hrs, width=W, color=bar_colors, alpha=0.75, edgecolor=BG, zorder=3)
tmin = 0
tmax = max(playtime_hrs) * 1.3
ax.set_ylim(tmin, tmax)
for bar, val in zip(bars3, playtime_hrs):
ax.text(bar.get_x() + bar.get_width()/2, val + tmax * 0.02,
f'{val:.0f}h', ha='center', va='bottom',
fontsize=11, fontweight='bold', color=TEXT)
ax2.plot(x, n_reviews, color='#FFA657', linewidth=2.5, marker='o', markersize=8,
markerfacecolor='#FFA657', markeredgecolor=BG, markeredgewidth=2, zorder=5)
for i, val in enumerate(n_reviews):
offset = 12 if i < 2 else -18
ax2.annotate(f'{int(val)}', (i, val), textcoords='offset points',
xytext=(0, offset), ha='center', fontsize=10,
fontweight='bold', color='#FFA657')
ax.set_xticks(x)
ax.set_xticklabels(short_labels, fontsize=9.5)
ax.set_ylabel('Avg playtime at review (hrs)', fontsize=10)
ax2.set_ylabel('Avg reviews per game', color='#FFA657', fontsize=10)
ax2.tick_params(axis='y', colors='#FFA657')
ax2.spines['right'].set_edgecolor('#FFA657')
ax2.spines['top'].set_visible(False)
ax2.spines['left'].set_visible(False)
ax.set_title('Playtime at review & review volume', fontsize=12, color=TEXT, pad=10)
ax.grid(True, axis='y', alpha=0.15, zorder=0)
clean_ax(ax)
ax.spines['right'].set_visible(True)
ax.spines['right'].set_edgecolor(GRID)
# ── Shared legend ─────────────────────────────────────────────────────────────
legend_patches = [
mpatches.Patch(color=C_BREAKOUT, label='Breakout Hit (≥1M owners)'),
mpatches.Patch(color=C_STRONG, label='Strong Success (200k–1M)'),
mpatches.Patch(color=C_MODERATE, label='Moderate Success (20k–200k)'),
mpatches.Patch(color=C_UNDER, label='Underperformer (<20k)'),
]
fig.legend(handles=legend_patches, loc='lower center', ncol=4,
fontsize=10, frameon=True, facecolor=CARD, edgecolor=GRID,
bbox_to_anchor=(0.5, 0.01), handlelength=1.2, columnspacing=2.0)
plt.show()
Review sentiment is very similar across tiers. VADER scores range from 0.415 to 0.453, and Steam review positivity varies only from 81.4% to 83.2% and differences are too small to be predictive. The real divergence is in playtime depth and review volume. Breakout Hit players average 125 hours before reviewing versus 66 for Underperformers, and generate 74 reviews per game versus 17. What distinguishes breakout games is not that players feel more positively, but that they play substantially longer and engage more actively.
# Q5: Dedicated subreddit presence vs success tier.
# A 4-tier breakdown shows whether the community-building effect is gradual
# (Moderate Success → Strong Success → Breakout Hit) or a binary jump.
# has_dedicated_subreddit was computed from Reddit JSON endpoint data.
q5 = con.execute("""
WITH tiered AS (
SELECT *,
CASE
WHEN owners_estimate >= 1000000 THEN 'Breakout Hit'
WHEN owners_estimate >= 200000 THEN 'Strong Success'
WHEN owners_estimate >= 20000 THEN 'Moderate Success'
ELSE 'Underperformer'
END AS success_tier,
CASE
WHEN owners_estimate >= 1000000 THEN 1
WHEN owners_estimate >= 200000 THEN 2
WHEN owners_estimate >= 20000 THEN 3
ELSE 4
END AS tier_rank
FROM games
)
SELECT success_tier, tier_rank,
CAST(has_dedicated_subreddit AS VARCHAR) AS has_subreddit,
COUNT(*) AS num_games,
ROUND(AVG(subreddit_subscribers), 0) AS avg_subscribers,
ROUND(AVG(total_reddit_posts), 0) AS avg_posts,
ROUND(AVG(reddit_engagement), 1) AS avg_engagement,
ROUND(AVG(owners_estimate), 0) AS avg_owners
FROM tiered
GROUP BY success_tier, tier_rank, has_dedicated_subreddit
ORDER BY tier_rank, has_dedicated_subreddit DESC
""").df()
display(q5)
| success_tier | tier_rank | has_subreddit | num_games | avg_subscribers | avg_posts | avg_engagement | avg_owners | |
|---|---|---|---|---|---|---|---|---|
| 0 | Breakout Hit | 1 | true | 280 | 97599.0 | 49.0 | 1727092.9 | 5350000.0 |
| 1 | Breakout Hit | 1 | false | 88 | NaN | 47.0 | 1117135.4 | 2920455.0 |
| 2 | Strong Success | 2 | true | 303 | 8100.0 | 48.0 | 1169413.3 | 562541.0 |
| 3 | Strong Success | 2 | false | 187 | NaN | 43.0 | 1441523.4 | 523262.0 |
| 4 | Moderate Success | 3 | true | 118 | 2561.0 | 48.0 | 1048253.0 | 104449.0 |
| 5 | Moderate Success | 3 | false | 137 | NaN | 39.0 | 884266.4 | 99891.0 |
| 6 | Underperformer | 4 | true | 25 | 7829.0 | 45.0 | 724093.7 | 10000.0 |
| 7 | Underperformer | 4 | false | 78 | NaN | 38.0 | 1301657.3 | 10000.0 |
# ── Q5 Chart: Subreddit Presence × Success Tier (v3) ─────────────────────────
# ── Prep ──────────────────────────────────────────────────────────────────────
has_sub = q5[q5['has_subreddit'] == 'true'].set_index('success_tier').reindex(tier_order)
no_sub = q5[q5['has_subreddit'] == 'false'].set_index('success_tier').reindex(tier_order)
total_per_tier = has_sub['num_games'].fillna(0).values + no_sub['num_games'].fillna(0).values
pct_has_sub = has_sub['num_games'].fillna(0).values / total_per_tier * 100
# Subscribers: log scale to handle Breakout Hit dominance
avg_subs_raw = has_sub['avg_subscribers'].fillna(0).values
avg_posts = has_sub['avg_posts'].fillna(0).values
x = np.arange(4)
W = 0.5
bar_colors = [TIER_COLORS[t] for t in tier_order]
# ── Figure ────────────────────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(17, 5.5), facecolor=BG)
fig.subplots_adjust(left=0.06, right=0.96, top=0.84, bottom=0.18, wspace=0.4)
fig.suptitle('Q5 · Reddit Community — Does Having a Subreddit Signal Success?',
fontsize=14, fontweight='700', color=TEXT, y=0.97)
# ── Panel 1: % with subreddit — dot + bar combo ───────────────────────────────
ax = axes[0]
# Background 100% track
ax.barh(x, [100]*4, height=W, color=GRID, alpha=0.5, edgecolor='none')
# Filled portion = has subreddit
bars = ax.barh(x, pct_has_sub, height=W, color=bar_colors, alpha=0.88, edgecolor='none')
for i, (pct, tot) in enumerate(zip(pct_has_sub, total_per_tier)):
ax.text(pct + 2, i, f'{pct:.0f}%', va='center', fontsize=12,
fontweight='bold', color=TEXT)
ax.text(101, i - 0.38, f'n={int(tot)}', va='top', fontsize=8, color=SUBTEXT)
ax.set_yticks(x)
ax.set_yticklabels(short_labels, fontsize=10, color=TEXT)
ax.set_xlim(0, 125)
ax.set_xlabel('% of games with a subreddit', fontsize=10)
ax.set_title('Subreddit ownership rate\nper success tier', fontsize=12, color=TEXT, pad=10)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_edgecolor(GRID)
ax.spines['bottom'].set_edgecolor(GRID)
ax.set_facecolor(CARD)
ax.tick_params(colors=SUBTEXT)
ax.grid(False)
# ── Panel 2: Subscribers on log scale ────────────────────────────────────────
ax = axes[1]
log_subs = np.log10(np.clip(avg_subs_raw, 1, None))
bars2 = ax.bar(x, log_subs, width=W, color=bar_colors, alpha=0.88, edgecolor=BG)
lmax = max(log_subs) * 1.2
ax.set_ylim(0, lmax)
for bar, raw, lv in zip(bars2, avg_subs_raw, log_subs):
if raw >= 1000:
label = f'{raw/1000:.0f}k'
else:
label = f'{int(raw)}'
ax.text(bar.get_x() + bar.get_width()/2, lv + lmax * 0.02,
label, ha='center', va='bottom',
fontsize=11, fontweight='bold', color=TEXT)
tick_vals = [3, 4, 5]
ax.set_yticks(tick_vals)
ax.set_yticklabels(['1k', '10k', '100k'], fontsize=9)
ax.set_xticks(x); ax.set_xticklabels(short_labels, fontsize=9.5)
ax.set_ylabel('Avg subscribers (log scale)', fontsize=10)
ax.set_title('Avg subreddit size\n(games with subreddit only)', fontsize=12, color=TEXT, pad=10)
ax.grid(True, axis='y', alpha=0.15)
clean_ax(ax)
# ── Panel 3: Avg Reddit posts per tier ───────────────────────────────────────
ax = axes[2]
bars3 = ax.bar(x, avg_posts, width=W, color=bar_colors, alpha=0.88, edgecolor=BG)
pmax = max(avg_posts) * 1.28
ax.set_ylim(0, pmax)
for bar, val in zip(bars3, avg_posts):
ax.text(bar.get_x() + bar.get_width()/2, val + pmax * 0.02,
f'{int(val)}', ha='center', va='bottom',
fontsize=11, fontweight='bold', color=TEXT)
ax.set_xticks(x); ax.set_xticklabels(short_labels, fontsize=9.5)
ax.set_ylabel('Avg Reddit posts', fontsize=10)
ax.set_title('Reddit post volume\n(games with subreddit only)', fontsize=12, color=TEXT, pad=10)
ax.grid(True, axis='y', alpha=0.15)
clean_ax(ax)
# ── Shared legend ─────────────────────────────────────────────────────────────
legend_patches = [
mpatches.Patch(color=C_BREAKOUT, label='Breakout Hit (≥1M owners)'),
mpatches.Patch(color=C_STRONG, label='Strong Success (200k–1M)'),
mpatches.Patch(color=C_MODERATE, label='Moderate Success (20k–200k)'),
mpatches.Patch(color=C_UNDER, label='Underperformer (<20k)'),
]
fig.legend(handles=legend_patches, loc='lower center', ncol=4,
fontsize=10, frameon=True, facecolor=CARD, edgecolor=GRID,
bbox_to_anchor=(0.5, 0.01), handlelength=1.2, columnspacing=2.0)
plt.show()
Subreddit ownership rises consistently with tier: 76% of Breakout Hits have a dedicated subreddit versus 24% of Underperformers, the clearest tier-by-tier gradient in the dataset. Breakout Hits average 98k subscribers, more than 10× the size of Strong Success (8k) and Moderate Success (3k) communities. Notably, Underperformers with subreddits average 8k subscribers, suggesting niche games with small audiences. The most striking finding: average Reddit post volume is nearly identical across all tiers (45–49 posts).
# Q6: Google Trends pre/post-launch interest vs success tier.
# Tests whether pre-launch buzz is progressively stronger for higher tiers,
# or whether it only separates Breakout Hits from everyone else.
# google_interest_pre_launch and google_interest_post_launch are engineered
# features computed in spark_pipeline.py from 69,552 weekly search data points.
q6 = con.execute("""
WITH tiered AS (
SELECT *,
CASE
WHEN owners_estimate >= 1000000 THEN 'Breakout Hit'
WHEN owners_estimate >= 200000 THEN 'Strong Success'
WHEN owners_estimate >= 20000 THEN 'Moderate Success'
ELSE 'Underperformer'
END AS success_tier,
CASE
WHEN owners_estimate >= 1000000 THEN 1
WHEN owners_estimate >= 200000 THEN 2
WHEN owners_estimate >= 20000 THEN 3
ELSE 4
END AS tier_rank
FROM games
)
SELECT success_tier, tier_rank,
COUNT(*) AS num_games,
ROUND(AVG(google_interest_pre_launch), 2) AS avg_pre_launch_interest,
ROUND(AVG(google_interest_post_launch), 2) AS avg_post_launch_interest,
ROUND(AVG(days_to_peak_interest), 0) AS avg_days_to_peak,
COUNT(CASE WHEN trend_direction='rising' THEN 1 END) AS rising_trends,
COUNT(CASE WHEN trend_direction='falling' THEN 1 END) AS falling_trends,
ROUND(AVG(owners_estimate), 0) AS avg_owners
FROM tiered
WHERE google_interest_pre_launch IS NOT NULL
GROUP BY success_tier, tier_rank
ORDER BY tier_rank
""").df()
display(q6)
| success_tier | tier_rank | num_games | avg_pre_launch_interest | avg_post_launch_interest | avg_days_to_peak | rising_trends | falling_trends | avg_owners | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Breakout Hit | 1 | 185 | 12.44 | 28.84 | 162.0 | 148 | 20 | 4070270.0 |
| 1 | Strong Success | 2 | 324 | 8.02 | 26.67 | 6.0 | 244 | 45 | 537654.0 |
| 2 | Moderate Success | 3 | 146 | 8.00 | 25.22 | -206.0 | 76 | 20 | 112671.0 |
| 3 | Underperformer | 4 | 49 | 9.61 | 27.38 | -365.0 | 21 | 3 | 10000.0 |
# ── Q6 Chart: Google Trends Pre/Post Launch × Success Tier (v3) ──────────────
# ── Prep ──────────────────────────────────────────────────────────────────────
q6_plot = q6.set_index('success_tier').reindex(tier_order).reset_index()
pre = q6_plot['avg_pre_launch_interest'].values.astype(float)
post = q6_plot['avg_post_launch_interest'].values.astype(float)
days_peak = q6_plot['avg_days_to_peak'].values.astype(float)
rising = q6_plot['rising_trends'].values.astype(float)
falling = q6_plot['falling_trends'].values.astype(float)
n_games = q6_plot['num_games'].values.astype(float)
rising_pct = rising / n_games * 100
falling_pct = falling / n_games * 100
days_peak_clamped = np.clip(days_peak, 0, None)
x = np.arange(4)
W = 0.35
bar_colors = [TIER_COLORS[t] for t in tier_order]
# ── Figure ────────────────────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(17, 5.5), facecolor=BG)
fig.subplots_adjust(left=0.06, right=0.96, top=0.84, bottom=0.18, wspace=0.4)
fig.suptitle('Q6 · Google Trends — Does Pre-Launch Buzz Predict Breakout Success?',
fontsize=14, fontweight='700', color=TEXT, y=0.97)
# ── Panel 1: Pre vs Post launch interest ─────────────────────────────────────
ax = axes[0]
bars_pre = ax.bar(x - W/2, pre, width=W, color=bar_colors, alpha=0.88, edgecolor=BG)
bars_post = ax.bar(x + W/2, post, width=W, color=bar_colors, alpha=0.35, edgecolor=BG)
vmax = max(max(pre), max(post)) * 1.28
ax.set_ylim(0, vmax)
for i, (p1, p2) in enumerate(zip(pre, post)):
ax.text(i - W/2, p1 + vmax*0.02, f'{p1:.1f}',
ha='center', va='bottom', fontsize=9, fontweight='bold', color=TEXT)
ax.text(i + W/2, p2 + vmax*0.02, f'{p2:.1f}',
ha='center', va='bottom', fontsize=9, color=SUBTEXT)
ax.set_xticks(x); ax.set_xticklabels(short_labels, fontsize=9.5)
ax.set_ylabel('Avg Google interest score', fontsize=10)
ax.set_title('Pre vs post-launch\nGoogle search interest', fontsize=12, color=TEXT, pad=10)
ax.grid(True, axis='y', alpha=0.15)
p1 = mpatches.Patch(color='#E6EDF3', alpha=0.88, label='Pre-launch')
p2 = mpatches.Patch(color='#E6EDF3', alpha=0.35, label='Post-launch')
ax.legend(handles=[p1, p2], fontsize=9, frameon=True,
facecolor=CARD, edgecolor=GRID, loc='upper right')
clean_ax(ax)
# ── Panel 2: Days to peak – log scale so all tiers are visible ────────────────
ax = axes[1]
days_display = np.where(days_peak_clamped > 0, days_peak_clamped, 0.5)
bars2 = ax.bar(x, days_display, width=W+0.1, color=bar_colors, alpha=0.88, edgecolor=BG)
ax.set_yscale('log')
ax.set_ylim(0.3, max(days_display) * 3)
for i, (bar, raw) in enumerate(zip(bars2, days_peak)):
if raw < 0:
label = 'Pre-launch\npeak'
col = SUBTEXT
elif raw == 0:
label = '<1d'
col = SUBTEXT
else:
label = f'{int(raw)}d'
col = TEXT
ax.text(bar.get_x() + bar.get_width()/2,
bar.get_height() * 1.15,
label, ha='center', va='bottom',
fontsize=10, fontweight='bold', color=col)
ax.set_xticks(x); ax.set_xticklabels(short_labels, fontsize=9.5)
ax.set_ylabel('Days to peak interest (log scale)', fontsize=10)
ax.set_title('How quickly does search\ninterest peak post-launch?', fontsize=12, color=TEXT, pad=10)
ax.grid(True, axis='y', alpha=0.15)
clean_ax(ax)
# ── Panel 3: Rising vs falling ───────────────────────────────────────────────
ax = axes[2]
display_order = list(reversed(range(4)))
bar_h = 0.3
spacing = 1.0
y_positions = [i * spacing for i in range(4)]
for plot_i, tier_i in enumerate(display_order):
y = y_positions[plot_i]
col = bar_colors[tier_i]
rp = rising_pct[tier_i]
fp = falling_pct[tier_i]
ax.barh(y + bar_h/2 + 0.02, 100, height=bar_h, color=GRID, alpha=0.4, edgecolor='none')
ax.barh(y + bar_h/2 + 0.02, rp, height=bar_h, color=col, alpha=0.9, edgecolor='none')
ax.text(rp + 2, y + bar_h/2 + 0.02, f'{rp:.0f}% rising',
va='center', fontsize=9, fontweight='bold', color=TEXT)
ax.barh(y - bar_h/2 - 0.02, 100, height=bar_h, color=GRID, alpha=0.4, edgecolor='none')
ax.barh(y - bar_h/2 - 0.02, fp, height=bar_h, color=col, alpha=0.3, edgecolor='none')
ax.text(fp + 2, y - bar_h/2 - 0.02, f'{fp:.0f}% falling',
va='center', fontsize=9, color=SUBTEXT)
ax.set_yticks(y_positions)
ax.set_yticklabels([short_labels[i] for i in display_order], fontsize=9.5, color=TEXT)
ax.set_xlim(0, 140)
ax.set_xlabel('% of games in tier', fontsize=10)
ax.set_title('Trend direction after launch\n(rising vs falling interest)', fontsize=12, color=TEXT, pad=10)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_edgecolor(GRID)
ax.spines['bottom'].set_edgecolor(GRID)
ax.set_facecolor(CARD)
ax.tick_params(colors=SUBTEXT)
ax.grid(False)
# ── Shared legend ─────────────────────────────────────────────────────────────
legend_patches = [
mpatches.Patch(color=C_BREAKOUT, label='Breakout Hit (≥1M owners)'),
mpatches.Patch(color=C_STRONG, label='Strong Success (200k–1M)'),
mpatches.Patch(color=C_MODERATE, label='Moderate Success (20k–200k)'),
mpatches.Patch(color=C_UNDER, label='Underperformer (<20k)'),
]
fig.legend(handles=legend_patches, loc='lower center', ncol=4,
fontsize=10, frameon=True, facecolor=CARD, edgecolor=GRID,
bbox_to_anchor=(0.5, 0.01), handlelength=1.2, columnspacing=2.0)
plt.show()
Pre-launch Google Trends interest is a weak predictor. Breakout Hits average 12.4, while Underperformers score 9.6, and Strong/Moderate Success both sit at 8.0, differences are too small and inconsistent to be meaningful. Post-launch behaviour diverges more clearly: Breakout Hits take 162 days to reach peak search interest versus 6 days for Strong Success, indicating gradual momentum rather than launch spikes. 80% of Breakout Hits show rising post-launch interest versus 43% of Underperformers, suggesting sustained public attention after release is a far stronger signal than pre-launch buzz.
# Q7: Top 20 breakout hits ranked by peak concurrent users.
# Restricted to games in the Breakout Hit tier (1M+ owners) so the ranking
# is within the elite tier rather than conflating tiers. Includes all key signals.
q7 = con.execute("""
WITH tiered AS (
SELECT *,
CASE
WHEN owners_estimate >= 1000000 THEN 'Breakout Hit'
WHEN owners_estimate >= 200000 THEN 'Strong Success'
WHEN owners_estimate >= 20000 THEN 'Moderate Success'
ELSE 'Underperformer'
END AS success_tier,
CASE
WHEN owners_estimate >= 1000000 THEN 1
WHEN owners_estimate >= 200000 THEN 2
WHEN owners_estimate >= 20000 THEN 3
ELSE 4
END AS tier_rank
FROM games
)
SELECT name, success_tier, peak_ccu,
ROUND(owners_estimate, 0) AS owners_estimate,
price,
ROUND(review_positivity_ratio, 3) AS review_ratio,
ROUND(avg_vader_compound, 3) AS vader_score,
num_platforms,
COALESCE(subreddit_subscribers, 0) AS reddit_subs,
ROUND(trend_peak, 0) AS trend_peak
FROM tiered
WHERE success_tier = 'Breakout Hit'
ORDER BY peak_ccu DESC LIMIT 20
""").df()
display(q7)
| name | success_tier | peak_ccu | owners_estimate | price | review_ratio | vader_score | num_platforms | reddit_subs | trend_peak | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Rust | Breakout Hit | 143870 | 35000000.0 | 19.99 | 0.872 | 0.190 | 2 | 390673 | 100 |
| 1 | Wallpaper Engine | Breakout Hit | 91184 | 35000000.0 | 3.99 | 0.980 | 0.566 | <NA> | 212778 | 100 |
| 2 | DELTARUNE | Breakout Hit | 83936 | 1500000.0 | 19.99 | 0.989 | 0.651 | 6 | 470882 | 100 |
| 3 | Stardew Valley | Breakout Hit | 50662 | 35000000.0 | 8.99 | 0.984 | 0.812 | 11 | 3114769 | 100 |
| 4 | Crosshair X | Breakout Hit | 41428 | 3500000.0 | 3.49 | 0.934 | 0.471 | <NA> | 0 | 100 |
| 5 | Don't Starve Together | Breakout Hit | 30542 | 15000000.0 | 5.09 | 0.953 | 0.589 | 4 | 42503 | 100 |
| 6 | Geometry Dash | Breakout Hit | 27850 | 7500000.0 | 2.49 | 0.933 | 0.460 | 4 | 646126 | 100 |
| 7 | Terraria | Breakout Hit | 24580 | 35000000.0 | 4.99 | 0.975 | 0.504 | 15 | 1590045 | 100 |
| 8 | Project Zomboid | Breakout Hit | 22864 | 15000000.0 | 13.39 | 0.942 | 0.317 | 3 | 556701 | 100 |
| 9 | Schedule I | Breakout Hit | 22757 | 15000000.0 | 13.99 | 0.984 | 0.635 | 1 | 117 | 100 |
| 10 | ARK: Survival Evolved | Breakout Hit | 22170 | 35000000.0 | 9.89 | 0.838 | 0.460 | 10 | 27355 | 100 |
| 11 | Rocket League® | Breakout Hit | 21106 | 15000000.0 | 0.00 | 0.874 | 0.352 | <NA> | 1850853 | 0 |
| 12 | RimWorld | Breakout Hit | 19211 | 7500000.0 | 27.99 | 0.980 | 0.533 | 3 | 633623 | 100 |
| 13 | Garry's Mod | Breakout Hit | 18400 | 35000000.0 | 4.99 | 0.968 | 0.593 | 3 | 8547 | 100 |
| 14 | Palworld | Breakout Hit | 18028 | 75000000.0 | 22.49 | 0.941 | 0.676 | 5 | 755450 | 100 |
| 15 | Euro Truck Simulator 2 | Breakout Hit | 17281 | 15000000.0 | 4.99 | 0.974 | 0.470 | 1 | 9 | 100 |
| 16 | Balatro | Breakout Hit | 17123 | 3500000.0 | 12.74 | 0.980 | 0.638 | 10 | 567722 | 100 |
| 17 | 7 Days to Die | Breakout Hit | 17045 | 15000000.0 | 24.74 | 0.886 | 0.589 | 7 | 219740 | 100 |
| 18 | Factorio | Breakout Hit | 16426 | 7500000.0 | 35.00 | 0.970 | 0.539 | 5 | 472507 | 100 |
| 19 | Slay the Spire | Breakout Hit | 15038 | 7500000.0 | 2.49 | 0.978 | 0.660 | 3 | 282473 | 100 |
# ── Q7 Chart: Top 20 Breakout Hits by Peak CCU ──────────────────────────────
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
BG = '#0D1117'
CARD = '#161B22'
GRID = '#21262D'
TEXT = '#E6EDF3'
SUBTEXT = '#8B949E'
C_BREAKOUT = '#D2A8FF'
plt.rcParams.update({
'figure.facecolor': BG, 'axes.facecolor': CARD,
'axes.edgecolor': GRID, 'axes.labelcolor': SUBTEXT,
'text.color': TEXT, 'xtick.color': SUBTEXT, 'ytick.color': SUBTEXT,
'grid.color': GRID, 'grid.alpha': 0.4, 'font.family': 'sans-serif',
'font.size': 11, 'legend.facecolor': CARD, 'legend.edgecolor': GRID,
})
def clean_ax(ax, keep_left=True):
ax.set_facecolor(CARD)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(keep_left)
ax.spines['bottom'].set_edgecolor(GRID)
if keep_left: ax.spines['left'].set_edgecolor(GRID)
ax.tick_params(colors=SUBTEXT)
q7_plot = q7.sort_values('peak_ccu', ascending=True).tail(20)
fig, ax = plt.subplots(figsize=(12, 8), facecolor=BG)
clean_ax(ax, keep_left=False)
bars = ax.barh(
range(len(q7_plot)), q7_plot['peak_ccu'].values,
color=C_BREAKOUT, edgecolor=BG, linewidth=0.8, alpha=0.85
)
for i, (_, row) in enumerate(q7_plot.iterrows()):
name = row['name'][:30] + '…' if len(str(row['name'])) > 30 else row['name']
ccu = row['peak_ccu']
ax.text(ccu + max(q7_plot['peak_ccu']) * 0.01, i, f'{name} ({ccu:,.0f})',
va='center', fontsize=9, color=TEXT, fontweight='500')
ax.set_yticks([])
ax.set_xlabel('Peak Concurrent Users', fontsize=11, color=SUBTEXT)
ax.set_xlim(0, max(q7_plot['peak_ccu']) * 1.45)
ax.grid(True, axis='x', alpha=0.15, color=GRID)
fig.suptitle('Q7 · Top 20 Breakout Hits Ranked by Peak Concurrent Users',
fontsize=14, fontweight='700', color=TEXT, y=0.98)
plt.tight_layout()
plt.show()
Even within the top success tier, the top 20 breakout hits differ hugely in peak concurrent players. This shows that passing 1M owners is only part of the story: the biggest outliers are likely the games that catch on socially, get strong streaming exposure, or have multiplayer mechanics that drive much higher live player counts.
# Q8: Deep profile of Underperformers (<20k owners) grouped by price tier.
# Previously only showed 1 row due to NULL price_tier — COALESCE maps NULLs to 'unknown'.
# Now also includes avg_owners within the underperformer tier and comparison
# against the Moderate Success tier average for context.
q8 = con.execute("""
WITH tiered AS (
SELECT *,
CASE
WHEN owners_estimate >= 1000000 THEN 'Breakout Hit'
WHEN owners_estimate >= 200000 THEN 'Strong Success'
WHEN owners_estimate >= 20000 THEN 'Moderate Success'
ELSE 'Underperformer'
END AS success_tier,
CASE
WHEN owners_estimate >= 1000000 THEN 1
WHEN owners_estimate >= 200000 THEN 2
WHEN owners_estimate >= 20000 THEN 3
ELSE 4
END AS tier_rank
FROM games
)
SELECT COALESCE(price_tier, 'unknown') AS price_tier,
COUNT(*) AS underperformer_games,
ROUND(AVG(owners_estimate), 0) AS avg_owners,
ROUND(AVG(price), 2) AS avg_price,
ROUND(AVG(peak_ccu), 0) AS avg_peak_ccu,
ROUND(AVG(review_positivity_ratio), 3) AS avg_review_ratio,
ROUND(AVG(avg_vader_compound), 4) AS avg_vader,
ROUND(AVG(num_platforms), 1) AS avg_platforms,
ROUND(AVG(total_reddit_posts), 0) AS avg_reddit_posts,
SUM(CASE WHEN has_dedicated_subreddit THEN 1 ELSE 0 END) AS with_subreddit
FROM tiered WHERE success_tier = 'Underperformer'
GROUP BY price_tier ORDER BY underperformer_games DESC
""").df()
display(q8)
| price_tier | underperformer_games | avg_owners | avg_price | avg_peak_ccu | avg_review_ratio | avg_vader | avg_platforms | avg_reddit_posts | with_subreddit | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5_to_15 | 51 | 10000.0 | 10.07 | 146.0 | 0.867 | 0.4507 | 2.1 | 41.0 | 11.0 |
| 1 | under_5 | 27 | 10000.0 | 2.92 | 144.0 | 0.824 | 0.3922 | 1.7 | 36.0 | 4.0 |
| 2 | free | 12 | 10000.0 | 0.00 | 16575.0 | 0.680 | 0.4017 | 1.6 | 35.0 | 4.0 |
| 3 | 15_to_30 | 12 | 10000.0 | 17.86 | 138.0 | 0.818 | 0.5432 | 2.6 | 47.0 | 5.0 |
| 4 | premium_30+ | 1 | 10000.0 | 35.99 | 67.0 | 0.685 | 0.1148 | 1.0 | 50.0 | 1.0 |
# ── Q8 Chart: What Do Underperformers Look Like? ────────────────────────────
# ── Prep ──────────────────────────────────────────────────────────────────────
price_display = {
'free': 'Free', 'under_5': 'Under $5', '5_to_15': '$5–15',
'15_to_30': '$15–30', 'premium_30+': '$30+', 'unknown': 'Unknown'
}
q8_plot = q8.copy()
q8_plot['label'] = q8_plot['price_tier'].map(price_display).fillna(q8_plot['price_tier'])
q8_plot['subreddit_pct'] = q8_plot['with_subreddit'] / q8_plot['underperformer_games'] * 100
labels = q8_plot['label'].tolist()
x = np.arange(len(labels))
w = 0.35
fig, axes = plt.subplots(1, 3, figsize=(17, 5.5), facecolor=BG)
fig.subplots_adjust(left=0.06, right=0.96, top=0.84, bottom=0.18, wspace=0.35)
fig.suptitle('Q8 · Anatomy of an Underperformer: Where the <20k-Owner Games Cluster',
fontsize=14, fontweight='700', color=TEXT, y=0.97)
# ── Panel 1: Distribution across price tiers ─────────────────────────────────
ax = axes[0]
clean_ax(ax)
bars = ax.bar(x, q8_plot['underperformer_games'], color=C_UNDER, edgecolor=BG, alpha=0.85)
for b in bars:
n = int(b.get_height())
pct = n / q8_plot['underperformer_games'].sum() * 100
ax.text(b.get_x() + b.get_width()/2, b.get_height() + 1,
f'{n} ({pct:.0f}%)', ha='center', fontsize=9, color=TEXT)
ax.set_xticks(x); ax.set_xticklabels(labels, fontsize=9)
ax.set_title('Where underperformers price themselves', fontsize=12, fontweight='600', color=TEXT, pad=10)
ax.set_ylabel('Games', fontsize=10, color=SUBTEXT)
ax.grid(True, axis='y', alpha=0.15)
# ── Panel 2: Review ratio vs VADER ───────────────────────────────────────────
ax = axes[1]
clean_ax(ax)
ax.bar(x - w/2, q8_plot['avg_review_ratio'], w, color=C_STRONG, alpha=0.85, label='Review ratio')
ax.bar(x + w/2, q8_plot['avg_vader'].fillna(0), w, color=C_MODERATE, alpha=0.85, label='VADER sentiment')
for i, (rr, vd) in enumerate(zip(q8_plot['avg_review_ratio'], q8_plot['avg_vader'].fillna(0))):
ax.text(i - w/2, rr + 0.01, f'{rr:.2f}', ha='center', fontsize=8, color=TEXT)
ax.text(i + w/2, vd + 0.01, f'{vd:.2f}', ha='center', fontsize=8, color=TEXT)
ax.set_xticks(x); ax.set_xticklabels(labels, fontsize=9)
ax.set_title('Quality isn\'t the problem', fontsize=12, fontweight='600', color=TEXT, pad=10)
ax.set_ylabel('Score', fontsize=10, color=SUBTEXT)
ax.set_ylim(0, 1.05)
ax.legend(fontsize=9, labelcolor=TEXT)
ax.grid(True, axis='y', alpha=0.15)
# ── Panel 3: Community presence — subreddit % with Reddit posts annotated ────
ax = axes[2]
clean_ax(ax)
bars = ax.bar(x, q8_plot['subreddit_pct'], width=0.6, color='#FFA657', edgecolor=BG, alpha=0.85)
for i, (b, rp) in enumerate(zip(bars, q8_plot['avg_reddit_posts'].fillna(0))):
pct = q8_plot['subreddit_pct'].iloc[i]
# % label above bar
ax.text(b.get_x() + b.get_width()/2, pct + 2,
f'{pct:.0f}%', ha='center', fontsize=10, fontweight='bold', color=TEXT)
# Reddit posts below the x-axis label
ax.text(b.get_x() + b.get_width()/2, -9,
f'~{int(rp)} posts', ha='center', fontsize=8, color=SUBTEXT)
ax.set_xticks(x); ax.set_xticklabels(labels, fontsize=9)
ax.set_title('Visibility is the problem', fontsize=12, fontweight='600', color=TEXT, pad=10)
ax.set_ylabel('% with dedicated subreddit', fontsize=10, color=SUBTEXT)
ax.set_ylim(-15, 115)
ax.grid(True, axis='y', alpha=0.15)
plt.show()
These underperforming games do not look like obvious quality failures. Their review ratios are still reasonably strong across price tiers, so the bigger problem seems to be visibility rather than reception. In the most crowded brackets ($5–15 and Under $5), very few have a dedicated subreddit and Reddit activity is minimal. Since half of all underperformers sit in the $5–15 tier, it looks more like they are getting lost in an overcrowded part of Steam than being rejected by players.
# Q9: Side-by-side comparison of all key features across the four success tiers.
# This is the primary evidence table — replaces the binary Successful/Failed comparison
# with four-way granularity to reveal whether signals are gradual or threshold effects.
q9 = con.execute("""
WITH tiered AS (
SELECT *,
CASE
WHEN owners_estimate >= 1000000 THEN 'Breakout Hit'
WHEN owners_estimate >= 200000 THEN 'Strong Success'
WHEN owners_estimate >= 20000 THEN 'Moderate Success'
ELSE 'Underperformer'
END AS success_tier,
CASE
WHEN owners_estimate >= 1000000 THEN 1
WHEN owners_estimate >= 200000 THEN 2
WHEN owners_estimate >= 20000 THEN 3
ELSE 4
END AS tier_rank
FROM games
)
SELECT success_tier, tier_rank,
COUNT(*) AS n,
ROUND(AVG(owners_estimate), 0) AS avg_owners,
ROUND(AVG(price), 2) AS avg_price,
ROUND(AVG(peak_ccu), 0) AS avg_ccu,
ROUND(AVG(review_positivity_ratio), 3) AS review_ratio,
ROUND(AVG(avg_vader_compound), 4) AS vader,
ROUND(AVG(num_platforms), 1) AS platforms,
ROUND(AVG(total_reddit_posts), 0) AS reddit_posts,
ROUND(AVG(subreddit_subscribers), 0) AS reddit_subs,
ROUND(AVG(trend_peak), 1) AS trend_peak,
ROUND(AVG(steam_platform_count), 1) AS steam_platforms,
ROUND(AVG(playtime_retention_ratio), 3) AS retention,
ROUND(AVG(achievements), 0) AS achievements
FROM tiered
GROUP BY success_tier, tier_rank
ORDER BY tier_rank
""").df()
display(q9)
| success_tier | tier_rank | n | avg_owners | avg_price | avg_ccu | review_ratio | vader | platforms | reddit_posts | reddit_subs | trend_peak | steam_platforms | retention | achievements | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Breakout Hit | 1 | 368 | 4769022.0 | 7.25 | 3335.0 | 0.899 | 0.4290 | 4.1 | 48.0 | 97599.0 | 92.4 | 1.9 | 0.381 | 56.0 |
| 1 | Strong Success | 2 | 490 | 547551.0 | 8.83 | 383.0 | 0.871 | 0.4147 | 3.1 | 46.0 | 8100.0 | 85.3 | 1.7 | 0.496 | 65.0 |
| 2 | Moderate Success | 3 | 255 | 102000.0 | 9.77 | 237.0 | 0.851 | 0.4525 | 2.4 | 43.0 | 2561.0 | 62.0 | 1.5 | 0.858 | 49.0 |
| 3 | Underperformer | 4 | 103 | 10000.0 | 8.18 | 2058.0 | 0.834 | 0.4385 | 2.0 | 40.0 | 7829.0 | 48.5 | 1.6 | 1.019 | 41.0 |
# ── Q9 Chart: What Actually Separates the Tiers? ────────────────────────────
# ── Prep ──────────────────────────────────────────────────────────────────────
q9_plot = q9.set_index('success_tier').reindex(tier_order)
x = np.arange(4)
bar_colors = [TIER_COLORS[t] for t in tier_order]
fig, axes = plt.subplots(2, 3, figsize=(17, 10), facecolor=BG)
fig.suptitle('Q9 · What Actually Separates the Success Tiers?',
fontsize=15, fontweight='700', color=TEXT, y=0.98)
# ── Row 1: Metrics that VARY (the differentiators) ───────────────────────────
fig.text(0.03, 0.75, 'DIFFERENTIATORS', fontsize=11, fontweight='700',
color='#58A6FF', rotation=90, va='center', ha='center')
# Peak CCU
ax = axes[0, 0]
clean_ax(ax)
vals = q9_plot['avg_ccu'].values
ax.bar(x, vals, color=bar_colors, edgecolor=BG, alpha=0.85, width=0.55)
for i, v in enumerate(vals):
ax.text(i, v + max(vals)*0.03, f'{v:,.0f}', ha='center', fontsize=9, color=TEXT)
ax.set_xticks(x); ax.set_xticklabels(short_labels, fontsize=9)
ax.set_title('Peak concurrent users', fontsize=12, fontweight='600', color=TEXT, pad=10)
ax.grid(True, axis='y', alpha=0.15)
# Reddit Subs
ax = axes[0, 1]
clean_ax(ax)
vals = q9_plot['reddit_subs'].values
ax.bar(x, vals, color=bar_colors, edgecolor=BG, alpha=0.85, width=0.55)
for i, v in enumerate(vals):
ax.text(i, v + max(vals)*0.03, f'{v:,.0f}', ha='center', fontsize=9, color=TEXT)
ax.set_xticks(x); ax.set_xticklabels(short_labels, fontsize=9)
ax.set_title('Avg subreddit subscribers', fontsize=12, fontweight='600', color=TEXT, pad=10)
ax.grid(True, axis='y', alpha=0.15)
# Trend Peak
ax = axes[0, 2]
clean_ax(ax)
vals = q9_plot['trend_peak'].values
ax.bar(x, vals, color=bar_colors, edgecolor=BG, alpha=0.85, width=0.55)
for i, v in enumerate(vals):
ax.text(i, v + max(vals)*0.03, f'{v:.1f}', ha='center', fontsize=9, color=TEXT)
ax.set_xticks(x); ax.set_xticklabels(short_labels, fontsize=9)
ax.set_title('Google Trends peak interest', fontsize=12, fontweight='600', color=TEXT, pad=10)
ax.grid(True, axis='y', alpha=0.15)
# ── Row 2: Metrics that DON'T vary (quality is flat) ─────────────────────────
fig.text(0.03, 0.28, 'QUALITY\n(FLAT)', fontsize=11, fontweight='700',
color='#8B949E', rotation=90, va='center', ha='center')
# Review Ratio
ax = axes[1, 0]
clean_ax(ax)
vals = q9_plot['review_ratio'].values
ax.bar(x, vals, color=bar_colors, edgecolor=BG, alpha=0.85, width=0.55)
for i, v in enumerate(vals):
ax.text(i, v + 0.01, f'{v:.3f}', ha='center', fontsize=9, color=TEXT)
ax.set_xticks(x); ax.set_xticklabels(short_labels, fontsize=9)
ax.set_title('Review positivity ratio', fontsize=12, fontweight='600', color=TEXT, pad=10)
ax.set_ylim(0, 1.1)
ax.grid(True, axis='y', alpha=0.15)
# VADER
ax = axes[1, 1]
clean_ax(ax)
vals = q9_plot['vader'].values
ax.bar(x, vals, color=bar_colors, edgecolor=BG, alpha=0.85, width=0.55)
for i, v in enumerate(vals):
ax.text(i, v + 0.01, f'{v:.3f}', ha='center', fontsize=9, color=TEXT)
ax.set_xticks(x); ax.set_xticklabels(short_labels, fontsize=9)
ax.set_title('VADER sentiment score', fontsize=12, fontweight='600', color=TEXT, pad=10)
ax.set_ylim(0, 1.1)
ax.grid(True, axis='y', alpha=0.15)
# Retention
ax = axes[1, 2]
clean_ax(ax)
vals = q9_plot['retention'].values
ax.bar(x, vals, color=bar_colors, edgecolor=BG, alpha=0.85, width=0.55)
for i, v in enumerate(vals):
ax.text(i, v + max(vals)*0.03, f'{v:.3f}', ha='center', fontsize=9, color=TEXT)
ax.set_xticks(x); ax.set_xticklabels(short_labels, fontsize=9)
ax.set_title('Playtime retention ratio', fontsize=12, fontweight='600', color=TEXT, pad=10)
ax.grid(True, axis='y', alpha=0.15)
plt.tight_layout(rect=[0.05, 0, 1, 0.95])
plt.show()
These six features split quite neatly into two groups: the ones that really distinguish successful games, and the ones that barely change across tiers. The clearest differentiators are peak CCU, subreddit size, and Google Trends peak interest. Breakout games are dramatically ahead on all three, suggesting that visibility, attention, and active player momentum matter far more than subtle differences in perceived quality. The one odd result is that underperformers still show a high average peak CCU of 2,058, which is probably being driven by a few niche titles with sharp but short-lived spikes.
The other features tell a different story. Review positivity, VADER sentiment, and playtime retention are all relatively similar across tiers, so they do not seem to explain why some games break out while others do not. In fact, underperformers have the highest retention value, which suggests some unsuccessful games still manage to build small but loyal audiences even without reaching broad commercial success.
# Q10: Playtime retention metrics across success tiers.
# playtime_retention_ratio = median_playtime / average_playtime (closer to 1 = even retention).
# FIX retained: outlier filter < 50000 mins removes Desktop Kitten Girl anomaly.
# Four-tier version reveals whether Breakout Hits have qualitatively different
# retention curves or just more of the same.
q10 = con.execute("""
WITH tiered AS (
SELECT *,
CASE
WHEN owners_estimate >= 1000000 THEN 'Breakout Hit'
WHEN owners_estimate >= 200000 THEN 'Strong Success'
WHEN owners_estimate >= 20000 THEN 'Moderate Success'
ELSE 'Underperformer'
END AS success_tier,
CASE
WHEN owners_estimate >= 1000000 THEN 1
WHEN owners_estimate >= 200000 THEN 2
WHEN owners_estimate >= 20000 THEN 3
ELSE 4
END AS tier_rank
FROM games
)
SELECT success_tier, tier_rank,
COUNT(*) AS num_games,
ROUND(AVG(average_playtime_forever), 0) AS avg_playtime_mins,
ROUND(AVG(median_playtime_forever), 0) AS median_playtime_mins,
ROUND(AVG(playtime_retention_ratio), 3) AS avg_retention_ratio,
ROUND(AVG(steamspy_avg_playtime), 0) AS steamspy_avg_playtime,
ROUND(AVG(steamspy_median_playtime), 0) AS steamspy_median_playtime,
ROUND(AVG(owners_estimate), 0) AS avg_owners
FROM tiered
WHERE playtime_retention_ratio IS NOT NULL
AND average_playtime_forever < 50000
GROUP BY success_tier, tier_rank
ORDER BY tier_rank
""").df()
display(q10)
| success_tier | tier_rank | num_games | avg_playtime_mins | median_playtime_mins | avg_retention_ratio | steamspy_avg_playtime | steamspy_median_playtime | avg_owners | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Breakout Hit | 1 | 365 | 2265.0 | 583.0 | 0.381 | 2314.0 | 590.0 | 4795890.0 |
| 1 | Strong Success | 2 | 485 | 1816.0 | 614.0 | 0.497 | 2045.0 | 652.0 | 547113.0 |
| 2 | Moderate Success | 3 | 245 | 1752.0 | 1493.0 | 0.860 | 1509.0 | 1284.0 | 102224.0 |
| 3 | Underperformer | 4 | 80 | 1457.0 | 1077.0 | 1.019 | 1655.0 | 1334.0 | 10000.0 |
# ── Q10 Chart: Playtime Retention Across Success Tiers ──────────────────────
# ── Prep ──────────────────────────────────────────────────────────────────────
q10_plot = q10.set_index('success_tier').reindex(tier_order).reset_index()
x = np.arange(4)
bar_colors = [TIER_COLORS[t] for t in tier_order]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5.5), facecolor=BG,
gridspec_kw={'width_ratios': [3, 2]})
fig.subplots_adjust(left=0.06, right=0.96, top=0.84, bottom=0.15, wspace=0.3)
fig.suptitle('Q10 · Do Successful Games Keep Players Longer?',
fontsize=14, fontweight='700', color=TEXT, y=0.97)
# ── Left: Average vs Median playtime side by side ─────────────────────────────
clean_ax(ax1)
w = 0.3
avg_vals = q10_plot['avg_playtime_mins'].values
med_vals = q10_plot['median_playtime_mins'].values
bars_avg = ax1.bar(x - w/2, avg_vals, w, color=bar_colors, alpha=0.85, edgecolor=BG)
bars_med = ax1.bar(x + w/2, med_vals, w, color=bar_colors, alpha=0.4, edgecolor=BG)
for i in range(4):
ax1.text(i - w/2, avg_vals[i] + max(avg_vals)*0.02,
f'{int(avg_vals[i]):,}', ha='center', fontsize=9, fontweight='bold', color=TEXT)
ax1.text(i + w/2, med_vals[i] + max(avg_vals)*0.02,
f'{int(med_vals[i]):,}', ha='center', fontsize=9, color=SUBTEXT)
ax1.set_xticks(x); ax1.set_xticklabels(short_labels, fontsize=10)
ax1.set_ylabel('Minutes', fontsize=10, color=SUBTEXT)
ax1.set_title('Average vs Median playtime (minutes)', fontsize=12, fontweight='600', color=TEXT, pad=10)
ax1.grid(True, axis='y', alpha=0.15)
p1 = mpatches.Patch(color=TEXT, alpha=0.85, label='Average (solid)')
p2 = mpatches.Patch(color=TEXT, alpha=0.4, label='Median (faded)')
ax1.legend(handles=[p1, p2], fontsize=9, labelcolor=TEXT, frameon=True,
facecolor=CARD, edgecolor=GRID)
# ── Right: Retention ratio as lollipop ────────────────────────────────────────
clean_ax(ax2)
retention = q10_plot['avg_retention_ratio'].values
for i, (r, color) in enumerate(zip(retention, bar_colors)):
ax2.plot([i, i], [0, r], color=color, lw=4, alpha=0.6)
ax2.scatter(i, r, color=color, s=150, zorder=5, edgecolor=BG, linewidth=2)
ax2.text(i, r + 0.025, f'{r:.3f}', ha='center', fontsize=11,
fontweight='bold', color=TEXT)
ax2.axhline(1.0, color=SUBTEXT, ls='--', lw=1, alpha=0.3, label='Perfect retention (1.0)')
ax2.set_xticks(range(4))
ax2.set_xticklabels(short_labels, fontsize=10)
ax2.set_ylabel('Median / Average playtime', fontsize=10, color=SUBTEXT)
ax2.set_ylim(0, max(retention) * 1.3)
ax2.set_title('Retention ratio\n(closer to 1.0 = more even engagement)', fontsize=12,
fontweight='600', color=TEXT, pad=10)
ax2.legend(fontsize=9, labelcolor=SUBTEXT, frameon=True, facecolor=CARD, edgecolor=GRID)
ax2.grid(True, axis='y', alpha=0.15)
plt.show()
Breakout Hits attract significantly more total playtime, but the retention ratio (median ÷ average) tells a nuanced story. Lower ratios in the top tiers indicate a wider spread between casual and hardcore players, with a small group of dedicated players pulling the average well above the median. Underperformers, with a ratio of 1.019, have the most even engagement, suggesting a small niche audience that plays consistently rather than a broad player base with power users. The key difference between tiers is not how long each player stays, but how many players show up at all. This reinforces the discoverability thesis: successful games don't necessarily create stickier experiences, they just reach more people.
# Q11: Identifies which Steam tags (genres/themes) correlate with success tiers.
# Tags are stored in game_tags (one row per game × tag). Exported from PostgreSQL
# to CSV so the notebook is self-contained without Docker.
game_tags = pd.read_csv('../storage/game_tags.csv')
con.execute("CREATE OR REPLACE TABLE game_tags AS SELECT * FROM game_tags")
q11_top = con.execute("""
SELECT t.tag_name, COUNT(*) AS num_games,
SUM(CASE WHEN g.owners_estimate >= 1000000 THEN 1 ELSE 0 END) AS breakout_hits,
SUM(CASE WHEN g.owners_estimate >= 200000 AND g.owners_estimate < 1000000 THEN 1 ELSE 0 END) AS strong_successes,
SUM(CASE WHEN g.owners_estimate >= 20000 AND g.owners_estimate < 200000 THEN 1 ELSE 0 END) AS moderate_successes,
SUM(CASE WHEN g.owners_estimate < 20000 THEN 1 ELSE 0 END) AS underperformers,
ROUND(SUM(CASE WHEN g.owners_estimate >= 200000 THEN 1 ELSE 0 END)*100.0/COUNT(*),1) AS pct_strong_or_above,
ROUND(AVG(g.peak_ccu),0) AS avg_peak_ccu
FROM game_tags t JOIN games g ON t.app_id = g.app_id
GROUP BY t.tag_name HAVING COUNT(*) >= 20
ORDER BY pct_strong_or_above DESC LIMIT 20
""").df()
q11_bot = con.execute("""
SELECT t.tag_name, COUNT(*) AS num_games,
SUM(CASE WHEN g.owners_estimate < 20000 THEN 1 ELSE 0 END) AS underperformers,
ROUND(SUM(CASE WHEN g.owners_estimate < 20000 THEN 1 ELSE 0 END)*100.0/COUNT(*),1) AS pct_underperformer
FROM game_tags t JOIN games g ON t.app_id = g.app_id
GROUP BY t.tag_name HAVING COUNT(*) >= 20
ORDER BY pct_underperformer DESC LIMIT 10
""").df()
print("Top 20 tags by % reaching Strong Success or above:")
display(q11_top)
print("\nBottom 10 tags (highest underperformer %):")
display(q11_bot)
Top 20 tags by % reaching Strong Success or above:
| tag_name | num_games | breakout_hits | strong_successes | moderate_successes | underperformers | pct_strong_or_above | avg_peak_ccu | |
|---|---|---|---|---|---|---|---|---|
| 0 | Mature | 35 | 18.0 | 16.0 | 1.0 | 0.0 | 97.1 | 3269.0 |
| 1 | 4 Player Local | 32 | 21.0 | 10.0 | 1.0 | 0.0 | 96.9 | 1500.0 |
| 2 | Great Soundtrack | 120 | 71.0 | 45.0 | 4.0 | 0.0 | 96.7 | 2198.0 |
| 3 | Competitive | 29 | 18.0 | 10.0 | 0.0 | 1.0 | 96.6 | 1197.0 |
| 4 | Fighting | 22 | 13.0 | 8.0 | 0.0 | 1.0 | 95.5 | 1018.0 |
| 5 | Memes | 44 | 24.0 | 17.0 | 2.0 | 1.0 | 93.2 | 2736.0 |
| 6 | Fast-Paced | 29 | 17.0 | 10.0 | 2.0 | 0.0 | 93.1 | 1252.0 |
| 7 | Local Co-Op | 84 | 42.0 | 35.0 | 6.0 | 1.0 | 91.7 | 841.0 |
| 8 | Blood | 21 | 10.0 | 9.0 | 2.0 | 0.0 | 90.5 | 556.0 |
| 9 | Moddable | 82 | 40.0 | 34.0 | 7.0 | 1.0 | 90.2 | 1832.0 |
| 10 | Science | 20 | 5.0 | 13.0 | 2.0 | 0.0 | 90.0 | 667.0 |
| 11 | Survival Horror | 75 | 42.0 | 25.0 | 6.0 | 2.0 | 89.3 | 1595.0 |
| 12 | Free to Play | 132 | 59.0 | 58.0 | 13.0 | 2.0 | 88.6 | 1102.0 |
| 13 | Gore | 87 | 47.0 | 30.0 | 10.0 | 0.0 | 88.5 | 681.0 |
| 14 | Voxel | 26 | 7.0 | 16.0 | 2.0 | 1.0 | 88.5 | 1143.0 |
| 15 | Nudity | 43 | 17.0 | 21.0 | 4.0 | 1.0 | 88.4 | 3910.0 |
| 16 | Massively Multiplayer | 76 | 35.0 | 32.0 | 8.0 | 1.0 | 88.2 | 1621.0 |
| 17 | Parkour | 25 | 8.0 | 14.0 | 2.0 | 1.0 | 88.0 | 295.0 |
| 18 | War | 66 | 24.0 | 34.0 | 8.0 | 0.0 | 87.9 | 1566.0 |
| 19 | Violent | 72 | 36.0 | 27.0 | 9.0 | 0.0 | 87.5 | 643.0 |
Bottom 10 tags (highest underperformer %):
| tag_name | num_games | underperformers | pct_underperformer | |
|---|---|---|---|---|
| 0 | Collectathon | 23 | 7.0 | 30.4 |
| 1 | Creature Collector | 28 | 7.0 | 25.0 |
| 2 | Idler | 85 | 14.0 | 16.5 |
| 3 | Farming Sim | 49 | 8.0 | 16.3 |
| 4 | Cozy | 33 | 5.0 | 15.2 |
| 5 | Time Management | 55 | 8.0 | 14.5 |
| 6 | Demons | 21 | 3.0 | 14.3 |
| 7 | Auto Battler | 29 | 4.0 | 13.8 |
| 8 | Score Attack | 22 | 3.0 | 13.6 |
| 9 | 2D Platformer | 45 | 6.0 | 13.3 |
The tag results suggest that social and replayable games have a clear advantage. 4 Player Local, Competitive, and Local Co-Op all rank very highly, which fits the idea that games built around shared experiences are more likely to spread through word of mouth. Great Soundtrack is one of the strongest findings overall, suggesting that production quality is not just aesthetic but also linked to wider commercial reach. Moddable points in a similar direction, implying that games which stay fresh for longer are better able to build and sustain communities.
There are also a couple of findings that challenge easy assumptions. Mature content does not seem to hold games back, since tags like Mature, Blood, and Gore still appear near the top. Free to Play also ranks highly, but that needs to be read with caution: because these games remove the cost barrier, they can accumulate owners more easily than paid games, so part of that effect may be mechanical rather than purely demand-driven.
# ── Q11 Chart: Steam Tags Most Correlated with Success (Option 1) ────────────
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7), facecolor=BG)
fig.subplots_adjust(wspace=0.45, left=0.12, right=0.95, top=0.88, bottom=0.08)
fig.suptitle('Q11 · Which Steam Tags Predict (or Doom) Indie Success?',
fontsize=14, fontweight='700', color=TEXT, y=0.97)
# ── Left panel: Top 10 tags by success ────────────────────────────────────────
top10 = q11_top.sort_values('pct_strong_or_above', ascending=True).tail(10)
pcts_t = top10['pct_strong_or_above'].values
tags_t = top10['tag_name'].values
n_t = top10['num_games'].values
clean_ax(ax1, keep_left=False)
bars1 = ax1.barh(range(len(tags_t)), pcts_t, color=C_BREAKOUT, edgecolor=BG,
alpha=0.85, height=0.7)
for i, (pct, n) in enumerate(zip(pcts_t, n_t)):
ax1.text(pct + 0.8, i, f'{pct:.1f}% (n={int(n)})',
va='center', fontsize=9, color=TEXT)
ax1.set_yticks(range(len(tags_t)))
ax1.set_yticklabels(tags_t, fontsize=10, color=TEXT)
ax1.set_xlabel('% reaching Strong Success or above (≥200k owners)', fontsize=9, color=SUBTEXT)
ax1.set_xlim(0, max(pcts_t) * 1.25)
ax1.set_title('Top 10 Tags — Success Magnets', fontsize=12, color=C_BREAKOUT, pad=12)
ax1.grid(True, axis='x', alpha=0.15, color=GRID)
# ── Right panel: Bottom 10 tags by underperformer rate ────────────────────────
bot10 = q11_bot.sort_values('pct_underperformer', ascending=True).tail(10)
pcts_b = bot10['pct_underperformer'].values
tags_b = bot10['tag_name'].values
n_b = bot10['num_games'].values
clean_ax(ax2, keep_left=False)
bars2 = ax2.barh(range(len(tags_b)), pcts_b, color=C_UNDER, edgecolor=BG,
alpha=0.85, height=0.7)
for i, (pct, n) in enumerate(zip(pcts_b, n_b)):
ax2.text(pct + 0.8, i, f'{pct:.1f}% (n={int(n)})',
va='center', fontsize=9, color=TEXT)
ax2.set_yticks(range(len(tags_b)))
ax2.set_yticklabels(tags_b, fontsize=10, color=TEXT)
ax2.set_xlabel('% classified as Underperformer (<20k owners)', fontsize=9, color=SUBTEXT)
ax2.set_xlim(0, max(pcts_b) * 1.25)
ax2.set_title('Bottom 10 Tags — Red Flags', fontsize=12, color=C_UNDER, pad=12)
ax2.grid(True, axis='x', alpha=0.15, color=GRID)
plt.show()
# Q12: Success tier composition by release year (2010–2025).
# Instead of a single success rate, shows how each year's cohort distributes
# across the four tiers — revealing whether the market has become harder to
# break out in as game counts increased (the "Discoverability Crisis").
q12 = con.execute("""
WITH tiered AS (
SELECT *,
CASE
WHEN owners_estimate >= 1000000 THEN 'Breakout Hit'
WHEN owners_estimate >= 200000 THEN 'Strong Success'
WHEN owners_estimate >= 20000 THEN 'Moderate Success'
ELSE 'Underperformer'
END AS success_tier,
CASE
WHEN owners_estimate >= 1000000 THEN 1
WHEN owners_estimate >= 200000 THEN 2
WHEN owners_estimate >= 20000 THEN 3
ELSE 4
END AS tier_rank
FROM games
)
SELECT TRY_CAST(SPLIT_PART(release_date, ', ', 2) AS INTEGER) AS release_year,
COUNT(*) AS num_games,
SUM(CASE WHEN success_tier='Breakout Hit' THEN 1 ELSE 0 END) AS breakout_hits,
SUM(CASE WHEN success_tier='Strong Success' THEN 1 ELSE 0 END) AS strong_successes,
SUM(CASE WHEN success_tier='Moderate Success' THEN 1 ELSE 0 END) AS moderate_successes,
SUM(CASE WHEN success_tier='Underperformer' THEN 1 ELSE 0 END) AS underperformers,
ROUND(SUM(CASE WHEN success_tier='Breakout Hit' THEN 1 ELSE 0 END)*100.0/COUNT(*),1) AS pct_breakout,
ROUND(SUM(CASE WHEN success_tier='Underperformer' THEN 1 ELSE 0 END)*100.0/COUNT(*),1) AS pct_underperformer,
ROUND(AVG(peak_ccu), 0) AS avg_peak_ccu,
ROUND(AVG(owners_estimate), 0) AS avg_owners
FROM tiered
WHERE TRY_CAST(SPLIT_PART(release_date, ', ', 2) AS INTEGER) BETWEEN 2010 AND 2025
GROUP BY release_year ORDER BY release_year
""").df()
display(q12)
| release_year | num_games | breakout_hits | strong_successes | moderate_successes | underperformers | pct_breakout | pct_underperformer | avg_peak_ccu | avg_owners | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2010 | 2 | 2.0 | 0.0 | 0.0 | 0.0 | 100.0 | 0.0 | 66.0 | 3500000.0 |
| 1 | 2011 | 6 | 6.0 | 0.0 | 0.0 | 0.0 | 100.0 | 0.0 | 4233.0 | 8416667.0 |
| 2 | 2012 | 12 | 11.0 | 1.0 | 0.0 | 0.0 | 91.7 | 0.0 | 1895.0 | 4562500.0 |
| 3 | 2013 | 13 | 10.0 | 1.0 | 2.0 | 0.0 | 76.9 | 0.0 | 2435.0 | 6346154.0 |
| 4 | 2014 | 31 | 18.0 | 13.0 | 0.0 | 0.0 | 58.1 | 0.0 | 1374.0 | 2282258.0 |
| 5 | 2015 | 56 | 28.0 | 23.0 | 5.0 | 0.0 | 50.0 | 0.0 | 963.0 | 2153929.0 |
| 6 | 2016 | 53 | 31.0 | 17.0 | 4.0 | 1.0 | 58.5 | 1.9 | 2044.0 | 2928962.0 |
| 7 | 2017 | 56 | 26.0 | 21.0 | 8.0 | 1.0 | 46.4 | 1.8 | 1283.0 | 4357232.0 |
| 8 | 2018 | 87 | 37.0 | 43.0 | 6.0 | 1.0 | 42.5 | 1.1 | 3434.0 | 2657701.0 |
| 9 | 2019 | 68 | 30.0 | 31.0 | 4.0 | 3.0 | 44.1 | 4.4 | 839.0 | 1535368.0 |
| 10 | 2020 | 102 | 38.0 | 46.0 | 12.0 | 6.0 | 37.3 | 5.9 | 1242.0 | 1701765.0 |
| 11 | 2021 | 99 | 40.0 | 44.0 | 11.0 | 4.0 | 40.4 | 4.0 | 852.0 | 1439747.0 |
| 12 | 2022 | 97 | 25.0 | 51.0 | 15.0 | 6.0 | 25.8 | 6.2 | 631.0 | 1166804.0 |
| 13 | 2023 | 129 | 25.0 | 65.0 | 33.0 | 6.0 | 19.4 | 4.7 | 735.0 | 1029341.0 |
| 14 | 2024 | 197 | 23.0 | 92.0 | 64.0 | 18.0 | 11.7 | 9.1 | 1023.0 | 1287893.0 |
| 15 | 2025 | 204 | 16.0 | 40.0 | 91.0 | 57.0 | 7.8 | 27.9 | 1902.0 | 370392.0 |
# ── Q12 Chart: Success Tier Composition by Release Year (v2) ─────────────────
# ── Prep ──────────────────────────────────────────────────────────────────────
q12_plot = q12.dropna(subset=['release_year']).copy()
q12_plot = q12_plot[q12_plot['release_year'].between(2010, 2025)]
q12_plot = q12_plot.sort_values('release_year')
years = q12_plot['release_year'].astype(int).values
totals = q12_plot['num_games'].values
pct_b = q12_plot['breakout_hits'].values / totals * 100
pct_s = q12_plot['strong_successes'].values / totals * 100
pct_m = q12_plot['moderate_successes'].values / totals * 100
pct_u = q12_plot['underperformers'].values / totals * 100
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(16, 9), facecolor=BG,
gridspec_kw={'height_ratios': [3, 1]})
fig.subplots_adjust(hspace=0.25)
fig.suptitle('Q12 · The Discoverability Crisis: Has It Gotten Harder to Break Out?',
fontsize=14, fontweight='700', color=TEXT, y=0.98)
# ── Top: stacked bar chart (%) ────────────────────────────────────────────────
clean_ax(ax1)
w = 0.7
ax1.bar(years, pct_b, w, color=C_BREAKOUT, alpha=0.85, edgecolor=BG, label='Breakout Hit')
ax1.bar(years, pct_s, w, bottom=pct_b, color=C_STRONG, alpha=0.85, edgecolor=BG, label='Strong Success')
ax1.bar(years, pct_m, w, bottom=pct_b+pct_s, color=C_MODERATE, alpha=0.85, edgecolor=BG, label='Moderate Success')
ax1.bar(years, pct_u, w, bottom=pct_b+pct_s+pct_m, color=C_UNDER, alpha=0.85, edgecolor=BG, label='Underperformer')
# Label breakout % on each bar (the key metric to watch shrink)
for i, (yr, pb) in enumerate(zip(years, pct_b)):
if pb >= 5:
ax1.text(yr, pb / 2, f'{pb:.0f}%', ha='center', va='center',
fontsize=8, fontweight='bold', color=BG)
ax1.set_ylabel('% of cohort', fontsize=11, color=SUBTEXT)
ax1.set_ylim(0, 100)
ax1.set_title('What proportion of each year\'s games reached each tier?',
fontsize=12, fontweight='600', color=TEXT, pad=10)
ax1.legend(fontsize=9, labelcolor=TEXT, ncol=4, loc='upper center',
bbox_to_anchor=(0.5, 1.15), frameon=True, facecolor=CARD, edgecolor=GRID)
ax1.grid(True, axis='y', alpha=0.15, color=GRID)
ax1.set_xticks(years)
ax1.set_xticklabels(years, rotation=45, fontsize=9)
# ── Bottom: absolute game count ───────────────────────────────────────────────
clean_ax(ax2)
ax2.bar(years, totals, color=SUBTEXT, alpha=0.5, edgecolor=BG, width=0.7)
for yr, n in zip(years, totals):
if n >= 5:
ax2.text(yr, n + max(totals)*0.03, str(int(n)),
ha='center', fontsize=8, color=SUBTEXT)
ax2.set_xlabel('Release year', fontsize=11, color=SUBTEXT)
ax2.set_ylabel('Games in\ndataset', fontsize=10, color=SUBTEXT)
ax2.set_xticks(years)
ax2.set_xticklabels(years, rotation=45, fontsize=9)
ax2.grid(True, axis='y', alpha=0.15, color=GRID)
plt.show()
The stacked bars show a clear downward trend in breakout rates over time. Early cohorts (2010–2014) had a much higher share of Breakout Hits and Strong Successes, while post-2020 cohorts are increasingly dominated by Underperformers. This aligns with Steam's well-documented flood of new releases — the bottom panel shows the dataset mirrors this growth. The implication is structural: as the marketplace becomes more crowded, the same quality of game faces longer odds of finding an audience. However, it's worth noting that newer games have also had less time to accumulate owners, which may slightly inflate the underperformer share for recent years.
7. Data Lineage¶
Data lineage tracks where each piece of data came from, how it was transformed, and where it ended up, providing an audit trail that makes the pipeline reproducible and debuggable. In this project, lineage is managed by data_lineage.py, which logs every pipeline step to a dedicated lineage table in PostgreSQL. Each entry records the step name, source system, target system, number of records processed, status, and timestamp.
A total of 22 steps is tracked across the full pipeline: 8 collection steps covering each data source, 2 loading steps for PostgreSQL and MongoDB ingestion, 11 Spark transformation steps spanning the joins, sentiment aggregation, feature engineering and Parquet writes, and 1 DuckDB analysis step. This means that at any point, the full history of the data can be queried directly from PostgreSQL, making it straightforward to verify that each stage ran correctly and produced the expected record counts.
8. Automation¶
To ensure the pipeline is fully reproducible from a single command, all stages are orchestrated by run_pipeline.sh. Running ./run_pipeline.sh from the project root executes the entire pipeline end to end without any manual intervention between steps.
The script begins with a prerequisite check, verifying that Docker is running, Python and Java are available, the .env credentials file is present, and the collected data files exist, exiting with a clear error message if any of these are missing. It then starts the PostgreSQL and MongoDB containers via Docker Compose and waits for both databases to report healthy before proceeding, which prevents loading scripts from failing silently against an unready database. Once the databases are confirmed ready, it runs load_postgres.py and load_mongodb.py in sequence to ingest all collected data, followed by spark_pipeline.py which handles the joins, sentiment analysis, feature engineering, and Parquet output. duckdb_queries.py then runs the 12 analytical queries against the Parquet files, and finally data_lineage.py logs all 22 pipeline steps to PostgreSQL.
The practical value of this is that anyone with Docker and Python installed can clone the repository, add the .env file, and reproduce every result in this notebook with a single command, no manual database setup, no step-by-step script execution, and no dependency on a specific machine configuration.
from IPython.display import display, Markdown
with open('../run_pipeline.sh', 'r') as f:
content = f.read()
display(Markdown("""
### `run_pipeline.sh` — End-to-End Pipeline Orchestration
The script automates the full pipeline in 6 steps:
| Step | Action | What it does |
|------|--------|-------------|
| 0 | Prerequisite checks | Validates Docker, Python, Java, `.env`, and data files are present |
| 1 | Start containers | `docker compose up -d` (PostgreSQL + MongoDB) |
| 2 | Wait for readiness | Polls both databases for up to 30s each |
| 3 | Load data | Runs `load_postgres.py` and `load_mongodb.py` |
| 4 | Spark pipeline | Joins, VADER sentiment, feature engineering, Parquet export |
| 5 | DuckDB queries | 12 analytical queries on Parquet files (Lakehouse pattern) |
| 6 | Data lineage | Logs all 22 pipeline steps to PostgreSQL |
**Reproducibility:** Any team member can clone the repo, run `./run_pipeline.sh`, and get identical results — Docker ensures consistent database environments, and Git tracks all code changes.
Raw Pipeline:
"""))
display(Markdown(f'```bash\n{content}\n```'))
run_pipeline.sh — End-to-End Pipeline Orchestration¶
The script automates the full pipeline in 6 steps:
| Step | Action | What it does |
|---|---|---|
| 0 | Prerequisite checks | Validates Docker, Python, Java, .env, and data files are present |
| 1 | Start containers | docker compose up -d (PostgreSQL + MongoDB) |
| 2 | Wait for readiness | Polls both databases for up to 30s each |
| 3 | Load data | Runs load_postgres.py and load_mongodb.py |
| 4 | Spark pipeline | Joins, VADER sentiment, feature engineering, Parquet export |
| 5 | DuckDB queries | 12 analytical queries on Parquet files (Lakehouse pattern) |
| 6 | Data lineage | Logs all 22 pipeline steps to PostgreSQL |
Reproducibility: Any team member can clone the repo, run ./run_pipeline.sh, and get identical results — Docker ensures consistent database environments, and Git tracks all code changes.
Raw Pipeline:
#!/bin/bash
# =============================================================================
# run_pipeline.sh — Indie Game Survival Guide
# =============================================================================
# Orchestrates the full data engineering pipeline end to end.
# Run from the project root: ./run_pipeline.sh
#
# Prerequisites:
# - Docker Desktop installed and running
# - Python 3.x with packages from requirements.txt installed
# - Java 17 installed (for PySpark)
# - .env file with database credentials
#
# What this script does:
# 1. Starts Docker containers (PostgreSQL + MongoDB)
# 2. Waits for databases to be ready
# 3. Loads data into PostgreSQL and MongoDB
# 4. Runs Spark pipeline (joins, sentiment, features, Parquet output)
# 5. Runs DuckDB analytical queries on Parquet files
# 6. Logs data lineage into PostgreSQL
# 7. Prints summary
# =============================================================================
set -e # Exit on any error
echo "============================================================"
echo " Indie Game Pipeline — Full Automation Script"
echo "============================================================"
echo ""
# ── Step 0: Check prerequisites ──────────────────────────────────────────────
echo "[0/6] Checking prerequisites..."
# Check Docker
if ! docker info > /dev/null 2>&1; then
echo " ERROR: Docker is not running. Please start Docker Desktop."
exit 1
fi
echo " ✓ Docker running"
# Check Python
if ! python3 --version > /dev/null 2>&1; then
echo " ERROR: Python3 not found."
exit 1
fi
echo " ✓ Python3 found: $(python3 --version)"
# Check Java
if ! java -version > /dev/null 2>&1; then
echo " ERROR: Java not found. Install with: brew install openjdk@17"
exit 1
fi
echo " ✓ Java found"
# Check .env
if [ ! -f .env ]; then
echo " ERROR: .env file not found. Create it with database credentials."
echo " See DOCKER_SETUP.md for instructions."
exit 1
fi
echo " ✓ .env file found"
# Check data files
if [ ! -f data_collection/indie_games_filtered.json ]; then
echo " ERROR: data_collection/indie_games_filtered.json not found."
echo " Run the collection scripts first or pull data from the repo."
exit 1
fi
echo " ✓ Data files present"
echo ""
# ── Step 1: Start Docker containers ─────────────────────────────────────────
echo "[1/6] Starting Docker containers..."
docker compose up -d
echo " ✓ Containers started"
echo ""
# ── Step 2: Wait for databases to be ready ───────────────────────────────────
echo "[2/6] Waiting for databases to be ready..."
# Wait for PostgreSQL
echo " Waiting for PostgreSQL..."
for i in $(seq 1 30); do
if docker exec indie_postgres pg_isready -U indie -d indie_games > /dev/null 2>&1; then
echo " ✓ PostgreSQL ready"
break
fi
if [ $i -eq 30 ]; then
echo " ERROR: PostgreSQL did not start within 30 seconds"
exit 1
fi
sleep 1
done
# Wait for MongoDB
echo " Waiting for MongoDB..."
for i in $(seq 1 30); do
if docker exec indie_mongo mongosh --quiet -u indie -p pipeline2026 --authenticationDatabase admin --eval "db.runCommand({ping:1})" > /dev/null 2>&1; then
echo " ✓ MongoDB ready"
break
fi
if [ $i -eq 30 ]; then
echo " ERROR: MongoDB did not start within 30 seconds"
exit 1
fi
sleep 1
done
echo ""
# ── Step 3: Load data into databases ────────────────────────────────────────
echo "[3/6] Loading data into databases..."
echo ""
echo "--- PostgreSQL ---"
python3 databases/load_postgres.py
echo ""
echo "--- MongoDB ---"
python3 databases/load_mongodb.py
echo ""
# ── Step 4: Run Spark pipeline ───────────────────────────────────────────────
echo "[4/6] Running Spark pipeline..."
echo ""
python3 spark_pipeline/spark_pipeline.py
echo ""
# ── Step 5: Run DuckDB queries ───────────────────────────────────────────────
echo "[5/6] Running DuckDB analytical queries..."
echo ""
python3 duckdb_queries.py
echo ""
# ── Step 6: Log data lineage ────────────────────────────────────────────────
echo "[6/6] Logging data lineage..."
echo ""
python3 data_lineage.py
echo ""
# ── Summary ──────────────────────────────────────────────────────────────────
echo "============================================================"
echo " PIPELINE COMPLETE"
echo "============================================================"
echo ""
echo " Databases:"
echo " PostgreSQL: 11 data tables + 2 Spark output tables + lineage"
echo " MongoDB: 2 collections (reviews + Reddit)"
echo ""
echo " Parquet files:"
echo " storage/master_dataset.parquet (1,216 games × 59 columns)"
echo " storage/sentiment_scores.parquet"
echo " storage/pre_launch_signals.parquet"
echo ""
echo " Analysis:"
echo " 12 DuckDB queries executed"
echo " storage/master_dataset.csv exported for notebook"
echo " 22 lineage entries logged"
echo " storage/data_lineage_diagram.md (Mermaid diagram)"
echo ""
echo " Next: Open notebooks/ and build visualisations"
echo "============================================================"
9. Key Findings¶
The analysis reveals that success is multifactorial: no single variable predicts whether a game will break out or underperform.
Pricing: Free-to-play and budget-priced games (under £5) produce the highest proportion of Breakout Hits at 40% and 38% respectively, but the £15-30 tier generates the highest average peak concurrent users (2,613), suggesting mid-price games attract more engaged players.
Community presence: Games with a dedicated subreddit have significantly higher success rates across all tiers, and Reddit engagement correlates positively with ownership.
Review sentiment: Breakout Hits average higher VADER compound scores and a higher percentage of positive Steam reviews. However, pre-launch Google Trends interest is nearly identical across tiers (9.26 for successful games vs 9.61 for failures), indicating that buzz alone does not predict commercial outcomes.
Tags: Certain Steam tags show near-perfect success rates when filtered to tags with at least 20 games: Great Soundtrack, Open World Survival Craft, Surreal, and Souls-like all achieved 100%, while Collectathon (69.6%) and Creature Collector (75%) lagged behind.
Market trends: While the number of indie releases has grown dramatically (from 6 games in 2011 to 204 in 2025 in our filtered dataset), the overall success rate has remained stable between 85% and 95%, suggesting that Steam's growing catalogue has not made it proportionally harder to succeed.
10. Conclusion¶
This project built an end-to-end data engineering pipeline to answer a single question: what makes an indie game succeed on Steam?
Starting from 122,611 games in the Kaggle Steam dataset, we filtered to 1,216 indie titles with meaningful player activity, then enriched them with data from seven additional sources: Steam Store metadata, 57,938 Steam reviews, SteamSpy ownership estimates, Reddit community data from 52,842 posts, IGDB cross-platform ratings, Google Trends search interest, and Steam Charts monthly player counts.
Raw data was stored in PostgreSQL (13 tables, star schema) for structured relational data and MongoDB (2 collections) for nested documents, processed and joined in Apache Spark with VADER sentiment analysis, written to Parquet for columnar analytical storage, and queried with DuckDB following the Lakehouse pattern.
11. Final Deliverable¶
The primary output of this pipeline is a single, analysis-ready dataset stored in storage/master_dataset.parquet and exported as storage/master_dataset.csv. It contains 1,216 rows (one per game) and 62 columns spanning:
- Identity:
app_id,name,release_date - Commercial performance:
price,price_tier,estimated_owners,peak_ccu - Review metrics:
positive,negative,review_positivity_ratio,metacritic_score - VADER sentiment:
avg_vader_compound,pct_positive_reviews,num_reviews_analysed - Community signals:
has_dedicated_subreddit,subreddit_subscribers,total_reddit_posts,reddit_engagement - Google Trends:
google_interest_pre_launch,google_interest_post_launch,trend_direction,days_to_peak_interest - Playtime:
average_playtime_forever,median_playtime_forever,playtime_retention_ratio - Platform presence:
num_platforms,steam_platform_count - Success classification:
success_tier,tier_rank,owners_estimate
The Parquet files are partitioned by success_tier, enabling efficient filtered reads when analysing a single tier. PostgreSQL tables remain available for relational queries, particularly for tag analysis via the normalised game_tags table.
All code, data, and Docker configuration are stored in the GitHub repository, and the full pipeline can be reproduced from a clean machine by running ./run_pipeline.sh after starting Docker.
12. Limitations and Future Work¶
12.1 Limitations¶
Several limitations should be noted.
Threshold heuristics: The four-tier success thresholds (20k / 200k / 1M owners) are heuristic rather than empirically derived. The same game could be classified differently depending on market context or threshold choice. The original binary success label is retained for backward compatibility but carries the same limitation.
Reddit noise: Games with generic names (e.g. "Mirror", "LIMBO") return search posts unrelated to the actual game, and subreddit detection is based on naming heuristics that may miss unofficial communities.
English-only sentiment: VADER sentiment analysis is English-only, so multilingual reviews are excluded from the sentiment scores.
Sparse Trends data: Google Trends data is sparse for smaller games. Interest values are capped at zero when search volume is too low to register, which may understate pre-launch buzz for niche titles.
Class imbalance: 70% of games fall into the top two tiers (Breakout Hit or Strong Success), which would require resampling or class weighting if used to train a machine learning classifier.
Ownership approximation: SteamSpy ownership estimates are reported as ranges (e.g. "200,000.. 500,000") rather than exact figures, so the numeric
owners_estimatecolumn represents a midpoint approximation.
12.2 Future Work¶
Three extensions would strengthen this pipeline.
Apache Iceberg: Could replace raw Parquet for schema evolution support. The
price_historytable already demonstrates this with a composite primary key across source and snapshot date, but Iceberg would enable time-travel queries and atomic updates across the full dataset.ML classification: The 62-feature master dataset is well-suited for a supervised classification model predicting
success_tierfrom pre-launch signals (price, platform count, pre-launch Reddit activity, Google Trends). Class imbalance would need to be addressed via SMOTE or class weighting.Genre expansion: The pipeline could be expanded beyond indie games to cover all Steam genres, enabling cross-genre comparisons and a larger training set for any downstream ML work.
On the operational side:
- Airflow: Migrating from
run_pipeline.shto Apache Airflow would add DAG-based scheduling, retry policies, and execution monitoring. - Great Expectations: Could enforce data contracts between pipeline stages.
- GitHub Actions: CI/CD would automate testing on every push.
- dbt: Would provide a version-controlled transformation layer with built-in documentation and lineage.
APPENDIX¶
Appendix A: Meeting Minutes¶
# Meeting minutes
display(Image('pipeline_notebook_images/meeting_minutes.png', width=1200))
Appendix B: Project Management¶
Overview¶
The project ran over approximately two weeks (February 26 – March 11, 2026), building a full end-to-end data engineering pipeline from data collection through to analytical querying. Work was split across individual ownership of scripts and components, with a full-team in-person session on Saturday March 7 at Ana's to integrate everything and run the pipeline together. The pipeline processes 1,216 indie games across 8 data sources, producing a 1,216 × 62 master dataset used for analysis.
Timeline¶
Wednesday, February 26: Project kickoff meeting: defined project scope, problem statement, and initial data sources. Tasks divided across the team. Trello board and tickets created.
Friday, February 28: Second meeting: data collection tasks assigned and schema discussion begun.
Sunday, March 2 (Orestis): Created GitHub repo, set up folder structure, ensured full team access to the project repo and Docker environment.
Monday–Tuesday, March 3–4 (Orestis): Validated Kaggle dataset (122,611 games) and chose the filtering threshold (CCU ≥ 50). Built filter_games.py, producing the working set of 1,216 indie games. Investigated the Reddit API, found that self-service key creation had been blocked since November 2025, and discovered a workaround using Reddit's public JSON endpoints. Collected 52,842 Reddit posts across all 1,216 games.
Individual Contributions¶
Orestis: Project setup, game filtering, Reddit data collection, IGDB collection, SteamSpy collection, data validation across all sources, Docker container setup.
Ana: PostgreSQL schema design and loading, MongoDB loading, DuckDB analytical queries, environment setup troubleshooting.
Samuel: Google Trends collection, Steam Charts web scraping, full Spark pipeline build and debugging, PySpark/MongoDB connector compatibility fix.
Theo: Steam Store API collection, data lineage tracking, Mermaid architecture diagram, project log and documentation throughout.
Azada: Pipeline automation and orchestration, team coordination, meeting organisation, project management.
Key Challenges and How They Were Resolved¶
Reddit API access was blocked after Reddit removed self-service key creation in November 2025. We used Reddit's public JSON endpoints instead, which require no authentication and count as web scraping for the brief.
External game databases we had planned to use were unavailable — RAWG's signup was broken and GiantBomb's API was being rebuilt. We switched to IGDB via Twitch OAuth2, which also gave us a genuine authenticated API example for the pipeline.
Steam Charts blocked automated requests via Cloudflare. We resolved this with randomised delays between requests and User-Agent rotation.
mongo-spark-connector was incompatible with PySpark 4.1.1 due to changes in the internal Catalyst APIs. We downgraded to PySpark 3.5.5 to restore compatibility.
PySpark Row objects do not support .get(), which caused errors during Spark processing. We fixed this by converting rows with .asDict(recursive=True) after calling .collect().
game_overview view in PostgreSQL blocked Spark's overwrite mode because Spark tried to DROP and recreate the underlying table while the view still had a dependency on it. We fixed this by explicitly dropping the view before the Spark write and recreating it afterwards.
The above is a screenshot of the Kanban board we used during the project. We divided tasks into tickets, each assigned to a different aspect and phase of the project.
# Trello board screenshots
# During project
display(Image('pipeline_notebook_images/trello_before.jpeg', width=1200))
# Project finalised screenshot
display(Image('pipeline_notebook_images/trello_after.jpeg', width=1200))
Appendix C: AI Usage Acknowledgement¶
This project was completed with assistive AI usage, in accordance with the UCL MSIN0166 guidelines.
Tool Used¶
- Claude Code (claude-sonnet-4-6) via the Anthropic Claude Code CLI — used throughout the project as a coding assistant.
How AI Was Used¶
| Area | Description |
|---|---|
| Spark Pipeline | Claude Code assisted in writing and debugging spark_pipeline.py, including the join logic across 8 data sources, VADER sentiment integration, and feature engineering (success tier classification, playtime ratios, subreddit metrics). |
| DuckDB Queries | Claude Code helped write and refine the 12 analytical queries in duckdb_queries.py, particularly for multi-source joins, aggregation logic, and formatting results for visualisation. |
| Data Collection Scripts | Assisted in debugging API integrations (Steam, SteamSpy, IGDB OAuth2, Reddit, Google Trends) and handling edge cases (rate limiting, missing fields, data normalisation). |
| Notebook Visualisations | Claude Code assisted in generating matplotlib/seaborn charts for Q1–Q12 findings and writing the accompanying insight commentary. |
| Data Lineage | Helped design and implement the 22-step lineage tracking in data_lineage.py. |
| Documentation | Assisted in drafting sections of this notebook including architecture descriptions, schema explanations, and the key findings summary. |
What AI Did Not Do¶
All analytical conclusions, research questions, schema design decisions, and project direction were determined by the team. AI was used as a productivity tool — equivalent to an advanced autocomplete — rather than as a replacement for understanding or judgement. All AI-generated code was reviewed, tested, and modified by team members before inclusion.
Declaration¶
We confirm that all AI usage falls within the Assistive category as defined in the UCL assessment brief. The use of AI tools has been fully disclosed above, and we take responsibility for all content submitted.
References¶
Bustos, M. (2026) Steam Games Dataset [Dataset]. Kaggle. Available at: https://www.kaggle.com/datasets/fronkongames/steam-games-dataset (Accessed: 11 March 2026).
Hutto, C.J. (2014) vaderSentiment: VADER Sentiment Analysis [Software]. GitHub. Available at: https://github.com/cjhutto/vaderSentiment (Accessed: 11 March 2026).
Valve Corporation (2025) Steam Web API Documentation. Available at: https://steamcommunity.com/dev (Accessed: 11 March 2026).
Valve Corporation (2025) Steam Store API. Available at: https://store.steampowered.com/api/ (Accessed: 11 March 2026).
SteamSpy (2025) SteamSpy API Documentation. Available at: https://steamspy.com/api.php (Accessed: 11 March 2026).
Twitch Interactive, Inc. (2025) IGDB API Documentation. Available at: https://api-docs.igdb.com/ (Accessed: 11 March 2026).
Reddit, Inc. (2025) Reddit API Documentation. Available at: https://www.reddit.com/dev/api/ (Accessed: 11 March 2026).
GeneralMills (2023) pytrends: Unofficial Google Trends API. Available at: https://github.com/GeneralMills/pytrends (Accessed: 11 March 2026).
Steam Charts (2025) Steam Charts — Steam's Top Games by Current Players. Available at: https://steamcharts.com (Accessed: 11 March 2026).
Apache Software Foundation (2024) Apache Spark Documentation, Version 3.5. Available at: https://spark.apache.org/docs/3.5.0/ (Accessed: 11 March 2026).