About Split The Upright

An NFL game outcome prediction system leveraging ensemble machine learning methods and comprehensive feature engineering.

Background

I'm Keirnen, a University of Colorado Boulder graduate with a degree in Information Science specializing in data science. Split The Upright represents the culmination of applied machine learning research aimed at quantifying the predictability of NFL game outcomes.

The project integrates multiple data sources, sophisticated feature engineering pipelines, and ensemble modeling techniques to generate calibrated win probabilities that consistently outperform baseline prediction methods.

Project Overview

Each week, the system ingests team performance metrics, injury reports, meteorological data, momentum indicators, and historical matchup statistics to produce win probability estimates for every scheduled game.

This platform serves as a public implementation of the prediction pipeline, offering full transparency into methodology, model architecture, and real-time performance tracking against live NFL outcomes.

Navigation Guide

Weekly Predictions

Access game-by-game predictions including projected winner, confidence intervals, and model-derived point differentials.

Confidence Stratification

Predictions are categorized into High, Medium, and Low confidence tiers based on model certainty thresholds.

Stat Lab

Interactive analytics dashboard featuring the underlying team and player metrics that inform model predictions.

Performance Tracking

Comprehensive season-long accuracy metrics with no selection bias. Every prediction is documented and evaluated.

Matchup Simulator

The Matchup Simulator provides on-demand predictions for any NFL matchup configuration, leveraging the same ensemble model architecture deployed for weekly predictions while enabling granular control over contextual variables.

API Architecture

The simulator operates through a FastAPI backend deployed on Render, executing predictions via RESTful endpoints that accept matchup parameters and return calibrated probabilities with full statistical breakdowns.

Technical Implementation

Real-Time Prediction Engine

Requests are processed through the same four-model ensemble (Logistic Regression, Random Forest, XGBoost, LightGBM) used for weekly predictions. Each model generates independent probabilities that are weighted, calibrated via isotonic regression, and synthesized into a final prediction with confidence bounds.

Consensus Decision Logic

The system employs dual prediction pathways: classification models output win probabilities while spread regression predicts point differentials. When both approaches agree, confidence is amplified; disagreement triggers a confidence-weighted arbitration mechanism that selects the more certain prediction.

Feature Vector Construction

For each matchup, the API dynamically constructs a 40-feature vector incorporating team-specific rolling averages, ELO ratings, quarterback valuations, opponent-adjusted metrics, and user-specified contextual modifiers. This vector is preprocessed through the same imputation and scaling pipeline used during model training.

Statistical Transparency

Every prediction returns granular model breakdowns showing individual model probabilities, agreement status, confidence tier classification, and projected point spreads, enabling full visibility into the decision-making process.

Configurable Parameters

Environmental Conditions

Weather Integration: Temperature, wind speed, and precipitation forecasts are encoded as passing condition scores that modulate offensive efficiency projections. Dome games override environmental variables with neutral passing conditions.

Neutral Site Adjustment: Removes home-field advantage coefficient from ELO calculations and strips location-based momentum indicators.

Injury Scenarios

Quarterback Impact: Starting QB absences trigger substantial adjustments to team offensive EPA projections, with elite quarterback injuries (Mahomes, Allen, Jackson) applying more severe penalties than replacement-level starters.

Future Enhancement: Expansion to model injuries for skill position players and key defensive personnel is planned for subsequent iterations.

Team Strength Modifiers

Offseason Mode: During periods without current-season data, users can apply manual strength adjustments (±15%) to account for roster changes, coaching transitions, or anticipated regression/progression.

Historical Comparisons

Coming Soon: The ability to simulate matchups using historical team data from any season since 2017, enabling "what-if" scenarios such as peak 2019 Ravens versus 2023 49ers.

Comprehensive Statistics Display

Following each prediction, the interface presents a side-by-side comparison of team statistics across three categories:

Advanced Metrics

ELO ratings, Pythagorean win expectation, and point differential. This provides high-level indicators of team strength derived from season-long performance patterns.

Offensive Statistics

EPA per play, pass/rush EPA splits, success rate, yards per play, third-down efficiency, completion percentage, turnover rate, and red zone touchdown percentage.

Defensive Statistics

EPA allowed per play, pass EPA suppression, opponent success rate, yards allowed, takeaway generation, and red zone defense. This is all opponent-adjusted to account for schedule strength.

The superior team in each statistical category is highlighted in green, enabling rapid identification of matchup advantages and providing context for the model's prediction rationale.

⚡ Cold Start Notice

The prediction API operates on Render's free tier with cold start latency. Initial requests after periods of inactivity may require 30-60 seconds for server initialization. Subsequent predictions execute in under 2 seconds.

Technical Methodology

A detailed examination of the prediction pipeline architecture, from raw data ingestion through final probability calibration.

189
Features Engineered
40
Features Selected
5,700+
Historical Games
4
Model Ensemble

Data Sources

The system aggregates data from multiple authoritative sources to construct a comprehensive analytical foundation:

Play-by-Play Data

Complete snap-level data from 2017 to present via nflreadpy, encompassing approximately 5,700+ team-game observations.

Injury Intelligence

Real-time injury report data sourced from ESPN's API, cross-referenced with player value metrics derived from EPA contributions.

Meteorological Data

OpenWeatherMap forecasts calibrated to precise kickoff times, capturing game-time conditions rather than current observations.

Advanced Analytics

Expected Points Added (EPA), success rate metrics, Next Gen Stats tracking data, and FTN charting analysis.

Feature Engineering Pipeline

The pipeline constructs 189 candidate features from source data, subsequently refined through recursive feature elimination and importance-based selection to identify the 40 most predictive variables.

Feature Taxonomy:

Offensive Efficiency

EPA per play, success rate, yards per play, completion percentage, third-down conversion rate, turnover frequency

Defensive Performance

Defensive EPA allowed, opponent success rate suppression, takeaway generation, yards allowed per play

Situational Variables

ELO ratings, momentum trajectories, weather impact coefficients, rest differentials, primetime indicators

Personnel Impact

Skill position EPA loss from injuries, defensive availability index, quarterback status weighted by performance tier

Exponentially-Weighted Rolling Aggregations

Static season averages fail to capture team trajectory. A team that started 1-4 but won their subsequent six games exhibits different characteristics than aggregate statistics suggest. The system computes exponentially-weighted rolling averages across 3, 5, and 8-game windows:

Game t-1: 100% weight
Game t-2: ~70% weight
Game t-3: ~50% weight
Decay parameter λ = 0.35

Opponent-Adjusted Metrics

Each offensive metric is paired with the corresponding defensive metric of the scheduled opponent. A high-efficiency offense facing an elite defensive unit presents a fundamentally different prediction context than the same offense against a permissive defense. The model evaluates matchup-specific dynamics for every game.

Feature Importance Analysis

The following visualization displays feature importance scores derived from the XGBoost component of the ensemble. ELO-based features exhibit dominant predictive power, complemented by offensive efficiency metrics, defensive indicators, and situational variables:

Top 20 Feature Importances (XGBoost)

Ensemble Architecture

The prediction system employs a heterogeneous ensemble combining four algorithms with complementary inductive biases:

Model Weight Contribution
Logistic Regression 35% Linear baseline with strong interpretability and stable coefficient estimation across feature space.
Random Forest 25% Non-linear pattern detection via bagged decision trees with inherent robustness to outliers.
XGBoost 20% Gradient boosted trees optimized for complex feature interactions and residual error minimization.
LightGBM 20% Histogram-based gradient boosting with leaf-wise growth strategy for improved ensemble diversity.

Ensemble weights are determined by Brier score performance on held-out validation data. Models producing superior probability calibration receive proportionally greater influence in the aggregated prediction.

Probability Calibration

Raw model outputs frequently exhibit systematic overconfidence. A prediction of 85% probability that historically resolves at 70% represents a calibration failure that undermines decision-theoretic utility.

1

Isotonic Regression

A non-parametric calibration method that learns the empirical relationship between predicted and observed probabilities, applying monotonic corrections to future outputs.

2

Confidence Ceiling (75%)

NFL outcomes exhibit substantial inherent variance. Even optimal feature configurations rarely justify extreme confidence levels, necessitating a hard probability ceiling to maintain calibration integrity.

Post-calibration, probability estimates demonstrate strong reliability when the model outputs 70%, the empirical win rate converges to approximately 70% across sufficient sample sizes.

Temporal Weighting Strategy

NFL team composition and performance characteristics evolve substantially across seasons through roster turnover, coaching changes, and schematic adaptation. Historical data from distant seasons exhibits diminishing relevance to current predictive tasks. The training pipeline implements aggressive temporal weighting:

Season Sample Weight
Current (2025) 5.0x
Previous (2024) 2.0x
Two seasons prior (2023) 1.0x
Three seasons prior (2022) 0.6x
Four+ seasons prior 0.4x → 0.05x

Current season observations alone constitute approximately 50% of the effective training weight. Historical data provides sample size stability while recent performance patterns dominate model optimization.

Data Integrity Protocols

Initial model iterations exhibited artificially elevated accuracy metrics exceeding 75%, subsequently traced to subtle data leakage — future information contaminating feature construction. A comprehensive pipeline audit identified and remediated these issues:

Temporal Isolation

Rolling feature calculations utilize exclusively historical observations. Week 5 predictions access only Weeks 1-4 data.

Walk-Forward Validation

Model evaluation occurs exclusively on temporally subsequent data that was unavailable during training.

Pre-Game Feature Constraint

Feature engineering strictly excludes post-game statistics from pre-game prediction contexts.

Validated accuracy on genuinely unseen data provides meaningful performance benchmarks; inflated metrics from leaked information do not.

Model Performance

The system achieves approximately 65% accuracy on live 2025 season predictions. Contextual benchmarks for comparison:

Vegas Consensus

Approximately 52-55% against the spread after accounting for vigorish

Random Baseline

50% (uniform probability assignment)

This Implementation

~65% on straight-up winner prediction

Weekly variance remains substantial. Individual week accuracy ranges from approximately 85% to 55% depending on upset frequency distribution. The objective is demonstrable predictive edge across full-season sample sizes.

Key Technical Insights

1

Data integrity supersedes model sophistication

Architectural complexity yields minimal benefit when underlying data pipelines contain temporal contamination.

2

Feature quality dominates feature quantity

Dimensionality reduction from 189 to 40 features improved predictive performance through noise elimination.

3

Calibration determines practical utility

A model producing reliable probability estimates enables rational decision-making; miscalibrated confidence does not.

4

Irreducible uncertainty defines the ceiling

NFL outcomes contain inherent stochasticity. The objective is consistent edge generation, not deterministic prediction.

Disclaimer

Important Notice

This platform represents a personal research project developed for educational and analytical purposes. The predictions and analyses presented do not constitute financial or gambling advice. Users who choose to incorporate these predictions into wagering decisions do so at their own discretion and risk.