About Split The Upright
An NFL game outcome prediction system leveraging ensemble machine learning methods and comprehensive feature engineering.
Background
I'm Keirnen, a University of Colorado Boulder graduate with a degree in Information Science specializing in data science. Split The Upright represents the culmination of applied machine learning research aimed at quantifying the predictability of NFL game outcomes.
The project integrates multiple data sources, sophisticated feature engineering pipelines, and ensemble modeling techniques to generate calibrated win probabilities that consistently outperform baseline prediction methods.
Project Overview
Each week, the system ingests team performance metrics, injury reports, meteorological data, momentum indicators, and historical matchup statistics to produce win probability estimates for every scheduled game.
This platform serves as a public implementation of the prediction pipeline, offering full transparency into methodology, model architecture, and real-time performance tracking against live NFL outcomes.
Navigation Guide
Weekly Predictions
Access game-by-game predictions including projected winner, confidence intervals, and model-derived point differentials.
Confidence Stratification
Predictions are categorized into High, Medium, and Low confidence tiers based on model certainty thresholds.
Stat Lab
Interactive analytics dashboard featuring the underlying team and player metrics that inform model predictions.
Performance Tracking
Comprehensive season-long accuracy metrics with no selection bias. Every prediction is documented and evaluated.
Matchup Simulator
The Matchup Simulator provides on-demand predictions for any NFL matchup configuration, leveraging the same ensemble model architecture deployed for weekly predictions while enabling granular control over contextual variables.
API Architecture
The simulator operates through a FastAPI backend deployed on Render, executing predictions via RESTful endpoints that accept matchup parameters and return calibrated probabilities with full statistical breakdowns.
Technical Implementation
Real-Time Prediction Engine
Requests are processed through the same four-model ensemble (Logistic Regression, Random Forest, XGBoost, LightGBM) used for weekly predictions. Each model generates independent probabilities that are weighted, calibrated via isotonic regression, and synthesized into a final prediction with confidence bounds.
Consensus Decision Logic
The system employs dual prediction pathways: classification models output win probabilities while spread regression predicts point differentials. When both approaches agree, confidence is amplified; disagreement triggers a confidence-weighted arbitration mechanism that selects the more certain prediction.
Feature Vector Construction
For each matchup, the API dynamically constructs a 40-feature vector incorporating team-specific rolling averages, ELO ratings, quarterback valuations, opponent-adjusted metrics, and user-specified contextual modifiers. This vector is preprocessed through the same imputation and scaling pipeline used during model training.
Statistical Transparency
Every prediction returns granular model breakdowns showing individual model probabilities, agreement status, confidence tier classification, and projected point spreads, enabling full visibility into the decision-making process.
Configurable Parameters
Environmental Conditions
Weather Integration: Temperature, wind speed, and precipitation forecasts are encoded as passing condition scores that modulate offensive efficiency projections. Dome games override environmental variables with neutral passing conditions.
Neutral Site Adjustment: Removes home-field advantage coefficient from ELO calculations and strips location-based momentum indicators.
Injury Scenarios
Quarterback Impact: Starting QB absences trigger substantial adjustments to team offensive EPA projections, with elite quarterback injuries (Mahomes, Allen, Jackson) applying more severe penalties than replacement-level starters.
Future Enhancement: Expansion to model injuries for skill position players and key defensive personnel is planned for subsequent iterations.
Team Strength Modifiers
Offseason Mode: During periods without current-season data, users can apply manual strength adjustments (±15%) to account for roster changes, coaching transitions, or anticipated regression/progression.
Historical Comparisons
Coming Soon: The ability to simulate matchups using historical team data from any season since 2017, enabling "what-if" scenarios such as peak 2019 Ravens versus 2023 49ers.
Comprehensive Statistics Display
Following each prediction, the interface presents a side-by-side comparison of team statistics across three categories:
Advanced Metrics
ELO ratings, Pythagorean win expectation, and point differential. This provides high-level indicators of team strength derived from season-long performance patterns.
Offensive Statistics
EPA per play, pass/rush EPA splits, success rate, yards per play, third-down efficiency, completion percentage, turnover rate, and red zone touchdown percentage.
Defensive Statistics
EPA allowed per play, pass EPA suppression, opponent success rate, yards allowed, takeaway generation, and red zone defense. This is all opponent-adjusted to account for schedule strength.
The superior team in each statistical category is highlighted in green, enabling rapid identification of matchup advantages and providing context for the model's prediction rationale.
⚡ Cold Start Notice
The prediction API operates on Render's free tier with cold start latency. Initial requests after periods of inactivity may require 30-60 seconds for server initialization. Subsequent predictions execute in under 2 seconds.
Technical Methodology
A detailed examination of the prediction pipeline architecture, from raw data ingestion through final probability calibration.
Data Sources
The system aggregates data from multiple authoritative sources to construct a comprehensive analytical foundation:
Play-by-Play Data
Complete snap-level data from 2017 to present via nflreadpy, encompassing approximately 5,700+ team-game observations.
Injury Intelligence
Real-time injury report data sourced from ESPN's API, cross-referenced with player value metrics derived from EPA contributions.
Meteorological Data
OpenWeatherMap forecasts calibrated to precise kickoff times, capturing game-time conditions rather than current observations.
Advanced Analytics
Expected Points Added (EPA), success rate metrics, Next Gen Stats tracking data, and FTN charting analysis.
Feature Engineering Pipeline
The pipeline constructs 189 candidate features from source data, subsequently refined through recursive feature elimination and importance-based selection to identify the 40 most predictive variables.
Feature Taxonomy:
Offensive Efficiency
EPA per play, success rate, yards per play, completion percentage, third-down conversion rate, turnover frequency
Defensive Performance
Defensive EPA allowed, opponent success rate suppression, takeaway generation, yards allowed per play
Situational Variables
ELO ratings, momentum trajectories, weather impact coefficients, rest differentials, primetime indicators
Personnel Impact
Skill position EPA loss from injuries, defensive availability index, quarterback status weighted by performance tier
Exponentially-Weighted Rolling Aggregations
Static season averages fail to capture team trajectory. A team that started 1-4 but won their subsequent six games exhibits different characteristics than aggregate statistics suggest. The system computes exponentially-weighted rolling averages across 3, 5, and 8-game windows:
Game t-1: 100% weight
Game t-2: ~70% weight
Game t-3: ~50% weight
Decay parameter λ = 0.35
Opponent-Adjusted Metrics
Each offensive metric is paired with the corresponding defensive metric of the scheduled opponent. A high-efficiency offense facing an elite defensive unit presents a fundamentally different prediction context than the same offense against a permissive defense. The model evaluates matchup-specific dynamics for every game.
Feature Importance Analysis
The following visualization displays feature importance scores derived from the XGBoost component of the ensemble. ELO-based features exhibit dominant predictive power, complemented by offensive efficiency metrics, defensive indicators, and situational variables:
Ensemble Architecture
The prediction system employs a heterogeneous ensemble combining four algorithms with complementary inductive biases:
| Model | Weight | Contribution |
|---|---|---|
| Logistic Regression | 35% | Linear baseline with strong interpretability and stable coefficient estimation across feature space. |
| Random Forest | 25% | Non-linear pattern detection via bagged decision trees with inherent robustness to outliers. |
| XGBoost | 20% | Gradient boosted trees optimized for complex feature interactions and residual error minimization. |
| LightGBM | 20% | Histogram-based gradient boosting with leaf-wise growth strategy for improved ensemble diversity. |
Ensemble weights are determined by Brier score performance on held-out validation data. Models producing superior probability calibration receive proportionally greater influence in the aggregated prediction.
Probability Calibration
Raw model outputs frequently exhibit systematic overconfidence. A prediction of 85% probability that historically resolves at 70% represents a calibration failure that undermines decision-theoretic utility.
Isotonic Regression
A non-parametric calibration method that learns the empirical relationship between predicted and observed probabilities, applying monotonic corrections to future outputs.
Confidence Ceiling (75%)
NFL outcomes exhibit substantial inherent variance. Even optimal feature configurations rarely justify extreme confidence levels, necessitating a hard probability ceiling to maintain calibration integrity.
Post-calibration, probability estimates demonstrate strong reliability when the model outputs 70%, the empirical win rate converges to approximately 70% across sufficient sample sizes.
Temporal Weighting Strategy
NFL team composition and performance characteristics evolve substantially across seasons through roster turnover, coaching changes, and schematic adaptation. Historical data from distant seasons exhibits diminishing relevance to current predictive tasks. The training pipeline implements aggressive temporal weighting:
| Season | Sample Weight |
|---|---|
| Current (2025) | 5.0x |
| Previous (2024) | 2.0x |
| Two seasons prior (2023) | 1.0x |
| Three seasons prior (2022) | 0.6x |
| Four+ seasons prior | 0.4x → 0.05x |
Current season observations alone constitute approximately 50% of the effective training weight. Historical data provides sample size stability while recent performance patterns dominate model optimization.
Data Integrity Protocols
Initial model iterations exhibited artificially elevated accuracy metrics exceeding 75%, subsequently traced to subtle data leakage — future information contaminating feature construction. A comprehensive pipeline audit identified and remediated these issues:
Temporal Isolation
Rolling feature calculations utilize exclusively historical observations. Week 5 predictions access only Weeks 1-4 data.
Walk-Forward Validation
Model evaluation occurs exclusively on temporally subsequent data that was unavailable during training.
Pre-Game Feature Constraint
Feature engineering strictly excludes post-game statistics from pre-game prediction contexts.
Validated accuracy on genuinely unseen data provides meaningful performance benchmarks; inflated metrics from leaked information do not.
Model Performance
The system achieves approximately 65% accuracy on live 2025 season predictions. Contextual benchmarks for comparison:
Vegas Consensus
Approximately 52-55% against the spread after accounting for vigorish
Random Baseline
50% (uniform probability assignment)
This Implementation
~65% on straight-up winner prediction
Weekly variance remains substantial. Individual week accuracy ranges from approximately 85% to 55% depending on upset frequency distribution. The objective is demonstrable predictive edge across full-season sample sizes.
Key Technical Insights
Data integrity supersedes model sophistication
Architectural complexity yields minimal benefit when underlying data pipelines contain temporal contamination.
Feature quality dominates feature quantity
Dimensionality reduction from 189 to 40 features improved predictive performance through noise elimination.
Calibration determines practical utility
A model producing reliable probability estimates enables rational decision-making; miscalibrated confidence does not.
Irreducible uncertainty defines the ceiling
NFL outcomes contain inherent stochasticity. The objective is consistent edge generation, not deterministic prediction.
Disclaimer
Important Notice
This platform represents a personal research project developed for educational and analytical purposes. The predictions and analyses presented do not constitute financial or gambling advice. Users who choose to incorporate these predictions into wagering decisions do so at their own discretion and risk.