Machine Learning

SHAP Analysis: Which Indicators Actually Matter

April 2026

TL;DR The pipeline trained XGBoost and LSTM models on a unified dataset of 4,323 daily observations (June 2014 to April 2026, three complete Bitcoin cycles) with 25 features. Walk-forward validation produced marginal directional accuracy: XGBoost averaged 48.9% on out-of-sample tests, and the best LSTM configuration reached 54%. Consistent with published research, the prediction model failed. The SHAP rankings were more informative: Power Law Position ranked #1 in 6 of 12 validation rounds, twice as often as any other feature, and appeared in the top five in 8. Two audit findings temper that: most models collapsed to a handful of trees, making individual rankings weak evidence, and the rankings surfaced results that do not fit the dashboard's story (oil ranked top-5 in 7 rounds; a re-run showed equities are not zero after all). All of it is reported below rather than omitted. The pipeline did not produce a usable prediction model. It produced a noisy feature ranking that informs, but does not mechanically set, the dashboard weights.

The Original Goal

The original goal was straightforward: build a machine learning model that predicts Bitcoin's 7-day price direction with enough accuracy to be useful. The pipeline assembled 25 features spanning on-chain metrics, macroeconomic data, equity indices, sentiment indicators, and technical signals. The dataset combined 4,323 daily rows from June 2014 to April 2026. Both XGBoost (a model that builds chains of yes/no questions about the data, each chain patching the errors of the last) and LSTM (a neural network built to read sequences in order, with a memory of what came before) models were trained using walk-forward validation.

The prediction model failed. Both architectures achieved marginal directional accuracy on out-of-sample data (data the model never trained on): XGBoost averaged 49.7% on 7-day direction (a coin flip) and 48.9% across all twelve test windows, with a range of 44% to 56%. The best LSTM configuration from an 8-configuration hyperparameter sweep averaged 54%. This is consistent with the broader literature. A 2025 systematic review in the Decision Analytics Journal found that most Bitcoin forecasting models "perform only marginally better than random guesses." The Bank of Spain's 2023 study using the same LSTM + SHAP methodology achieved 5-21% RMSE on price regression, but with error rates that spiked during unprecedented market moves.

The result was accepted. Bitcoin's 7-day price direction is not reliably predictable from the feature set tested. But the SHAP analysis told a different, more useful story.

Walk-Forward Validation

Most backtests in crypto are misleading. They train a model on all available data, then test on a random subset of that same data. The model has already seen patterns from the test period during training. This is not validation; it is memorization with plausible deniability.

Walk-forward validation eliminates this. The model is trained only on data from before the test period. It never sees the future. The pipeline used an expanding-window approach across four rounds, each capturing a different Bitcoin market regime:

Round 1: Train from June 2014. Test on 2017-2018 (first mainstream cycle and its crash).

Round 2: Expand training through 2018. Test on 2019-2020 (crypto winter recovery).

Round 3: Expand through 2021. Test on 2022-2023 (post-COVID tightening, FTX).

Round 4: Expand through 2024. Test on 2025-2026 (ETF era, current regime).

Each round was trained at three prediction horizons (7, 14, and 30 days), giving 12 model runs in total. The dataset begins in June 2014, when all data sources become reliable, so the validation spans three complete market cycles, not four. An earlier version of this article described the rounds as starting in 2011; that was wrong, and the round definitions above match the saved results file.

Each round produces its own SHAP decomposition. A feature that ranks highly in one round might be irrelevant in another. This is not a bug. It is the central finding.

SHAP: From Black Box to Feature Ranking

SHAP (SHapley Additive exPlanations) is a game-theoretic framework that attributes each prediction to individual feature contributions. Think of it as splitting a restaurant bill by who ordered what: each prediction's total gets divided among the inputs that produced it, and features that ordered nothing pay nothing. For every data point the model evaluates, SHAP computes how much each feature pushed the prediction toward or away from the positive class. Aggregate these contributions across all predictions in a test set and you get a feature importance ranking.

Unlike simpler approaches like permutation importance or Gini impurity, SHAP provides consistent and theoretically grounded attributions. It respects feature interactions and does not double-count correlated inputs.

SHAP values were computed for every XGBoost round (12 total: 4 time periods x 3 prediction horizons) and the rankings were validated against the LSTM hyperparameter sweep.

The Results

All counts below come from the saved results file and can be regenerated with ml/src/shap_summary.py. An earlier version of this article published higher counts for several features; the corrected figures are these.

Power Law Position ranked first in 6 of the 12 rounds, twice as often as the runner-up (200-Week MA, with 3), and appeared in the top five in 8 of 12. No other feature finished first more than three times. The power law measures where Bitcoin's price sits within its long-term logarithmic growth corridor; its consistency across test windows from 2017 to 2026 is what earned it the largest dashboard weight.

BTC Transaction Fees appeared in the top five in 7 of 12 rounds, never first, but more frequently present than any feature except Power Law. This on-chain metric, denominated in native BTC, captures real demand for block space independently of price. Fees are what users pay to get a transaction into a block: a direct meter of demand for the network, like tolls counted at a bridge. The ratio asks whether today's toll take is high or low against the past year's normal. The fee ratio compares daily fees to their 365-day moving average, similar to how the Puell Multiple treats miner issuance.

Update, July 2026: the calendar sweep showed this ranking is era-contaminated the same way oil's was (a 25.9pp pooled edge that collapses to 3.2pp within-year). Fees' evidence is reclassified from SHAP-backed to rationale-backed; the weight is unchanged. See The Dashboard Takes the Calendar Test.

200-Week MA Distance ranked first in 3 rounds, all in the COVID-era test window (2022-2023), and appeared in the top five only in those same 3 rounds. When it mattered, it mattered most: the metric measures proximity to Bitcoin's most reliable historical support, and 2022 was the regime that tested it.

DMA Cross (top five in 5 of 12), 30-Day Volatility (5), Pi Cycle Gap (4), and DXY (4) form the middle tier: regime-dependent contributors that surface in some windows and vanish in others.

Mayer Multiple appeared mostly through its 30-day-lagged variant (4 of 12 lagged, 1 raw). M2 Supply appeared in 3 rounds, all in the earliest test window. MVRV Ratio appeared in just 1 of 12, consistent with its character: informative at extremes (below 1.0 or above 3.5), quiet in the broad middle where Bitcoin spends most of its time.

NUPL and the Puell Multiple never appeared in any top five. Both still carry dashboard weight (5.6% and 4.7%). That tension is real and is addressed in the weights section below: the dashboard weights are informed by these rankings, not mechanically derived from them, and the choice to keep NUPL and Puell rests on their economic rationale and their behavior at cycle extremes rather than on SHAP rank.

How Much Evidence Is a One-Tree Model?

A June 2026 audit of the saved models surfaced a caveat that belongs next to every ranking above. The training used aggressive early stopping to prevent overfitting (the model quits adding complexity the moment it stops helping on unseen data), and on this dataset early stopping frequently won: seven of the twelve models stopped at one or two trees, and nine at four or fewer. Each tree is one such chain of yes/no questions; a healthy model stacks dozens, each correcting the last. A one-tree model asked one round of questions and gave up. Only the 2022-2023 test window consistently produced models of substance (22 to 36 trees). A model with one tree has used a handful of features once; its SHAP "ranking" describes which features that stunted tree happened to split on, not a validated importance ordering.

Applied to the headline numbers, the caveat cuts unevenly. Five of Power Law's six first-place finishes came from models with four or fewer trees. The 200-Week MA's three firsts all came from the healthiest models in the set. Read strictly, the most defensible ranking statement this pipeline supports is at the cluster level, not the feature level: structural valuation leads in both the stunted models (via Power Law) and the healthy ones (via the 200-Week MA, which correlates with Power Law at 0.93), and BTC Fees recurs across both kinds. Which of the two valuation twins sits first depends on which models you trust.

The deeper message of the tree counts is the same as the accuracy numbers: walk-forward XGBoost mostly declined to find a tradeable signal in this data. The rankings are a description of what weak models reached for, corroborated where possible by economic rationale and backtesting. They are not a precision instrument, and the dashboard's weights, which lean on them, inherit that uncertainty.

What Ranked High That We Don't Trust

A feature-importance analysis that only reports the convenient rows is marketing. Three results from the saved runs do not fit the dashboard's story, and they belong in the open.

Oil ranked first in 2 of 12 rounds and appeared in the top five in 7, a better showing than most indicators that made the dashboard. WTI crude has no plausible direct mechanism for predicting Bitcoin's weekly direction. The working hypothesis is that oil proxies the inflation and liquidity regime, information that overlaps with what M2 and DXY are supposed to carry. But no test in the pipeline distinguishes that hypothesis from overfitting to a coincidence, and until one does, oil stays off the dashboard as a judgment call, not a data-driven exclusion. It is the standing anomaly of this analysis.

Update, July 2026: the test has now been run, and the anomaly is resolved. Oil's rank is a regime marker, not a signal: the pooled 30-day edge on cheap-oil days (71.4% up vs ~51%) collapses to zero once the calendar year is controlled for. The exclusion stands, now as a tested decision. See The Oil Anomaly Is a Calendar.

Google Trends was the single strongest feature in one round of the first pipeline run, the 2017 test window, where its 30-day-lagged variant dominated every other input. Everywhere else it contributed almost nothing. An earlier version of this article said Google Trends "contributed nothing," which the saved results contradict. The honest statement is that it was unstable: dominant in one retail-driven cycle, absent in all others. It was dropped for that instability and for its short reliable history, not for zero importance.

Gold and active addresses each surfaced in a few rounds (1 and 3 respectively) without consistency across windows. Same treatment: noted, not trusted, not included.

The Equity Indices Question

This site once claimed the S&P 500 and Nasdaq showed "zero importance." The June 2026 audit forced a re-test: a third pipeline run (v3) added four equity features to the current 25-feature set and persisted the full SHAP table for every round. The result kills the "zero" claim. Equity features landed in the top five in 4 of 12 rounds (the Nasdaq reached rank 4 in the healthiest model of the set, the 7-day 2022-2023 round) and in the top ten in 21 of 48 feature-round observations. They behave like the dashboard's other macro features: small, regime-dependent contributions, mid-pack ranks, never consistent leadership.

What survives is the narrower and more useful claim: equities never led the rankings in any round, their strongest showings cluster in the one window where bitcoin traded as a macro asset (2022-2023), and they offer no dimension the dashboard's existing macro features (DXY, M2) do not already cover. That, not "zero importance," is the supported reason they stay off the composite. The tech-beta article carries the full revision.

Fear and Greed Index was excluded from the dashboard for a different reason: insufficient history. With data only from February 2018, it covers fewer than two complete cycles, making walk-forward validation unreliable.

Feature Importance Is Not Stationary

The most important finding is not which feature ranked first, but that the rankings shift across market regimes.

The Bank of Spain study quantified this directly. During Bitcoin's early adoption phase (2015-2017), technology variables like hash rate and transaction count drove 46% of feature importance. By the institutional cycle (2020-2021), that share had dropped to 21%, while public attention variables rose from 10% to 34%.

The walk-forward results show a similar pattern. Power Law Position was the only feature that ranked first in more than one test window (it won the 2017-2018 and 2025-2026 windows at every horizon). The 2019-2020 window went to macro features, including oil, and the 2022-2023 window went to the 200-Week MA. Every feature, Power Law included, had periods of relevance and periods of irrelevance. This non-stationarity is why static backtests are misleading: a model optimized for one regime will fail in the next.

It is also why the pipeline uses walk-forward validation with expanding windows rather than a single train/test split. Each round captures a different regime, and the aggregate SHAP rankings reflect importance across regimes, not within a single favorable period.

From SHAP to Dashboard Weights

A disclosure first. The dashboard weights (Power Law 17%, 200-Week MA 12.1%, and so on) are informed by these rankings, not computed from them. The pipeline persists only the top five SHAP features per round, which is not enough to derive a full 13-way weight vector, and the published weights also reflect judgment: NUPL and Puell keep small weights despite never reaching a top five, because their grades separate cycle tops from bottoms cleanly in the backtest record. Describing the weights as "SHAP-derived" overstated the mechanics; "SHAP-informed and then fixed by hand" is accurate. The reproducible ranking counts live in ml/results/shap_summary.json.

The initial approach cluster-corrected the weights to prevent correlated valuation indicators from dominating the composite (reducing valuation from 44% to 30%). Backtesting reversed that decision: the SHAP-informed weights, with Power Law dominant at 17%, produce better DCA outcomes. The higher valuation weight means the composite naturally drops further at cycle peaks, triggering sell signals that the cluster-corrected version missed. The ranking was second-guessed for theoretical reasons (correlation), and the backtest sided with the ranking. See the backtest analysis for the full comparison.

The weight mapping was deliberately kept static rather than automated. Indicator weights that update automatically based on the latest SHAP run would chase whichever features happened to be important in the most recent regime, potentially just as that regime was ending. Static weights derived from cross-regime analysis are more durable.

Lessons from the Pipeline

The prediction model failed. The feature importance analysis succeeded. This is a common outcome in applied machine learning, particularly in financial markets where returns are close to efficient and directional prediction is a near-random process. The failure is not an artifact of the architectures chosen here: a pretrained tabular foundation model, run later on this exact protocol, landed on the same coin flip.

The value of the pipeline is not in forecasting tomorrow's price. It is in answering a simpler, more useful question: when you look at all available data about Bitcoin's position in its cycle, which dimensions actually carry information?

The answer, drawn from three complete market cycles and consistent with independent academic research, is that structural valuation (power law, 200-week MA) and on-chain demand (fees) carry the most consistent signal, cycle-timing indicators matter moderately and intermittently, and neither equity correlations nor search interest showed stable predictive value in the runs that included them.

The dashboard weights reflect this hierarchy. They are not perfect. They will likely need revision as new cycles produce new data. But they are grounded in something better than intuition: a systematic, reproducible analysis of what actually works.

Backtesting Confirmed the Rankings

After building the ML pipeline, the actual DCA outcomes were backtested using these weights. The results validated SHAP's central finding in a surprising way: Power Law Position alone, used as a single-indicator DCA strategy, produces the best return per dollar invested in most entry points (9 of 12). It outperforms the 13-indicator composite on capital efficiency because its grade distribution is more balanced and it sells more confidently during overvaluation.

The composite still builds the largest total portfolio because it buys more aggressively, but the feature SHAP ranked first most often also ranks first on capital efficiency: the pipeline pointed at the most effective single signal for accumulation timing.

See the full backtest analysis for the three-way comparison between Standard DCA, Power Law DCA, and Signal DCA.