The Original Goal
The original goal was straightforward: build a machine learning model that predicts Bitcoin's 7-day price direction with enough accuracy to be useful. The pipeline assembled 25 features spanning on-chain metrics, macroeconomic data, equity indices, sentiment indicators, and technical signals. The dataset combined 4,323 daily rows from June 2014 to April 2026. Both XGBoost (gradient-boosted trees) and LSTM (recurrent neural network) models were trained using walk-forward validation.
The prediction model failed. Both architectures achieved marginal directional accuracy on out-of-sample data: XGBoost averaged 49.7% on 7-day direction (essentially a coin flip), and the best LSTM configuration reached 54% on 30-day direction after an 8-configuration hyperparameter sweep. This is consistent with the broader literature. A 2025 systematic review in the Decision Analytics Journal found that most Bitcoin forecasting models "perform only marginally better than random guesses." The Bank of Spain's 2023 study using the same LSTM + SHAP methodology achieved 5-21% RMSE on price regression, but with error rates that spiked during unprecedented market moves.
The result was accepted. Bitcoin's 7-day price direction is not reliably predictable from the feature set tested. But the SHAP analysis told a different, more useful story.
Walk-Forward Validation
Most backtests in crypto are misleading. They train a model on all available data, then test on a random subset of that same data. The model has already seen patterns from the test period during training. This is not validation; it is memorization with plausible deniability.
Walk-forward validation eliminates this. The model is trained only on data from before the test period. It never sees the future. The pipeline used an expanding-window approach across four rounds, each capturing a different Bitcoin market regime:
Round 1: Train on 2011-2015 (early adoption, Mt. Gox era). Test on 2017 (first mainstream cycle).
Round 2: Train on 2011-2017. Test on 2019 (crypto winter recovery).
Round 3: Train on 2011-2019. Test on 2021 (institutional cycle, COVID stimulus).
Round 4: Train on 2011-2021. Test on 2024-2026 (ETF era, current regime).
Each round produces its own SHAP decomposition. A feature that ranks highly in one round might be irrelevant in another. This is not a bug. It is the central finding.
SHAP: From Black Box to Feature Ranking
SHAP (SHapley Additive exPlanations) is a game-theoretic framework that attributes each prediction to individual feature contributions. For every data point the model evaluates, SHAP computes how much each feature pushed the prediction toward or away from the positive class. Aggregate these contributions across all predictions in a test set and you get a feature importance ranking.
Unlike simpler approaches like permutation importance or Gini impurity, SHAP provides consistent and theoretically grounded attributions. It respects feature interactions and does not double-count correlated inputs.
SHAP values were computed for every XGBoost round (12 total: 4 time periods x 3 prediction horizons) and the rankings were validated against the LSTM hyperparameter sweep.
The Results
Across all 12 XGBoost rounds, the rankings were remarkably stable:
Power Law Position appeared in the top 5 features in 11 of 12 rounds. No other feature came close to this consistency. The power law measures where Bitcoin's price sits within its long-term logarithmic growth corridor. It captures structural valuation in a way that transcends individual market cycles.
200-Week MA Distance ranked in the top 5 in 7 of 12 rounds. This metric measures proximity to Bitcoin's most reliable historical support level. It tends to be most informative during bear markets and early recoveries, when the distance to this support narrows.
BTC Transaction Fees appeared in 8 of 12 rounds. This on-chain metric, denominated in native BTC, captures real demand for block space independently of price. It ranked #3 overall and carries 12% weight on the dashboard. The fee ratio compares daily fees to their 365-day moving average, similar to how the Puell Multiple treats miner issuance.
Mayer Multiple (4 of 12) and Pi Cycle Gap (6 of 12) provided complementary signals: mean reversion and cycle timing, respectively.
MVRV Ratio appeared in 3 of 12 rounds. Its lower frequency reflects the fact that MVRV tends to be informative at extremes (below 1.0 or above 3.5) but provides less signal in the broad middle range where Bitcoin spends most of its time.
The remaining indicators, NUPL, Puell Multiple, DMA Cross, and M2 Supply, each appeared in 2 of 12 rounds. They are not uninformative, but their predictive contribution is regime-dependent and less consistent.
What Had Zero Importance
Three features showed no meaningful SHAP contribution across any round, any horizon, or any model architecture:
S&P 500 and Nasdaq had zero aggregate importance. This contradicts the popular narrative that Bitcoin is correlated with equities. The correlation exists in short-term price movements, but it provides no predictive signal for 7-day or 30-day direction. Equity indices move with Bitcoin; they do not lead it.
Google Trends (post-2018) contributed nothing. This is surprising given the Bank of Spain's finding that attention variables drove 23% of SHAP importance during the 2020-2021 cycle. The difference is methodological: the BIS study used price regression (predicting exact prices), while this analysis used directional classification (up or down). Search interest may correlate with magnitude of moves but not with direction.
Fear and Greed Index was excluded from the dashboard for a different reason: insufficient history. With data only from February 2018, it covers fewer than two complete cycles, making walk-forward validation unreliable.
Feature Importance Is Not Stationary
The most important finding is not which feature ranked first. It is that the rankings shift across market regimes.
The Bank of Spain study quantified this directly. During Bitcoin's early adoption phase (2015-2017), technology variables like hash rate and transaction count drove 46% of feature importance. By the institutional cycle (2020-2021), that share had dropped to 21%, while public attention variables rose from 10% to 34%.
The walk-forward results show a similar pattern. Power Law Position was the only feature that remained consistently dominant across all four rounds. Every other feature had periods of relevance and periods of irrelevance. This non-stationarity is why static backtests are misleading: a model optimized for one regime will fail in the next.
It is also why the pipeline uses walk-forward validation with expanding windows rather than a single train/test split. Each round captures a different regime, and the aggregate SHAP rankings reflect importance across regimes, not within a single favorable period.
From SHAP to Dashboard Weights
The initial approach cluster-corrected the SHAP weights to prevent correlated valuation indicators from dominating the composite (reducing valuation from 44% to 30%). However, backtesting revealed that the original SHAP weights, with Power Law at 18% as the dominant signal, actually produce better DCA outcomes. The higher valuation weight means the composite naturally drops further at cycle peaks, triggering sell signals that the cluster-corrected version missed.
This is a case where the ML got it right and manual intervention made it worse. SHAP said Power Law matters most. The original ranking was second-guessed for theoretical reasons (correlation). The backtest proved SHAP correct. The dashboard now uses SHAP-derived weights with DXY and Volatility added as small independent contributors. See the backtest analysis for the full comparison.
The weight mapping was deliberately kept static rather than automated. Indicator weights that update automatically based on the latest SHAP run would chase whichever features happened to be important in the most recent regime, potentially just as that regime was ending. Static weights derived from cross-regime analysis are more robust.
Lessons from the Pipeline
The prediction model failed. The feature importance analysis succeeded. This is a common outcome in applied machine learning, particularly in financial markets where returns are close to efficient and directional prediction is a near-random process.
The value of the pipeline is not in forecasting tomorrow's price. It is in answering a simpler, more useful question: when you look at all available data about Bitcoin's position in its cycle, which dimensions actually carry information?
The answer, validated across 4 market cycles and consistent with independent academic research, is that structural valuation (power law, 200-week MA) matters most, cycle timing indicators (Pi Cycle, Mayer Multiple) matter moderately, and equity correlations and search interest do not matter at all for directional prediction.
The dashboard weights reflect this hierarchy. They are not perfect. They will likely need revision as new cycles produce new data. But they are grounded in something better than intuition: a systematic, reproducible analysis of what actually works.
Backtesting Confirmed the Rankings
After building the ML pipeline, the actual DCA outcomes were backtested using these weights. The results validated SHAP's central finding in a surprising way: Power Law Position alone, used as a single-indicator DCA strategy, produces the best return per dollar invested. It outperforms the 13-indicator composite on capital efficiency because its grade distribution is more balanced and it sells more confidently during overvaluation.
The composite still produces the largest total portfolio because it buys more aggressively. But the single indicator that SHAP ranked #1 across all regimes also ranks #1 in the backtest. The ML didn't just indicate which feature matters most for prediction: it identified the most effective signal for accumulation timing.
See the full backtest analysis for the three-way comparison between Standard DCA, Power Law DCA, and Signal DCA.