Machine Learning

A Foundation Model Can't Beat a Coin Flip on Bitcoin

June 2026

TL;DR TabICL is a pretrained tabular foundation model from Inria that predicts by reading a dataset in a single forward pass, with no tuning. Its v2 checkpoint beats heavily tuned XGBoost, CatBoost, and LightGBM on roughly 80% of the TabArena benchmark. Run on this site's exact walk-forward protocol, it averaged 49.1% directional accuracy on bitcoin across 12 runs (range 39.6% to 55.7%, mean AUC 0.516): a coin flip, statistically indistinguishable from the project's own XGBoost (48.9%) and LSTM (best config 54%). The lesson is the point. When a free, tuning-free, state-of-the-art learner also lands at chance, the better-model excuse runs out. The ceiling is a property of the target, not the modeller. As a bonus, in the rounds that held any signal, TabICL independently surfaced the same valuation features the dashboard already weights.

In about ten seconds, a transformer that had never seen bitcoin read 2,770 days of it and predicted the direction of the next week. It got 48% right. The model was TabICL, a pretrained tabular foundation model from the scikit-learn group at Inria, and on the public TabArena benchmark its latest checkpoint beats heavily tuned XGBoost, CatBoost, and LightGBM on roughly four datasets in five. On bitcoin it could not beat a coin.

That is no knock on the model. If anything, it is the most useful thing a model could have told this project. Every earlier result here, the XGBoost runs averaging 48.9%, the LSTM topping out near 54%, pointed at the same conclusion: daily bitcoin direction, from this feature set, is close to unpredictable. The standing objection to that conclusion has always been the same. Maybe the model was the problem. Maybe a bigger architecture, or a less clumsy hand on the hyperparameters, would have found the signal.

TabICL is the cleanest way to retire that objection. It is state of the art, it is free, and it has no hyperparameters to fumble. If it lands where the home-grown models landed, the wall is not the modeller. It is the data.

What TabICL Is, and Why It Is the Right Tool Here

Most machine learning fits a model to a dataset by gradient descent (nudging internal settings step by step). A foundation model does something stranger. TabICL was pretrained on millions of synthetic tables, and at inference it performs the whole task in one forward pass: hand it the training rows, their labels, and the test rows together, and it returns predictions without ever updating a weight. It learns in context, the way a large language model answers a question from the examples in its prompt.

The practical consequence is that there is almost nothing to get wrong. No learning rate, no tree depth, no early-stopping patience, no regularisation to tune badly. The model arrives pretrained and opinionated, and a result it produces cannot be blamed on a botched configuration. For a project whose central finding is a null result, that is exactly the property worth having. A null from a tuned model invites the retort that the tuning was the weakness. A null from a tuning-free, benchmark-leading model does not.

The Same Harness, No Thumb on the Scale

The test reused the existing walk-forward harness without changing the rules. The walk-forward rule is simple to state: the model may only read days that came before the days it is asked to predict, then the window slides forward and the test repeats. Identical feature set (the 29 indicators of the v3 run, valuation, momentum, on-chain, macro, and equities), identical four expanding train-test splits across three market regimes, identical three horizons of 7, 14, and 30 days. TabICL's in-context set was every observation strictly before each test window, and it predicted the held-out window. No future leaked backward.

Two differences from the gradient-boosted run are worth stating plainly, because hiding them would defeat the purpose. TabICL needs no validation split for early stopping, so the training and validation data were merged into one in-context set. And its features were median-imputed from the context, where XGBoost used native handling of missing values. In plain terms: where the data had gaps, TabICL filled them with the middle value of what it had already seen, while XGBoost has its own built-in way of working around holes. Neither change favours either model. The code is ml/src/train_tabicl.py and the full per-round output is in ml/results/walk_forward_tabicl.json, on the same protocol as train_v3.py.

The Result: a Coin Flip, Again

Across the twelve runs, TabICL averaged 49.1% directional accuracy, ranging from 39.6% in the worst round to 55.7% in the best, with a mean area-under-curve of 0.516 (AUC is a 0-to-1 score of how well a model sorts up days from down days; 0.5 is what blind guessing scores). The per-round gaps against XGBoost were small and ran in both directions: a couple of points better here, a couple worse there, with no systematic edge for the foundation model. Three different families of model, asked the same question, returned the same answer.

Model	Mean directional accuracy	Tuning	How it learns
XGBoost (v3)	48.9%	Heavy (regularised, early-stopped)	Gradient-boosted trees, fitted per round
LSTM (best of a sweep)	~54%	Swept across configs	Recurrent network, fitted per round
TabICL v2	49.1%	None	Pretrained transformer, single forward pass

The honest reading of that table is not that TabICL tied, but that all three sit on the 50% line, where a model that has learned nothing useful also sits. The LSTM's 54% is the most flattering figure in the set, and it is the best of a configuration sweep, which is to say the high-water mark of a search, not a durable edge. The mean AUC near 0.51 across every approach is the tell: the models can barely sort tomorrow's winners from losers better than alphabetical order would.

Why a Better Model Did Not Help

The reason is not a defect in any of these models. It is a property of the thing being predicted. Daily bitcoin returns are close to a martingale: most of what will happen tomorrow is not written in the data available today, because anything reliably written there would already have been traded away. A learner can only extract structure that exists. When three architectures spanning trees, recurrence, and pretrained attention all converge on chance, the parsimonious explanation is that there is little structure to extract, not that all three happened to fail in the same way.

This is also a caution about benchmark scores. TabICL earns its reputation on TabArena and TALENT, suites of mostly independent, identically distributed tabular datasets where the signal is real and the task is to find it. Financial returns are neither independent nor stationary, and they are close to efficient. Those benchmark suites are worlds where each row is a fresh, unrelated example and the rules stay fixed. Financial returns are the opposite: each day leans on the last, and the rules themselves drift, so a pattern that held in 2019 need not hold in 2024. A model can be genuinely best-in-class at the first kind of problem and still hit a wall on the second, because the wall is informational, not architectural. Leaderboard dominance does not transfer to a near-efficient market.

What the Model Did Agree On

The accuracy was the headline, but the more interesting result came from the feature rankings. For each round, the test measured permutation importance: shuffle one feature, see how much accuracy falls. This is a model-agnostic cross-check on the order the indicators fall in, computed for a model that knows nothing about how the dashboard is built.

In the rounds that held any signal at all, the 2019-2020 and 2022-2023 windows, TabICL's most important features were the valuation set the composite already leans on: the 200-week moving average distance, Power Law Position, and the Mayer Multiple led its rankings, the same cluster the XGBoost SHAP analysis (the method that splits a model's prediction into per-indicator credit) put on top and the same cluster that carries the heaviest dashboard weight. Equity features surfaced almost only in the 2022-2023 macro window, exactly the pattern the tech-beta analysis reported. A different model, a different importance method, the same hierarchy.

The agreement comes with a caveat that is itself a finding. The importance magnitudes were tiny, a shuffled top feature moved accuracy by one to seven percentage points and usually two or three, and the rankings were unstable across rounds: the early window and the current one surfaced noise rather than valuation. That instability is not a bug to be explained away, but the independent reproduction, from a foreign architecture, of the warning this project already carries: with a near-zero signal, feature rankings are weak evidence, and the defensible claim is the cluster-level one, that valuation leads when anything does, not the per-feature one.

The Trap That Was Avoided

One detail decides whether a result like this means anything. A foundation model that learns in context is unusually easy to leak. Because the training rows and test rows pass through together, the obvious mistake is to hand the model the whole history at once and score it with cross-validation, which lets the context contain days that come after the days being predicted. On a time series that quietly imports the future, and it manufactures an accuracy that evaporates the moment the model meets a date it has not already seen.

The harness avoids it by construction: the in-context set for each test window is strictly earlier than the window, the same walk-forward rule the rest of the pipeline runs on. A model is only ever asked to read the past and guess the future. The unglamorous discipline is the whole reason the 49% can be trusted as a real number rather than an artifact.

What This Does and Does Not Mean

It does not mean bitcoin is unpredictable for all time, at all horizons, with all data. It means that daily direction, from this set of indicators, sits at chance across three model families, and that a state-of-the-art foundation model with no tuning does not change that. A different feature set, a longer horizon, or order-book microstructure might hold more. This test does not reach those claims, and it does not pretend to.

What it does settle is the question that prompted it. The dashboard was never a price oracle, and the backtests already showed its value is not point prediction, which is at a coin flip, but the disciplined reading of where bitcoin sits against its own valuation history. A foundation model arriving at the same accuracy ceiling is the strongest available confirmation that the ceiling is real and shared, not a local failure of one modeller's craft. The model was never the problem. The market was the point.