Remember when you were in school and accidentally saw the answer key before a test? That's essentially what's happening with many AI models being used for financial trading today.

A new research paper from Mostapha Benhenda reveals that popular large language models (LLMs) are achieving spectacular trading returns by essentially “cheating”, remembering future stock prices from their training data rather than making genuine predictions.

The $45k GitHub project that exposed a critical flaw

The research centers around an open-source AI trading system that's captured the imagination of developers worldwide.

With over 45,000 stars on GitHub, the AI Hedge Fund project lets LLMs act as portfolio managers, making real trading decisions.

But when Benhenda tested these AI traders across different time periods, the results were shocking.

Standard models like Meta's Llama 3.1 and DeepSeek achieved returns exceeding 44% when trading stocks from 2021.

Impressive, but here's the catch: these models had likely seen news articles, market analyses, and retrospective reports about 2021's tech boom during their training. They weren't predicting anything. They were reciting memorized history.

When the same models were tested on data from mid-2024 (after their training cutoff), their performance collapsed.

DeepSeek's returns dropped by nearly 22 percentage points. That's not a minor adjustment. That's the difference between a successful hedge fund and bankruptcy.

AI in FP&A: What you need to know
AI in FP&A won’t make you or your role obsolete. But it will help automate routine tasks and give you more time for strategic initiatives.

Why bigger isn't always better in financial AI

Here's where things get really interesting. The research uncovered what Benhenda calls the "Scaling Paradox."

You'd expect larger, more sophisticated models to perform better, right? Wrong. At least not when they're contaminated with future knowledge.

The 70-billion parameter version of Llama 3.1 actually performed worse than its smaller 8-billion parameter sibling when faced with genuinely unknown market conditions.

Why? Larger models have better memory. They've memorized more specific details about historical stock movements. When NVIDIA surged 190% in 2023, these models didn't just learn general patterns about tech stocks. They memorized that specific fact.

Think about it this way: if you're trying to predict tomorrow's weather, would you rather have someone who understands meteorology or someone who memorized last year's weather reports?

The memorizer might look brilliant if you test them on last year's dates, but they're useless for actual forecasting.

The solution: Point-in-time models that actually work

Not all hope is lost. The research also tested a family of specialized "Point-in-Time" (PiT) models from a company called PiT-Inference.

These models are designed with strict knowledge cutoffs; they literally cannot access information beyond a certain date.

The results? While standard models showed massive performance decay between time periods, PiT models maintained consistent returns.

Even more encouraging, larger PiT models actually performed better than smaller ones, suggesting that when you remove the contamination of future data, scale does improve financial reasoning.

The largest PiT model achieved over 7% excess returns (alpha) in both test periods, compared to the buy-and-hold baseline. That's the kind of consistent, reliable performance that actual fund managers dream about.

What this means for finance professionals

If you're a CFO or FP&A professional considering AI tools for financial analysis, this research should be a wake-up call.

Many vendors claiming revolutionary AI-powered trading systems might be selling you sophisticated memorization machines rather than genuine predictive tools.

The paper introduces "Look-Ahead-Bench," a standardized way to test whether an AI model is actually making predictions or just recalling training data. It's like a lie detector test for financial AI.

The benchmark uses two carefully selected time periods and measures how much a model's performance degrades when it moves from familiar to unfamiliar territory.

Here's what you should ask any AI vendor: "Has your model been tested for look-ahead bias?

What's its alpha decay between in-sample and out-of-sample periods?" If they can't answer these questions, you might be buying an expensive crystal ball that only works in hindsight.

The path forward for AI in finance

This research doesn't mean AI is useless for financial applications. Quite the opposite. It shows that when properly designed and tested, AI models can provide consistent value in portfolio management.

The key is ensuring they're making decisions based on patterns and reasoning, not memorized answers.

For finance teams looking to leverage AI, the message is clear: be skeptical of spectacular backtested results, demand rigorous temporal testing, and consider specialized models designed for financial applications rather than general-purpose chatbots.

The financial markets are perhaps the ultimate test of predictive capability. There's no partial credit for almost getting it right, and there's certainly no value in perfectly predicting the past.

As this research demonstrates, the difference between genuine intelligence and sophisticated memorization can be worth millions.

The next time someone shows you an AI model with incredible historical trading returns, remember to ask the critical question: is it predicting, or is it remembering? In finance, only one of those abilities will make you money.


Wondering what finance salaries look like in 2026? Join your peers in taking part of our annual survey and help us paint a clear picture of what's happening in finance.

Finance Alliance Salary Survey 2026
Help shape the Finance Alliance Salary Report 2026, the global benchmark for finance salaries.