Classic setups are hypotheses. A Karpathy-style experiment loop on strategy code and benchmarks surfaces second-order improvements—filters, exits, and risk—that the textbook version rarely explores.

We ran 40+ battle-tested strategies through autoresearch. The textbook version was never the whole story.

A famous setup is not a finished product. It is a hypothesis with a story attached.

When you maintain a large library of named strategies—dozens of categories drawn from public methods, classic indicators, and practitioner playbooks—the honest starting point is humility. The first implementation is almost never the last. The edge often lives in the second-order details: which filter actually helps, which exit silently hurts, and whether the “canonical” parameters match the universe you measure against.

This is where autoresearch comes in: not as magic, but as a disciplined loop.

The same loop Karpathy open-sourced—applied to markets, not training runs

Andrej Karpathy’s autoresearch repository is a clean mental model. You give an agent a bounded workspace (for example train.py and instructions in program.md), a time budget, and a score. The system proposes changes, runs, reads the metric, and iterates. The domain there is small-model training; the shape of the loop is universal: hypothesis → run → measure → revise.

We use that same shape on strategy code and backtests: encode the idea, benchmark it on a fixed ticker set, tune parameters and logic where the evidence supports it, and merge only what survives human review and sanity checks. Our trading stack is not a fork of that repo and does not share its metrics—but the inspiration is direct: autonomous, comparable experiments beat one-off intuition when you are trying to learn what actually moves the needle.

What “battle-tested” really means in our stack

Our platform implements on the order of 43 strategy categories end to end—from screening through to the signal workflow. Calling them “battle-tested” means they are real implementations with real tradeoffs, not slide-deck sketches. It does not mean every default parameter was optimal on day one. That is exactly why we keep a running changelog of research-driven changes.

Patterns we kept seeing

The following themes come from our internal research log. They are illustrative, not a promise that any one tweak generalizes to your account or timeframe.

Textbook parameters are often too slow for a given benchmark. A slow golden cross is easy to explain in a blog post; a faster pair of moving averages can capture more of the trend dynamics in a fixed backtest window. We have moved variants toward faster EMA bundles where the evidence supported it, and paired those changes with exit logic that matches the new rhythm—rather than leaving a fast entry tied to a sluggish exit.

Sometimes the filter is the bug. Classic trend-strength filters sound sensible until you watch them veto valid signals week after week. We have seen cases where replacing an ADX gate with a volume-based filter, or removing an exit condition that trapped positions, did more for outcomes than another tweak to entry timing. The story is not “ADX is bad”; it is “this filter interacted badly with our data and our holding assumptions.”

Entry strictness trades off against risk. Relaxing an Ichimoku condition from “above the cloud” to “above Kijun” can admit earlier entries. That is not automatically better—it changes the distribution of trades. When we make that kind of shift, stop and take-profit structure has to be revisited in the same breath. Entry and risk are one system.

Autoresearch loses sometimes—and that is valuable data. A strong score on a research branch does not always survive when merged into the fuller strategy implementation that powers live screens and merged variants. We have reverted changes when the uplift did not transfer. The changelog is supposed to record that honestly. If you only publish wins, you teach the wrong lesson.

Benchmarks can lie if stops are unrealistic. Iterative search can propose extreme risk parameters that look incredible in aggregate until you audit them. We have had manual audit passes that froze or tightened stops when floating or overly wide ATR multiples produced nonsense economics on volatile names. Research proposes; engineering and risk review dispose.

Why the original authors are not “wrong”

Writers of classic methods optimize for clarity, generality, and pedagogy. They are not tuning against your universe, your slippage model, or your merge rules for multi-variant screeners.

When you measure a strategy inside a fixed process—same benchmark machinery, same merge semantics—you discover improvements that are orthogonal to the original narrative: a better trailing exit, a filter swap, a stop width that matches how often your universe gaps. Those are not betrayals of the source idea. They are the normal output of treating the idea as a live system instead of a museum piece.

What we are not claiming

This is not financial advice. Benchmark scores are research scaffolding: they rank ideas under explicit assumptions. Past measurements do not guarantee future results. The goal of autoresearch here is process quality: fewer hidden failure modes, fewer parameters that only work in a slide deck, and a library that gets sharper as the loop turns.

If you want to explore how those strategies surface in the product—screening, variants, and synthesis—the Strategies hub is the natural next step.