V7 — a NIFTY 50 forward-return forecasting system

case study · updated 2026

The problem

Short-horizon forward-return prediction on liquid Indian equities is adversarial in the mundane ways that matter most. Signal-to-noise is low, regime shifts are frequent, standard validation splits leak future information into training almost by default, and the cost stack on Indian exchanges — STT, brokerage, slippage — eats unremarkable edges alive. An accuracy of 52% can be a profitable strategy or a losing one depending entirely on where the 2% sits in the distribution of trades.

This is a small problem wearing a big problem's clothes. The hard part is not the model. The hard part is being willing to tell the truth about what the model is actually doing.

The version history

Most of what's been learned on this project is encoded in the gap between what a version claimed and what it turned out to be doing.

fig.v7.timelineFour versions that matter. V3.1 reported 54.3% directional accuracy on internal test splits; after two label-construction leakage bugs were caught, V5 re-baselined at 49.6% — i.e. below random. V7 Config A is the recovery.

V3.1 — the lie. dir_acc = 0.543. I was excited. An audit of the label-construction pipeline uncovered two distinct leakage paths (documented in the post-mortem linked in the writing section). Both involved information that was only nominally in the past being used to build "forward" labels. Removing them dropped the reported metric by four percentage points in a single commit.

honest note— what V3.1 taught me

A model that looks too good on a first run almost always is. The habit I picked up from this: before believing any new metric, run the same pipeline on randomly shuffled labels. If dir_acc stays above 0.5, the leakage is in the pipeline, not the model.

V5 — the honest baseline. dir_acc = 0.496. Below random. All Sharpes negative. This is the number that should have existed in V3 if the pipeline had been right. V5 is not a good model. V5 is the point from which honest progress can be measured.

V6 — the bridge. No headline metric. V6 was a full pipeline rewrite: strict temporal ordering, one place for label construction, feature manifests versioned alongside code, SQLite schemas locked with migrations. Shipping numbers was explicitly not the goal — the goal was removing the possibility of the V3.1 class of bug recurring.

V7 — the recovery. dir_acc = 0.510, IC = 0.025 on TATASTEEL and RELIANCE. Still not profitable after costs. But each improvement from V5 onward is now honest, which is the part that matters.

System architecture

fig.v7.archData sources land in SQLite, pass through a deterministic feature layer, and feed three gradient-boosted models whose predictions are stacked by a Ridge meta-learner. Backtesting runs per-symbol with per-symbol cost models.

Four data sources: 1-min OHLCV bars from 2015 through 2026, 5-min options chain snapshots, VIX (level and term structure), index and stock futures, FII/DII cash-market flows, and GIFT NIFTY for overnight context. All of it lands in a single SQLite database.

decision— why SQLite and not Postgres/DuckDB

SQLite is a single file I can inspect with sqlite3, diff across runs, and check into backup. The full corpus fits on disk with room to spare. The day the symbol universe or the bar frequency changes meaningfully, this gets revisited — but picking Postgres today would be solving a scale problem I do not have.

Feature engineering

Two configurations run side by side. Config A is the set currently live in V7. Config B is where new features get prototyped — they graduate into Config A only when they demonstrably move held-out IC.

Config A — live. Greeks (delta, gamma, theta, vega) on liquid strikes; IV surface shape beyond ATM (skew, term); dealer-positioning proxies derived from options open interest; options flow; FII/DII net divergence; VIX term structure; a compact set of OHLCV derivatives.

Config B — in progress. Fractional differentiation to preserve memory while inducing stationarity; Parkinson, Garman–Klass, and Rogers–Satchell volatility estimators; seasonality normalization across intraday, day-of-week, and calendar effects; approximate entropy (ApEn); Hurst exponent.

honest note— shipping Config B is not the same as Config B working

I keep a mental separation between features I've built and features that have earned their place in the ensemble. Every Config B candidate is held out of the live model until it produces a statistically defensible IC lift on the last rolling window. Most candidates do not make it.

Modelling

Three gradient-boosted regressors — LightGBM, XGBoost, CatBoost — train independently per horizon. Their out-of-fold predictions become the feature set for a Ridge meta-learner, which produces the final forecast.

Why GBMs and not a Transformer, an LSTM, or a Mamba-class state-space model: the data is tabular, heterogeneous, and irregularly missing. GBMs handle that natively; deep models would need an imputation pipeline that is itself a source of bugs. More importantly, iteration speed matters. I want to try ten feature ideas in a week, not one, and the marginal return of the deep-model path on tabular financial data of this shape is not there in the literature and has not been there in my own quick experiments.

Loss is Huber regression with δ tuned per horizon — Huber is less punitive on the fat tails that dominate financial returns.

Huber δ per forecast horizon

horizon	δ	intuition
30m	0.005	wide enough to not treat minor tape noise as error
1h	0.008	scaled with horizon-implied vol
4h	0.015	broader shoulders, still squared in the centre
1d	0.025	accommodates overnight gap distribution

Values are approximate and retuned each rolling window. δ that is too small makes training indistinguishable from squared error; too large and the loss is effectively MAE, which is more robust but slower to pick up signal in calm regimes.

Validation

Walk-forward CV is the only honest choice for time-series of this shape. Each fold uses an expanding training window, a purge gap to prevent label-window leakage across the train/validation boundary, and an out-of-sample test block.

fig.v7.wfcvThree successive folds, each sliding forward in time. The hatched band is the purge gap; validation (accent) and test (accent) abut because embargo is intentionally zero in V7 — see note.

decision— why embargo is zero (for now)

Classical purged CV uses an embargo window between validation and test to guard against serial correlation leaking via overlapping labels. V7's triple-barrier labelling already short-circuits the path that embargo is defending against, and adding an embargo on top was empirically costing roughly 0.003 IC with no improvement in out-of-sample stability I could measure. This is re-examined every time the labelling scheme changes.

Conformal calibration sits on top of the Ridge forecast: for each prediction, produce a coverage-calibrated interval. Those intervals feed position sizing, not the entry decision itself. Sizing by uncertainty is where most of V7's remaining edge comes from.

Results

dir_acc: 0.510
information coef.: 0.025
sharpe (net): negative
coverage: on-target

Metrics are computed over the most recent 6-month rolling window and recomputed nightly. The negative Sharpe is named on purpose: it is the thing the next six months of work needs to fix, and pretending otherwise would be the kind of mistake V3.1 is meant to teach against.

fig.v7.equityRepresentative equity curve illustrating the V5 → V6 → V7 arc. Intentionally a synthetic draw — actual P&L stays with the private repo. The shape is what the metric table describes: drawdown through V5, a flat V6 bridge, a modest V7 recovery that has not yet cleared the zero line net of costs.

screenshot · pending

fig.v7.per-symbol

Per-symbol breakdown of dir_acc and IC for TATASTEEL and RELIANCE over the last four rolling windows, rendered from the backtest run log. Include rolling-window dates on the x-axis and a small textual note on sample size per window.

The universe is deliberately small right now. Expanding beyond two symbols is gated on stable IC across those two, not on elapsed time or a calendar. Every previous version expanded the universe too early and paid for it in debugging time.

What I rejected, and why

Each of these is a credible-looking architectural move that I have explicitly chosen not to make in V7. Listing them matters because shipping them would be more visible than not, and the not is a deliberate decision.

decision— Mamba-3 / state-space models

Plausible for long-range sequence modelling. Not plausible for tabular, mixed-frequency, heavy-with-missingness financial features at the signal-to-noise ratio I'm seeing. Revisit when the base GBM model shows stable positive net Sharpe and the bottleneck is sequence memory rather than feature quality.

decision— reinforcement-learning agents

RL requires a reward signal, and the reward signal requires a working forecast first. Building an RL layer on top of a forecaster with IC = 0.025 would be optimizing against noise.

decision— Kafka / streaming pipelines

This is infrastructure for a scale I do not have. V7 trains on a laptop-sized corpus and runs nightly batch. A streaming rewrite would be a signal to recruiters that I know Kafka. It would not make the system better.

decision— C++ / Rust rewrite of the hot path

99% of my time is spent in Pandas-land doing feature ideation and label audits. The hot path is not the bottleneck. A rewrite would move work from the part of the system that matters to the part that does not.

What's next

SHAP analysis across the live V7 Config A features to identify dead weight and remove it before adding more. Config B features ship incrementally, each gated on a held-out IC gain that is larger than the noise band of the rolling window — most candidates are cut here rather than graduated.

A reframe of the training objective from directional accuracy toward P(net profit per trade after costs) is the largest pending change. Directional accuracy is a proxy for a proxy; the actual question is whether the forecast crosses the cost threshold, and training for that directly is one experiment away.

If dir_acc stalls at 0.51 and IC stops improving over the next two rolling windows, V7 gets a post-mortem and V8 starts from the data side — alternative labellings, different horizon sets — rather than from the model side. The model is not the bottleneck right now, and the next version will be about proving that.

last updated · 2026

get in touch →