V7 — a NIFTY 50 forward-return forecasting system
case study · updated 2026
The problem
Short-horizon forward-return prediction on liquid Indian equities is adversarial in the mundane ways that matter most. Signal-to-noise is low, regime shifts are frequent, standard validation splits leak future information into training almost by default, and the cost stack on Indian exchanges — STT, brokerage, slippage — eats unremarkable edges alive. An accuracy of 52% can be a profitable strategy or a losing one depending entirely on where the 2% sits in the distribution of trades.
This is a small problem wearing a big problem's clothes. The hard part is not the model. The hard part is being willing to tell the truth about what the model is actually doing.
The version history
Most of what's been learned on this project is encoded in the gap between what a version claimed and what it turned out to be doing.
V3.1 — the lie. dir_acc = 0.543. I was excited. An audit of the
label-construction pipeline uncovered two distinct leakage paths
(documented in the post-mortem linked in the writing section). Both
involved information that was only nominally in the past being used to
build "forward" labels. Removing them dropped the reported metric by four
percentage points in a single commit.
A model that looks too good on a first run almost always is. The habit I picked up from this: before believing any new metric, run the same pipeline on randomly shuffled labels. If dir_acc stays above 0.5, the leakage is in the pipeline, not the model.
V5 — the honest baseline. dir_acc = 0.496. Below random. All
Sharpes negative. This is the number that should have existed in V3
if the pipeline had been right. V5 is not a good model. V5 is the point
from which honest progress can be measured.
V6 — the bridge. No headline metric. V6 was a full pipeline rewrite: strict temporal ordering, one place for label construction, feature manifests versioned alongside code, SQLite schemas locked with migrations. Shipping numbers was explicitly not the goal — the goal was removing the possibility of the V3.1 class of bug recurring.
V7 — the recovery. dir_acc = 0.510, IC = 0.025 on TATASTEEL and
RELIANCE. Still not profitable after costs. But each improvement from V5
onward is now honest, which is the part that matters.
System architecture
Four data sources: 1-min OHLCV bars from 2015 through 2026, 5-min options chain snapshots, VIX (level and term structure), index and stock futures, FII/DII cash-market flows, and GIFT NIFTY for overnight context. All of it lands in a single SQLite database.
SQLite is a single file I can inspect with sqlite3, diff across runs, and check into backup. The full corpus fits on disk with room to spare. The day the symbol universe or the bar frequency changes meaningfully, this gets revisited — but picking Postgres today would be solving a scale problem I do not have.
Feature engineering
Two configurations run side by side. Config A is the set currently live in V7. Config B is where new features get prototyped — they graduate into Config A only when they demonstrably move held-out IC.
Config A — live. Greeks (delta, gamma, theta, vega) on liquid strikes; IV surface shape beyond ATM (skew, term); dealer-positioning proxies derived from options open interest; options flow; FII/DII net divergence; VIX term structure; a compact set of OHLCV derivatives.
Config B — in progress. Fractional differentiation to preserve memory while inducing stationarity; Parkinson, Garman–Klass, and Rogers–Satchell volatility estimators; seasonality normalization across intraday, day-of-week, and calendar effects; approximate entropy (ApEn); Hurst exponent.
I keep a mental separation between features I've built and features that have earned their place in the ensemble. Every Config B candidate is held out of the live model until it produces a statistically defensible IC lift on the last rolling window. Most candidates do not make it.
Modelling
Three gradient-boosted regressors — LightGBM, XGBoost, CatBoost — train independently per horizon. Their out-of-fold predictions become the feature set for a Ridge meta-learner, which produces the final forecast.
Why GBMs and not a Transformer, an LSTM, or a Mamba-class state-space model: the data is tabular, heterogeneous, and irregularly missing. GBMs handle that natively; deep models would need an imputation pipeline that is itself a source of bugs. More importantly, iteration speed matters. I want to try ten feature ideas in a week, not one, and the marginal return of the deep-model path on tabular financial data of this shape is not there in the literature and has not been there in my own quick experiments.
Loss is Huber regression with δ tuned per horizon — Huber is less punitive on the fat tails that dominate financial returns.
| horizon | δ | intuition |
|---|---|---|
| 30m | 0.005 | wide enough to not treat minor tape noise as error |
| 1h | 0.008 | scaled with horizon-implied vol |
| 4h | 0.015 | broader shoulders, still squared in the centre |
| 1d | 0.025 | accommodates overnight gap distribution |
Values are approximate and retuned each rolling window. δ that is too small makes training indistinguishable from squared error; too large and the loss is effectively MAE, which is more robust but slower to pick up signal in calm regimes.
Validation
Walk-forward CV is the only honest choice for time-series of this shape. Each fold uses an expanding training window, a purge gap to prevent label-window leakage across the train/validation boundary, and an out-of-sample test block.
Classical purged CV uses an embargo window between validation and test to guard against serial correlation leaking via overlapping labels. V7's triple-barrier labelling already short-circuits the path that embargo is defending against, and adding an embargo on top was empirically costing roughly 0.003 IC with no improvement in out-of-sample stability I could measure. This is re-examined every time the labelling scheme changes.
Conformal calibration sits on top of the Ridge forecast: for each prediction, produce a coverage-calibrated interval. Those intervals feed position sizing, not the entry decision itself. Sizing by uncertainty is where most of V7's remaining edge comes from.
Results
- dir_acc
- 0.510
- information coef.
- 0.025
- sharpe (net)
- negative
- coverage
- on-target
Metrics are computed over the most recent 6-month rolling window and recomputed nightly. The negative Sharpe is named on purpose: it is the thing the next six months of work needs to fix, and pretending otherwise would be the kind of mistake V3.1 is meant to teach against.
Per-symbol breakdown of dir_acc and IC for TATASTEEL and RELIANCE over the last four rolling windows, rendered from the backtest run log. Include rolling-window dates on the x-axis and a small textual note on sample size per window.
The universe is deliberately small right now. Expanding beyond two symbols is gated on stable IC across those two, not on elapsed time or a calendar. Every previous version expanded the universe too early and paid for it in debugging time.
What I rejected, and why
Each of these is a credible-looking architectural move that I have explicitly chosen not to make in V7. Listing them matters because shipping them would be more visible than not, and the not is a deliberate decision.
Plausible for long-range sequence modelling. Not plausible for tabular, mixed-frequency, heavy-with-missingness financial features at the signal-to-noise ratio I'm seeing. Revisit when the base GBM model shows stable positive net Sharpe and the bottleneck is sequence memory rather than feature quality.
RL requires a reward signal, and the reward signal requires a working forecast first. Building an RL layer on top of a forecaster with IC = 0.025 would be optimizing against noise.
This is infrastructure for a scale I do not have. V7 trains on a laptop-sized corpus and runs nightly batch. A streaming rewrite would be a signal to recruiters that I know Kafka. It would not make the system better.
99% of my time is spent in Pandas-land doing feature ideation and label audits. The hot path is not the bottleneck. A rewrite would move work from the part of the system that matters to the part that does not.
What's next
SHAP analysis across the live V7 Config A features to identify dead weight and remove it before adding more. Config B features ship incrementally, each gated on a held-out IC gain that is larger than the noise band of the rolling window — most candidates are cut here rather than graduated.
A reframe of the training objective from directional accuracy toward
P(net profit per trade after costs) is the largest pending change.
Directional accuracy is a proxy for a proxy; the actual question is
whether the forecast crosses the cost threshold, and training for that
directly is one experiment away.
If dir_acc stalls at 0.51 and IC stops improving over the next
two rolling windows, V7 gets a post-mortem and V8 starts from the
data side — alternative labellings, different horizon sets — rather
than from the model side. The model is not the bottleneck right now,
and the next version will be about proving that.
get in touch →