Paper 10 Alt Data Machine Learning Nowcasting Satellite

The Role of Alternate Data in Quantitative Finance

Traditional market data tells you what prices did. Alternate data aims to tell you why they might do something next. Satellite imagery, transaction panels, and e-commerce data as nowcasting signals with rigorous data governance.

Abstract

This paper examines the role of alternative (non-traditional) data sources in quantitative finance, including satellite imagery, transaction panels, and e-commerce pricing data. We present a nowcasting framework, a machine learning pipeline for alternative data features, and a data governance checklist addressing timestamp integrity, survivorship bias, and legal rights. The core thesis is that the real edge lies in the translation layer that converts messy external traces into clean, point-in-time, economically interpretable variables.

Key Takeaways

Introduction

Traditional market data tells you what prices did. Alternate data aims to tell you why they might do something next. If public prices already summarize common information, a quant edge must often come from signals that are earlier, orthogonal, or simply structured in a way that the market has not fully absorbed.

A useful framing: for a news item or data point arriving at time \(t\), the research task is:

Nowcasting Model
$$y_{t+1} = \beta^\top x_t + \epsilon_{t+1}$$

The challenge is not writing that equation — it is making sure \(x_t\) is truly available at time \(t\), properly normalized, and economically linked to the target. Alternate data examples include:

Machine Learning Pipeline

alt_data_model.py Python
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import r2_score

df = pd.DataFrame({
    "web_price_change":    [0.01, -0.02,  0.00,  0.03,  0.01, -0.01],
    "traffic_index":       [102,  98,   101,   110,   108,   99  ],
    "transaction_growth":  [0.05,  0.01,  0.02,  0.06,  0.04,  0.00],
    "target":              [0.02, -0.01,  0.00,  0.03,  0.01, -0.02]
})

X = df.drop(columns=["target"])
y = df["target"]

tscv   = TimeSeriesSplit(n_splits=3)
scores = []

for train_idx, test_idx in tscv.split(X):
    model = RandomForestRegressor(n_estimators=200, random_state=42)
    model.fit(X.iloc[train_idx], y.iloc[train_idx])
    pred = model.predict(X.iloc[test_idx])
    scores.append(r2_score(y.iloc[test_idx], pred))

print("Mean OOS R²:", sum(scores) / len(scores))

Data Governance Checklist

Conclusion

The best alternate-data teams behave less like headline chasers and more like measurement scientists. They spend as much time on ontology, joins, timestamp integrity, and missing-data behavior as they do on modeling. The real edge comes from converting messy external traces into clean, point-in-time, economically interpretable variables. That translation layer is where most of the alpha lives.