Probabilistic Sharpe Ratio (PSR) and Backtest Overfitting
A statistically rigorous alternative to raw Sharpe ratio that adjusts for non-normality, sample length, skewness, and kurtosis to detect backtest overfitting.
This paper examines the Probabilistic Sharpe Ratio (PSR), a statistically rigorous extension of the classical Sharpe ratio that accounts for sample length, skewness, and kurtosis. PSR estimates the probability that an observed Sharpe ratio exceeds a benchmark, providing an inferential framework that penalizes short histories and non-normal return distributions. We present the mathematical formulation, a Python implementation, and discuss its application in detecting backtest overfitting.
Key Takeaways
- Raw Sharpe ratio says nothing about statistical confidence, non-normality, or multiple testing -- PSR addresses all three.
- PSR inflates uncertainty when returns are asymmetric or fat-tailed, penalizing strategies that benefit from favorable sampling noise.
- Ranking strategies by PSR instead of raw Sharpe forces a stronger evidentiary standard: short histories and ugly tail behavior are penalized.
- Standard Sharpe is descriptive; PSR is inferential -- it lets you test whether estimated skill exceeds a benchmark such as SR* = 1.
Introduction
The Sharpe ratio is one of the most abused statistics in quantitative finance. Two strategies can have the same Sharpe ratio even if one is estimated from a short, skewed, fat-tailed sample and the other from a long, well-behaved history. A raw Sharpe number says nothing about statistical confidence, non-normality, or multiple testing.
The Probabilistic Sharpe Ratio (PSR) estimates the probability that an observed Sharpe ratio exceeds a benchmark \(SR^*\), while adjusting for skewness and kurtosis:
Here, \(T\) is the sample length, \(\gamma_3\) is skewness, \(\gamma_4\) is kurtosis, and \(\Phi\) is the standard normal CDF. The denominator inflates uncertainty when returns are asymmetric or fat-tailed — exactly what classical Sharpe ignores.
Python Implementation
import numpy as np import pandas as pd from scipy.stats import skew, kurtosis, norm def probabilistic_sharpe_ratio(returns, sr_benchmark=0.0, periods_per_year=252): r = pd.Series(returns).dropna() sr_hat = np.sqrt(periods_per_year) * r.mean() / r.std(ddof=1) T = len(r) g3 = skew(r, bias=False) g4 = kurtosis(r, fisher=False, bias=False) # Pearson kurtosis numerator = (sr_hat - sr_benchmark) * np.sqrt(T - 1) denominator = np.sqrt(1 - g3 * sr_hat + ((g4 - 1) / 4.0) * sr_hat**2) z = numerator / denominator return { "sharpe": sr_hat, "psr": norm.cdf(z), "skew": g3, "kurtosis": g4, "z_score": z } # Example np.random.seed(42) rets = np.random.normal(0.0005, 0.01, 500) print(probabilistic_sharpe_ratio(rets, sr_benchmark=1.0))
Ranking strategies by PSR instead of raw Sharpe forces the strategy to "earn" its Sharpe under a stronger evidentiary standard. It penalizes short histories, punishes ugly tail behavior, and gives you a way to compare estimated skill against a benchmark such as \(SR^* = 1\). Standard Sharpe is descriptive. PSR is inferential.