Paper 04 Statistics PSR Backtest

Probabilistic Sharpe Ratio (PSR) and Backtest Overfitting

A statistically rigorous alternative to raw Sharpe ratio that adjusts for non-normality, sample length, skewness, and kurtosis to detect backtest overfitting.

Abstract

This paper examines the Probabilistic Sharpe Ratio (PSR), a statistically rigorous extension of the classical Sharpe ratio that accounts for sample length, skewness, and kurtosis. PSR estimates the probability that an observed Sharpe ratio exceeds a benchmark, providing an inferential framework that penalizes short histories and non-normal return distributions. We present the mathematical formulation, a Python implementation, and discuss its application in detecting backtest overfitting.

Key Takeaways

Introduction

The Sharpe ratio is one of the most abused statistics in quantitative finance. Two strategies can have the same Sharpe ratio even if one is estimated from a short, skewed, fat-tailed sample and the other from a long, well-behaved history. A raw Sharpe number says nothing about statistical confidence, non-normality, or multiple testing.

Standard Sharpe Ratio
$$\widehat{SR} = \frac{\hat{\mu}}{\hat{\sigma}}$$

The Probabilistic Sharpe Ratio (PSR) estimates the probability that an observed Sharpe ratio exceeds a benchmark \(SR^*\), while adjusting for skewness and kurtosis:

Probabilistic Sharpe Ratio
$$PSR(SR^*) = \Phi\left( \frac{(\widehat{SR} - SR^*)\sqrt{T-1}} {\sqrt{1 - \gamma_3 \widehat{SR} + \frac{\gamma_4 - 1}{4}\widehat{SR}^2}} \right)$$

Here, \(T\) is the sample length, \(\gamma_3\) is skewness, \(\gamma_4\) is kurtosis, and \(\Phi\) is the standard normal CDF. The denominator inflates uncertainty when returns are asymmetric or fat-tailed — exactly what classical Sharpe ignores.

Python Implementation

psr.py Python
import numpy as np
import pandas as pd
from scipy.stats import skew, kurtosis, norm

def probabilistic_sharpe_ratio(returns, sr_benchmark=0.0, periods_per_year=252):
    r = pd.Series(returns).dropna()
    sr_hat = np.sqrt(periods_per_year) * r.mean() / r.std(ddof=1)

    T  = len(r)
    g3 = skew(r, bias=False)
    g4 = kurtosis(r, fisher=False, bias=False)  # Pearson kurtosis

    numerator   = (sr_hat - sr_benchmark) * np.sqrt(T - 1)
    denominator = np.sqrt(1 - g3 * sr_hat + ((g4 - 1) / 4.0) * sr_hat**2)
    z = numerator / denominator

    return {
        "sharpe":   sr_hat,
        "psr":      norm.cdf(z),
        "skew":     g3,
        "kurtosis": g4,
        "z_score":  z
    }

# Example
np.random.seed(42)
rets = np.random.normal(0.0005, 0.01, 500)
print(probabilistic_sharpe_ratio(rets, sr_benchmark=1.0))

Ranking strategies by PSR instead of raw Sharpe forces the strategy to "earn" its Sharpe under a stronger evidentiary standard. It penalizes short histories, punishes ugly tail behavior, and gives you a way to compare estimated skill against a benchmark such as \(SR^* = 1\). Standard Sharpe is descriptive. PSR is inferential.