When faced with sequential decision-making problems, it is often useful to be
able to predict what would happen if decisions were made using a new policy.
Those predictions must often be based on data collected under some previously
used decision-making rule. Many previous methods enable such off-policy (or
counterfactual) estimation of the expected value of a performance measure
called the return. In this paper, we take the first steps towards a universal
off-policy estimator (UnO) — one that provides off-policy estimates and
high-confidence bounds for any parameter of the return distribution. We use UnO
for estimating and simultaneously bounding the mean, variance,
quantiles/median, inter-quantile range, CVaR, and the entire cumulative
distribution of returns. Finally, we also discuss Uno’s applicability in
various settings, including fully observable, partially observable (i.e., with
unobserved confounders), Markovian, non-Markovian, stationary, smoothly
non-stationary, and discrete distribution shifts.