Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm. (arXiv:2103.09847v1 [cs.LG])

In this paper, we investigate the sample complexity of policy evaluation in
infinite-horizon offline reinforcement learning (also known as the off-policy
evaluation problem) with linear function approximation. We identify a hard
regime $dgamma^{2}>1$, where $d$ is the dimension of the feature vector and
$gamma$ is the discount rate. In this regime, for any $qin[gamma^{2},1]$, we
can construct a hard instance such that the smallest eigenvalue of its feature
covariance matrix is $q/d$ and it requires
samples to approximate the value function up to an additive error
$varepsilon$. Note that the lower bound of the sample complexity is
exponential in $d$. If $q=gamma^{2}$, even infinite data cannot suffice. Under
the low distribution shift assumption, we show that there is an algorithm that
needs at most $Oleft(maxleft{ frac{leftVert theta^{pi}rightVert
right)$ samples ($theta^{pi}$ is the parameter of the policy in linear
function approximation) and guarantees approximation to the value function up
to an additive error of $varepsilon$ with probability at least $1-delta$.



Related post