# Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm. (arXiv:2103.09847v1 [cs.LG])

In this paper, we investigate the sample complexity of policy evaluation in

infinite-horizon offline reinforcement learning (also known as the off-policy

evaluation problem) with linear function approximation. We identify a hard

regime $dgamma^{2}>1$, where $d$ is the dimension of the feature vector and

$gamma$ is the discount rate. In this regime, for any $qin[gamma^{2},1]$, we

can construct a hard instance such that the smallest eigenvalue of its feature

covariance matrix is $q/d$ and it requires

$Omegaleft(frac{d}{gamma^{2}left(q-gamma^{2}right)varepsilon^{2}}expleft(Thetaleft(dgamma^{2}right)right)right)$

samples to approximate the value function up to an additive error

$varepsilon$. Note that the lower bound of the sample complexity is

exponential in $d$. If $q=gamma^{2}$, even infinite data cannot suffice. Under

the low distribution shift assumption, we show that there is an algorithm that

needs at most $Oleft(maxleft{ frac{leftVert theta^{pi}rightVert

_{2}^{4}}{varepsilon^{4}}logfrac{d}{delta},frac{1}{varepsilon^{2}}left(d+logfrac{1}{delta}right)right}

right)$ samples ($theta^{pi}$ is the parameter of the policy in linear

function approximation) and guarantees approximation to the value function up

to an additive error of $varepsilon$ with probability at least $1-delta$.

Source: https://arxiv.org/abs/2103.09847