Decision-making for urban autonomous driving is challenging due to the
stochastic nature of interactive traffic participants and the complexity of
road structures. Although reinforcement learning (RL)-based decision-making
scheme is promising to handle urban driving scenarios, it suffers from low
sample efficiency and poor adaptability. In this paper, we propose Scene-Rep
Transformer to improve the RL decision-making capabilities with better scene
representation encoding and sequential predictive latent distillation.
Specifically, a multi-stage Transformer (MST) encoder is constructed to model
not only the interaction awareness between the ego vehicle and its neighbors
but also intention awareness between the agents and their candidate routes. A
sequential latent Transformer (SLT) with self-supervised learning objectives is
employed to distill the future predictive information into the latent scene
representation, in order to reduce the exploration space and speed up training.
The final decision-making module based on soft actor-critic (SAC) takes as
input the refined latent scene representation from the Scene-Rep Transformer
and outputs driving actions. The framework is validated in five challenging
simulated urban scenarios with dense traffic, and its performance is manifested
quantitatively by the substantial improvements in data efficiency and
performance in terms of success rate, safety, and efficiency. The qualitative
results reveal that our framework is able to extract the intentions of neighbor
agents to help make decisions and deliver more diversified driving behaviors.