Gaussian processes with derivative information are useful in many settings
where derivative information is available, including numerous Bayesian
optimization and regression tasks that arise in the natural sciences.
Incorporating derivative observations, however, comes with a dominating
$O(N^3D^3)$ computational cost when training on $N$ points in $D$ input
dimensions. This is intractable for even moderately sized problems. While
recent work has addressed this intractability in the low-$D$ setting, the
high-$N$, high-$D$ setting is still unexplored and of great value, particularly
as machine learning problems increasingly become high dimensional. In this
paper, we introduce methods to achieve fully scalable Gaussian process
regression with derivatives using variational inference. Analogous to the use
of inducing values to sparsify the labels of a training set, we introduce the
concept of inducing directional derivatives to sparsify the partial derivative
information of a training set. This enables us to construct a variational
posterior that incorporates derivative information but whose size depends
neither on the full dataset size $N$ nor the full dimensionality $D$. We
demonstrate the full scalability of our approach on a variety of tasks, ranging
from a high dimensional stellarator fusion regression task to training graph
convolutional neural networks on Pubmed using Bayesian optimization.
Surprisingly, we find that our approach can improve regression performance even
in settings where only label data is available.