The efficacy of machine learning (ML) models depends on both algorithms and
data. Training data defines what we want our models to learn, and testing data
provides the means by which their empirical progress is measured. Benchmark
datasets define the entire world within which models exist and operate, yet
research continues to focus on critiquing and improving the algorithmic aspect
of the models rather than critiquing and improving the data with which our
models operate. If “data is the new oil,” we are still missing work on the
refineries by which the data itself could be optimized for more effective use.