Good decision making requires machine-learning models to provide trustworthy
confidence scores. To this end, recent work has focused on miscalibration, i.e,
the over or under confidence of model scores. Yet, contrary to widespread
belief, calibration is not enough: even a classifier with the best possible
accuracy and perfect calibration can have confidence scores far from the true
posterior probabilities. This is due to the grouping loss, created by samples
with the same confidence scores but different true posterior probabilities.
Proper scoring rule theory shows that given the calibration loss, the missing
piece to characterize individual errors is the grouping loss. While there are
many estimators of the calibration loss, none exists for the grouping loss in
standard settings. Here, we propose an estimator to approximate the grouping
loss. We use it to study modern neural network architectures in vision and NLP.
We find that the grouping loss varies markedly across architectures, and that
it is a key model-comparison factor across the most accurate, calibrated,
models. We also show that distribution shifts lead to high grouping loss.