An Empirical Investigation of the Role of Pre-training in Lifelong Learning. (arXiv:2112.09153v1 [cs.LG])

The lifelong learning paradigm in machine learning is an attractive
alternative to the more prominent isolated learning scheme not only due to its
resemblance to biological learning, but also its potential to reduce energy
waste by obviating excessive model re-training. A key challenge to this
paradigm is the phenomenon of catastrophic forgetting. With the increasing
popularity and success of pre-trained models in machine learning, we pose the
question: What role does pre-training play in lifelong learning, specifically
with respect to catastrophic forgetting? We investigate existing methods in the
context of large, pre-trained models and evaluate their performance on a
variety of text and image classification tasks, including a large-scale study
using a novel dataset of 15 diverse NLP tasks. Across all settings, we observe
that generic pre-training implicitly alleviates the effects of catastrophic
forgetting when learning multiple tasks sequentially compared to randomly
initialized models. We then further investigate why pre-training alleviates
forgetting in this setting. We study this phenomenon by analyzing the loss
landscape, finding that pre-trained weights appear to ease forgetting by
leading to wider minima. Based on this insight, we propose jointly optimizing
for current task loss and loss basin sharpness in order to explicitly encourage
wider basins during sequential fine-tuning. We show that this optimization
approach leads to performance comparable to the state-of-the-art in
task-sequential continual learning across multiple settings, without retaining
a memory that scales in size with the number of tasks.



Related post