The traditional notion of “Junk DNA” has long been linked to non-coding
segments within the human genome, constituting roughly 98% of its composition.
However, recent research has unveiled the critical roles some of these
seemingly non-functional DNA sequences play in cellular processes.
Intriguingly, the weights within deep neural networks exhibit a remarkable
similarity to the redundancy observed in human genes. It was believed that
weights in gigantic models contained excessive redundancy, and could be removed
without compromising performance. This paper challenges this conventional
wisdom by presenting a compelling counter-argument. We employ sparsity as a
tool to isolate and quantify the nuanced significance of low-magnitude weights
in pre-trained large language models (LLMs). Our study demonstrates a strong
correlation between these weight magnitudes and the knowledge they encapsulate,
from a downstream task-centric angle. we raise the “Junk DNA Hypothesis” backed
by our in-depth investigation: while small-magnitude weights may appear
“useless” for simple tasks and suitable for pruning, they actually encode
crucial knowledge necessary for solving more difficult downstream tasks.
Removing these seemingly insignificant weights can lead to irreversible
knowledge forgetting and performance damage in difficult tasks. These findings
offer fresh insights into how LLMs encode knowledge in a task-sensitive manner,
pave future research direction in model pruning, and open avenues for
task-aware conditional computation during inference.