A Survey on Protein Representation Learning: Retrospect and Prospect. (arXiv:2301.00813v1 [cs.LG])
Proteins are fundamental biological entities that play a key role in life
activities. The amino acid sequences of proteins can be folded into stable 3D
structures in the real physicochemical world, forming a special kind of
sequence-structure data. With the development of Artificial Intelligence (AI)
techniques, Protein Representation Learning (PRL) has recently emerged as a
promising research topic for extracting informative knowledge from massive
protein sequences or structures. To pave the way for AI researchers with little
bioinformatics background, we present a timely and comprehensive review of PRL
formulations and existing PRL methods from the perspective of model
architectures, pretext tasks, and downstream applications. We first briefly
introduce the motivations for protein representation learning and formulate it
in a general and unified framework. Next, we divide existing PRL methods into
three main categories: sequence-based, structure-based, and sequence-structure
co-modeling. Finally, we discuss some technical challenges and potential
directions for improving protein representation learning. The latest advances
in PRL methods are summarized in a GitHub repository
https://github.com/LirongWu/awesome-protein-representation-learning.
Source: https://arxiv.org/abs/2301.00813