Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph. (arXiv:2202.12307v1 [cs.LG])
This paper addresses the unsupervised learning of content-style decomposed
representation. We first give a definition of style and then model the
content-style representation as a token-level bipartite graph. An unsupervised
framework, named Retriever, is proposed to learn such representations. First, a
cross-attention module is employed to retrieve permutation invariant (P.I.)
information, defined as style, from the input data. Second, a vector
quantization (VQ) module is used, together with man-induced constraints, to
produce interpretable content tokens. Last, an innovative link attention module
serves as the decoder to reconstruct data from the decomposed content and
style, with the help of the linking keys. Being modal-agnostic, the proposed
Retriever is evaluated in both speech and image domains. The state-of-the-art
zero-shot voice conversion performance confirms the disentangling ability of
our framework. Top performance is also achieved in the part discovery task for
images, verifying the interpretability of our representation. In addition, the
vivid part-based style transfer quality demonstrates the potential of Retriever
to support various fascinating generative tasks. Project page at
https://ydcustc.github.io/retriever-demo/.
Source: https://arxiv.org/abs/2202.12307