Multi-Scale Spectrogram Modelling for Neural Text-to-Speech. (arXiv:2106.15649v1 [eess.AS])

We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to
synthesise speech with an improved coarse and fine-grained prosody. We present
a generic multi-scale spectrogram prediction mechanism where the system first
predicts coarser scale mel-spectrograms that capture the suprasegmental
information in speech, and later uses these coarser scale mel-spectrograms to
predict finer scale mel-spectrograms capturing fine-grained prosody.

We present details for two specific versions of MSS called Word-level MSS and
Sentence-level MSS where the scales in our system are motivated by the
linguistic units. The Word-level MSS models word, phoneme, and frame-level
spectrograms while Sentence-level MSS models sentence-level spectrogram in

Subjective evaluations show that Word-level MSS performs statistically
significantly better compared to the baseline on two voices.



Related post