Whispered and Lombard Neural Speech Synthesis. (arXiv:2101.05313v1 [eess.AS])

It is desirable for a text-to-speech system to take into account the
environment where synthetic speech is presented, and provide appropriate
context-dependent output to the user. In this paper, we present and compare
various approaches for generating different speaking styles, namely, normal,
Lombard, and whisper speech, using only limited data. The following systems are
proposed and assessed: 1) Pre-training and fine-tuning a model for each style.
2) Lombard and whisper speech conversion through a signal processing based
approach. 3) Multi-style generation using a single model based on a speaker
verification model. Our mean opinion score and AB preference listening tests
show that 1) we can generate high quality speech through the
pre-training/fine-tuning approach for all speaking styles. 2) Although our
speaker verification (SV) model is not explicitly trained to discriminate
different speaking styles, and no Lombard and whisper voice is used for
pre-training this system, the SV model can be used as a style encoder for
generating different style embeddings as input for the Tacotron system. We also
show that the resulting synthetic Lombard speech has a significant positive
impact on intelligibility gain.

Source: https://arxiv.org/abs/2101.05313


