Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment. (arXiv:2205.03432v1 [cs.SD])

Automatic pronunciation assessment is an important technology to help
self-directed language learners. While pronunciation quality has multiple
aspects including accuracy, fluency, completeness, and prosody, previous
efforts typically only model one aspect (e.g., accuracy) at one granularity
(e.g., at the phoneme-level). In this work, we explore modeling multi-aspect
pronunciation assessment at multiple granularities. Specifically, we train a
Goodness Of Pronunciation feature-based Transformer (GOPT) with multi-task
learning. Experiments show that GOPT achieves the best results on
speechocean762 with a public automatic speech recognition (ASR) acoustic model
trained on Librispeech.



