The NAVI's Text-To-Speech System for VLSP 2021

Hanoi University of Science and Technology
Abstract

The Association for Vietnamese Language and Speech Processing (VLSP) has organized a series of workshop with the aim to bring together researchers and professionals working in NLP and to attempt a synthesis of research in Vietnamese language. One of the shared task held at the eighth workshop is TTS[1] using dataset that only consists of spontaneous audio. This poses a challenge for current TTS models since they only perform well constructing reading-style speech (e.g, audiobook). Not only that, the quality of the audio provided by the dataset has a huge impact on the performance of the model. Specifically, samples with noisy background or with multiple voices speaking at the same time will deteriorate the performance of our model. In this paper, we describe our approach to tackle this problem: we first preprocess the training data then use it to train a FastSpeech2[3] acoustic model with some replacements in external aligner model, finally we use HiFiGAN[2] vocoder to construct the waveform. According to the official evaluation of VLSP 2021 competition in TTS task, our approach achieves 3.729 in-domain MOS, 3.557 out-of-domain MOS and 79.70% SUS score.

Audio Samples

(*): The raw audio provided by the VLSP 2021 organizer.

(**): The processed audio by our preprocessing pineline.


Cái này bạn chỉ còn trên bốn tuổi là có thể sử dụng được

GT(*)

GT(**)

NAVI's System

Cái thời đi học của mình nó khá là bi đát mà mình khá là

GT(*)

GT(**)

NAVI's System

Dở thế nhưng mà ngày mai mình sẽ làm tốt hơn ngày hôm nay, điều tiếp theo mình học được là khi giận dữ thì nên im lặng gần như

GT(*)

GT(**)

NAVI's System

Nếu như bây giờ, nếu mà mình phải bắc ghế lên thì mình phải ra mình lấy cái ghế nhựa đúng không, không

GT(*)

GT(**)

NAVI's System

References