A Pitch-Controlled End-to-End Voice Conversion System for Brazilian Portuguese

Authors

DOI:

https://doi.org/10.14209/jcis.2024.13

Keywords:

Voice Conversion, Generative Adversarial Networks

Abstract

Speech conversion is a technique that modifies the identity of the voice in a speech signal without changing the spoken content. Accurate pitch conversion is a requirement the best speech conversion systems must address, as this characteristic is essential to the correct identification of the target speaker. This work proposes a pitch-controlled end-to-end voice conversion model that combines state-of-the-art ideas from both speaking and singing voice conversion with a novel cost function to ensure artifact-free pitch tracking. The model is trained in Brazilian Portuguese, overcoming the lack of high-quality data by improving a large but flawed dataset with filtering operation. Our model mostly outperforms other popular open source models in both listening tests and objective measurements. In particular, on a 5-point MOS, we obtained the highest speaker similarity score (4.05), and a naturalness score of 3.48, second only to a system whose similarity score was 2.62.

Downloads

Download data is not yet available.

Author Biographies

Victor Pereira da Costa, a:1:{s:5:"en_US";s:38:"Universidade Federal do Rio de Janeiro";}

Victor P. da Costa recieved a B.Sc. in Eletronic and Computing Engineering (2015) and a M.Sc. in Electric Engineering (2017), both from Universidade Federal do Rio de Janeiro (UFRJ), Brazil. He is currently pursuing the D.Sc. degree in electrical engineering at COPPE/UFRJ. His interests are digital signal processing, particularly audio processing and machine learning.

Sergio Lima Netto, Universidade Federal do Rio de Janeiro

Dr. Sergio L. Netto holds BSc (Federal University of Rio de Janeiro, 1991), MSc (COPPE/Federal University of Rio de Janeiro, 1992), and PhD (University of Victoria, Canada, 1996) degrees, all in Electrical Engineering. He is currently a full professor at the Federal University of Rio de Janeiro. He is a coauthor of Digital Signal Processing: System Analysis and Design (Cambridge University Press, 2nd ed., 2010) and Variational Methods for Machine Learning with Applications to Deep Networks, (Springer, 2021). His main teaching and research interests include speech processing, information theory, and applied artificial intelligence.

Luiz Wagner Pereira Biscainho, Universidade Federal do Rio de Janeiro

Luiz W. P. Biscainho was born in Rio de Janeiro, Brazil, in 1962. He received the Electronic Engineering degree (magna cum laude) from the EE (now Poli) at Universidade Federal do Rio de Janeiro (UFRJ), Brazil, in 1985, and the M.Sc. and D.Sc. degrees in Electrical Engineering from the COPPE at UFRJ in 1990 and 2000, respectively. Having worked in the telecommunications industry between 1985 and 1993, Dr. Biscainho is now Associate Professor at the Department of Electronic and Computer Engineering (DEL) of Poli and the Electrical Engineering Program (PEE) of COPPE, at UFRJ. His research area is digital audio processing. He is currently a member of the IEEE (Institute of Electrical and Electronics Engineers), the AES (Audio Engineering Society), the SBrT (Brazilian Telecommunications Society), and the SBC (Brazilian Computer Society).

Ranniery Maia, Universidade Federal do Rio Grande do Norte

Ranniery Maia received the B.Sc. degree in Electrical Engineering from Federal University of Rio Grande do Norte (1998), M.Sc. degree in Electrical Engineering from Federal University of Rio de Janeiro (2000) and D.Eng. degree in Engineering and Computer Science from Nagoya Institute of Technology (2006). From 2006 to 2009 he was a Research Scientist at NICT/ATR Spoken Language Communication Labs, Kyoto, Japan. From 2009 to 2016 he was a Research Engineer at Toshiba Research Europe Limited, Cambridge, UK. From 2018 to 2022 he was a Consultant on Machine Learning and Text-to-Speech at DeepZen Limited, London, United Kingdom, and a Visiting Researcher at Federal University of Santa Catarina, Florianopolis, Brazil. Currently he is an Assistant Professor at Federal University of Rio Grande do Norte, Natal, Brazil. His interests are artificial intelligence, machine learning, deep learning, speech synthesis, speech recognition and voice conversion.

Downloads

Published

2024-08-13

How to Cite

Pereira da Costa, V., Lima Netto, S., Pereira Biscainho, L. W., & Maia, R. (2024). A Pitch-Controlled End-to-End Voice Conversion System for Brazilian Portuguese. Journal of Communication and Information Systems, 39(1), 127–136. https://doi.org/10.14209/jcis.2024.13
Received 2024-05-02
Accepted 2024-08-07
Published 2024-08-13