A Pitch-Controlled End-to-End Voice Conversion System for Brazilian Portuguese

Victor Pereira da Costa; Sergio Lima Netto; Luiz Wagner Pereira Biscainho; Ranniery Maia

doi:10.14209/jcis.2024.13

A Pitch-Controlled End-to-End Voice Conversion System for Brazilian Portuguese

Authors

Victor Pereira da Costa a:1:{s:5:"en_US";s:38:"Universidade Federal do Rio de Janeiro";} https://orcid.org/0009-0006-7024-2840
Sergio Lima Netto Universidade Federal do Rio de Janeiro https://orcid.org/0000-0001-7389-1463
Luiz Wagner Pereira Biscainho Universidade Federal do Rio de Janeiro https://orcid.org/0000-0003-2959-6963
Ranniery Maia Universidade Federal do Rio Grande do Norte https://orcid.org/0000-0002-5512-3038

DOI:

https://doi.org/10.14209/jcis.2024.13

Keywords:

Voice Conversion, Generative Adversarial Networks

Abstract

Speech conversion is a technique that modifies the identity of the voice in a speech signal without changing the spoken content. Accurate pitch conversion is a requirement the best speech conversion systems must address, as this characteristic is essential to the correct identification of the target speaker. This work proposes a pitch-controlled end-to-end voice conversion model that combines state-of-the-art ideas from both speaking and singing voice conversion with a novel cost function to ensure artifact-free pitch tracking. The model is trained in Brazilian Portuguese, overcoming the lack of high-quality data by improving a large but flawed dataset with filtering operation. Our model mostly outperforms other popular open source models in both listening tests and objective measurements. In particular, on a 5-point MOS, we obtained the highest speaker similarity score (4.05), and a naturalness score of 3.48, second only to a system whose similarity score was 2.62.

Downloads

Author Biographies

Victor Pereira da Costa, a:1:{s:5:"en_US";s:38:"Universidade Federal do Rio de Janeiro";}

Victor P. da Costa recieved a B.Sc. in Eletronic and Computing Engineering (2015) and a M.Sc. in Electric Engineering (2017), both from Universidade Federal do Rio de Janeiro (UFRJ), Brazil. He is currently pursuing the D.Sc. degree in electrical engineering at COPPE/UFRJ. His interests are digital signal processing, particularly audio processing and machine learning.

Sergio Lima Netto, Universidade Federal do Rio de Janeiro

Dr. Sergio L. Netto holds BSc (Federal University of Rio de Janeiro, 1991), MSc (COPPE/Federal University of Rio de Janeiro, 1992), and PhD (University of Victoria, Canada, 1996) degrees, all in Electrical Engineering. He is currently a full professor at the Federal University of Rio de Janeiro. He is a coauthor of Digital Signal Processing: System Analysis and Design (Cambridge University Press, 2nd ed., 2010) and Variational Methods for Machine Learning with Applications to Deep Networks, (Springer, 2021). His main teaching and research interests include speech processing, information theory, and applied artificial intelligence.

Luiz Wagner Pereira Biscainho, Universidade Federal do Rio de Janeiro

Luiz W. P. Biscainho was born in Rio de Janeiro, Brazil, in 1962. He received the Electronic Engineering degree (magna cum laude) from the EE (now Poli) at Universidade Federal do Rio de Janeiro (UFRJ), Brazil, in 1985, and the M.Sc. and D.Sc. degrees in Electrical Engineering from the COPPE at UFRJ in 1990 and 2000, respectively. Having worked in the telecommunications industry between 1985 and 1993, Dr. Biscainho is now Associate Professor at the Department of Electronic and Computer Engineering (DEL) of Poli and the Electrical Engineering Program (PEE) of COPPE, at UFRJ. His research area is digital audio processing. He is currently a member of the IEEE (Institute of Electrical and Electronics Engineers), the AES (Audio Engineering Society), the SBrT (Brazilian Telecommunications Society), and the SBC (Brazilian Computer Society).

Ranniery Maia, Universidade Federal do Rio Grande do Norte

Ranniery Maia received the B.Sc. degree in Electrical Engineering from Federal University of Rio Grande do Norte (1998), M.Sc. degree in Electrical Engineering from Federal University of Rio de Janeiro (2000) and D.Eng. degree in Engineering and Computer Science from Nagoya Institute of Technology (2006). From 2006 to 2009 he was a Research Scientist at NICT/ATR Spoken Language Communication Labs, Kyoto, Japan. From 2009 to 2016 he was a Research Engineer at Toshiba Research Europe Limited, Cambridge, UK. From 2018 to 2022 he was a Consultant on Machine Learning and Text-to-Speech at DeepZen Limited, London, United Kingdom, and a Visiting Researcher at Federal University of Santa Catarina, Florianopolis, Brazil. Currently he is an Assistant Professor at Federal University of Rio Grande do Norte, Natal, Brazil. His interests are artificial intelligence, machine learning, deep learning, speech synthesis, speech recognition and voice conversion.

Downloads

Published

2024-08-13

How to Cite

Pereira da Costa, V., Lima Netto, S., Pereira Biscainho, L. W., & Maia, R. (2024). A Pitch-Controlled End-to-End Voice Conversion System for Brazilian Portuguese. Journal of Communication and Information Systems, 39(1), 127–136. https://doi.org/10.14209/jcis.2024.13

Download Citation

Issue

Vol. 39 No. 1 (2024): Vol. 39 No. 1 (2024)

Section

Regular Papers

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish in this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a CC BY-NC 4.0 (Attribution-NonCommercial 4.0 International) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors can enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) before and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

___________

Received 2024-05-02
Accepted 2024-08-07
Published 2024-08-13

A Pitch-Controlled End-to-End Voice Conversion System for Brazilian Portuguese

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biographies

Victor Pereira da Costa, a:1:{s:5:"en_US";s:38:"Universidade Federal do Rio de Janeiro";}

Sergio Lima Netto, Universidade Federal do Rio de Janeiro

Luiz Wagner Pereira Biscainho, Universidade Federal do Rio de Janeiro

Ranniery Maia, Universidade Federal do Rio Grande do Norte

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Make a Submission

Keywords

Information