An open-source end-to-end ASR system for Brazilian Portuguese using DNNs built from newly assembled corpora

In this work, we present a baseline end-to-end system based on deep learning for automatic speech recognition in Brazilian Portuguese. To build such a model, we employ a speech corpus containing 158 hours of annotated speech by assembling four individual datasets, three of them publicly available, and a text corpus containing 10.2 millions of sentences. We train an acoustic model based on the DeepSpeech 2 network, with two convolutional and five bidirectional recurrent layers. By adding a newly trained 15-gram language model at the character level, we achieve a character error rate of only 10.49% and a word error rate of 25.45%, which are on a par with other works in different languages using a similar amount of training data.


I. INTRODUCTION
A UTOMATIC speech recognition (ASR) technology has been around for over sixty years, and it is embedded in many products, from automated calls to personal assistants [1], [2]. However, ASR systems are far from perfect: their performance degrades quickly with background noise or far speech, and they tend to have a higher error rate if insufficient annotated training data is available.
Current state-of-the-art ASR systems rely on deep learning techniques [3]. Such algorithms made possible to train speech recognizers with different levels of background noise and microphone distances, moving away from previous approaches based on Gaussian mixture models and hidden Markov models systems [4].
Unfortunately, deep-learning-based ASR systems are strongly data driven, requiring considerable amounts of data to produce good models. Building a large corpus is timeconsuming and not a trivial task. Open-source efforts, such as the Common Voice [5] led by Mozilla, were able to gather more than 1, 000 hours of English speech, far away from its original 10, 000 goal. For non-English languages, such as Dutch, only less than 31 hours of annotated speech are available (from a 1, 200 hours goal). These numbers are orders of magnitude below the ones that private companies, such as Baidu and Google, have been reporting with their results [6], [7].
The lack of annotated speech and public corpora makes it difficult to evaluate ASR systems for several languages, whose accuracies are, therefore, much worse than the ones reported for English and Mandarin [8]. This is especially true for languages with only a few dozen hours of public data available, such as Brazilian Portuguese (PT-BR). This work aims to establish a new baseline system for ASR using deep neural networks for PT-BR. By doing so, this work addresses the lack of a large annotated speech corpus by studying and showing how English-trained ASR systems can be beneficial for under-represented languages by applying transfer learning techniques. As a result of this work, we make the following contributions: • A new PT-BR text corpus, by assembling three text corpora totalling 10.2 million sentences, among which the WikiText-PT-BR-built for this work by scraping text from the Wikipedia; • An assorted PT-BR speech corpus containing a total of 158 hours of speech, by gathering four smaller datasets (three of which are free to distribute); • An open-source pre-trained model with weights finetuned from the DeepSpeech 2-based architecture, so that anyone can evaluate or further improve the baseline endto-end PT-BR speech recognizer; • New open-source character-and word-level language models based on deep-learning techniques for PT-BR. All development code employed in this work will be opensourced under the MIT License and available at http://github. com/igormq/speech2text, as well as the speech and text corpora.
The remaining sections of this paper are organized as follows. Section II reviews the related works on the field of deep-learning ASR. Section III details the acoustic model and the loss function, whereas Section IV details the decoding strategy and Section V, the language model. Section VI specifies all speech and text datasets employed in this work. Sections VII, VIII, and IX describe the experimental results for the developed language model, acoustic model, and final ASR system, respectively. Finally, Section X concludes the paper emphasizing its main contributions.

II. DEVELOPMENT CONTEXT AND GENERAL DIAGRAM OF THE SYSTEM
Most languages in the world lack the amount of text, speech, and/or linguistic resources required to build large models based on deep neural networks. There has been an increasing research interest on how to build a high-accuracy ASR system for languages with insufficient annotated data (using from hours to a few dozen hours of annotated speech), such as Brazilian Portuguese [8]; Indian languages (Gujarati, Tamil, and Telugu) [9], [10]; and Seneca (an Indigenous North-American language) [11].
To overcome data scarsity, Swietojanski et al. [12] did an unsupervised pretraining on different languages while Dalmia et al. [13] shared the same weights from the recurrent layers over different languages and trained additional layers to develop a multilingual ASR system, which improved the final word error rate by over 6% when compared to monolingual systems. Renduchintala et al. [14] proposed a multimodal data augmentation scheme for attention-based models, which only requires text data. Zhou et al. [15] showed that a single transformer network performs well on reduced training data in a multilingual setting. While [14] consists on training an extra encoder for its augmentation scheme, and [15] uses a transformer topology with more than 200 million parameters, our DeepSpeech 2 based model has fewer parameters (42 millions) and does not require extra parameters for pretraining.
The present paper extends upon [16], [17] by: showing the advances on adding a newly-trained external language model based on over 10.2 million sentences scrapped from the Portuguese Wikipedia; considering a transfer-learning approach to train an acoustic model based on a large English dataset; showing the impacts of adding more 148 hours of PT-BR speech in the acoustic-model training stage.
The overall system described in this paper is represented in Figure 1, and detailed in the next three sections.
Acoustic model Decoding Hello, world! Preprocessing Language model Fig. 1. The overall end-to-end automatic speech recognition system: the acoustic model is a DeepSpeech-2-based model trained using the connectionist temporal classification (CTC) loss function; the language model is an -gram trained using the Kneser-Ney algorithm; and the decoding is the beam-search scheme adapted for the CTC algorithm.

III. ACOUSTIC MODEL AND LOSS FUNCTION
This section presents the acoustic model with the corresponding loss function employed in its development, similar to the work by Amodei et al. [6].
A. Acoustic model Fig. 2 illustrates the deep-learning neural-network architecture considered in this work, which is based on the Deep-Speech 2 [6] model, introduced by Baidu, with two convolutional (Conv) and five bidirectional neural-network layers of the gated-recurrent unit (GRU) type.
Here, as in [6], we use a normalized power spectrogram calculated over the audio signal with = 161 frequency bins as the network input. The first main layers are spatial convolutions, usually found in image-related tasks to increase the model capacity without exponentially increasing the number of parameters. In [6], the authors argue that a convolution in the frequency domain models the speakers' variability better than fully connected layers. Moreover, tuning the convolutionlayer parameters, such as strides and kernel sizes, helps to release redundant information found in the input spectrogram, as well as to reduce the number of outputs to be fed into the subsequent and more expensive layers. Table I shows the parameters used in this work, which yield the best performance according to [6]. Each convolution layer is followed by a batch normalization layer [18], which improves the training speed by allowing higher learning rates, and a clipped ReLU nonlinearity of the form (min(max( , 0), 20)). The output from the convolutional layers is fed to a stack of five bidirectional recurrent layers. Different from [6], this work uses gated recurrent unit (GRU) instead of Elman's recurrent layer. The GRU is a simplified version of long short-term memories [19] where the forget and input gates are fused, thus reducing the overall number of parameters. In the bidirectional setting, there are two unidirectional GRUs for each layer, one proceeding forward and the other backward in time. Then, the two outputs are summed into a single output to be fed into the next layer.
After the bidirectional GRUs, one fully connected layer is employed to generate the unnormalized scores over the label set. At each timestep, the softmax layer predicts an augmented label given by the loss function. Finally, the predicted transcription is decoded from the given sequences according to the probability distributions. Given the input-output pair and the current network coefficients, the loss function and its gradient with respect to the network parameters can be calculated over the data batches. The gradient is then backpropagated through the network in order to update its coefficients.

B. Loss function
To build an ASR system, we need a dataset comprising audio clips and their corresponding transcriptions. Both input sequence = ( 1 , . . . , ) of size , for ∈ R , where is the input size, and its transcription = ( 1 , . . . , ) of size , for ∈ R , where is the number of labels, may vary in length and ratio and may not have an accurate temporal alignment, which is not suitable for the usual supervisedlearning setting. Performing manual alignment of the input sequence and its transcription requires intense human labor and it is too time-consuming.  The well-known connectionist temporal classification (CTC) [20] approach is an alignment-free algorithm which considers all possible alignments between and before opting for the most probable one. To handle repeated characters and regions of speech that do not contain any audio, the CTC introduces a special blank token . The alignments considered by CTC are of the same length as the input and must map to the output after merging character repetitions and removing tokens. As an example, a valid alignment for = [ℎ, , , , ] with input size of = 8 can be [ℎ, ℎ, , , , , , ], while [ , ℎ, ℎ, , , , , ] is an invalid alignment which leads to [ℎ, , , ]. Therefore, the CTC alignment approach between and is a many-to-one operation where the length of cannot be greater than the length of .
The CTC approach aims at maximizing the log-likelihood log ( | ) of the label sequence given the inputs, such that where A , is the set of all possible alignments = ( 1 , . . . , ) between and and the output probabilities ( ) at each time step are assumed to be independent given . The probability ( | ) can be estimated using any learning algorithm which produces a distribution over output classes (e.g., number of characters plus blank label) given a fixed-size slice of the input. Usually, as considered in this work, a recurrent neural network (RNN) is employed to estimate ( | ) [21].
Summing over all possible CTC alignments is computationally impractical. Fortunately, the likelihood evaluation can be efficiently computed via dynamic programming [4]. The overall CTC loss function is the colog probability of correctly labeling the entire training set D, that is The loss (D) is differentiable with respect to the output probabilities, and can be used by any gradient-based optimization method to update the network coefficients. The CTC algorithm can directly map speech into text without any alignment by considering all possible conversion paths. To do so, the CTC assumes that the output symbols are conditionally independent of each other given the input, i.e., no linguistic information is directly imposed in the process, which means that the acoustic and language models are separated. While this separation allows for domain independence and adaptation or reuse of some of the speech recognition components, this independence also brings at least one drawback to the decoding scheme: the model does not know what a word is and how the previous and current symbols correlate. This aspect, despite simplifying the ASR development, reduces the overall system performance in practice. The CTC decoding schemes discussed in the next section aim to overcome this issue.

IV. CTC DECODING WITH A LANGUAGE MODEL
This section describes a decoding procedure aided by an external language model that can potentially improve the performance of CTC-based ASR systems. We also described thoroughly the beam-search decoding algorithm for both characterlevel and word-level language models in a same framework, which has been previously handled separately by different works [22], [23].
There are several ways of decoding the output, each one with its pros and cons. The best-path or greedy decoding [24], for instance, is the fastest decoding scheme. It works by considering that the highest symbol-probability at each step yields the best hypothesis, that is, Then, it collapses the character duplicates and removes all blanks to get the final transcription. The best-path decoding is both fast and straightforward. For many applications, this approach works quite well, mainly when most of the probability mass is allocated to a single alignment, such as in handwriting recognition [24]. In practice, however, the primary goal of the decoding algorithm is not to find the best instantaneous match, but to find the final transcription with the highest probability, and a single transcription can have many paths. More precisely, we want to solve * = arg max ( | ).
In the example illustrated in Fig. 3  As mentioned in the previous section, the standard CTC algorithm does not include the constraint of either a lexicon or a language model. Indeed, under lexicon/language model constraints the best hypothesis is defined as where lm ( ) is the language-model prior distribution. In practice, this prior distribution may be too restrictive, and thus it is down-weighted by a factor of > 0 and a penalty (or bonus) ( ) controlled by ∈ R is also included into the best hypothesis, which then becomes * = arg max ( | ) lm ( ) ( ) , where the hyperparameters and are set by cross-validation. The most famous decoding algorithm, which takes into account the many-to-one mapping and may include lexicon/language model constraints, is the beam-search decoding [22], [23]. This scheme interactively searches for the best hypothesis in a tree of hypotheses and is flexible enough to handle both constrained and unconstrained vocabularies.
At each time step, the beam search computes a new set of hypotheses (beams), generated from the previous set by extending each hypothesis with all possible output symbols (labels) and keeping only the top candidates. Given its limited computation, the beam-search algorithm does not find the most probable output (Eq. (4)), but it empowers the machinelearning practitioner to trade-off computation (i.e., a larger beam size) for an asymptotically better solution.
The vanilla beam search needs some adjustments to handle the CTC many-to-one mapping. Instead of keeping a list of alignments in the beam, the CTC beam-decoding algorithm stores the output prefixes after the mapping (i.e., collapsing the repeats and removing the blank symbols). At each step, the algorithm accumulates the score of a given prefix.
By storing the outputs after mapping, a new hypothesis can now map to two different prefixes if the character is a repeat, as exemplified in Fig. 4  , a blank symbol is required between repeated characters; then, we must only consider in the new score the part of the alignment that ends with . Contrarily, to generate [ ], we must only consider the part of the previous score for alignments, which does not end with .
Algorithm 1 outlines the employed decoding procedure. Instead of one score for each beam, the CTC beam-search algorithm has to keep track of two probabilities for each prefix in the beam, b (ˆ | 1: ) and nb (ˆ | 1: ), the probabilities of the candidate prefixˆ ending in or not, respectively, given the first time steps of the input . The final score for a given prefixˆ is the sum of these two probabilities, i.e., The hypothesis sets H −1 and H maintain a list of active prefixes at the previous and current time steps, respectively, and H −1 is never larger than the beam width . Also, variableˆ + is the concatenation of the label with the prefixˆ , whileˆ −1 is the last label in the prefixˆ .
The language model constraint is only added for a new prefixˆ + if the language model is character-based or if is a space in the case of a word-based language model. Finally, the overall probability of a prefix is a product of a language model insertion term and the sum of non-blank and blank probabilities, that is where ( ) is the number of characters in for a characterbased language model or the number of words for a wordbased language model. The final decoding scheme is significantly simpler and faster compared to weighted finite-state transducers [25], can naturally handle out-of-vocabulary words, and uses an arbitrary language model (i.e., character/word level, rule/neural-network based).

V. LANGUAGE MODEL
In the previous section, we described the beam-search decoding for CTC, which may add linguistic information through a language model. This section walks through one of the main approaches to construct a language model lm (·) using a nonparametric model based on counting statistics, the Kneser-Ney algorithm [26], [27]. In the end, we make a brief discussion about other language models based on neural networks [28], [29], [30], [31].

A. Statistical language model
A language model computes the probability over a sequence of words [32], , i.e., which can be factorized, using the chain rule, as  Algorithm 1 Connectionist Temporal Classification (CTC) beam-search decoding, a unified approach for both [22], [23]. if lm is character based or ( is space andˆ −1 is not space) then 15: H ← most probableˆ by (ˆ | 1: ) (ˆ ) in H 22: end for Return arg max ∈H ( | 1: ) ( ) total times of the sequence of words minus the last word, ( 1: −1 ), that is, However, computing the conditional probability would require virtually infinite data. Under a Markov assumption, we can by approximation condition it to just a few words, that is such that the probability over a sequence of words becomes which is the definition of a vanilla -gram model. For the unigram model ( = 0), the simplest case, the conditional probability is approximated by the probability of the current word, whereas in a bigram model ( = 1) the conditional probability is only conditioned to the previous word.

B. Perplexity
It is expected that a trained language model would assign higher probabilities to frequently observed sequences than to rarely observed ones. In practice, however, how can we evaluate a trained model? One way is using the extrinsic evaluation, i.e., testing it in the task it was designed for (e.g., ASR) and comparing the resulting evaluation metrics (e.g., WER-word error rate). This method is time-consuming; besides, many factors can impact the performance of the task, and hide some training problems. An alternative way, commonly used in the ASR context, is the intrinsic evaluation using the so-called perplexity measure, which is the inverse probability of the test set, normalized by the number of words, that is .
In other words, perplexity measures how well the model can predict the next word. A better text model is one that assigns a higher probability to the word that actually occurs. Better models mean minimizing the perplexity or maximizing the ( ) probability.

C. Generalization
For any -gram that occurred a significant number of times, the trained model usually provides a good probability estimate. As the training data are limited, however, many acceptablegrams are bound to be missing, being assigned zero probability in spite of having non-zero probability in real-world scenarios. This means that our model is underestimating the probability of several word sequences, which can hurt its performance, and that the entire perplexity of the test set cannot be calculated due to divisions by zero.
One way to overcome the zero probabilities is called smoothing discounting. The intuition in this approach is to adjust the probability mass to generalize better. The simplest algorithm is the 1-add smoothing (or Laplacian smoothing), which adds one to all -gram counts. All the counts that used to be zero will now have a count of one, the counts of one will be two, and so on. However, there is additional information that the model can rely on instead of "adding one". If an -gram model does not find a particular -gram, the model can instead estimate the probability by using the ( − 1)-gram. Similarly, if ( − 1)-gram model does not have a particular ( − 1)-gram count, the model can look at ( − 2)gram and so on down to the unigram model. In this so-called backoff approach, the -gram evidence is used when it is sufficient; otherwise, the language model uses lower-ordergram information. For a backoff -gram model to yield a correct probability distribution, one has to discount the higherorder -grams to save some probability mass for the lower order -grams.
One of the most commonly used and best performing -gram smoothing methods is the interpolated Kneser-Ney algorithm, which is based on the absolute discount [33] and subtracts a fixed (absolute) discount from each count. The Kneser-Ney discounting augments absolute discounting with a more sophisticated way to handle the lower-order -gram distribution. A standard -gram model assigns higher probabilities to frequent occurrences, but the Kneser-Ney discounting estimates the probability of occur the -gram as a novel continuation in a new unseen context.
For example, consider that for a training data the word Francisco is more common than glasses, since San Francisco is a very frequent word. The Kneser-Ney method captures the intuition that although Francisco is more frequent, it is mainly only frequent alongside San, whereas the word glasses has a much wider distribution. The number of times that a word appears as a novel continuation can be expressed as where the numerator is the cardinality of the set containing all the non-zero counts, i.e. (·) > 0, of the sequence = [ , ] for all , and the denominator is the normalization factor.
Then, a frequent word Francisco occurring in only one context (San) will have a low continuation probability CONT  is a normalizing constant that distributes the discounted probability mass, ( − +1: −1 ) is the normalized discount, and |{ : ( −1 ) > 0}| is the number of times the normalized discount was applied. In these expressions, KN is given by KN for the highest order, CONT (·), for lower orders, where CONT (·) is the number of unique single-word contexts for "·".
A modified version of the Kneser-Ney smoothing [27] employed in this work uses three different discounts 1 , 2 , and 3+ for -grams with counts of one, two, and three or more, respectively.
Neural language models address both -gram data sparsity and limited context issues. Due to their parametric model, such models require more data and take longer than statistical language models to train. The -gram data sparsity problem is addressed through word embeddings (representing each word as a real-valued vector instead of a one-hot vector) and using them as inputs to a neural network [28], [29]. The key idea behind word embedding is to create semantic relationships between words in the feature space, i.e., words like "cat" and "dog" should somehow be close in the embedding space, since they tend to appear in similar contexts. Many language model methods use RNNs [29], [34] and, more recently, transformerslike architectures [35], [36] were proposed in the literature, achieving state-of-the-art results on many datasets. However, in this work, a statistical language model using the Kneser-Ney algorithm is employed, as it leads to comparable results under data-resource constraints, which are inherent in the PT-BR scenario, leaving the use of a neural language model as a future work.
VI. DATASETS So far, we described how to build, to train, and to evaluate both the acoustic and language models employed in this work. As explained in Section I, the key idea of this paper is to train a backbone acoustic model using an open available English dataset, and then fine-tune it using our PT-BR speech dataset. Finally, we use a PT-BR language model to increase even further the system accuracy. This section describes all datasets employed to train both the acoustic (backbone and fine-tuned) and the language models. For the acoustic model, three datasets were used: 1) LibriSpeech: an English dataset containing nearly 1000 hours of speech and freely available, to train the backbone model; 2) Brazilian Portuguese speech dataset (BRSD) v1 [8], with approximately 14 hours of PT-BR speech; 3) Brazilian Portuguese speech dataset (BRSD) v2, containing the first version (BRSD v1) plus extra 144 hours of speech data. The separation of BRSD v1 and v2 is employed in this work to illustrate the effect of additional training data in the final performance of the ASR system and how the use of language models can mitigate the problem of having a few hours of annotated speech corpus.
For the language model, we built a single PT-BR text dataset (BRTD) by combining three datasets publicly available: Lap-sNews [37]; CETENFolha [38]; New WikiText PT-BR, including scraped data from the Wikipedia website, retrieving more than 8 million sentences, which is 71 times larger than LapsNews and 5.5 times larger than CETENFolha.
A. Acoustic model datasets 1) LibriSpeech: The LibriSpeech [39] is a speech corpus derived from reading audiobooks from the LibriVox project, totalling almost 1, 000 hours of reading speech sampled at 16 kHz. Due to its massive amount of data, the LibriSpeech corpus is a perfect candidate to pre-train end-to-end ASR models.
2) BRSD v1: This dataset is an ensemble of three publicly available datasets (Sid, VoxForge, LapsBM) and one paid (PT-BR Spoltech) [8]. It contains almost 14 hours of nonconversational speech, totalling 425 different speakers, and more than 12, 000 utterances sampled at 16 kHz in a noncontrolled environment: •  [43], some audio recordings have lacking or erroneous transcriptions. All datasets above were pruned of samples with no/wrong transcription, too short recordings, and other defects that could produce wrong results. All audio files were re-sampled at 16 kHz. Recording lengths concentrate around 3 s but can reach up to 25 s.
3) BRSD v2: This second BRSD version includes its previous version, BRSD v1, plus the CETUC dataset [44], which totals almost 145 hours of speech signals performed by 50 male and 50 female speakers, each one pronouncing 1, 000 phonetically balanced sentences selected from the CETEN-Folha corpus [38]. The CETUC dataset was recorded in a controlled environment at a sampling rate of 16 kHz.
For the reader's convenience, a summary of all acoustic datasets employed in this work is provided in Table II. For the sake of comparison, the most important datasets employed in the ASR literature are the 5.4-h TIMIT [45], the 73-h Wall Street Journal (WSJ) [46], [47], the 300-h Switchboard [48], and the 1, 000-h Librispeech [39], all in English. For its noncontrolled environmental conditions, multiplicity of acquiring hardware, and distinct speaker dialects, the 158-h BRSD v2 is far more stringent than the WSJ. In addition, Switchboard contains conversational speech, which is also not found in BRSD v2. A summary of all language model datasets used in this work is given in Table III.

VII. LANGUAGE-MODEL EXPERIMENTS
We consider as a baseline model the LapsLM [37], which is a word-level 3-gram model trained with the modified Kneser-Ney smoothing technique and is freely available to use.
When training the PT-BR language model, we used the BRTD, keeping all words with at least three occurrences in the dataset, totalling almost 512k words. Words not in the vocabulary were replaced by 'unknown'. We also removed sentences that were in common with the LapsBM and CETUC datasets so as not to bias our result. We split the text corpus into two sets: the training set, containing 90% of all sentences, and the test set with the remaining ones. We evaluate the word perplexity in both test sets and the LapsBM dataset (by extracting the utterances).
All -gram models were trained with the KenLM [49] algorithm, and differently from [6], the use of word-and character-level language models is studied in the context of a much smaller speech corpus. Two -gram word models, with = 3 and = 5, were trained using the same parameters as the LapsLM. For the character-level language models, we study how the context influences perplexity by traininggrams with = 5, 10, 15, 20 following the same procedure as in [50]. In the = 15 and = 20 cases, rare sequences were pruned according to: 6, 7, 8-grams appearing only once, 9-grams appearing once or twice, and for ≥ 10 all -grams with less than 4 appearances were dropped.
In order to compare word-level and character-level models, one may estimate the word-level perplexity in all cases. In this strategy, following [50], given a previous context C, the word probability can be estimated: where 1 , . . . , are the letters in a word, . This approach, used by [50], does not take into account that word-level models are constrained to a fixed-size lexicon, while character-based models have virtually an infinite vocabulary size. Therefore, Eq. (18) shall be used only as an upper bound for the true word-level perplexity in the character-level models. The word-level perplexity results attained by different language models, along with their respective receptive fields (average characters seen in all -grams) and storage size, are shown in Table IV. As one can see, the LapsLM presents the best perplexity results over the LapsBM dataset, but it does not generalize over the BRTD test set as well as some of the other models. The LapsLM model might have been trained on a smaller dataset probable in a similar text domain of the LapsBM test set, which may explain the difference in perplexity between the two test sets (138 PP gap), while ours was trained on more text, exhibiting better generalization over different test sets. As expected, increasing the context increases the model complexity and decreases the perplexity, indicating a better performance. For = 15, 20 both characterlevel models exhibit competitive performances against the trained word-level models while having virtually infinite vocabulary. In Section IX, we investigate how these newly-trained language models may improve the ASR performance when combined to the proposed acoustic model.

VIII. ACOUSTIC-MODEL EXPERIMENTS
In this section, we develop new acoustic models based on the DeepSpeech 2 deep neural-network architecture. All hyperparameter tuning was performed on validation sets, and the final performance was evaluated on the test sets. We use tempo and gain perturbation as data augmentation in all experiments [51]. We randomly modified the tempo in about 85% to 115% of the original rate, while the gain was randomly altered from −6 dB to 8 dB from the original one. System performances are evaluated with respect to their final word error rate (WER) and character error rate (CER), which are the edit distances at word and character levels, respectively. In each case, we consider as the best model the one that achieves the lowest WER in the validation set.
Over the next sections, we show the results for ASR models trained on BSRD v1 and v2, discuss how using a well-trained language model can further improve the results and even reduce the gap between the speech recognition systems trained on a smaller and a bigger speech corpora, respectively.

A. Backbone model training
When training the backbone acoustic model for English, we employed the same training, validation, and test sets as provided in the original publication [39]. In particular, there are two versions for each of the validation and test sets, containing either clean or noisy speech, respectively.
Table V summarizes the backbone model architecture and related hyperparameters. The network input is the normalized spectrogram, as described in Section III, calculated using a Hamming window of 320 samples and a hop size of 160 samples, resulting in = 161 frequency bins. Each recurrent layer has = 800 hidden units, and the output alphabet contains = 29 labels corresponding to the {A, B, . . ., Z} letters plus the apostrophe, space, and blank characters.
Network training was carried out using the stochastic gradient-descent method with momentum [3] with a learning rate of 4.8×10 −4 , a momentum of 0.9, an annealing rate of 0.9091, and a gradient norm clipping [52] of 400 for over 20 epochs with a batch size of 16. In the first epoch, we sort out the utterances by their lengths, accelerating the network training, as proposed in [6]. After the first epoch, the batches are randomly organized. The predicted sequence is decoded using the greedy search [20]. Table VI shows the results of the backbone model, which are comparable to the ones found in the literature without a proper language model for decoding and extra data. The Paddle Paddle [53] implementation differs from others by adopting larger recurrent layers (with 2, 048 hidden units each) and a different activation function, while Sean Naren's implementation [54] employs a different padding scheme in the convolutional layers.

B. PT-BR model training
For the PT-BR acoustic model, two experiments were performed, using v1 and using v2 of the BRSD dataset (see Subsection VI-A), in order to evaluate the impact of their sizes (approximately 14 and 159 hours, respectively) in the final results.
For the BRSD v1 experiment, we follow the previous work [16], [17] by using same validation set (termed v1-val), containing 21 speakers from LapsBM dataset, whereas the training set (v1-train) comprised the Sid, VoxForge, and CSLU datasets. It is worth mentioning that both LapsBM and CETUC datasets were built using sentences gathered from the CETENFolha dataset, so they are not utterance independent, which may bias our results.
To mitigate such utterance contamination in the BRSD v2 experiment, we considered a new test set (v2-test) using 20 speakers from the CETUC dataset, each one speaking only 200 sentences out of the possible 1000 ones instead of the remaining speakers of the LapsBM, as used in [16], [17]. We use the v2-test in both BRSD v1 and v2 experiments to report the final results. The other 80 speakers in the CETUC set were used to expand the v1-train set into the v2-train set with the remaining 800 sentences not included in the test set. We made three different 200 : 800 random splits, and the results are reported in the form of mean and variance. Finally, the validation set (v2-val) in the BRSD v2 experiment remained the same as in the v1 case.
We conduct an experiment to compare the performances attained without pre-training (i.e., training from scratch) and with pre-training (i.e., fine-tuning the model obtained in Subsection VIII-A).
The PT-BR model has a broader character set to include Brazilian Portuguese accents, so the number of characters is = 43. Since the number of characters is different in the  backbone and in the fine-tuned models, the last fully-connected weights must be initialized from scratch. The PT-BR acoustic models are trained using the same procedure as the backbone model, except for a batch size of 32 and a learning rate decay of 0.99. We trained the model from scratch for 100 epochs and the fine-tuned model for 50 epochs. The training curves are depicted in Fig. 5, where one notices how the model performances in both cases did not improve after only 20 epochs. As one can see, using a pretrained model accelerates training and reaches a better WER, besides a lower bias.
The final comparison between the acoustic PT-BR models is shown in Table VII for both BRSD v1 and v2 experiments. From these results, it is clear that fine-tuning is advantageous, reducing the final CER by 11.51% and 1.41% for the v1 and v2 cases, respectively. Also, it is noteworthy that increasing the dataset size was quite beneficial, significantly reducing the CER by 8.21%.

IX. COMPLETE ASR EXPERIMENTS
In this section, we combine the language and acoustic models using the beam search decoder discussed in Section IV   VII  PT-BR MODEL COMPARISON BETWEEN TRAINING FROM RANDOM INITIALIZATION (SCRATCH) AND TRAINING FROM A PRE-TRAINED BACKBONE MODEL  (THE NUMBER BETWEEN PARENTHESES IS THE STANDARD DEVIATION CALCULATED OVER DIFFERENT V2-T E S T SPLITS. IT IS CLEAR THAT  FINE-TUNING IS ADVANTAGEOUS, REDUCING THE FINAL CER FOR BOTH V1 AND V2 Tables VIII and IX for the BRSD v1 and  v2 datasets, respectively. The best ASR system, trained with the BSRD v2 dataset and using a 15-gram character-based language model, achieved a CER and a WER of 10.49% and 25.45%, respectively. This is similar to the result found in [9] (15.2% of WER) for Tamil language, and comparable to the improvement reported by [13] and [14] using a multilingual setting training and multimodal data augmentation, respectively. From these tables, one clearly verifies the advantage of incorporating the language model onto the acoustic models obtained in Section VIII, with the WER dropping from 71.62% to 30.50% for the BRSD v1 set, and a 21.96% improvement for the BRSD v2 set. Comparing the v1 and v2 results, one concludes that more training data makes the final ASR system less dependent on an external language model, as also observed in [6]. The ASR system based on the -gram character-level language models with ≥ 10 achieves better performance than the ones using word-level language models, most probably due to the former virtually infinite vocabulary. It is worth mentioning that the use of a well-trained language model greatly reduces the gap between the BRSD v1 and v2. This is due to the parameter in the language model. In our experiments, a worse acoustic model (i.e., trained on v1 data) has a higher , indicating that the overall ASR system is relying more on the language model to transcribe the audio than on the audio itself.
Some of the transcriptions provided by our best ASR system are shown in Table X. As one can see, some errors arise from grammatical mistakes (example 1), have phonetic similarities with the expected transcriptions (examples 2 and 3), or are due to proper names (example 4). Overall, most of the transcriptions are plausible and can be easily understandable by human readers.

X. CONCLUSIONS
Through this work, we described all steps necessary to build an end-to-end ASR system for Brazilian Portuguese. Our proposal employs a DeepSpeech-2-based architecture and a transfer-learning approach from a backbone model trained on