AN HMM-BASED BRAZILIAN PORTUGUESE SPEECH SYNTHESIZER AND ITS CHARACTERISTICS

Research on speech synthesis area has made great progress recently, perhaps motivated by its numerous applications, of which text-to-speech converters and dialog systems are examples. Several improvements have been reported in the technical literature related to existing state-of-the-art techniques as well as in the development of new ideas related to the alteration of voice characteristics, with their eventual application to different languages. Nevertheless, in spite of the attention that the speech synthesis field has been receiving, the technique which employs unit selection and concatenation of waveform segments still remains as the most popular approach among those available nowadays. In this paper, we report how a synthesizer for the Brazilian Portuguese language was constructed according to a technique in which the speech waveform is generated through parameters directly determined from Hidden Markov Models. When compared with systems based on unit selection and concatenation, the proposed synthesizer presents the advantage of being trainable, with the utilization of contextual factors including information related to different levels of the following acoustic units: phones, syllables, words, phrases and utterances. Such information is brought into effect through a set of questions for context-clustering. Thus, both the spectral and the prosodic characteristics of the system are managed by decision-trees generated for each one of the following parameters: mel-cepstral coefficients, fundamental frequency and state durations. As a typical characteristic of the technique based on Hidden Markov Models, synthesized speech with quality comparable to commercial applications built under the unit selection and concatenation approach can be obtained even from a database as small as eighteen minutes of speech. This was tested by a subjective comparison of samples from the synthesizer in question and other systems currently available for Brazilian Portuguese.


INTRODUCTION
The speech synthesis area has been stimulating great interest for speech processing researchers in the last years.Aside from topics related to multilingual speech synthesis, with the attempt at designing unified TTS engines which could possibly work on different languages [1], the tendency nowadays has also been driven towards the synthesis of voices with different styles and emotions [2,3].
Although a few speech synthesis techniques exist, the approach wherein speech is synthesized through the selection and concatenation of waveform units has been largely applied [4,5].One of its main advantages when compared with the other techniques is the fact that synthesized speech with high quality can be achieved due to the utilization of natural speech waveforms as units for concatenation, selected Volume 21, N úmero 2, Agosto 2006 according to some specific cost functions.Nevertheless, for this technique the synthesis of voices with different styles and emotions, as well as the obtainment of high quality itself requires the availability of large corpora.
Recently, a trainable approach in which the speech waveform is synthesized from parameters directly derived from Hidden Markov Models (HMMs) has been reported to work well for some languages [6][7][8].One of the main advantages of the referred HMM-based synthesis technique when compared with the unit selection and concatenation method is the fact that voice alteration can be performed with no need of large databases [9][10][11].Another advantage is that synthesized speech with applicability 1 can be achieved by training the system with a database as small as eighty sentences, as reported in [8].Besides, still considering small databases (about one hour of speech), HMM-based synthesizers could possibly be competitive in quality to unit selection and concatenation ones [12,13].On the other hand, one of the main disadvantages of the referred approach corresponds to the buzzy quality of the synthesized speech.This drawback is caused by the source-filter model which is used during the waveform generation stage, which basically consists in a linear predictive vocoder, though in [14] it is reported that the mentioned buzz can be removed with the utilization of a mixed excitation scheme.Another approach to solve this problem is shown in [12], which consists in an adaptation from the vocoding method introduced in [15] to HMM-based speech synthesis.
Turning to the Portuguese language case, considerable advance has been achieved on the speech synthesis area for both European (e.g.[16,17]) and Brazilian (e.g.[18,19]) dialects.However, the common aspect among most of the contributions given to the Portuguese language is the fact of being related to synthesizers based on the waveform selection and concatenation approach.In this paper, HMM-based Brazilian Portuguese speech synthesis [8,20] is focused.In a more specific way, the contribution of this paper consists in the description and discussion of topics related to the application of the HMM-based speech synthesis approach to Brazilian Portuguese, namely: determination of a list of contextual factors, definition of an utterance information which enables the derivation of all the specified features, and elaboration of questions which can bring into effect all the factors, according to a tree-based context-clustering algorithm.Finally, in a way to learn the characteristics of the trained system as well as to empirically improve the list of features and questions, a rough inspection of the generated decision-trees is performed.
This paper is organized as follows: in Section 2 all the procedures carried out by the synthesizer engine are described, from the database training to the synthesis of a given utterance.Section 3 describes the aspects of HMM-based speech synthesis applied to Brazilian Portuguese.In Section 4, two subjective tests are presented: the first one concerns the perceptual importance of contextual factors related to syllable, stress, and part-of-speech (POS) [21], whereas the second test corresponds to an evaluation among the synthesizer in 1 Meaning that it could possibly be employed by some applications due to the naturalness of the synthesized speech prosody.

ENGINE DESCRIPTION
The procedures of training and synthesis carried out by the speech synthesis engine are depicted in the block diagram of Figure 1.The present engine corresponds to an improved version of the one already described in the literature, e.g.[6][7][8].The enhancements correspond to the: (1) application of the high-quality vocoding method described in [15]; (2) utilization of HMMs with explicit state durations (Hidden Semi-Markov Models) [22]; and (3) generation of parameters considering global variance [23].These improvements are described with more details in [12].In the following sections an outline of the whole engine is given.

TRAINING PART
The synthesizer is trained through the following steps: (1) speech parameter extraction; (2) label generation; and (3) HMM training.

SPEECH PARAMETER EXTRACTION
The training is started with parameter extraction.In this step, initially a sequence of fundamental frequency logarithms, {log(F 0 1 ), . . ., log(F 0 N )}, including voicing deci-sion information (if F 0 = 0 the frame is considered unvoiced), where N is the total number of frames of all the utterances from the training database, is extracted in a shorttime basis.After that, a sequence of mel-cepstral coefficient vectors which represent speech envelope spectra [24], , where the superscript i indicates the frame number and [•] T means transposition, is derived through an M -th order mel-cepstral analysis, taking into account the already extracted sequence of log(F 0) in order to remove signal periodicity [15].Finally, a sequence of aperiodicity coefficient vectors, {b 1 , . . ., b N }, is also obtained from all the utterances at the same rate as the mel-cepstral coefficients and log(F 0).

LABEL GENERATION
In this step, utterance information for all the sentences of the training database are converted into HMM contextual labels.The descriptions of utterance information and contextual label are given in sections 3.2 and 3.3, respectively.

HMM TRAINING
Each HMM corresponds to a no-skip S-state left-to-right model, with S = 5.Each output observation vector o i for the i-th frame consists of five streams, as illustrated in Figure 2, where: : vector composed of mel-cepstral coefficients, {c i 0 , . . ., c i M }, their corresponding delta, {∆c i 0 , . . ., ∆c i M }, and delta-delta components, : composed respectively of the fundamental frequency logarithm, log(F 0 i ), its corresponding delta, ∆ log(F 0 i ), and delta-delta, ∆ 2 log(F 0 i ); The observation vector o i is output by an HMM state s according to a probability distribution given by where N (•; µ, Σ) means Gaussian distribution with mean µ and variance Σ, ω sjl is the weight for the l-th mixture component of the j-th stream vector o i j output by the state s, and γ j is the output probability weight for the j-th stream, with R j being the corresponding number of mixture components.The first and fifth stream vectors, , are modeled by single-mixture continuous Gaussian distributions, where the dimensionality is 3(M + 1) for o i 1 and fifteen for o i 5 .For the second, third and fourth scalar streams, o i 2 = log(F 0 i ), o i 3 = ∆ log(F 0 i ), and o i 4 = ∆ 2 log(F 0 i ), the output probability is modeled by multi-space Gaussian distributions [25] with two mixture components.
For each HMM k, the durations of the S states are considered as vectors, T , where d k s represents the duration of the s-th state.Further, each of the duration vectors, {d 1 , . . ., d K }, where K is the total number of HMMs representing the database, is modeled by an S-dimensional single-mixture Gaussian distribution.The output probabilities of the state duration vectors are thus re-estimated by Baum-Welch iterations in the same way as the output probabilities of the speech parameters [22].
During the training, a context-clustering technique is applied to the streams of mel-cepstral coefficients, log(F 0) and aperiodicity parameters, as well as to the state duration models.In the end of the process, 3S + 1 different acoustic decision-trees are generated: S trees for mel-cepstral coefficients (one tree for each state s), S trees for the logarithms of fundamental frequencies (one tree for each state s), S trees for aperiodicity parameters (one tree for each state s), and finally one tree for state duration.

SYNTHESIS PART
The procedure of synthesis of a given sentence into the corresponding speech is conducted through the following steps: (1) label generation; (2) HMM selection and concatenation; (3) parameter determination; and (4) excitation construction and filtering.

LABEL GENERATION AND HMM SELEC-TION/CONCATENATION
The synthesis procedure starts with the conversion of the utterance information of a given sentence into contextual labels, which are eventually used to select corresponding leaves from each one of the 3S + 1 decision-trees generated by the context-clustering procedure in the training stage.In the end

PARAMETER DETERMINATION
The four above mentioned HMM sequences are then used to derive mel-cepstral coefficients, log(F 0) and aperiodicity parameters.The whole procedure is conducted as follows.Initially, the duration vectors {d 1 , . . ., d K }, where K is the number of HMMs in each sequence, are determined from the K S-dimensional Gaussian distributions, defining the state sequence s = {s 1 , . . ., s L }, with L being the number of frames of the utterance to be synthesized and s i the HMM state wherein the i-th frame belongs to.After that, mel-cepstral coefficient vectors, {c 1 , . . ., c L }, aperiodicity parameters, {b 1 , . . ., b L }, and logarithms of the fundamental frequencies, {log(F 0 1 ), . . ., log(F 0 L )}, are determined from each corresponding HMM sequence in a way to maximize their output probability given s, taking into account the delta and delta-delta components, according to the algorithm described in [23].

EXCITATION CONSTRUCTION AND FILTER-ING
The last step of the synthesis process is divided into two parts.In the first one an excitation signal is derived from the sequences of generated fundamental frequency logarithms, {log(F 0 1 ), . . ., log(F 0 L )}, and aperiodicity parameters, {b 1 , . . ., b L }, using the same approach described in the high-quality vocoding method of [15], which is based on mixed excitation construction according to frequency subband strengths.In the second part, speech waveform is generated with the utilization of the Mel Log Spectrum Approximation (MLSA) filter [24], whose corresponding coefficients are derived from the sequence of generated mel-cepstral coefficients, {c 1 , . . ., c L }.

THE PHONE SET
The synthesizer employs a set of 40 phones -including long and short pause models -as the basic acoustic units, which are shown in Table 1 represented with the use of SAMPA (Speech Assessment Methods Phonetic Alphabet) [26].Although diphthongs are sometimes considered as independent acoustic units due to their peculiar characteristics [27], where the formants of the initial vowel/semi-vowel are smoothly changed into the formants of the succeeding semi-vowel/vowel, in this work they were not considered as

DEFINITION OF AN UTTERANCE INFOR-MATION
For the present synthesizer, utterance information corresponds to the basic text knowledge which is input by the system in order to generate speech.The henceforth defined utterance information is thus composed of the following parts: • phone part: phone symbol; • syllable part: syllable transcription and stress indication; • word part: word transcription and POS tag.
Table 2 shows the utterance information for the sentence "Leila tem um lindo jardim" (Leila has a beautiful garden).

TEXT PROCESSING: UTTERANCE INFOR-MATION CONSTRUCTION
Utterance information can be derived by a natural language processing (NLP) module.According to the definition of the utterance information, the NLP module is required to perform the following procedures: (1) grapheme-phone conversion; (2) syllabication; (3) stress determination; and (4) POS tagging.
Although NLP is out of scope of this paper, in the following paragraphs the procedures carried out by a text processor specifically designed for the present synthesizer are slightly outlined.Details of the NLP module can be found in [30][31][32].

Grapheme-phone conversion
The grapheme-phone converter is rule-based [30] with a database of word exceptions.Most of these exceptions are composed of terms in which the transcription rules do not cover the problem of openclosed vowel alternation, though [33] presents some directions which could possibly solve this drawback.Special procedures are also applied in order to solve the problem of the homographs [31].
Syllabication and stress determination Even though considerable contributions have been reported concerning automatic syllabication for Portuguese TTS systems, e.g.[34,35], in the present case this task has been performed through the application of orthographic rules to the non-transcribed word tokens [36].Stress is also determined before graphemephone conversion, according to the algorithm presented in [32].
The method of classification consists in verifying if the input word belongs to a list of possible function words.If so, the word is classified as function, otherwise content [37].

THE CONTEXTUAL FACTORS
In speech synthesis, some factors are usually necessary to be taken into account in order to provide a natural reproduction of the prosody.These factors might include context dependent terms, such as preceding/succeeding phone, syllable, word, phrase, etc, and are referred to as contextual factors in this paper, though the reference features might also be employed [27].
The determination of contextual factors for a particular language is based on prosodic characteristics of the referred language and consequently linguistic assumptions should be considered.Besides this theoretical approach, empirical analysis can also be carried out in order to tune the features, by obtaining related extensions to the factors that are important and eliminating the ones which are not.
The contextual factors listed below, which correspond to the ones employed by the present synthesizer, were firstly derived from those used in HMM-based English speech synthesis [7] and eventually adjusted, through theoretical and empirical approaches, to the characteristics of the Brazilian Portuguese language: • phone level: • phrase level: 1. number of {syllables, words} in {preceding, current, succeeding} phrase; 2. position of current phrase in current utterance; • utterance level: 1. number of {syllables, words, phrases} in the utterance.

FORMAT OF THE CONTEXTUAL LABEL
The contextual labels include all the information listed in Section 3.3.1 in a phone-by-phone basis.In other words, for each phone of the input utterance information the whole set of features related to the respective phone is included into the corresponding label.Table 3 describes the label format for the synthesizer.

THE PROBLEM
Since for each phone from the speech database there is a corresponding contextual label which includes all its related features, it can be noticed that there is a wide range of different contextual labels which can result during the training stage of the synthesizer.Thus, it would be impractical to train a large amount of different HMMs for a relatively small database, and consequently the resulting models would not be adequately re-estimated during the training process.This problem can be enlightened through the following example.Considering the utterance information shown in Table 2, the corresponding contextual label for the phone /e/ of the word /lejla/ would be: silˆl-e+j=l/M2:2 2 /S1:y @y-1 @3+0 @2/S2:1 2/S3:1 8/S4:0 2/S5:0 5/S6:e /W1:y #y-content #2+content #1/W2:1 5/W3:0 3/W4:0 2 /P1:y !y-8 !5+y !y/P2:1 1 /U:8 $5 &1 where the letter y means does not apply.Because of the contextual information attached to the phone /e/, which goes further to the utterance level, it is probable that only a few examples of the exact same label, if any at all, could be derived from the training corpus, although /e/ is one of the most frequent phones in Brazilian Portuguese.
Furthermore, during the synthesis stage, a given utterance information may generate some contextual labels which do not correspond to any model in the trained set of HMMs.

THE SOLUTION: TREE-BASED CONTEXT-CLUSTERING
In order to solve the problems discussed above, a decision tree-based context-clustering technique is applied [38].This technique has the property of training models with a proportionally small database -solving the training problem, and constructing unseen models -solving the synthesis problem.
Because the contextual factors are responsible for the spectral and prosodic characteristics of the system, the importance of how to cluster these features should be stated.Therefore, the determination of the questions for context-clustering represents an important issue in order to achieve synthesized speech with good quality.
Questions about contextual factors Several questions are applied for each feature listed in Section 3.3.1.As an example of application, the questions for the feature "position of current syllable in the word" are listed: • Is current syllable in position 1 within the current word?
• Is current syllable in position 2 within the current word? . . .
According to the classification above, some examples of questions are listed below: • Is current phone a voiced fricative?
• Is pre-preceding phone voiced?
• Is succeeding phone an oral semi-vowel?
• Is post-succeeding phone a convex alveolar consonant?
Questions concerning diphthongs Questions regarding diphthongs are determined by considering diphones, in order to take advantage of the possible sequence of a vowel with its succeeding semi-vowel (descendant diphthong), or a semi-vowel with its succeeding vowel (ascendant diphthong).For example, for the question "Is current phone part of a descendant diphthong?," it would be true if the current phone was the vowel /e/ and its corresponding right context was the semi-vowel /j/, forming thus the descendant diphthong /ej/, Table 3. Label format for the HMM-based Brazilian Portuguese synthesizer."/Si:," "/Wi:" and "/Pi:" mean i-th syllable, word and phrase-based contextual information part, respectively, whereas "/U:" means utterance-based contextual information part.
• Does current phone form with right context a descendant diphthong?→ Is current phone a vowel and the succeeding one a semi-vowel?
• Is right context an ascendant diphthong?→ Is succeeding phone a semi-vowel and the post-succeeding one a vowel?
• Is left context a descendant diphthong?→ Is prepreceding phone a vowel and the preceding one a semivowel?
Questions considering phones under specific contexts Aside from diphthongs, other questions concerning diphones, and even triphones are also taken into account in order to track some peculiar properties of the units under certain contexts.Some of the questions are related to: vowels in the end of utterances (vowel followed by silence), which normally are uttered with lower intonation and energy; interword sequence of vowels, which tend to concatenate themselves forming a diphthong, or even an allophonic realization of one of them; and vowels preceded by stops and followed by silences, e.g., the phone /i/ in the end of the word "qualidade" (quality) pronounced by a native of the northeastern part of Brazil.However, these questions might be effective only if these special situations occur in the recorded database since the HMM-based speech synthesis technique tends to mimic the characteristics of the material from which the training is carried out.

THE CORPUS
The text material used to train the synthesizer comprised the 200 phonetically balanced sentences for Brazilian Portuguese spoken in Rio de Janeiro listed in [40], and the 21 phonetically balanced utterances from the joint project reported in [29].The sentences were recorded by a male Brazilian speaker.The recorded utterances correspond to 18 minutes and 48 seconds of speech including silence regions, where the average duration of each utterance is approximately five seconds with silence regions ranging 1 one to 2 seconds.The database was recorded at a sampling rate of 48 kHz with 16 bits per sample, being posteriorly downsampled to 16 kHz.
The phonetic labeling of the database was carried out using the phone set shown in Table 1.Time label boundaries were obtained by manual correction of the label boundaries generated by Viterbi alignment.Further, syllable and word labeling as described by the utterance information in Section 3.2 were also manually conducted for each sentence.Thus, the database information was carefully included in a considerably time consuming process.

PARAMETER EXTRACTION
Fundamental frequencies, mel-cepstral coefficients and aperiodicity parameters were extracted from the speech corpus at every 5-ms frames.Mel-cepstral coefficients were obtained through a 39-th order analysis (M = 39) with the utilization of 25-ms Blackman windows.The computation of aperiodicity components and smoothing of mel-cepstral coefficients were carried out from speech and F 0 in a way that the high-quality vocoding technique described in [15] could be applied during the synthesis.The dynamic parameters were  derived according to where x i corresponds to the original feature (log(F 0), melcepstral coefficients or aperiodicity parameters) for the i-th frame, and ∆x i and ∆ 2 x i are the corresponding delta and delta-delta parameters, respectively.Each HMM had five states (S = 5).

GENERATED DECISION-TREES
Figure 3 and Figure 4 show respectively the top part of the decision-trees generated for mel-cepstral coefficients and log(F 0) (for the third HMM state, i.e., s = 3) whereas Figure 5 shows the decision-tree generated for state duration, when the training procedure was concluded.Table 4 shows the number of leaves derived in the end of the process for each tree as well as the model reduction rate, which corresponds to the ratio between of number of states after and before performing context-clustering.
By observing Figure 3, Figure 4 and Figure 5, and based on the assumption that top nodes are more important considering the parameter which is being clustered, one can notice that knowledge regarding syllable, word, phrase and utterance are more crucial for log(F 0) and state duration.On the other hand, questions related to phones are more significant to the tree of mel-cepstral coefficients.
Table 5 presents for each generated decision-tree the number of nodes concerning the information input to the synthesizer, namely: phone, syllable, stress, and POS.It should be  .Top of the decision-tree constructed to cluster the distribution for state durations.The terms "C ", "L ", "R " and "RR " stand for current, left, right and after the right contexts, respectively.
noted that nodes regarding phrase, utterance and word (aside from POS) are not listed.From Table 5 the importance of phone, syllable, stress and POS for the quality of the synthesized speech can be figured out.Thus, assuming the total number of questions related to each of these pieces of information as the evaluation parameter, the following order of importance can be retrieved: (1) phone; (2) syllable; (3) stress; and (4) POS.

EXAMPLE OF SYNTHESIS
Figure 6 shows the spectrograms for the natural utterance "Quando eu vim para cá, eu sempre gostei de jogar futebol" (Since I came here, I have always enjoyed playing football) and its synthesized version.It should be noted that the referred sentence was not part of the training database 2 .Aside from the reproduction of the phones, it can also be observed from Figure 6  rate similar to the natural speech case.This represents an important characteristic of the HMM-based speech synthesis approach: the ability to mimic the prosody of the speech corpus used to train the system.
Although it has been reported that even with a database as small as eighty utterances it is possible to synthesize speech [8], the lack of database strongly affects the quality.Once the HMMs do not track properly the characteristics of the several contextual labels derived from the training database, inconsistent parameters might be generated during the synthesis part, consequently resulting into synthesized speech with poor quality.Because of the trade-off between number of contextual labels and speech material, the severity of this problem might depend on the language.For example, for Japanese, which has a small phonetic system, intelligible synthesized speech (although with a badly reproduced prosody) can be achieved even by the training of sixty utterances.

INFLUENCE OF SOME CONTEXTUAL FACTORS ON THE SYNTHESIZED SPEECH
A subjective evaluation was conducted in order to investigate the importance for the synthesized speech of: (1) POS; (2) syllable; and (3) syllable stress.In addition to the empirical improvement on the speech quality by tuning the contextual factors, the analysis had also contributed to the development of the NLP module which has been applied to the synthesizer.
The tests were divided in two in order to not cause fatigue to the listeners who took part in it.The decision of which factors should be evaluated in the first and second tests was taken so as to match the difficulty levels of the NLP mod-ule.Hence, the first test attempted to verify the importance of carrying out or not POS tagging and/or syllabification.The second test, which concerns the importance of syllable stress, was conducted in order to investigate the need of proceeding or not beyond syllabification.

INFLUENCE OF POS AND SYLLABLE
To evaluate the influence of POS and syllable on the synthesized speech, the synthesizer was built under the conditions of no POS and no syllable information.This was achieved by training the original system with the exclusion of all the questions for context-clustering which concerned POS and syllable, respectively.In this way, all the features related to POS and syllable were automatically not taken into account during the training and synthesis procedures.
The perceptual evaluation corresponded to a forced AB comparison test whose procedure is described as follows.Each test sentence was synthesized into three different utterances: original, no POS, and no syllable information versions.The resulting utterances were combined in pairs so that three different test pairs for each test sentence were obtained.Each subject had to listen, for each test sentence, to the three pairs of utterances, eventually giving for each pair his/her preference concerning which utterance presented better quality.The order of the pairs as well as the order of the utterances within the respective pairs were randomly chosen for each listener.In case of undistinguished quality the subjects were instructed to choose the first synthesized utterance out of the corresponding test pair, so that options A and B could have equal probability of being chosen.
To perform the test, a total of ten sentences which were not used to train the system were randomly chosen from a list of twenty phrases, that were selected from a newspaper database through a genetic algorithm [41].Thus, each listener had to listen to thirty different pairs of utterances.Among the eleven Brazilian listeners who participated in the test, four of them  were speech processing specialists.Figure 7(a) shows the choices of the listeners according to each test pair, whereas Figure 7(b) presents the overall preference.It can be observed that the lack of information related to syllable and POS degrades the quality of the synthesized speech, with the absence of syllable information being more severely sensed.

INFLUENCE OF SYLLABLE STRESS
For the evaluation of the influence of syllable stress on the speech quality, the original synthesizer was trained with the exclusion of all the questions for context-clustering related to syllable stress.
The subjective test was performed in the same way as the previous test in which the influence of POS and syllable was verified.However, for the present case only two utterance versions, namely, original and no stress information, were compared.The same listeners who participated in the previous test took part in this one.
Figure 7(c) shows the preference of the listeners for this case, where it can be seen that the lack of features related to syllable stress information strongly degrades the quality of the synthesized speech.

DISCUSSION
Comparing the results shown in Figure 7 with the number of nodes of the generated decision-trees presented in Table 5 it is possible to connect the correlation between the influence of POS, syllable and stress on the synthesized speech with the amount of nodes.However, it should be stated that a more precise evaluation should also take into account the fact of which nodes belong to the topper or lower parts of the generated trees.
By observing Figure 7(a) and Figure 7(c), another fact which might arise a discussion is that stress apparently degrades more the synthesized speech than syllable, though stress information corresponds to a sub-set of syllable according to the way in which the experiment was carried out.This difference might have occurred due to the conditions in which the tests were conducted, since syllable information was tested jointly with POS in the first evaluation, whereas stress was separately evaluated in a test where there was only one pair of utterances for each sentence.

COMPARISON OF THE CURRENT SYN-THESIZER WITH OTHER SYSTEMS
A Mean Opinion Score (MOS) test was conducted among the HMM-based synthesizer and other four systems available for Brazilian Portuguese.

THE SYNTHESIZERS
The synthesizers considered in the subjective test were: • the Brazilian Portuguese engine of the MBROLA Project [42]; • the optional free TTS engine which can be used in the DOSVOX system, i.e., the Lernout & Hauspie engine [43]; • two commercial systems (Commercial System 1 and Commercial System 2).
All the systems described above are based on the method of unit selection and concatenation.The first synthesizer is part of the MBROLA Project, which intends to provide free speech synthesis engines for several languages.The DOSVOX system corresponds to a free operating system for the visually impaired.It includes its own speech synthesizer and offers the possibility of using other engines.For this test, it was considered the Lernout & Hauspie engine for Brazilian Portuguese, which has also been applied to other multilingual applications based on TTS.Commercial System 1 and Commercial System 2 were chosen to be kept without identification to avoid any sort of implication.

THE SENTENCES
For the test, the same twenty sentences selected from newspaper database used in the test of Section 4.1 were used.It should be noted that each utterance information produced by the NLP modules connected to the HMM-based and MBROLA synthesizers was manually corrected in order to avoid transcription and/or stress related errors on the synthesized speech.However, no sort of manual correction was carried out for the other systems due to the impossibility of accessing the intermediate parts of their respective TTS engines.Nevertheless, it was observed that they were able to perform text processing with no apparent errors according to listening tests, therefore ensuring a fair comparison among the synthesizers.

THE SUBJECTS
A total of twenty Brazilian subjects participated in the test.Since the intention was to evaluate the overall quality of the synthesizers from the viewpoint of the general user, the chosen listeners had no training and were not familiarized with the speech processing area.

THE RESULTS
Figure 8 shows the overall MOS obtained for each synthesizer, where it can be seen that except for Commercial System 2 the HMM-based synthesizer outperforms the other systems.
Although the HMM-based system achieved the second score in terms of overall quality, the difference between the referred synthesizer and Commercial System 2 is considerable.The quality of the database seemed to be crucial for the decision of the listeners.Among the evaluated synthesizers, Commercial System 2 was the only engine which presented female voice.Furthermore, the referred system seems to be designed from a speech inventory with considerable size, since in most of the times it was hard to detect discontinuity distortions on the synthesized speech.These sort of artifacts are typical from unit selection and concatenation-based systems derived from limited corpora, although it is possible to reduce their effect through some prosody modification techniques (e.g [44]).

CONCLUSION AND FUTURE WORK
The description of a Brazilian Portuguese speech synthesizer with its corresponding characteristics was performed in this paper.The system is based on a technique wherein the speech waveform is generated from parameters directly derived from HMMs.The main advantages of this approach, when compared with the other techniques, corresponds to the possibility of obtaining synthesized speech with good quality and the modification of voice characteristics/styles, even from relatively small speech databases.It was shown that with the proper application of questions for context-clustering and determination of the list of features, the HMMs, through the generated decision-trees, were able to track the characteristics of the Brazilian Portuguese language.In order to investigate the value of the information input to the synthesizer, it was verified according to some subjective tests that POS tags, syllable and stress are important for the quality of the synthesized speech.Also, according to a MOS test performed with listeners not familiarized with the speech processing area, the synthesizer in question performed well when compared to other systems based on the unit selection and concatenation method.
Future work concerns the utilization of larger corpora (at least one hour of speech) and experiments related to speaker adaptation as well as synthesis of voices with different styles and expressions (beyond read speech).

Figure 1 .
Figure 1.Block diagram illustrating the basic procedures conducted by the speech synthesis engine.

Figure 4 .Figure 5
Figure 4. Top of the decision-tree constructed to cluster the third HMM state for log(F 0).The terms "C ", "L ", "R " and "RR " stand for current, left, right and after the right contexts, respectively.

Figure 6 .
Figure6shows the spectrograms for the natural utterance "Quando eu vim para cá, eu sempre gostei de jogar futebol" (Since I came here, I have always enjoyed playing football) and its synthesized version.It should be noted that the referred sentence was not part of the training database2 .Aside from the reproduction of the phones, it can also be observed from Figure6that the synthesized version presents speaking

Figure 8 .
Figure 8. Overall result of the MOS test comparing the synthesizer with four other systems.

Table 1 .
Phone set employed by the synthesizer as the basic acoustic units, with some corresponding examples.

Table 2 .
Utterance information for "Leila tem um lindo jardim" (Leila has a beautiful garden).

21, N úmero 2, Agosto 2006
• word level: 1. part-of-speech of {preceding, current, succeeding} word; 2. number of syllables in {preceding, current, succeeding} word; Volume s10 position of current syllable in current phrase (backward) s11 number of stressed syllables before current syllable in current phrase s12 number of stressed syllables after current syllable in current phrase s13 number of syllables, counting from the previous stressed syllable to the current syllable in the utterance s14 number of syllables, counting from the current syllable to the next stressed syllable in the utterance

Table 4 .
Number of leaves for each generated decision-tree.The total number of models according to the contextual labels (logical models) is 6273.

Table 5 .
Number of nodes with questions related to phone, syllable, stress and POS for each generated decision-tree.
Top of the decision-tree constructed to cluster the third HMM state for mel-cepstral coefficients.The terms "C ", "L " and "R " stand for current, left, and right contexts, respectively.
Results of the comparison tests where: (a) for the first test, listener's preferences according to each test pair; (b) for the first test, listener's preferences according to each synthesized version; and (c) for the second test, listener's preferences according to each synthesized version.