A Rule-Based Method for Homograph Disambiguation in Brazilian Portuguese Text-to-Speech Systems

This work presents a rule-based algorithm set used to decide the pronunciation of homographs applied to a Brazilian Portuguese (BP) text-to-speech (TTS) system. The proposed approach is composed of a morphosyntactic analysis, which deals with homographs that belong to different part-of-speech (POS), and a semantic analysis, which deals with homographs that belong to the same POS. The algorithms were implemented to solve ambiguities for 111 homograph pairs organized into 23 disambiguation algorithms, and tested with three types of texts: news, Bible and literature. Computer experiments showed that a correct homograph pronunciation is obtained in 99.00% of the occurrences.


I. INTRODUCTION
I N text-to-speech (TTS) systems, the decision on the pro- nunciation of heterophonic homographs is a nontrivial problem.In Brazilian Portuguese (BP), whenever a homograph appears, the algorithms that undertake grapheme-phone conversion (G2P) need to decide between two possible situations: whether the stressed vowel is opened ([E]/[O]) or closed ([e]/[o]) [1].Words such as <seca> (noun, "the drought", and verb, "he dries") have the same spelling, but different meanings and pronunciation.If those words are not correctly analyzed, they may give rise to a wrong phonetic transcription.
The number of homographs usually represents a small percentage of the analyzed text (about 1.0% in the text database used in this work), but in the context of speech synthesis, mistaken phonetic transcriptions produce a bad evaluation of the TTS system, even if it occurs in a small number of times.Therefore, minimizing G2P errors for homographs is fundamental to obtain a satisfactory evaluation of a TTS system.
Homographs are a subject widely analyzed in several languages: [2] presents a typology of homograph pairs in the English language and some traditionally used techniques for disambiguation, such as bayesian classifiers, n-gram taggers and decision trees, as well as the proposal of a hybrid system, combining the best of the three described approaches.In [3], the subject is treated in languages such as Thai, Chinese and Japanese, in which the words have no word-boundary delimiter, and a pattern recognition approach called "winnow" has been proposed to solve both word segmentation and homograph ambiguity problems altogether.[4] presents a study on the relation between Chinese characters and their pronunciations and also considers a solution for the disambiguation of polyphonic characters.Regarding disambiguation in European Portuguese TTS systems, [5] and [6] use morphosyntactic information, while in [7], the disambiguation is obtained through morphosyntactic as well as semantic information.For Brazilian Portuguese, in [8] and [9] a morphosyntactic analyzer is applied, and in [10] and [11], both morphosyntactic and semantic approaches are presented, but the algorithms were designed for only one homograph.
In this work a rule-based algorithm set is proposed to solve homograph disambiguation applied to a BP TTS system [12].The proposed approach is composed of a morphosyntactic analysis, which deals with problems of homographs that belong to different POS, and a semantic analysis, which deals with problems of homographs that belong to the same POS.Modifications produced by a recent orthographic agreement in Portuguese language [13] are also taken into account.The algorithms were implemented to solve ambiguities for 111 homograph pairs organized into 23 disambiguation algorithms, and tested with three types of texts: news, Bible and literature.The overall homograph correct pronunciation rate achieved through computer experiments is 99.00%.
This work is organized as follows.In Section II, the proposed method for homograph disambiguation and its characteristics are described.In Section III, computer experiments with data extracted from CETENFolha text database [14], Holy Bible [15] and Brazilian literature [16] are presented.Finally, Section IV contains our conclusions.• A lemmas library, which features the Portuguese Jspell dictionary with approximately 34 000 morphologically annotated words [17]; • An irregular verbs library, with the inflexion forms of the main existing irregular verbs in the BP; • A library consisting of the verb "to be" in the third person followed by an adjective; • A restrict lexical combinations library, with idiomatic expressions, proverbs, or fixed expressions with one or more words.This library is only used in the semantic analysis; • A Wordnets library, developed under the concept of Wordnets [18], [19], with words that are semantically and cognitively related with the analyzed homograph.This library also is required only in the semantic analysis.

II. APPLIED METHODOLOGY In
In the processing, the text is split into words and phrases.The system carries through the search for every homograph, and applies the corresponding algorithm type.
The homographs that belong to different POS and to the same POS are shown in Table II and in Table III, respectively.As shown in Table II, the grammatical oppositions are more frequent between nouns and verbs, according to the morphological concept, and between [e]/[E] and [o]/[O], according to the phonetic concept.The evidence is that in nouns the stressed vowel is typically closed, while in verbal forms the stressed vowel is opened.Type 1 and 2 homographs represent 61.3% of the total number of homographs in the test library.Type 13, 14, 15 and 20 homographs need both morphosyntactic and semantic analysis.
In the Appendix all the proposed algorithms, from Algorithm 1 (Homograph type 1) up to Algorithm 23 (Homograph  type 23) can be found.The symbols used in the algorithms can be seen in Table IV.The Algorithm 16 was included to attend to the recently signed Orthographic Agreement [13].This agreement is only orthographic; therefore, it is restricted to the written language and does not affect any aspect of the spoken language.

III. COMPUTER EXPERIMENTS
The proposed algorithms were tested with three different types of texts: news, Bible and literature.The results can be found in Tables V, VI and VII.
The CETENFolha text database is a corpus containing approximately 24 million words in BP extracted from Folha de São Paulo newspaper [14] built by the Computational  Processing of Portuguese Project.The system was tested with a random extract containing 1 564 591 words, of which 20 308 homograph pairs were detected (1.30% of the processed text).
The text was processed and a correctness rate of 99.00% was achieved.
The other database is a version, in text format, of the Holy Bible in BP [15].It is composed of 750 000 words, presenting a more formal style than that of the CETENFolha database.This test detected 7 904 homographs (1.05% of the processed text) and a correctness rate of 99.00% was achieved.
The text from Brazilian literature [16] is composed of 70 000 words.It is a romance narrated in the first person.= 99.00%.
It could be observed that most of the errors occur while running Algorithms 1 and 2 when the homograph was followed by a preposition or contraction, or anteceded by conjugated verbal forms.The performance of the proposed algorithm did not vary signifcantly with the type of text.
IV. CONCLUSIONS In this work it was presented an algorithm set based on linguistic rules for homograph disambiguation applied to a BP TTS system.The proposed algorithms are capable of determining the correct pronunciation of 111 pairs of homographs in BP.The algorithms are based on morphosyntactic and semantic analysis.The algorithm set was implemented and tested on a randomly chosen extract of a newspaper text database, the Holy Bible and a text from Brazilian literature.An overall correct pronunciation result of 99.00% was achieved through computer experiments.

APPENDIX PROPOSED ALGORITHMS
Algorithm 1 1: if (Word is a homograph of the type 1) then 2: if (P-1 = P DEM, P IND, P INT or P POSS) or (P-1, P-2 or P-3 = A IND) or (P-1 or P-2 = HN, CONTR or PREPO) or (P+1 = <que> or P RELA) then 3: 4: else if (P-1 = P PESSO SU, P PESS O 1 or CS) or (P+1 = PREPO, CONTR, P PESS O 1 or HN) or (P+1 = A IND e P+2 = nc) or (P-1 or P-2 = <não> or <nunca>) then 5: 6: else 7: 1: if (Word is a homograph of the type 3) then 2: if (P+1 = <pelo>, ad or adv) or (P-2 or P-3 = A IND or HN) or (P-1 = <que>, <ele>, <ela>, <se>, <não>, <já>, <as>, nc, CC or CS) or (P-1 or P-2 = P DEM, P IND, P INT or P POSS) or (P+1 = <e> e P+2 = <rebola>) then 3: 4: else if (P+1 = <de>, <do>, <da>, <dos>, <das> or CONTR) or (P-1 or P-2 = <lá>, <cá> or <aí>) or (P-1 or P-2 ends by <-mente>) or (P-1 or P-2 begins with <deit->, <deix->, <atir->, <empat->, <consider->,<fic->, <est-> or <jog->) or (P-1 = <borda>, <jantar>, <comer>, <noite>, <mundo>, <dia>, <tarde>, <por>, <de> or <para>) or (P-1 ends by <-ar>, <-er> or <-ir>) then 1: if (Word is a homograph of the type 11) then 2: if (P+1 = <ti>, <mim> or <si>, HN, P PESS SU or P PESS O 1) or (P-1 = P PESS SU or P PESS O 1 e P+1 = A IND) or (P-1, P-2 or P-3 = VERB or VERB IRR) or (P-1 = nc or P PESS SU e P+1 or P+2 = nc) then 3: 4: else if (P-1 = P PESS SU, P PESS O 1 or CS) or (P-1 or P-2 = <não> or <nunca>) or ((P-1 or P-2 = <que> or <ainda>) e (P+1 = A IND)) or (P+1 = PREPO, CONTR or P PESS O 1) then  1: if (Word is a homograph of the type 13) then 2: if (The homograph is inside the BC forma o) or (WN forma o is on F0) or (P-1 = <uma> and the word is <corte>) or (P-1 = <um> and the word is <molho> or <soco>) then 3: Denilson da Cruz da Silva received the B.Sc. degree in telecommunication engineering from Federal Center of Technological Education of Rio de Janeiro (CEFET-RJ), Rio de Janeiro, Brazil, in 1999, and the M.Sc and D.Sc degree in electrical engineering from Federal University of Rio de Janeiro (UFRJ/COPPE), Rio de Janeiro, Brazil, in 2005 and 2011, respectively.Currently he is working with the Brazilian Air Force.His research interests include emotional speech synthesis, natural language processing and robust speech recognition.Daniela Braga holds a degree in Linguistics (2000) from the University of Oporto, Portugal, a Master's in Linguistics from the University of Minho, Portugal and an European PhD in Speech Synthesis (2008) from the University of A Coruña, Spain.From 2000 to 2006 she was a Researcher in Speech Technology in the University of Oporto (Portugal) as well as Assistant Lecturer in the Universities of Oporto (Portugal) and A Coruña (Spain).She has been participated in national and international R&D projects and consortia (FP6 and FP7 funded networks and projects, COST actions, QREN-national funded projects) since 2001.From Nov. 2006 to Nov. 2010, she was the head of the Text-to-Speech and Language Expansion team at MLDC -Microsoft Language Development Center (Lisbon), where she has been responsible for end-to-end product life cycles and several linguistic-related feature areas.From Nov. 2010-Nov.2011,she was a Program Manager in the Speech team in Microsoft in Beijing, where she was the technical manager responsible for the Prosody enhancement, technology roadmap and for the TTS release for Windows 8. Since Nov. 2011, she moved to the Microsoft headquarters in the US, joining the Information Platform and Experiences team in Redmond, WA, being responsible to drive the Crowdsourcing data collection strategy for IPE, including the Speech team.She is author and co-author of over 70 papers in Text-to-Speech Conversion, Speech Synthesis, Phonetics, Prosody, and Speech Recognition and has been member of scientific committees of several international conferences and Journals on Speech and Language Processing.Fernando Gil Vianna Resende Junior received the B.Sc. degree from Military Institute of Engineering (IME), Brazil, in 1990, and the M.Sc.and Ph.D. degrees from Tokyo Institute of Technology (TIT), Japan, in 1994 and 1997, respectively, all in electrical engineering.Since 1998 he has been with the Department of Electronic Engineering and Computer Science, Polytechnic School, Federal University of Rio de Janeiro (UFRJ), as Associate Professor.Also, since 2003 he has been with the Program of Electrical Engineering, COPPE/UFRJ.His research interests are in the areas of natural language processing, speech synthesis, speech and speaker recognition, and speech coding.

TABLE I HOMOGRAPH
SET SPLITTED BY TYPE.

TABLE II EXAMPLES
WITH HOMOGRAPHS THAT BELONG TO DIFFERENT POS.

TABLE III EXAMPLES
WITH HOMOGRAPHS THAT BELONG TO THE SAME POS.

TABLE IV APPLIED
SYMBOLOGY IN THE DISAMBIGUATION ALGORITHMS.

TABLE V TESTS
WITH PROPOSED ALGORITHM -CETENFOLHA.

TABLE VI TESTS
WITH PROPOSED ALGORITHM -HOLY BIBLE.

TABLE VII TESTS
WITH PROPOSED ALGORITHM -BRAZILIAN LITERATURE.