Tonal languages speech synthesis using an indirect pitch markers and the quantitative target approximation methods

Ta Yen  Thai; Hoang Ngo  Huy; Dao Van  Tuyet; Sergey V.  Ablameyko; Nguyen Van  Hung; Doan Van Hoa

doi:10.33581/2520-6508-2019-3-105-121

Ta Yen Thai Hanoi University of Business and Technology, 29A Vinh Tuy Street, Vinh Tuy Ward, Hai Ba Trung Dist, Hanoi, Vietnam
Hoang Ngo Huy Electric Power University, Vietnam Ministry of Industry and Trade, 235 Hoang Quoc Viet Street, Co Nhue, Tu Liem, Hanoi 129823, Vietnam
Dao Van Tuyet Belarusian State University, 4 Niezaliežnasci Avenue, Minsk 220030, Belarus, Binh Duong University, 504 Binh Duong Avenue, Thu Dau Mot Town 820000, Binh Duong Province, Vietnam https://orcid.org/0000-0002-3194-8844
Sergey V. Ablameyko Belarusian State University, 4 Niezaliežnasci Avenue, Minsk 220030, Belarus
Nguyen Van Hung Military Institute of Science and Technology, 17 Hoang Sam Street, Nghia Do Ward, Cau Giay District, Hanoi, Vietnam
Doan Van Hoa Military Institute of Science and Technology, 17 Hoang Sam Street, Nghia Do Ward, Cau Giay District, Hanoi, Vietnam

DOI: https://doi.org/10.33581/2520-6508-2019-3-105-121

Abstract

Synthesizing tones plays an important role in text-to-speech systems of tonal languages. To accomplish this, the two important steps are to determine the pitch markers of voice utterances and synthesize F0 trajectories for lexical tones. In this paper, we propose two efficient algorithms, one of them is to locate the pitch markers at the peaks of the cumulative signal of each voiced part of the input utterance and the other is to generate F0 trajectories of tones with quantitative target approximation (qTA) parameters of Xu model. The experimentation has shown that the proposed algorithms present pitch markers with high accuracy which has enabled us to generate tones with complex shapes.

Author Biographies

Ta Yen Thai, Hanoi University of Business and Technology, 29A Vinh Tuy Street, Vinh Tuy Ward, Hai Ba Trung Dist, Hanoi, Vietnam

lecturer at the faculty of informatics

Hoang Ngo Huy, Electric Power University, Vietnam Ministry of Industry and Trade, 235 Hoang Quoc Viet Street, Co Nhue, Tu Liem, Hanoi 129823, Vietnam

PhD (informatics); vice dean of the faculty informatics

Dao Van Tuyet, Belarusian State University, 4 Niezaliežnasci Avenue, Minsk 220030, Belarus, Binh Duong University, 504 Binh Duong Avenue, Thu Dau Mot Town 820000, Binh Duong Province, Vietnam

senior researcher at the Biomedical Informatics Center, Binh Duong University; postgraduate student at the department of web-technologies and computer simulation, faculty of mechanics and mathematics, Belarusian State University

Sergey V. Ablameyko, Belarusian State University, 4 Niezaliežnasci Avenue, Minsk 220030, Belarus

academician of the National Academy of Sciences of Belarus, doctor of science (engineering), full professor; professor at the department of web-technologies and computer simulation, faculty of mechanics and mathematics

Nguyen Van Hung, Military Institute of Science and Technology, 17 Hoang Sam Street, Nghia Do Ward, Cau Giay District, Hanoi, Vietnam

PhD (informatics); lecturer at the faculty of informatics

Doan Van Hoa, Military Institute of Science and Technology, 17 Hoang Sam Street, Nghia Do Ward, Cau Giay District, Hanoi, Vietnam

PhD (informatics); lecturer at the faculty of informatics

References

Kovacs MD, Cho MY, Burchett PF, Trambert M. Benefits of integrated RIS/PACS/Reporting due to automatic population of templated reports. Current Problems in Diagnostic Radiology. 2019;48(1):37–39. DOI: 10.1067/j.cpradiol.2017.12.002.
Plonkowski M, Urbanovich P. The use of pitch in large-vocabulary continuous speech recognition system. Przeglad Elektrotechniczny. 2016;92(8):78–81.
Wang D, Hansen JHL. F0 estimation for noisy speech by exploring temporal harmonic structures in local time frequency spectrum segment. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2016 March 20 –25; Shanghai, China. [S. l.]: IEEE; 2016. p. 6510 – 6514. DOI: 10.1109/ICASSP.2016.7472931.
Talkin D. A Robust Algorithm for Pitch Tracking (RAPT). In: Kleijn WB, Paliwal KK, editors. Speech Coding & Synthesis. [S. l.]: Elsevier Science B. V.; 1995. p. 495–518.
Xu Yi, Prom-on S. Articulatory-functional modeling of speech prosody: a review. In: Kobayashi T, Hirose K, Nakamura S. Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH-2010); 2010 September 26 –30; Makuhari, Chiba, Japan. [S. l.]: International Speech Communication Association; 2010. p. 46 – 49.
Kounoudes A, Naylor PA, Brookes M. The DYPSA algorithm for estimation of glottal closure instants in voiced speech. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (CASSPʼ02); 2002 May 13–17; Orlando, FL, USA. [S. l.]: IEEE; 2002. p. I349–I352. DOI: 10.1109/ICASSP.2002.5743726.
Smits R, Yegnanarayana B. Determination of instants of significant excitation in speech using group delay function. IEEE Transactions on Speech and Audio Processing. 1995; 3(5):325–333. DOI: 10.1109/89.466662.
Prom-on S, Liu F, Xu Y. Functional modeling of tone, focus and sentence type in mandarin Chinese. Proceedings of the 17th International Congress of Phonetic Sciences; 2011 August 17–21; Hong Kong, China. Hong Kong: City University of Hong Kong; 2011. p. 1638 –1641.
Bailly G, Holm B. SFC: a trainable prosodic model. Speech Communication. 2005;46(3– 4):348–364.
Fujisaki H. dynamic characteristics of voice fundamental frequency in speech and singing. In: MacNeilage PF, editor. The Production of Speech. New York: Springer; 1983. p. 39–55. DOI: 10.1007/978-1-4613-8202-7_3.
Kochanski G, Shih C. Prosody modeling with soft templates. Speech Communication. 2003;39(3– 4):311–352. DOI: 10.1016/ S0167-6393(02)00047-X.
Fujisaki H, Hirose K. Analysis of voice fundamental frequency contours for declarative sentences of Japanese. Journal of the Acoustical Society of Japan. 1984;5(4):233–242.
Xu Y, Wang QE. Pitch targets and their realization: evidence from Mandarin. Speech Communication. 2001;33(4):319–337. DOI: 10.1016/S0167-6393(00)00063-7.
Thai TY, Hung NV, Tuyet DV, Huy NHo, Ablameyko S. An effective algorithm for determining pitch markers of Vietna mese speech sentences. In: Huang T, Lv J, Sun C, Tuzikov A, editors. Advances in Neural Networks – ISNN’2018. Proceedings of the 15th International Symposium on Neural Networks, ISNN’2018; 2018 June 25–28; Minsk, Belarus. Cham: Springer; 2018. p. 628 – 636. (Lecture Notes in Computer Science; volume 10878).
Brookes M. Voicebox: speech processing toolbox for MATLAB [Internet; cited 2019 April 24]. Available from: http://www. ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.
Xu Y, Prom-on S. Toward invariant functional representations of variable surface fundamental frequency trajectories: synthesizing speech melody via model-based stochastic learning. Speech Communication. 2014;57:181–208. DOI: 10.1016/j.specom. 2013.09.013.
Weierstrass K. Über die analytische Darstellbarkeit sogenannter willkürlicher Funktionen einer reellen Veränderlichen Sitzungsberichteder. Berlin: Königlich Preussischen Akademie der Wissenschaften zu Berlin; 1885. p. 633– 639.
Cabral JP, Kane J, Gobl C, Carson-Berndsen J. Evaluation of glottal epoch detection algorithms on different voice types. In: Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH-2011); 2011 August 27–31; Florence, Italy. [S. l.]: International Speech Communication Association; 2011. p. 1989 –1992.
Optimizing Nonlinear Functions – MATLAB and Simulink [Internet; cited 2019 April 20]. Available from: https://www.mathworks.com/help/matlab/math/optimizing-nonlinear-functions.html.
Xu Y, Prom-on S. What is PENTAtrainer2? [Internet; cited 2019 April 20]. Available from: http://www.homepages.ucl.ac.uk/~uclyyix/PENTAtrainer2/.
Prom-on S, Xu Yi. The qTA toolkit for prosody: learning underlying parameters of communicative functions through modeling. In: Hasegawa-Johnson M, editor. Proceedings of Speech Prosody 2010. 2010;100034:1– 4.
Chen JH, Kao YA. Pitch marking based on an adaptable filter and a peak-valley estimation method. Computational Linguistics and Chinese Language Processing. 2001;6(2):31– 42.
Boersma P, Weenink D. Praat: Doing phonetics by computer [Internet; cited 2019 May 3]. Available from: http://www.fon.hum. uva.nl/praat/.
Babacan O, Drugman T, d’Alessandro N, Henrich N, Dutoit T. A comparative study of pitch extraction algorithms on a large variety of singing sounds. Proceedings of International Conference on Acoustics, Speech and Signal Processing (CASSP'13); 2013 May 26 –31; Vancouver, BC, Canada. [S. l.]: IEEE; 2013. p. 7815–7819. DOI: 10.1109/ICASSP.2013.6639185.
Yin pitch estimator [Internet]. 2012 November 27 [cited 2019 August 28]. Available from: http://audition.ens.fr/adc/sw/yin.zip.
Prom-on S, Xu Yi. Discovering underlying tonal representations by computational modeling: a case study of thai. Phonology Journal. 2015;32(3):505–535.
Li Y, Tao J, Lai W, Xu X. Quantitative intonation modeling of interrogative sentences for Mandarin speech synthesis. Speech Communication. 2017;89:92–102. DOI: 10.1016/j.specom.2017.03.002.
Wang B, Xu Y, Ding Q. Interactive prosodic marking of focus, boundary and newness in Mandarin. Phonetica. 2018;75(1): 24 –56. DOI: 10.1159/00045308.
Charpentier F, Stella M. Diphone synthesis using an overlap-add technique for speech waveforms concatenation. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSPʼ86); 1986 April 7–11; Tokyo, Japan. [S. l.]: IEEE; 1986. p. 2015–2018. DOI: 10.1109/ICASSP.1986.1168657.
Ching XXu, Yi Xu, Li-Shi Luo. A pitch target approximation model for F0 trajectories in Mandarin. In: Ohala JJ, editor. Proceedings of the 14th International Congress of Phonetic Sciences (ICPHS’99). San Francisco: University of California; 1999. p. 2359–2362.