Tonal languages speech synthesis using an indirect pitch markers and the quantitative target approximation methods
Abstract
Synthesizing tones plays an important role in text-to-speech systems of tonal languages. To accomplish this, the two important steps are to determine the pitch markers of voice utterances and synthesize F0 trajectories for lexical tones. In this paper, we propose two efficient algorithms, one of them is to locate the pitch markers at the peaks of the cumulative signal of each voiced part of the input utterance and the other is to generate F0 trajectories of tones with quantitative target approximation (qTA) parameters of Xu model. The experimentation has shown that the proposed algorithms present pitch markers with high accuracy which has enabled us to generate tones with complex shapes.
References
- Kovacs MD, Cho MY, Burchett PF, Trambert M. Benefits of integrated RIS/PACS/Reporting due to automatic population of templated reports. Current Problems in Diagnostic Radiology. 2019;48(1):37–39. DOI: 10.1067/j.cpradiol.2017.12.002.
- Plonkowski M, Urbanovich P. The use of pitch in large-vocabulary continuous speech recognition system. Przeglad Elektrotechniczny. 2016;92(8):78–81.
- Wang D, Hansen JHL. F0 estimation for noisy speech by exploring temporal harmonic structures in local time frequency spectrum segment. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2016 March 20 –25; Shanghai, China. [S. l.]: IEEE; 2016. p. 6510 – 6514. DOI: 10.1109/ICASSP.2016.7472931.
- Talkin D. A Robust Algorithm for Pitch Tracking (RAPT). In: Kleijn WB, Paliwal KK, editors. Speech Coding & Synthesis. [S. l.]: Elsevier Science B. V.; 1995. p. 495–518.
- Xu Yi, Prom-on S. Articulatory-functional modeling of speech prosody: a review. In: Kobayashi T, Hirose K, Nakamura S. Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH-2010); 2010 September 26 –30; Makuhari, Chiba, Japan. [S. l.]: International Speech Communication Association; 2010. p. 46 – 49.
- Kounoudes A, Naylor PA, Brookes M. The DYPSA algorithm for estimation of glottal closure instants in voiced speech. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (CASSPʼ02); 2002 May 13–17; Orlando, FL, USA. [S. l.]: IEEE; 2002. p. I349–I352. DOI: 10.1109/ICASSP.2002.5743726.
- Smits R, Yegnanarayana B. Determination of instants of significant excitation in speech using group delay function. IEEE Transactions on Speech and Audio Processing. 1995; 3(5):325–333. DOI: 10.1109/89.466662.
- Prom-on S, Liu F, Xu Y. Functional modeling of tone, focus and sentence type in mandarin Chinese. Proceedings of the 17th International Congress of Phonetic Sciences; 2011 August 17–21; Hong Kong, China. Hong Kong: City University of Hong Kong; 2011. p. 1638 –1641.
- Bailly G, Holm B. SFC: a trainable prosodic model. Speech Communication. 2005;46(3– 4):348–364.
- Fujisaki H. dynamic characteristics of voice fundamental frequency in speech and singing. In: MacNeilage PF, editor. The Production of Speech. New York: Springer; 1983. p. 39–55. DOI: 10.1007/978-1-4613-8202-7_3.
- Kochanski G, Shih C. Prosody modeling with soft templates. Speech Communication. 2003;39(3– 4):311–352. DOI: 10.1016/ S0167-6393(02)00047-X.
- Fujisaki H, Hirose K. Analysis of voice fundamental frequency contours for declarative sentences of Japanese. Journal of the Acoustical Society of Japan. 1984;5(4):233–242.
- Xu Y, Wang QE. Pitch targets and their realization: evidence from Mandarin. Speech Communication. 2001;33(4):319–337. DOI: 10.1016/S0167-6393(00)00063-7.
- Thai TY, Hung NV, Tuyet DV, Huy NHo, Ablameyko S. An effective algorithm for determining pitch markers of Vietna mese speech sentences. In: Huang T, Lv J, Sun C, Tuzikov A, editors. Advances in Neural Networks – ISNN’2018. Proceedings of the 15th International Symposium on Neural Networks, ISNN’2018; 2018 June 25–28; Minsk, Belarus. Cham: Springer; 2018. p. 628 – 636. (Lecture Notes in Computer Science; volume 10878).
- Brookes M. Voicebox: speech processing toolbox for MATLAB [Internet; cited 2019 April 24]. Available from: http://www. ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.
- Xu Y, Prom-on S. Toward invariant functional representations of variable surface fundamental frequency trajectories: synthesizing speech melody via model-based stochastic learning. Speech Communication. 2014;57:181–208. DOI: 10.1016/j.specom. 2013.09.013.
- Weierstrass K. Über die analytische Darstellbarkeit sogenannter willkürlicher Funktionen einer reellen Veränderlichen Sitzungsberichteder. Berlin: Königlich Preussischen Akademie der Wissenschaften zu Berlin; 1885. p. 633– 639.
- Cabral JP, Kane J, Gobl C, Carson-Berndsen J. Evaluation of glottal epoch detection algorithms on different voice types. In: Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH-2011); 2011 August 27–31; Florence, Italy. [S. l.]: International Speech Communication Association; 2011. p. 1989 –1992.
- Optimizing Nonlinear Functions – MATLAB and Simulink [Internet; cited 2019 April 20]. Available from: https://www.mathworks.com/help/matlab/math/optimizing-nonlinear-functions.html.
- Xu Y, Prom-on S. What is PENTAtrainer2? [Internet; cited 2019 April 20]. Available from: http://www.homepages.ucl.ac.uk/~uclyyix/PENTAtrainer2/.
- Prom-on S, Xu Yi. The qTA toolkit for prosody: learning underlying parameters of communicative functions through modeling. In: Hasegawa-Johnson M, editor. Proceedings of Speech Prosody 2010. 2010;100034:1– 4.
- Chen JH, Kao YA. Pitch marking based on an adaptable filter and a peak-valley estimation method. Computational Linguistics and Chinese Language Processing. 2001;6(2):31– 42.
- Boersma P, Weenink D. Praat: Doing phonetics by computer [Internet; cited 2019 May 3]. Available from: http://www.fon.hum. uva.nl/praat/.
- Babacan O, Drugman T, d’Alessandro N, Henrich N, Dutoit T. A comparative study of pitch extraction algorithms on a large variety of singing sounds. Proceedings of International Conference on Acoustics, Speech and Signal Processing (CASSP'13); 2013 May 26 –31; Vancouver, BC, Canada. [S. l.]: IEEE; 2013. p. 7815–7819. DOI: 10.1109/ICASSP.2013.6639185.
- Yin pitch estimator [Internet]. 2012 November 27 [cited 2019 August 28]. Available from: http://audition.ens.fr/adc/sw/yin.zip.
- Prom-on S, Xu Yi. Discovering underlying tonal representations by computational modeling: a case study of thai. Phonology Journal. 2015;32(3):505–535.
- Li Y, Tao J, Lai W, Xu X. Quantitative intonation modeling of interrogative sentences for Mandarin speech synthesis. Speech Communication. 2017;89:92–102. DOI: 10.1016/j.specom.2017.03.002.
- Wang B, Xu Y, Ding Q. Interactive prosodic marking of focus, boundary and newness in Mandarin. Phonetica. 2018;75(1): 24 –56. DOI: 10.1159/00045308.
- Charpentier F, Stella M. Diphone synthesis using an overlap-add technique for speech waveforms concatenation. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSPʼ86); 1986 April 7–11; Tokyo, Japan. [S. l.]: IEEE; 1986. p. 2015–2018. DOI: 10.1109/ICASSP.1986.1168657.
- Ching XXu, Yi Xu, Li-Shi Luo. A pitch target approximation model for F0 trajectories in Mandarin. In: Ohala JJ, editor. Proceedings of the 14th International Congress of Phonetic Sciences (ICPHS’99). San Francisco: University of California; 1999. p. 2359–2362.
Copyright (c) 2019 Journal of the Belarusian State University. Mathematics and Informatics
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The authors who are published in this journal agree to the following:
- The authors retain copyright on the work and provide the journal with the right of first publication of the work on condition of license Creative Commons Attribution-NonCommercial. 4.0 International (CC BY-NC 4.0).
- The authors retain the right to enter into certain contractual agreements relating to the non-exclusive distribution of the published version of the work (e.g. post it on the institutional repository, publication in the book), with the reference to its original publication in this journal.
- The authors have the right to post their work on the Internet (e.g. on the institutional store or personal website) prior to and during the review process, conducted by the journal, as this may lead to a productive discussion and a large number of references to this work. (See The Effect of Open Access.)