eSpeak

eSpeakNG is a compact, open-source, software speech synthesizer for Linux, Windows, and other platforms. It uses a formant synthesis method, providing many languages in a small size. Much of the programming for eSpeakNG's language support is done using rule files with feedback from native speakers.

eSpeakNG
Original author(s)Jonathan Duddington
Developer(s)Reece Dunn
Initial releaseFebruary 2006 (2006-02)
Stable release
1.51 / 2 April 2022 (2022-04-02)
Repositorygithub.com/espeak-ng/espeak-ng/
Written inC
Operating systemLinux
Windows
macOS
FreeBSD
TypeSpeech synthesizer
LicenseGPLv3
Websitegithub.com/espeak-ng/espeak-ng/

Because of its small size and many languages, it is included as the default speech synthesizer in the NVDA[1] open source screen reader for Windows, as well as Android,[2] Ubuntu[3] and other Linux distributions. Its predecessor eSpeak was recommended by Microsoft in 2016[4] and was used by Google Translate for 27 languages in 2010;[5] 17 of these were subsequently replaced by commercial voices.[6]

The quality of the language voices varies greatly. In eSpeakNG's predecessor eSpeak, the initial versions of some languages were based on information found on Wikipedia.[7] Some languages have had more work or feedback from native speakers than others. Most of the people who have helped to improve the various languages are blind users of text-to-speech.

History

Logo for ESpeak.

In 1995, Jonathan Duddington released the Speak speech synthesizer for RISC OS computers supporting British English.[8] On 17 February 2006, Speak 1.05 was released under the GPLv2 license, initially for Linux, with a Windows SAPI 5 version added in January 2007.[9] Development on Speak continued until version 1.14, when it was renamed to eSpeak.

Development of eSpeak continued from 1.16 (there was not a 1.15 release)[9] with the addition of an eSpeakEdit program for editing and building the eSpeak voice data. These were only available as separate source and binary downloads up to eSpeak 1.24. The 1.24.02 version of eSpeak was the first version of eSpeak to be version controlled using subversion,[10] with separate source and binary downloads made available on SourceForge.[9] From eSpeak 1.27, eSpeak was updated to use the GPLv3 license.[11] The last official eSpeak release was 1.48.04 for Windows and Linux, 1.47.06 for RISC OS and 1.45.04 for macOS.[12] The last development release of eSpeak was 1.48.15 on 16 April 2015.[13]

eSpeak uses the Usenet scheme to represent phonemes with ASCII characters.[14]

eSpeak NG

On 25 June 2010,[15] Reece Dunn started a fork of eSpeak on GitHub using the 1.43.46 release. This started off as an effort to make it easier to build eSpeak on Linux and other POSIX platforms.

On 4 October 2015 (6 months after the 1.48.15 release of eSpeak), this fork started diverging more significantly from the original eSpeak.[16][17]

On 8 December 2015, there were discussions on the eSpeak mailing list about the lack of activity from Jonathan Duddington over the previous 8 months from the last eSpeak development release. This evolved into discussions of continuing development of eSpeak in Jonathan's absence.[18][19] The result of this was the creation of the espeak-ng (Next Generation) fork, using the GitHub version of eSpeak as the basis for future development.

On 11 December 2015, the espeak-ng fork was started.[20] The first release of espeak-ng was 1.49.0 on 10 September 2016,[21] containing significant code cleanup, bug fixes, and language updates.

Features

eSpeakNG can be used as a command-line program, or as a shared library.

It supports Speech Synthesis Markup Language (SSML).

Language voices are identified by the language's ISO 639-1 code. They can be modified by "voice variants". These are text files which can change characteristics such as pitch range, add effects such as echo, whisper and croaky voice, or make systematic adjustments to formant frequencies to change the sound of the voice. For example, "af" is the Afrikaans voice. "af+f2" is the Afrikaans voice modified with the "f2" voice variant which changes the formants and the pitch range to give a female sound.

eSpeakNG uses an ASCII representation of phoneme names which is loosely based on the Usenet system.

Phonetic representations can be included within text input by including them within double square-brackets. For example: espeak-ng -v en "Hello [[w3:ld]]" will say Hello world in English.

Synthesis method

eSpeakNG can be used as text-to-speech translator in different ways, depending on which text-to-speech translation step user want to use.

1. step — text to phoneme translation

There are many languages (notably English) which don't have straightforward one-to-one rules between writing and pronunciation; therefore, the first step in text-to-speech generation has to be text-to-phoneme translation.

  1. input text is translated into pronunciation phonemes (e.g. input text xerox is translated into zi@r0ks for pronunciation).
  2. pronunciation phonemes are synthesized into sound e.g., zi@r0ks is voiced as zi@r0ks in monotone way

To add intonation for speech i.e. prosody data are necessary (e.g. stress of syllable, falling or rising pitch of basic frequency, pause, etc.) and other information, which allows to synthesize more human, non-monotonous speech. E.g. in eSpeakNG format stressed syllable is added using apostrophe: z'i@r0ks which provides more natural speech: z'i@r0ks with intonation

For comparison two samples with and without prosody data:

  1. [[DIs Iz m0noUntoUn spi:tS]] is spelled in monotone way
  2. [[DIs Iz 'Int@n,eItI2d sp'i:tS]] is spelled intonated way

If eSpeakNG is used for generation of prosody data only, then prosody data can be used as input for MBROLA diphone voices.

2. step — sound synthesis from prosody data

The eSpeakNG provides two different types of formant speech synthesis using its two different approaches. With its own eSpeakNG synthesizer and a Klatt synthesizer:[22]

  1. The eSpeakNG synthesizer creates voiced speech sounds such as vowels and sonorant consonants by additive synthesis adding together sine waves to make the total sound. Unvoiced consonants e.g. /s/ are made by playing recorded sounds,[23] because they are rich in harmonics, which makes additive synthesis less effective. Voiced consonants such as /z/ are made by mixing a synthesized voiced sound with a recorded sample of unvoiced sound.
  2. The Klatt synthesizer mostly uses the same formant data as the eSpeakNG synthesizer. But, it also produces sounds by subtractive synthesis by starting with generated noise, which is rich in harmonics, and then applying digital filters and enveloping to filter out necessary frequency spectrum and sound envelope for particular consonant (s, t, k) or sonorant (l, m, n) sound.

For the MBROLA voices, eSpeakNG converts the text to phonemes and associated pitch contours. It passes this to the MBROLA program using the PHO file format, capturing the audio created in output by MBROLA. That audio is then handled by eSpeakNG.

Languages

eSpeakNG performs text-to-speech synthesis for the following languages:[24][25]

  1. Abaza
  2. Abenaki
  3. Achinese
  4. Adyghe
  5. Afar
  6. Afrikaans[26]
  7. Albanian[27]
  8. Amharic
  9. Apache
  10. Arabela
  11. Ancient Greek
  12. Arabic1
  13. Aragonese[28]
  14. Arapaho
  15. Armenian (Eastern Armenian)
  16. Armenian (Western Armenian)
  17. Aromanian
  18. Assiniboine
  19. Assamese
  20. Avaric
  21. Awadhi
  22. Azerbaijani
  23. Bashkir
  24. Basque
  25. Basic English
  26. Belarusian
  27. Bengali
  28. Bhojpuri
  29. Bodo
  30. Bishnupriya Manipuri
  31. Bosnian
  32. Bulgarian[28]
  33. Breton
  34. Burmese
  35. Caddo
  36. Cahuilla
  37. Cantonese[28]
  38. Carrier
  39. Catalan[28]
  40. Catawba
  41. Cayuga
  42. Cebuano
  43. Chamorro
  44. Chechen
  45. Cherokee
  46. Cheyenne
  47. Chhattisgarhi
  48. Chichewa
  49. Chickasaw
  50. Chinese (Mandarin)
  51. Chitonga
  52. Chittagonian
  53. Choctaw
  54. Conestoga
  55. Corsican
  56. Croatian[28]
  57. Crow
  58. Czech
  59. Chuvash
  60. Church Slavonic
  61. Crimean Tatar
  62. Dakota
  63. Danish[28]
  64. Dogri
  65. Dogrib
  66. Dutch[28]
  67. Dzongkha
  68. English (American)[28]
  69. English (British)
  70. English (Caribbean)
  71. English (Lancastrian)
  72. English (Received Pronunciation)
  73. English (Scottish)
  74. English (West Midlands)
  75. Esperanto[28]
  76. Estonian[28]
  77. Ewe
  78. Eyak
  79. Finnish[28]
  80. Filipino
  81. Fon
  82. Fox
  83. French (Belgian)[28]
  84. French (France)
  85. French (Swiss)
  86. Frisian
  87. Gagauz
  88. Galician
  89. Garhwali
  90. Garifuna
  91. Garo
  92. Georgian[28]
  93. German[28]
  94. Greek (Modern)[28]
  95. Greenlandic
  96. Guarani
  97. Gujarati
  98. Gwichin
  99. Haida
  100. Haisla
  101. Hakka Chinese3
  102. Haitian Creole
  103. Hän
  104. Haryanvi
  105. Hausa
  106. Hawaiian
  107. Hebrew
  108. Hidatsa
  109. High Valyrian
  110. Hiligaynon
  111. Hindi[28]
  112. Hmong
  113. Ho-Chunk
  114. Hopi
  115. Hungarian[28]
  116. Hunsrik
  117. Iban
  118. Ibibio
  119. Icelandic[28]
  120. Igbo
  121. Iloko
  122. Indonesian[28]
  123. Ido
  124. Interlingua
  125. Interlingue
  126. Irish[28]
  127. Italian[28]
  128. Japanese4[29]
  129. Javanese
  130. Kannada[28]
  131. Kansa
  132. Kashmiri
  133. Kazakh
  134. Kedah Malay
  135. Khakas
  136. Khmer
  137. Klingon
  138. Kʼicheʼ
  139. Kirundi
  140. Kikuyu
  141. Kinyarwanda
  142. Konkani[30]
  143. Korean
  144. Krio
  145. Kumyk
  146. Kurdish[28]
  147. Kyrgyz
  148. Quechua
  149. Ladakhi
  150. Lakota
  151. Lao
  152. Latin
  153. Ladino
  154. Latgalian
  155. Latvian[28]
  156. Lang Belta
  157. Lingua Franca Nova
  158. Lepcha
  159. Lezgi
  160. Limbu
  161. Limburgish
  162. Lingala
  163. Lithuanian
  164. Lojban[28]
  165. Luganda
  166. Luxembourgish
  167. Macedonian
  168. Madurese
  169. Magahi
  170. Maguindanao
  171. Maithili
  172. Makassarese
  173. Malagasy
  174. Malay[28]
  175. Malayalam[28]
  176. Maltese
  177. Mandan
  178. Manipuri
  179. Māori
  180. Marathi,[28]
  181. Marwari
  182. Minangkabau
  183. Mizo
  184. Mohawk
  185. Mongolian
  186. Montenegrin
  187. Nahuatl (Classical)
  188. Navajo
  189. Nepali[28]
  190. Norwegian (Bokmål)[28]
  191. Northern Sotho
  192. Nogai
  193. Odia
  194. Omaha-Ponca
  195. Oneida
  196. Onondaga
  197. Oromo
  198. Occtian
  199. Pampanga
  200. Papiamento
  201. Palauan
  202. Pashto
  203. Pawnee
  204. Persian[28]
  205. Persian (Latin alphabet)2
  206. Polish[28]
  207. Portuguese (Brazilian)[28]
  208. Portuguese (Portugal)
  209. Punjabi[31]
  210. Pyash (a constructed language)
  211. Quapaw
  212. Romanian[28]
  213. Raramuri
  214. Russian[28]
  215. Russian (Latvia)
  216. Sadri
  217. Salar
  218. Samoan
  219. Sanskrit
  220. Santali
  221. Scottish Gaelic
  222. Seneca
  223. Serbian[28]
  224. Shan (Tai Yai),
  225. Sharda
  226. Sesotho
  227. Shipibo
  228. Shona
  229. Sicilian
  230. Sindhi
  231. Sinhala
  232. Slovak[28]
  233. Slovenian
  234. Somali
  235. Spanish (Spain)[28]
  236. Spanish (Latin American)
  237. Spanish (United States)
  238. Stoney
  239. Sundanese
  240. Surjapuri
  241. Swahili[26]
  242. Swedish[28]
  243. Sylheti
  244. Tajik
  245. Tamil[28]
  246. Tatar
  247. Telugu
  248. Tibetan
  249. Tswana
  250. Thai
  251. Turkmen
  252. Turkish[28]
  253. Tatar
  254. Uyghur
  255. Ukrainian
  256. Urarina
  257. Urdu
  258. Uzbek
  259. Vietnamese (Central Vietnamese)[28]
  260. Vietnamese (Northern Vietnamese)
  261. Vietnamese (Southern Vietnamese)
  262. Volapük
  263. Wayuu
  264. Welsh
  265. Wolof
  266. Xavante
  267. Xhosa
  268. Yiddish
  269. Yoruba
  270. Yucateco
  271. Zulu
  272. Zuni
  1. Currently, only fully diacritized Arabic is supported.
  2. Persian written using English (Latin) characters.
  3. Currently, only Pha̍k-fa-sṳ is supported.
  4. Currently, only Hiragana and Katakana are supported.

See also

References

  1. Switch to eSpeak NG in NVDA distribution #5651
  2. eSpeak TTS for Android
  3. espeak-ng package in Ubuntu
  4. "Download voices for Immersive Reader, Read Mode, and Read Aloud".
  5. Google blog, Giving a voice to more languages on Google Translate, May 2010
  6. Google blog, Listen to us now, December 2010.
  7. eSpeak Speech Synthesizer 3. LANGUAGES
  8. http://espeak.sourceforge.net/
  9. "ESpeak: Speech synthesis - Browse /Espeak at SourceForge.net".
  10. Subversion history (revision 1)
  11. Subversion history (revision 56)
  12. "Espeak: Downloads".
  13. http://espeak.sourceforge.net/test/latest.html
  14. van Leussen, Jan-Wilem; Tromp, Maarten (26 July 2007). "Latin to Speech": 6. CiteSeerX 10.1.1.396.7811. {{cite journal}}: Cite journal requires |journal= (help)
  15. "Build: Allow portaudio 18 and 19 to be switched easily. · rhdunn/Espeak@63daaec". GitHub.
  16. "Espeakedit: Fix argument processing for unicode argv types · rhdunn/Espeak@61522a1". GitHub.
  17. "Switch to eSpeak NG in NVDA distribution · Issue #5651 · nvaccess/Nvda". GitHub.
  18. Taking ownership of the eSpeak project and its future
  19. Vote for new main eSpeak developer
  20. Rebrand the espeak program to espeak-ng.
  21. espeak-ng 1.49.0
  22. Klatt, Dennis H. (1979). "Software for a cascade/parallel formant synthesizer" (PDF). J. Acoustical Society of America, 67(3) March 1980.
  23. List of recorded fricatives in eSpeakNG
  24. "ESpeak NG Text-to-Speech". GitHub. 13 February 2022.
  25. "ESpeak NG Text-to-Speech". GitHub. 22 October 2021.
  26. Butgereit, L., & Botha, A. (2009, May). Hadeda: The noisy way to practice spelling vocabulary using a cell phone. In The IST-Africa 2009 Conference, Kampala, Uganda.
  27. Hamiti, M., & Kastrati, R. (2014). Adapting eSpeak for converting text into speech in Albanian. International Journal of Computer Science Issues (IJCSI), 11(4), 21.
  28. Kayte, S., & Gawali, D. B. (2015). Marathi Speech Synthesis: A review. International Journal on Recent and Innovation Trends in Computing and Communication, 3(6), 3708-3711.
  29. Pronk, R. (2013). Adding Japanese language synthesis support to the eSpeak system. University of Amsterdam.
  30. Mohanan, S., Salkar, S., Naik, G., Dessai, N. F., & Naik, S. (2012). Text Reader for Konkani Language. Automation and Autonomous System, 4(8), 409-414.
  31. Kaur, R., & Sharma, D. (2016). An Improved System for Converting Text into Speech for Punjabi Language using eSpeak. International Research Journal of Engineering and Technology, 3(4), 500-504.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.