Gnuspeech is an extensible text-to-speech computer software package that produces artificial speech output based on real-time articulatory speech synthesis by rules. That is, it converts text strings into phonetic descriptions, aided by a pronouncing dictionary, letter-to-sound rules, and rhythm and intonation models; transforms the phonetic descriptions into parameters for a low-level articulatory speech synthesizer; uses these to drive an articulatory model of the human vocal tract producing an output suitable for the normal sound output devices used by various computer operating systems; and does this at the same or faster rate than the speech is spoken for adult speech.
Developer(s) | Trillium Sound Research |
---|---|
Initial release | 2002 |
Stable release | 0.9[1]
/ 14 October 2015 |
Repository | |
Platform | Cross-platform |
Type | Text-to-speech |
License | GNU General Public License |
Website | www |
Design
editThe synthesizer is a tube resonance, or waveguide, model that models the behavior of the real vocal tract directly, and reasonably accurately, unlike formant synthesizers that indirectly model the speech spectrum.[2] The control problem is solved by using René Carré's Distinctive Region Model[3] which relates changes in the radii of eight longitudinal divisions of the vocal tract to corresponding changes in the three frequency formants in the speech spectrum that convey much of the information of speech. The regions are, in turn, based on work by the Stockholm Speech Technology Laboratory[4] of the Royal Institute of Technology (KTH) on "formant sensitivity analysis" - that is, how formant frequencies are affected by small changes in the radius of the vocal tract at various places along its length.[5]
History
editGnuspeech was originally commercial software produced by the now-defunct Trillium Sound Research for the NeXT computer as various grades of "TextToSpeech" kit. Trillium Sound Research was a technology transfer spin-off company formed at the University of Calgary, Alberta, Canada, based on long-standing research in the computer science department on computer-human interaction using speech, where papers and manuals relevant to the system are maintained.[6] The initial version in 1992 used a formant-based speech synthesiser. When NeXT ceased manufacturing hardware, the synthesizer software was completely re-written[7] and also ported to NSFIP (NextStep For Intel Processors) using the waveguide approach to acoustic tube modeling based on the research at the Center for Computer Research in Music and Acoustics (CCRMA) at Stanford University, especially the Music Kit. The synthesis approach is explained in more detail in a paper presented to the American Voice I/O Society in 1995.[8] The system used the onboard 56001 Digital Signal Processor (DSP) on the NeXT computer and a Turtle Beach add-on board with the same DSP on the NSFIP version to run the waveguide (also known as the tube model). Speed limitations meant that the shortest vocal tract length that could be used for speech in real time (that is, generated at the same or faster rate than it was "spoken") was around 15 centimeters, because the sample rate for the waveguide computations increases with decreasing vocal tract length. Faster processor speeds are progressively removing this restriction, an important advance for producing children's speech in real time.
Since NeXTSTEP is discontinued and NeXT computers are rare, one option for executing the original code is the use of virtual machines. The Previous emulator, for example, can emulate the DSP in NeXT computers, which can be used by the Trillium software.
Trillium ceased trading in the late 1990s and the Gnuspeech project was first entered into the GNU Savannah repository under the terms of the GNU General Public License in 2002, as an official GNU software.
Due to its free and open source license, which allows customization of the code, Gnuspeech has been utilized in academic research.[9] [10]
References
edit- ^ https://directory.fsf.org/wiki/gnuspeech.
{{cite web}}
: Missing or empty|title=
(help) - ^ COOK, P.R. (1989) Synthesis of the singing voice using a physically parameterized model of the human vocal tract. International Computer Music Conference, Columbus Ohio
- ^ CARRE, R. (1992) Distinctive regions in acoustic tubes. Speech production modelling. Journal d'Acoustique, 5 141 to 159
- ^ Now Department for Speech, Music and Hearing
- ^ FANT, G. & PAULI, S. (1974) Spatial characteristics of vocal tract resonance models. Proceedings of the Stockholm Speech Communication Seminar, KTH, Stockholm, Sweden
- ^ Relevant U of Calgary website
- ^ The Tube Resonance Model Speech Synthesizer
- ^ HILL, D.R., MANZARA, L. & TAUBE-SCHOCK, C-R. (1995) Real-time articulatory speech-synthesis-by-rules. Proc. AVIOS '95 14th Annual International Voice Technologies Conf, San Jose, 12-14 September 1995, 27-44
- ^ D'Este, F. - Articulatory Speech Synthesis with Parallel Multi-Objective Genetic Algorithm. Master's Thesis, Leiden Institute of Advanced Computer Science, 2010.
- ^ Xiong, F.; Barker, J. - Deep Learning of Articulatory-Based Representations and Applications for Improving Dysarthric Speech Recognition. ITG Conference on Speech Communication, Germany, 2018.