Difference between revisions of "Speech synthesizers"

Revision as of 09:58, 10 May 2017

List of speech synthesizers entries:

Speech synthesis is the method of generating artificial speech by mechanical means or by a computer algorithm. The software can read texts aloud and should be able to create new sentences.^[1] Consequently, it is used when there is a need to express information acoustically. Nowadays, it is found in text-to-speech applications (screen text reading, assistance for the visually impaired) and virtual assistants (GPS navigation, mobile assistants such as Google Now or Apple Siri). In addition, it is employed in any other situation where information usually available in text has to be communicated acoustically. It often comes paired with voice recognition.^[2]

These applications require speech that is intelligible and natural-sounding. Current speech synthesis' systems achieve a great deal of naturalness compared to a real human voice. Yet they are still perceived as non-human because minor audible glitches still remain in the outputted utterances.^[3] It may be that modern speech synthesis has reached the state of 'uncanny valley', where the closeness of the artificial speech is so near perfection that humans find it unnatural.^[4]

Main Characteristics

Speech synthesizers create synthetic speech by various methods. First, they are used to mimic how the human vocal tract works and how the air passes through it as in the case of articulatory synthesis. Second, synthesizers are used to manipulate sounds to create the basic building blocks of speech in formant synthesis. Finally, synthesizers assemble new utterances from a large database of pre-recorded audio samples in concatenation synthesis.^[5]

The speech synthesizer can be a device, such as DECtalk, which is used by Stephen Hawking,^[6] or it can be an app for smartphones, which is more widespread at present.^[7]

Articulatory Synthesis

3D model of a human vocal tract by VocalTractLab articulatory speech synthesizer

Articulatory synthesis is the oldest form of speech synthesis. It tries to mimic the entire human vocal tract and the way the air passes through it when speaking. Modern synthesizers of this kind are based on the acoustic tube model, which is what the formant synthesis uses. But compared to those models, articulatory synthesis uses the entire system as the parameter and the entire tubes as the controls. This allows the creation of very complex models, but with increasing complexity the difficulty of effectively tracking the behaviour of the innards of the system decreases, because a small change in the tube settings propagates into complex speech patterns. But this also makes articulatory synthesis attractive, because researchers do not need to model complex formant trajectories explicitly. In other words, artificial speech is produced by controlling the modelled vocal tract and not by controlling the produced sound itself.^[8]

This synthesis simulates the airflow through the vocal tract of a human. Unlike formant synthesis, the entire modelled system is controlled as a whole, as it does not comprise connected, discrete modules. By changing the characteristics of the components and, tubes of the articulatory system, different aspects of the produced voice can be achieved.^[9]

Articulatory synthesis deals with two inherent problems. First, the resulting model must find a balance between being precise and complex and being simplistic but observable and controllable. Because the vocal tract is very complex, models mimicking it very closely are inherently very complicated, and this makes them difficult to track and control. This is connected to a second problem. It is important to base the model of the vocal tract on data as precisely as possible. Modern screening methods such as MRI or X-rays are used to see the human vocal tract and its working when producing speech. This data is then used for construction of the articulatory model, but methods such as these are often intrusive, and data from them are difficult to deduce.^[9]

Both these obstacles diminish the quality of the speech produced by these models when compared to modern methods, and articulatory synthesis was largely abandoned for the purposes of speech synthesis. It is still used, however, for research of the human vocal tract physiology, phonology, or visualisation of the human head when speaking.^[9]

Formant Synthesis

Spectrogram of American English vowels [i, u, ɑ] showing the formants F1 and F2

Formant synthesis is the first truly crystallized method of synthesizing speech. It used to be the dominant method of speech synthesis until the 1980s. Some call it a synthesis 'from scratch' or 'synthesis by rule'.^[9] In contrast to the more recent approach of concatenation synthesis, formant synthesis does not use sampled audio. Instead, the sound is created from scratch. The result is intelligible, clear-sounding speech, but listeners immediately notice that is was not spoken by a human. The model is too simplistic and does not accurately reflect the subtleties of a real human vocal tract, which consists of many different articulations working together.^[5]

Formant synthesis uses a modular, model-based approach to create the artificial speech. Synthesizers utilizing this method rely on an observable and controllable model of an acoustic tube. This model typically has a layout of a vocal tract with two systems functioning in parallel. The sound is generated from a source and then fed to a vocal tract model. The vocal tract is modelled so the oral and nasal cavities are separate and parallel, but the sound only passes through one of them depending on the type (i.e., nasalized sounds) of sounds needed at the moment. Both cavities are combined on the output and pass through a radiation component that simulates the sound characteristics of real lips and nose.^[9]

However, this method does not represent an accurate model of the vocal tract because it allows separate and independent control over every aspect of the individual formants. But in a real vocal tract, the formant is constructed as a whole by the entire system. It is not possible to highlight one part of it and base an artificial model on it.^[9]

Concatenation Synthesis

Unit selection concatenation speech synthesis

Concatenation synthesis is a form of speech synthesis that uses corresponding short samples of previously recorded speech to construct new utterances. These samples vary in length, ranging between 1 second to several milliseconds. These small samples are subsequently modified so they better fit together according to the type, structure, and mainly the contents of the synthesized text.^[3] These samples, or units, can be individual phones, diphones, morphemes, or even entire phrases. So instead of changing the characteristics of the sound to generate intelligible speech, concatenation synthesis works with a large set of pre-recorded data from which, much like a building set, it builds utterances that originally were not present in the data set.^[10]

The usage of real recorded sounds allows the production of very high-quality artificial speech. This is advantageous in comparison to formant synthesis, as we do not require approximate models of the entire vocal tract. However, some speech characteristics, such as voice character or mood, cannot be modified as easily.^[3] A unit placed in a not very ideal position can result in audible glitches and unnaturally sounding speech because the contours of the unit do not perfectly fit together with the surrounding units. This problem is minimal in domain-specific synthesizers, but general models require large data sets and additional output control that detects and repairs these glitches.^[9]

First, a data set of high quality audio samples needs to be recorded and divided into units. The most common type of unit is the diphone, but other larger or smaller units can also be used. The set of such units creates the speech corpus. A diphones inventory can be artificially created^[9] or based on many hours of recorded speech produced by one speaker,^[3]^[9] these being either spoken monotonously, as to create units with neutral prosody, or spoken naturally. The corpus is then segmented into discrete units. Segments such as silence or breaths are also marked and saved.

The speech corpus prepared in this way is then used for construction of new utterances. A large database, or in other words, a large set of units with varied prosodic and sound characteristics, should safeguard more natural-sounding speech.^[5] In unit selection synthesis, one of the approaches that concatenation synthesis uses, then the target utterance is first predicted, suitable units are picked from the speech corpus, assembled, and their suitability towards the target utterance is tested.^[11]

Historical overview

Schematics of von Kempelen's pneumatic speech synthesizer

Although the first attempts to build a speaking machine appeared in antiquity and Middle Ages, the first speech synthesizers were developed in the beginning of the modern era. The history of speech synthesis dates back to the 18th century, when Hungarian civil servant and inventor Wolfgan von Kempelen created a machine of pipes and elbows and assorted parts of musical instruments. He achieved a sufficient imitation of the human vocal tract with a third iteration. He published a comprehensive description of the design in his book Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschin [The Mechanism of Human Speech, with a Description of a Speaking Machine] in 1791.^[12] Christian Kratzenstein, a Danish scientist, who worked in Russia, introduced his speaking machine at the same time. However, it could produce only five vowels. Von Kempeln's speaking machine was popularised by Charles Wheatstone in 1837.^[13]

The Voder developed by Bell Laboratories

The design was picked up by Sir Charles Wheatstone, a Victorian-Era English inventor, who improved on Kempelen's design. The newly sparked interest into the research of phonetics and Whetstone's work inspired Alexander Graham Bell to do his own research into the matter and eventually arrive at the idea of the telephone. Bell laboratories created the Voice Operating Demonstrator, or the Voder for short, a human speech synthesizer, in the 1930s. The Voder was a model of a human vocal tract, basically an early version of a formant synthesizer, and required a trained operator who had to manually create utterances using a console with 15 touch-sensitive keys and a pedal.^[12]

Digital speech synthesizers began to emerge in the 1980s, following the MIT-developed DECtalk text-to-speech synthesizer.^[14] This synthesizer was notably used by the physicist Stephen Hawking.^[6]

At the beginning of the modern-day speech synthesis, two main approaches appeared; formant synthesis tries to model the entire vocal tract of a human, and articulatory synthesis aims to create the sounds speech is made of from scratch. However, both methods are gradually replaced by concatenation synthesis nowadays. This form of synthesis uses a large set, speech corpus, of high-quality pre-recorded audio samples. These samples can be assembled together and form a new utterance.^[3]

Purpose

The purpose of speech synthesis is to model, research, and create synthetic speech for applications where communicating information via text is undesirable or cumbersome. It is used in "giving voice" to virtual assistants and in text-to-speech, most notably as a speech aid for visually impaired people or for those who lost their own voice.

Important Dates

1769 - the first speech synthesizer was developed by Wolfgang von Kempelen
1770 - Christian Kratzenstein unveiled his speaking machine
1837 - Charles Wheatstone introduced von Kempelen's machine in Engalnd^[13]
1937 - Voder was unveiled at World Fair in New York^[14]
1953 - the first formant synthesizer PAT (Parametric Artificial Talker) was developed by Walter Lawrence^[13]
1984 - DECtalk was introduced by Digital Equipment Corporation^[15]

Enhancement/Therapy/Treatment

The speech synthesis play an important role in the development of Intelligent Personal Assistants, which responded in a natural voice.^[2] Tihelka also points out that speech synthesis could be beneficial when the information should be retrieved auditory. It could be also used in an interactive toys for children.^[3] Taylor points out that speech synthesis is used in GPS navigation and call centres. It is also used in various systems for visually disabled people.^[9] Dutoit claims that speech synthesis could be used in the language education.^[1] The speech synthesis is also used at present in the applications which help people who lost their voice.^[7]

At present, speech synthesis is able to preserve patients own voice, when it is recorded before its lost. However, the quality of the record is still an issue. There is usually just a few days between patients are informed about the surgery and the surgery itself, therefore they cannot record considerable amount of sentences. In addition, patients tents to be in difficult psychical condition due to the diagnosis and they are not trained speakers.^[16] This issue could be overcome by the voice donor, i.e. by the person whose voice is not negatively affected by a disease and has a similar voice as a patient, for instance his brother or her daughter.^[17]

The speech synthesis could be also beneficial for students with various disorders as attention, learning or reading disorder. It could be also used for the learning of a foreign language.^[18]

Ethical & Health Issues

Jan Romportl claims that the more natural-sounding voice, which is produced the current speech synthesizers might have not been entirely accepted due to effect of uncanny valley. The concept of uncanny valley, which was introduced by Masahiro Mori, argues that the natural-appearing artificial systems provoke negative feelings in human. However, Romportl point out that his research demonstrates that there is a difference in an acceptance of natural-sounding synthetic voice between the participants which came from technical environment and those who had their background in humanities. While the former group was more immune to uncanny valley, it occurred in the reactions of the latter one.^[4] In contrast, Paul Taylor claims that people are irritated by unnatural sounding voice of certain speech syntheses and prefer the natural ones.^[9]

Certain users point out privacy issues of apps for speech synthesis. This is primarily an issue of the apps which are free of charge.^[7]

For more issues linked with speech technologies, see speech technologies synopsis.

Public & Media Impact and Presentation

The most renowned user of speech synthetizer might be Stephen Hawking, who claims:

This system allowed me to communicate much better than I could before. I can manage up to 15 words a minute. I can either speak what I have written, or save it to disk. I can then print it out, or call it back and speak it sentence by sentence. Using this system, I have written a book, and dozens of scientific papers. I have also given many scientific and popular talks. They have all been well received. I think that is in a large part due to the quality of the speech synthesiser, which is made by Speech Plus. One's voice is very important. If you have a slurred voice, people are likely to treat you as mentally deficient: Does he take sugar? This synthesiser is by far the best I have heard, because it varies the intonation, and doesn't speak like a Dalek. The only trouble is that it gives me an American accent.^[19]

The speech synthesis also allowed Roger Ebert Pulitzer Prize-winning critic to speak again. Since Ebert's voice was preserved in many records, the speech synthesis company Cereproc could reconstruct his own voice.^[20]

The speech synthesis appears also in culture. The renowned example of speech synthesis is HAL a speaking computer which plays an important role in Kubrick's film 2001: A Space Odyssey. The story was based on Arthur C. Clarke's novel.^[13]

Public Policy

We have not recorded any public policy that regards speech syntheses, yet.

Related Technologies, Projects or Scientific Research

The speech synthesis apps are dependant on smartphones, tablets or PCs.^[7]

References

↑ ^1.0 ^1.1 DUTOIT, Thierry. High-quality text-to-speech synthesis: an overview. Journal of Electrical & Electronics Engineering [online]. 1997. Available online at: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=E773C6478622AEE97E0E678AE3E3B285?doi=10.1.1.495.9289&rep=rep1&type=pdf (Retrieved 7th February, 2017).
↑ ^2.0 ^2.1 Google. Speech Processing. Research.google.com [online]. Available online at: https://research.google.com/pubs/SpeechProcessing.html (Retrieved 7th February, 2017).
↑ ^3.0 ^3.1 ^3.2 ^3.3 ^3.4 ^3.5 TIHELKA, Daniel. The Unit Selection Approach in Czech TTS Synthesis. Pilsen. Dissertation Thesis. University of West Bohemia. Faculty of Applied Sciences. 2005.
↑ ^4.0 ^4.1 ROMPORTL, Jan. Speech Synthesis and Uncanny Valley. In: Text, Speech, and Dialogue. Cham: Springer, 2014, p. 595-602. Doi: 10.1007/978-3-319-10816-2_72 Available online at: http://link.springer.com/chapter/10.1007/978-3-319-10816-2_72 (Retrieved 2nd February, 2017).
↑ ^5.0 ^5.1 ^5.2 ROMPORTL, Jan. Zvyšování přirozenosti strojově vytvářené řeči v oblasti suprasegmentálních zvukových jevů. Pilsen. Dissertation Thesis. University of West Bohemia. Faculty of Applied Sciences. 2008.
↑ ^6.0 ^6.1 DIZON, Jan. 'Zoolander 2' Trailer Is Narrated By Stephen Hawking's DECTalk Dennis Speech Synthesizer. Tech Times [online]. 2015, Aug 3. Available online at: http://www.techtimes.com/articles/73832/20150803/zoolander-2-trailer-narrated-stephen-hawkings-dectalk-dennis-speech-synthesizer.htm#sthash.hj3Hwq95.dpuf (Retrieved 2nd February, 2017).
↑ ^7.0 ^7.1 ^7.2 ^7.3 WebWhispers.org. Text to speech apps for Phones and Pads. WebWhispers.org [online]. 2017. Available online at: http://www.webwhispers.org/library/TexttoSpeechApps.asp (Retrieved 16th February, 2017).
↑ BIRKHOLZ, Peter. About Articulatory Speech Synthesis. VocalTractLab [online]. 2016. Available online at: http://www.vocaltractlab.de/index.php?page=background-articulatory-synthesis (Retrieved 2nd February, 2017).
↑ ^9.00 ^9.01 ^9.02 ^9.03 ^9.04 ^9.05 ^9.06 ^9.07 ^9.08 ^9.09 ^9.10 TAYLOR, Paul. Text-to-Speech Synthes. University of Cambridge Department of Engineering [online]. 2014. Available online at: http://mi.eng.cam.ac.uk/~pat40/ttsbook_draft_2.pdf (Retrieved 2nd February, 2017).
↑ BLACK, Alan W. Perfect Synthesis for all of the people all of the time. Carnegie Mellon School of Computer Science [online]. 2009, Sep 30. Available online at: http://www.cs.cmu.edu/~awb/papers/IEEE2002/allthetime/allthetime.html (Retrieved 2nd February, 2017).
↑ CLARK, A. J. KING, Simon. Joint Prosodic and Segmental Unit Selection Speech Synthesis. The Centre for Speech Technology Research [online]. 2006. Available online at: http://www.cstr.ed.ac.uk/downloads/publications/2006/clarkking_interspeech_2006.pdf (Retrieved 2nd February, 2017).
↑ ^12.0 ^12.1 DUDLEY, Homer, TARNOCZY, T. H. The speaking machine of Wolfgang von Kempelen. Journal of the Acoustical Society of America 22, 151-166. Doi:10.1121/1.1906583. Available online at: http://pubman.mpdl.mpg.de/pubman/item/escidoc:2316415:3/component/escidoc:2316414/Dudley_1950_Speaking_machine.pdf (Retrieved 2nd February, 2017).
↑ ^13.0 ^13.1 ^13.2 ^13.3 WOODFORD, Chris. Speech synthesizers. EXPLAINTHATSTUFF [online]. 2017, Jan 21. Available online at: http://www.explainthatstuff.com/how-speech-synthesis-works.html (Retrieved 16th February, 2017).
↑ ^14.0 ^14.1 DUNCAN. Klatt’s Last Tapes: A History of Speech Synthesisers. Communication Aids [online]. 2013, Aug 10. Available online at: http://communicationaids.info/history-speech-synthesisers (Retrieved 2nd February, 2017).
↑ ZIENTARA, Peggy. DECtalk lets micros read massages over phones. InfoWorld. 1984, Jan 16. p. 21. Available online at: https://books.google.com.au/books?id=ey4EAAAAMBAJ&lpg=PA21&dq=DECtalk%20lets%20micros%20read%20messages%20over%20phone%5D%2C%20By%20Peggy%20Zientara&pg=PA21#v=onepage&q=DECtalk%20lets%20micros%20read%20messages%20over%20phone%5D%2C%20By%20Peggy%20Zientara&f=false (Retrieved 2nd February, 2017).
↑ JŮZOVÁ, M., TIHELKA, D., MATOUŠEK, J. Designing High-Coverage Multi-level Text Corpus for Non-professional-voice Conservation. In: Speech and Computer. Volume 9811 of the series Lecture Notes in Computer Science. Cham: Springer, 2016, pp 207-215. Doi: 10.1007/978-3-319-43958-7_24 Available online at: http://link.springer.com/chapter/10.1007/978-3-319-43958-7_24 (Retrieved 16th February, 2017).
↑ YAMAGASHI, Junichi. Speech synthesis technologies for individuals with vocal disabilities: Voice banking and reconstruction. Acoustical Science and Technology. 2012, 33(1), p. 1-5. Doi: 10.1250/ast.33.1 Available online at: https://www.jstage.jst.go.jp/article/ast/33/1/33_1_1/_article (Retrieved 16th February, 2017).
↑ The University of Kansas. Who can benefit from speech synthesis? The University of Kansas [online]. Available online at: http://www.specialconnections.ku.edu/?q=instruction/universal_design_for_learning/teacher_tools/speech_synthesis (Retrieved 16th February, 2017).
↑ HAWKING, Stephen. Disability my experience with ALS. Hawking.or.uk [online]. Available online at: https://web.archive.org/web/20000614043350/http://www.hawking.org.uk/disable/disable.html (Retrieved 2nd February, 2017).
↑ MILLAR, Hayley. New voice for film critic Roger Ebert. BBC News [online]. 2010, Mar 3. Available online at: http://news.bbc.co.uk/2/hi/uk_news/scotland/edinburgh_and_east/8547645.stm (Retrieved 16th February, 2017).

[High-quality_text-to-speech-1] 1.0 ^1.1 DUTOIT, Thierry. High-quality text-to-speech synthesis: an overview. Journal of Electrical & Electronics Engineering [online]. 1997. Available online at: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=E773C6478622AEE97E0E678AE3E3B285?doi=10.1.1.495.9289&rep=rep1&type=pdf (Retrieved 7th February, 2017).

[google-2] 2.0 ^2.1 Google. Speech Processing. Research.google.com [online]. Available online at: https://research.google.com/pubs/SpeechProcessing.html (Retrieved 7th February, 2017).

[tihelka-3] 3.0 ^3.1 ^3.2 ^3.3 ^3.4 ^3.5 TIHELKA, Daniel. The Unit Selection Approach in Czech TTS Synthesis. Pilsen. Dissertation Thesis. University of West Bohemia. Faculty of Applied Sciences. 2005.

[uncanny-4] 4.0 ^4.1 ROMPORTL, Jan. Speech Synthesis and Uncanny Valley. In: Text, Speech, and Dialogue. Cham: Springer, 2014, p. 595-602. Doi: 10.1007/978-3-319-10816-2_72 Available online at: http://link.springer.com/chapter/10.1007/978-3-319-10816-2_72 (Retrieved 2nd February, 2017).

[romportl-5] 5.0 ^5.1 ^5.2 ROMPORTL, Jan. Zvyšování přirozenosti strojově vytvářené řeči v oblasti suprasegmentálních zvukových jevů. Pilsen. Dissertation Thesis. University of West Bohemia. Faculty of Applied Sciences. 2008.

[DECtalk-6] 6.0 ^6.1 DIZON, Jan. 'Zoolander 2' Trailer Is Narrated By Stephen Hawking's DECTalk Dennis Speech Synthesizer. Tech Times [online]. 2015, Aug 3. Available online at: http://www.techtimes.com/articles/73832/20150803/zoolander-2-trailer-narrated-stephen-hawkings-dectalk-dennis-speech-synthesizer.htm#sthash.hj3Hwq95.dpuf (Retrieved 2nd February, 2017).

[whispers-7] 7.0 ^7.1 ^7.2 ^7.3 WebWhispers.org. Text to speech apps for Phones and Pads. WebWhispers.org [online]. 2017. Available online at: http://www.webwhispers.org/library/TexttoSpeechApps.asp (Retrieved 16th February, 2017).

[8] BIRKHOLZ, Peter. About Articulatory Speech Synthesis. VocalTractLab [online]. 2016. Available online at: http://www.vocaltractlab.de/index.php?page=background-articulatory-synthesis (Retrieved 2nd February, 2017).

[Taylor-9] 9.00 ^9.01 ^9.02 ^9.03 ^9.04 ^9.05 ^9.06 ^9.07 ^9.08 ^9.09 ^9.10 TAYLOR, Paul. Text-to-Speech Synthes. University of Cambridge Department of Engineering [online]. 2014. Available online at: http://mi.eng.cam.ac.uk/~pat40/ttsbook_draft_2.pdf (Retrieved 2nd February, 2017).

[10] BLACK, Alan W. Perfect Synthesis for all of the people all of the time. Carnegie Mellon School of Computer Science [online]. 2009, Sep 30. Available online at: http://www.cs.cmu.edu/~awb/papers/IEEE2002/allthetime/allthetime.html (Retrieved 2nd February, 2017).

[11] CLARK, A. J. KING, Simon. Joint Prosodic and Segmental Unit Selection Speech Synthesis. The Centre for Speech Technology Research [online]. 2006. Available online at: http://www.cstr.ed.ac.uk/downloads/publications/2006/clarkking_interspeech_2006.pdf (Retrieved 2nd February, 2017).

[history-12] 12.0 ^12.1 DUDLEY, Homer, TARNOCZY, T. H. The speaking machine of Wolfgang von Kempelen. Journal of the Acoustical Society of America 22, 151-166. Doi:10.1121/1.1906583. Available online at: http://pubman.mpdl.mpg.de/pubman/item/escidoc:2316415:3/component/escidoc:2316414/Dudley_1950_Speaking_machine.pdf (Retrieved 2nd February, 2017).

[explain-13] 13.0 ^13.1 ^13.2 ^13.3 WOODFORD, Chris. Speech synthesizers. EXPLAINTHATSTUFF [online]. 2017, Jan 21. Available online at: http://www.explainthatstuff.com/how-speech-synthesis-works.html (Retrieved 16th February, 2017).

[Klatt.E2.80.99s_Last_Tapes-14] 14.0 ^14.1 DUNCAN. Klatt’s Last Tapes: A History of Speech Synthesisers. Communication Aids [online]. 2013, Aug 10. Available online at: http://communicationaids.info/history-speech-synthesisers (Retrieved 2nd February, 2017).

[15] ZIENTARA, Peggy. DECtalk lets micros read massages over phones. InfoWorld. 1984, Jan 16. p. 21. Available online at: https://books.google.com.au/books?id=ey4EAAAAMBAJ&lpg=PA21&dq=DECtalk%20lets%20micros%20read%20messages%20over%20phone%5D%2C%20By%20Peggy%20Zientara&pg=PA21#v=onepage&q=DECtalk%20lets%20micros%20read%20messages%20over%20phone%5D%2C%20By%20Peggy%20Zientara&f=false (Retrieved 2nd February, 2017).

[16] JŮZOVÁ, M., TIHELKA, D., MATOUŠEK, J. Designing High-Coverage Multi-level Text Corpus for Non-professional-voice Conservation. In: Speech and Computer. Volume 9811 of the series Lecture Notes in Computer Science. Cham: Springer, 2016, pp 207-215. Doi: 10.1007/978-3-319-43958-7_24 Available online at: http://link.springer.com/chapter/10.1007/978-3-319-43958-7_24 (Retrieved 16th February, 2017).

[17] YAMAGASHI, Junichi. Speech synthesis technologies for individuals with vocal disabilities: Voice banking and reconstruction. Acoustical Science and Technology. 2012, 33(1), p. 1-5. Doi: 10.1250/ast.33.1 Available online at: https://www.jstage.jst.go.jp/article/ast/33/1/33_1_1/_article (Retrieved 16th February, 2017).

[18] The University of Kansas. Who can benefit from speech synthesis? The University of Kansas [online]. Available online at: http://www.specialconnections.ku.edu/?q=instruction/universal_design_for_learning/teacher_tools/speech_synthesis (Retrieved 16th February, 2017).

[19] HAWKING, Stephen. Disability my experience with ALS. Hawking.or.uk [online]. Available online at: https://web.archive.org/web/20000614043350/http://www.hawking.org.uk/disable/disable.html (Retrieved 2nd February, 2017).

[20] MILLAR, Hayley. New voice for film critic Roger Ebert. BBC News [online]. 2010, Mar 3. Available online at: http://news.bbc.co.uk/2/hi/uk_news/scotland/edinburgh_and_east/8547645.stm (Retrieved 16th February, 2017).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

@@ Line 18: / Line 18: @@
 === Articulatory Synthesis ===
 [[File:Articulatory synthesis model 1.jpg|thumbnail|right|3D model of a human vocal tract by VocalTractLab articulatory speech synthesizer]]
-Articulatory synthesis is the oldest form of speech synthesis. It tries to mimic the entire human vocal tract and the way the air passes through it when speaking. Modern synthesizer of this kind are based on the acoustic tube model such as the formant synthesis uses. But compared to those models, articulatory synthesis uses the entire system as the parameter and the entire tubes as the controls. This allows the creation of very complex models but with increasing complexity the difficulty of effectively tracking the behaviour of the innards of the system decrease, because a small change in the tube settings propagate into complex speech patters. But this also makes articulatory synthesis attractive, because the researches do not need to model complex formant trajectories explicitly. In other words, artificial speech is produced by controlling the modelled vocal tract and not by controlling the produced sound itself.<ref> BIRKHOLZ, Peter. About Articulatory Speech Synthesis. VocalTractLab [online]. 2016. Available online at: http://www.vocaltractlab.de/index.php?page=background-articulatory-synthesis (Retrieved 2nd February, 2017).</ref>
+Articulatory synthesis is the oldest form of speech synthesis. It tries to mimic the entire human vocal tract and the way the air passes through it when speaking. Modern synthesizers of this kind are based on the acoustic tube model, which is what the formant synthesis uses. But compared to those models, articulatory synthesis uses the entire system as the parameter and the entire tubes as the controls. This allows the creation of very complex models, but with increasing complexity the difficulty of effectively tracking the behaviour of the innards of the system decreases, because a small change in the tube settings propagates into complex speech patterns. But this also makes articulatory synthesis attractive, because researchers do not need to model complex formant trajectories explicitly. In other words, artificial speech is produced by controlling the modelled vocal tract and not by controlling the produced sound itself.<ref>BIRKHOLZ, Peter. About Articulatory Speech Synthesis. VocalTractLab [online]. 2016. Available online at: http://www.vocaltractlab.de/index.php?page=background-articulatory-synthesis (Retrieved 2nd February, 2017).</ref>
-This synthesis simulates the airflow through the vocal tract of a human. Unlike formant synthesis, the entire modelled system is controlled as a whole, as it does not comprise of connected, discrete modules. By changing the characteristics of the components, tubes, of the articulatory system, different aspects of the produced voice can be achieved.<ref name="Taylor"/>
+This synthesis simulates the airflow through the vocal tract of a human. Unlike formant synthesis, the entire modelled system is controlled as a whole, as it does not comprise connected, discrete modules. By changing the characteristics of the components and, tubes of the articulatory system, different aspects of the produced voice can be achieved.<ref name="Taylor"/>
-Articulatory synthesis deals with two inherent problems. First, the resulting model must find balance between being precise and complex and being simplistic but observable and controllable. Because the vocal tract is very complex, models mimicking it very closely are inherently very complicated, and this makes them difficult to tract and control. This is connected to a second problem. It is important to base the model of the vocal tract on data as precise as possible. Modern screening methods such as MRI or X-Rays is used to see the human vocal tract and its working when producing speech. This data is then used for construction of the articulatory model, but methods such as these are often intrusive and data from them are difficult to deduce.<ref name="Taylor"/>
+Articulatory synthesis deals with two inherent problems. First, the resulting model must find a balance between being precise and complex and being simplistic but observable and controllable. Because the vocal tract is very complex, models mimicking it very closely are inherently very complicated, and this makes them difficult to track and control. This is connected to a second problem. It is important to base the model of the vocal tract on data as precisely as possible. Modern screening methods such as MRI or X-rays are used to see the human vocal tract and its working when producing speech. This data is then used for construction of the articulatory model, but methods such as these are often intrusive, and data from them are difficult to deduce.<ref name="Taylor"/>
-Both these obstacles diminish the quality of the speech produced by these models when compared to modern methods and articulatory synthesis was largely abandoned for the purposes of speech synthesis. It is still used, however, for the research of the human vocal tract physiology, phonology, or visualisation of the human head when speaking.<ref name="Taylor">TAYLOR, Paul. Text-to-Speech Synthes. University of Cambridge Department of Engineering [online]. 2014. Available online at: http://mi.eng.cam.ac.uk/~pat40/ttsbook_draft_2.pdf (Retrieved 2nd February, 2017).</ref>
+Both these obstacles diminish the quality of the speech produced by these models when compared to modern methods, and articulatory synthesis was largely abandoned for the purposes of speech synthesis. It is still used, however, for research of the human vocal tract physiology, phonology, or visualisation of the human head when speaking.<ref name="Taylor">TAYLOR, Paul. Text-to-Speech Synthes. University of Cambridge Department of Engineering [online]. 2014. Available online at: http://mi.eng.cam.ac.uk/~pat40/ttsbook_draft_2.pdf (Retrieved 2nd February, 2017).</ref>
 === Formant Synthesis ===
 [[File:Spectrogram -iua-.png|thumbnail|right|Spectrogram of American English vowels [i, u, ɑ] showing the formants F1 and F2]]
-Formant synthesis is the first truly crystallized method of synthesizing speech. It used to be the dominant method of speech synthesis until the 1980s. Some call it a synthesis "from scratch" or "synthesis by rule".<ref name="Taylor"/> In contrast to the more recent approach of concatenation synthesis, formant synthesis does not use sampled audio. Instead, the sound is created from scratch. The result is intelligible, clear sounding speech, but listeners immediately notice that is was not spoken by a human. The model is too simplistic and does not accurately reflect the subtleties of a real human vocal tract which consists of many different articulation working together.<ref name="romportl">ROMPORTL, Jan. Zvyšování přirozenosti strojově vytvářené řeči v oblasti suprasegmentálních zvukových jevů. Pilsen. Dissertation Thesis. University of West Bohemia. Faculty of Applied Sciences. 2008.</ref>
+Formant synthesis is the first truly crystallized method of synthesizing speech. It used to be the dominant method of speech synthesis until the 1980s. Some call it a synthesis 'from scratch' or 'synthesis by rule'.<ref name="Taylor"/> In contrast to the more recent approach of concatenation synthesis, formant synthesis does not use sampled audio. Instead, the sound is created from scratch. The result is intelligible, clear-sounding speech, but listeners immediately notice that is was not spoken by a human. The model is too simplistic and does not accurately reflect the subtleties of a real human vocal tract, which consists of many different articulations working together.<ref name="romportl">ROMPORTL, Jan. Zvyšování přirozenosti strojově vytvářené řeči v oblasti suprasegmentálních zvukových jevů. Pilsen. Dissertation Thesis. University of West Bohemia. Faculty of Applied Sciences. 2008.</ref>
-Formant synthesis uses modular, model-based approach to creating the artificial speech. Synthesizer utilizing this method rely on an observable and controllable model of an acoustic tube. This model typically has a layout of a vocal tract with two system functioning parallel. The sound is generated from a source and then fed to a vocal tract model. The vocal tract is modelled so the oral and nasal cavities are separate and parallel but only passes through one of them depending on the type of (i.e. nasalized sounds) sounds needed at the moment. Both cavities are combined on the output and pass through a radiation component that simulates the sound characteristics of real lips and nose.<ref name="Taylor"/>
+Formant synthesis uses a modular, model-based approach to create the artificial speech. Synthesizers utilizing this method rely on an observable and controllable model of an acoustic tube. This model typically has a layout of a vocal tract with two systems functioning in parallel. The sound is generated from a source and then fed to a vocal tract model. The vocal tract is modelled so the oral and nasal cavities are separate and parallel, but the sound only passes through one of them depending on the type (i.e., nasalized sounds) of sounds needed at the moment. Both cavities are combined on the output and pass through a radiation component that simulates the sound characteristics of real lips and nose.<ref name="Taylor"/>
 However, this method does not represent an accurate model of the vocal tract because it allows separate and independent control over every aspect of the individual formants. But in a real vocal tract, the formant is constructed as a whole by the entire system. It is not possible to highlight one part of it and base an artificial model on it.<ref name="Taylor"/>
@@ Line 37: / Line 37: @@
 === Concatenation Synthesis ===
 [[File:Concatenation synthesis 1.png|thumbnail|right|Unit selection concatenation speech synthesis]]
-Concatenation synthesis is a form of speech synthesis that uses corresponding short samples of previously recorded speech to construct new utterances. These samples vary in length, ranging between 1 second to several milliseconds. These small samples are subsequently modified so they better fit together according to the type, structure, and mainly the contents of the synthesized text.<ref name="tihelka"/> These samples, or units, can be individual phones, diphones, morphemes, or even entire phrases. So instead of changing the characteristics of the sound to generate intelligible speech, concatenation synthesis works with a large set of pre-recorded data from which, much like a building set, builds utterances that originally were not present in the data set.<ref>BLACK, Alan W. Perfect Synthesis for all of the people all of the time. Carnegie Mellon School of Computer Science [online]. 2009, Sep 30. Available online at: http://www.cs.cmu.edu/~awb/papers/IEEE2002/allthetime/allthetime.html (Retrieved 2nd February, 2017).</ref>
+Concatenation synthesis is a form of speech synthesis that uses corresponding short samples of previously recorded speech to construct new utterances. These samples vary in length, ranging between 1 second to several milliseconds. These small samples are subsequently modified so they better fit together according to the type, structure, and mainly the contents of the synthesized text.<ref name="tihelka"/> These samples, or units, can be individual phones, diphones, morphemes, or even entire phrases. So instead of changing the characteristics of the sound to generate intelligible speech, concatenation synthesis works with a large set of pre-recorded data from which, much like a building set, it builds utterances that originally were not present in the data set.<ref>BLACK, Alan W. Perfect Synthesis for all of the people all of the time. Carnegie Mellon School of Computer Science [online]. 2009, Sep 30. Available online at: http://www.cs.cmu.edu/~awb/papers/IEEE2002/allthetime/allthetime.html (Retrieved 2nd February, 2017).</ref>
-The usage of real recorded sounds allows the production of very high quality artificial speech. This is advantageous in comparison to formant synthesis as we do not require approximate models of the entire vocal tract. However, some speech characteristics, such as voice character, or mood can not be modified as easily.<ref name="tihelka"/> A unit placed in a not very ideal position can result in audible glitches and unnaturally sounding speech because the contours of the unit do not perfectly fit together with the surrounding units. This problem is minimal in domain specific synthesizers but general models require large data sets and additional output control that detects and repairs these glitches.<ref name="Taylor"/>
+The usage of real recorded sounds allows the production of very high-quality artificial speech. This is advantageous in comparison to formant synthesis, as we do not require approximate models of the entire vocal tract. However, some speech characteristics, such as voice character or mood, cannot be modified as easily.<ref name="tihelka"/> A unit placed in a not very ideal position can result in audible glitches and unnaturally sounding speech because the contours of the unit do not perfectly fit together with the surrounding units. This problem is minimal in domain-specific synthesizers, but general models require large data sets and additional output control that detects and repairs these glitches.<ref name="Taylor"/>
-First, a data set of high quality audio samples needs to be recorded and divided into units. The most common type of a unit is the diphone, but other larger or smaller units can be also used. The set of such units creates the speech corpus. Diphones inventory can be artificially created,<ref name="Taylor"/> or based on many hours of recorded speech produced by one speaker.<ref name="tihelka"/><ref name="Taylor"/> These being either spoken monotonously, as to create units with neutral prosody, or spoken naturally. The corpus is then segmented into discrete units. Segments such as silence or breath-ins are also marked and saved.
+First, a data set of high quality audio samples needs to be recorded and divided into units. The most common type of unit is the diphone, but other larger or smaller units can also be used. The set of such units creates the speech corpus. A diphones inventory can be artificially created<ref name="Taylor"/> or based on many hours of recorded speech produced by one speaker,<ref name="tihelka"/><ref name="Taylor"/> these being either spoken monotonously, as to create units with neutral prosody, or spoken naturally. The corpus is then segmented into discrete units. Segments such as silence or breaths are also marked and saved.
-The speech corpus prepared in this way is then used for construction of new utterances. A large database, or in other words, a large set of units with varied prosodic and sound characteristics, should safeguard more natural-sounding speech.<ref name="romportl"/> In unit selection synthesis, one of the approaches concatenation synthesis uses, the target utterance is first predicted, suitable units are picked from the speech corpus, assembled, and their suitability towards the target utterance tested.<ref>CLARK, A. J. KING, Simon. Joint Prosodic and Segmental Unit Selection Speech Synthesis. The Centre for Speech Technology Research [online]. 2006. Available online at: http://www.cstr.ed.ac.uk/downloads/publications/2006/clarkking_interspeech_2006.pdf (Retrieved 2nd February, 2017).</ref>
+The speech corpus prepared in this way is then used for construction of new utterances. A large database, or in other words, a large set of units with varied prosodic and sound characteristics, should safeguard more natural-sounding speech.<ref name="romportl"/> In unit selection synthesis, one of the approaches that concatenation synthesis uses, then the target utterance is first predicted, suitable units are picked from the speech corpus, assembled, and their suitability towards the target utterance is tested.<ref>CLARK, A. J. KING, Simon. Joint Prosodic and Segmental Unit Selection Speech Synthesis. The Centre for Speech Technology Research [online]. 2006. Available online at: http://www.cstr.ed.ac.uk/downloads/publications/2006/clarkking_interspeech_2006.pdf (Retrieved 2nd February, 2017).</ref>
 === Historical overview ===
 [[File:Vonkempelen.gif|thumbnail|right|Schematics of von Kempelen's pneumatic speech synthesizer]]
-Although, the first attempts to build a speaking machine appeared in Antiquity and Middle Ages the first speech synthesizers was developed in the beginning of Modern era. The history of speech synthesis date back to the 18th century when Hungarian civil servant and inventor Wolfgan von Kempelen created a machine of pipes and elbows and assorted parts of musical instruments. He achieved a sufficient imitation of the human vocal tract with the third iteration. He published a comprehensive description of the design in his book entitled "Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschin" [The Mechanism of Human Speech, with a Description of a Speaking Machine] in 1791.<ref name="history">DUDLEY, Homer, TARNOCZY, T. H. The speaking machine of Wolfgang von Kempelen. Journal of the Acoustical Society of America 22, 151-166. Doi:10.1121/1.1906583. Available online at: http://pubman.mpdl.mpg.de/pubman/item/escidoc:2316415:3/component/escidoc:2316414/Dudley_1950_Speaking_machine.pdf (Retrieved 2nd February, 2017).</ref> Christian Kratzenstein, Danish scientists, who worked in Russia, introduced his speaking machine at the same time. However, it could produce just a five vowels. Von Kempeln's sepaking machine was popularised by Charles Wheatstone in 1837.<ref name="explain"> WOODFORD, Chris. Speech synthesizers. EXPLAINTHATSTUFF [online]. 2017, Jan 21. Available online at: http://www.explainthatstuff.com/how-speech-synthesis-works.html (Retrieved 16th February, 2017).</ref>
+Although the first attempts to build a speaking machine appeared in antiquity and Middle Ages, the first speech synthesizers were developed in the beginning of the modern era. The history of speech synthesis dates back to the 18th century, when Hungarian civil servant and inventor Wolfgan von Kempelen created a machine of pipes and elbows and assorted parts of musical instruments. He achieved a sufficient imitation of the human vocal tract with a third iteration. He published a comprehensive description of the design in his book ''Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschin'' [The Mechanism of Human Speech, with a Description of a Speaking Machine] in 1791.<ref name="history">DUDLEY, Homer, TARNOCZY, T. H. The speaking machine of Wolfgang von Kempelen. Journal of the Acoustical Society of America 22, 151-166. Doi:10.1121/1.1906583. Available online at: http://pubman.mpdl.mpg.de/pubman/item/escidoc:2316415:3/component/escidoc:2316414/Dudley_1950_Speaking_machine.pdf (Retrieved 2nd February, 2017).</ref> Christian Kratzenstein, a Danish scientist, who worked in Russia, introduced his speaking machine at the same time. However, it could produce only five vowels. Von Kempeln's speaking machine was popularised by Charles Wheatstone in 1837.<ref name="explain"> WOODFORD, Chris. Speech synthesizers. EXPLAINTHATSTUFF [online]. 2017, Jan 21. Available online at: http://www.explainthatstuff.com/how-speech-synthesis-works.html (Retrieved 16th February, 2017).</ref>
-[[File:Voder schematic.png|thumbnail|right|Voder developed by Bell Laboratories ]]
+[[File:Voder schematic.png|thumbnail|right|The Voder developed by Bell Laboratories ]]
-The design was picked up by Sir Charles Wheatstone, a Victorian Era English inventor, who improved on the Kemplelen's design. The newly sparked interest into the research of phonetics and the Whetstone's work inspired Alexander Graham Bell to do his own research into the matter and eventually arrive at the idea of the telephone. Bell laboratories created the Voice Operating Demonstrator, or Voder for short, a human speech synthesizer in the 1930s. Voder was a model of a human vocal tract, basically an early version of a formant synthesizer, and required a trained operator who had to manually create utterances using a console with fifteen touch-sensitive keys and a pedal.<ref name="history"/>
+The design was picked up by Sir Charles Wheatstone, a Victorian-Era English inventor, who improved on Kempelen's design. The newly sparked interest into the research of phonetics and Whetstone's work inspired Alexander Graham Bell to do his own research into the matter and eventually arrive at the idea of the telephone. Bell laboratories created the Voice Operating Demonstrator, or the Voder for short, a human speech synthesizer, in the 1930s. The Voder was a model of a human vocal tract, basically an early version of a formant synthesizer, and required a trained operator who had to manually create utterances using a console with 15 touch-sensitive keys and a pedal.<ref name="history"/>
-Digital speech synthesizer begun to emerge in the 1980s, following the MIT-developed DECtalk text-to-speech synthesizer.<ref name="Klatt’s Last Tapes"> DUNCAN. Klatt’s Last Tapes: A History of Speech Synthesisers. Communication Aids [online]. 2013, Aug 10. Available online at: http://communicationaids.info/history-speech-synthesisers (Retrieved 2nd February, 2017).</ref> This synthesizer was notably used by the physicist Stephen Hawking.<ref name="DECtalk">DIZON, Jan. 'Zoolander 2' Trailer Is Narrated By Stephen Hawking's DECTalk Dennis Speech Synthesizer. Tech Times [online]. 2015, Aug 3. Available online at: http://www.techtimes.com/articles/73832/20150803/zoolander-2-trailer-narrated-stephen-hawkings-dectalk-dennis-speech-synthesizer.htm#sthash.hj3Hwq95.dpuf (Retrieved 2nd February, 2017).</ref>
+Digital speech synthesizers began to emerge in the 1980s, following the MIT-developed DECtalk text-to-speech synthesizer.<ref name="Klatt’s Last Tapes"> DUNCAN. Klatt’s Last Tapes: A History of Speech Synthesisers. Communication Aids [online]. 2013, Aug 10. Available online at: http://communicationaids.info/history-speech-synthesisers (Retrieved 2nd February, 2017).</ref> This synthesizer was notably used by the physicist Stephen Hawking.<ref name="DECtalk">DIZON, Jan. 'Zoolander 2' Trailer Is Narrated By Stephen Hawking's DECTalk Dennis Speech Synthesizer. Tech Times [online]. 2015, Aug 3. Available online at: http://www.techtimes.com/articles/73832/20150803/zoolander-2-trailer-narrated-stephen-hawkings-dectalk-dennis-speech-synthesizer.htm#sthash.hj3Hwq95.dpuf (Retrieved 2nd February, 2017).</ref>
-In the beginnings of the modern day speech synthesis, two main approaches appeared. Formant synthesis that tries to model the entire vocal tract of a human, and articulatory synthesis which aims to create the sounds speech is made of from scratch. However, both methods are gradually replaced by concatenation synthesis nowadays. This form of synthesis uses a large set, speech corpus, of high-quality pre-recorded audio samples. These samples can be assembled together and form a new utterance.<ref name="tihelka">TIHELKA, Daniel. The Unit Selection Approach in Czech TTS Synthesis. Pilsen. Dissertation Thesis. University of West Bohemia. Faculty of Applied Sciences. 2005.</ref>
+At the beginning of the modern-day speech synthesis, two main approaches appeared; formant synthesis tries to model the entire vocal tract of a human, and articulatory synthesis aims to create the sounds speech is made of from scratch. However, both methods are gradually replaced by concatenation synthesis nowadays. This form of synthesis uses a large set, speech corpus, of high-quality pre-recorded audio samples. These samples can be assembled together and form a new utterance.<ref name="tihelka">TIHELKA, Daniel. The Unit Selection Approach in Czech TTS Synthesis. Pilsen. Dissertation Thesis. University of West Bohemia. Faculty of Applied Sciences. 2005.</ref>
 === Purpose ===

Difference between revisions of "Speech synthesizers"

Revision as of 09:58, 10 May 2017

Contents

Main Characteristics

Articulatory Synthesis

Formant Synthesis

Concatenation Synthesis

Historical overview

Purpose

Important Dates

Enhancement/Therapy/Treatment

Ethical & Health Issues

Public & Media Impact and Presentation

Public Policy

Related Technologies, Projects or Scientific Research

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Links

Tools