Abstract
When a user speaks to a conversational interface, the system has to be able to recognize what was said. The automatic speech recognition (ASR) component processes the acoustic signal that represents the spoken utterance and outputs a sequence of word hypotheses, thus transforming the speech into text. The other side of the coin is text-to-speech synthesis (TTS), in which written text is transformed into speech. There has been extensive research in both these areas, and striking improvements have been made over the past decade. In this chapter, we provide an overview of the processes of ASR and TTS.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
http://cmusphinx.sourceforge.net/wiki/tutorialconcepts. Accessed February 20, 2016.
- 2.
http://arxiv.org/abs/cs/0504020v2. Accessed February 20, 2016.
- 3.
http://www.eunison.eu/. Accessed February 20, 2016.
- 4.
https://www.youtube.com/watch?v=t4YzfGD0f6s&feature=youtu.be. Accessed February 20, 2016.
- 5.
http://www.w3.org/TR/speech-synthesis/. Accessed February 20, 2016.
- 6.
- 7.
- 8.
- 9.
http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html. Accessed February 20, 2016.
- 10.
https://www.ivona.com/. Accessed February 20, 2016.
- 11.
http://www.naturalreaders.com/index.html. Accessed February 20, 2016.
- 12.
http://www.cepstral.com/en/demos. Accessed February 20, 2016.
- 13.
http://speech.diotek.com/en/. Accessed February 20, 2016.
- 14.
http://www.cstr.ed.ac.uk/projects/festival/onlinedemo.html. Accessed February 20, 2016.
- 15.
http://mary.dfki.de/. Accessed February 20, 2016.
References
Aaron A, Eide E, Pitrelli JF (2005) Conversational computers. Sci Am June: 64–69. doi:10.1038/scientificamerican0605-64
Baker J, Deng L, Glass J, Khudanpur S, Lee C-H, Morgan N, O’Shaughnessy D (2009a) Developments and directions in speech recognition and understanding, Part 1. Sig Process Mag IEEE 26(3):75–80. doi:10.1109/msp.2009.932166
Baker J, Deng L, Khudanpur S, Lee C-H, Glass J, Morgan N, O’Shaughnessy D (2009b) Updated MINDS report on speech recognition and understanding, Part 2 signal processing magazine. IEEE 26(4):78–85. doi:10.1109/msp.2009.932707
Beckman ME, Hirschberg J, Shattuck-Hufnagel S (2005) The original ToBI system and the evolution of the ToBI framework. In: Jun S-A (ed) Prosodic typology—the phonology of intonation and phrasing, Chapter 2. Oxford University Press, Oxford, pp 9–54. doi:10.1093/acprof:oso/9780199249633.003.0002
Black A. (2000) Speech synthesis in Festival: a practical course on making computers talk. http://festvox.org/festtut/notes/festtut_toc.html. Accessed 20 Jan 2016
Deng L, Li X (2013) Machine learning paradigms for speech recognition: an overview. IEEE T Speech Audi P 21 (5) May 2013:1061–1089. doi:10.1109/tasl.2013.2244083
Deng L, Yu D (2013) Deep learning: methods and applications. Found Trends Signal Process 7(3–4):197–386. doi:10.1561/2000000039
Dutoit T (2001) An introduction to text-to-speech synthesis. Springer, New York. doi:10.1007/978-94-011-5730-8
Esposito A, Faundez-Zanuy M, Cordasco G, Drugman T, Solé-Casals J, Morabito FC (eds) (2016) Recent advances in nonlinear speech processing. Springer, New York
Forney GD Jr (2005) The Viterbi algorithm: a personal history. http://arxiv.org/abs/cs/0504020v2. Accessed 20 February 2016
Furui S (2010) History and development of speech recognition. In: Chen F, Jokinen K (eds) Speech technology: theory and applications. Springer, New York:1–18. doi:10.1007/978-0-387-73819-2_1
He X, Deng L, Chou W (2008) Discriminative learning in sequential pattern recognition. IEEE Signal Process Mag 25(5):14–36. doi:10.1109/msp.2008.926652
Hinton G, Deng L, Yu D, Dahl GE, Mohamed A-R, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 82:82–97. doi:10.1109/msp.2012.2205597
Holmes J, Holmes W (2001) Speech synthesis and recognition. CRC Press, Boca Raton
Huang X, Acero A, Hon H-W (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice Hall, Upper Saddle River, NJ
Huang X, Deng L (2010) An overview of modern speech recognition. In: Indurkhya N, Damerau FJ (eds) Handbook of natural language processing. CRC Press, Boca Raton, pp 339–366. http://research.microsoft.com/pubs/118769/Book-Chap-HuangDeng2010.pdf. Accessed 20 Jan 2016
Jelinek F (1998) Statistical methods for speech recognition. MIT Press, Massachusetts
Jurafsky D, Martin JH (2009) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd edn. Prentice Hall, Upper Saddle River, NJ
Levinson SE (2005) Mathematical models for speech technology. Wiley, Chichester, UK
Lewis JR (2011) Practical speech user interface design. CRC Press, Boca Raton. doi:10.1201/b10461
Pieraccini R (2012) The voice in the machine: building computers that understand speech. MIT Press, Cambridge, MA
Rabiner L, Juang B-H (1998) Fundamentals of speech recognition. Prentice Hall, Upper Saddle River
Renals S, Hain T (2010) Speech recognition. In: Clark A, Fox C, Lappin S (eds) The handbook of computational linguistics and natural language processing. Wiley-Blackwell, Chichester, UK, pp 299–322. doi:10.1002/9781444324044.ch12
Seide F, Li G, Yu D (2011) Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of the 12th annual conference of the international speech communication association (INTERSPEECH 2011). Florence, Italy, 27–31 Aug 2011, pp 437–440
Suendermann D, Höge H, Black A (2010) Challenges in speech synthesis. In: Chen F, Jokinen K (eds) Speech technology: theory and applications. Springer, New York, pp 19–32. doi:10.1007/978-0-387-73819-2_2
Taylor P (2000) Analysis and synthesis using the tilt model. J Acoust Soc Am 107(3):1697–1714. doi:10.1121/1.428453
Taylor P (2009) Text-to-speech synthesis. Cambridge University Press, Cambridge. doi:10.1017/cbo9780511816338
Viterbi AJ (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE T Inform Theory 13(2):260–269. doi:10.1109/TIT.1967.1054010
Yu D, Deng L (2015) Automatic speech recognition: a deep learning approach. Springer, New York. doi:10.1007/978-1-4471-5779-3
Web Pages
Comparison of Android TTS engines http://www.geoffsimons.com/2012/06/7-best-android-text-to-speech-engines.html
Computer Speech and Language http://www.journals.elsevier.com/computer-speech-and-language/
EURASIP journal on Audio, Speech, and Music Processing http://www.asmp.eurasipjournals.com/
History of text-to-speech systems www.cs.indiana.edu/rhythmsp/ASA/Contents.html
IEEE/ACM Transactions on Audio, Speech, and Language Processing http://www.signalprocessingsociety.org/publications/periodicals/taslp/
International Journal of Speech Technology http://link.springer.com/journal/10772
Resources for TTS http://technav.ieee.org/tag/2739/speech-synthesis
Speech Communication http://www.journals.elsevier.com/speech-communication/
SSML: a language for the specification of synthetic speech- http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
McTear, M., Callejas, Z., Griol, D. (2016). Speech Input and Output. In: The Conversational Interface. Springer, Cham. https://doi.org/10.1007/978-3-319-32967-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-32967-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32965-9
Online ISBN: 978-3-319-32967-3
eBook Packages: EngineeringEngineering (R0)