Speech Input and Output

McTear, Michael; Callejas, Zoraida; Griol, David

doi:10.1007/978-3-319-32967-3_5

Michael McTear⁴,
Zoraida Callejas⁵ &
David Griol⁶

6553 Accesses
2 Citations

Abstract

When a user speaks to a conversational interface, the system has to be able to recognize what was said. The automatic speech recognition (ASR) component processes the acoustic signal that represents the spoken utterance and outputs a sequence of word hypotheses, thus transforming the speech into text. The other side of the coin is text-to-speech synthesis (TTS), in which written text is transformed into speech. There has been extensive research in both these areas, and striking improvements have been made over the past decade. In this chapter, we provide an overview of the processes of ASR and TTS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://cmusphinx.sourceforge.net/wiki/tutorialconcepts. Accessed February 20, 2016.
2.
http://arxiv.org/abs/cs/0504020v2. Accessed February 20, 2016.
3.
http://www.eunison.eu/. Accessed February 20, 2016.
4.
https://www.youtube.com/watch?v=t4YzfGD0f6s&feature=youtu.be. Accessed February 20, 2016.
5.
http://www.w3.org/TR/speech-synthesis/. Accessed February 20, 2016.
6.
https://developer.amazon.com/public/solutions/alexa/alexa-skills-kit/docs/speech-synthesis-markup-language-ssml-reference. Accessed February 20, 2016.
7.
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-345-automatic-speech-recognition-spring-2003/index.htm. Accessed February 20, 2016.
8.
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-345-automatic-speech-recognition-spring-2003/assignments/. Accessed February 20, 2016.
9.
http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html. Accessed February 20, 2016.
10.
https://www.ivona.com/. Accessed February 20, 2016.
11.
http://www.naturalreaders.com/index.html. Accessed February 20, 2016.
12.
http://www.cepstral.com/en/demos. Accessed February 20, 2016.
13.
http://speech.diotek.com/en/. Accessed February 20, 2016.
14.
http://www.cstr.ed.ac.uk/projects/festival/onlinedemo.html. Accessed February 20, 2016.
15.
http://mary.dfki.de/. Accessed February 20, 2016.

References

Aaron A, Eide E, Pitrelli JF (2005) Conversational computers. Sci Am June: 64–69. doi:10.1038/scientificamerican0605-64
Google Scholar
Baker J, Deng L, Glass J, Khudanpur S, Lee C-H, Morgan N, O’Shaughnessy D (2009a) Developments and directions in speech recognition and understanding, Part 1. Sig Process Mag IEEE 26(3):75–80. doi:10.1109/msp.2009.932166
Article Google Scholar
Baker J, Deng L, Khudanpur S, Lee C-H, Glass J, Morgan N, O’Shaughnessy D (2009b) Updated MINDS report on speech recognition and understanding, Part 2 signal processing magazine. IEEE 26(4):78–85. doi:10.1109/msp.2009.932707
Google Scholar
Beckman ME, Hirschberg J, Shattuck-Hufnagel S (2005) The original ToBI system and the evolution of the ToBI framework. In: Jun S-A (ed) Prosodic typology—the phonology of intonation and phrasing, Chapter 2. Oxford University Press, Oxford, pp 9–54. doi:10.1093/acprof:oso/9780199249633.003.0002
Google Scholar
Black A. (2000) Speech synthesis in Festival: a practical course on making computers talk. http://festvox.org/festtut/notes/festtut_toc.html. Accessed 20 Jan 2016
Deng L, Li X (2013) Machine learning paradigms for speech recognition: an overview. IEEE T Speech Audi P 21 (5) May 2013:1061–1089. doi:10.1109/tasl.2013.2244083
Google Scholar
Deng L, Yu D (2013) Deep learning: methods and applications. Found Trends Signal Process 7(3–4):197–386. doi:10.1561/2000000039
MathSciNet MATH Google Scholar
Dutoit T (2001) An introduction to text-to-speech synthesis. Springer, New York. doi:10.1007/978-94-011-5730-8
Google Scholar
Esposito A, Faundez-Zanuy M, Cordasco G, Drugman T, Solé-Casals J, Morabito FC (eds) (2016) Recent advances in nonlinear speech processing. Springer, New York
Google Scholar
Forney GD Jr (2005) The Viterbi algorithm: a personal history. http://arxiv.org/abs/cs/0504020v2. Accessed 20 February 2016
Furui S (2010) History and development of speech recognition. In: Chen F, Jokinen K (eds) Speech technology: theory and applications. Springer, New York:1–18. doi:10.1007/978-0-387-73819-2_1
Google Scholar
He X, Deng L, Chou W (2008) Discriminative learning in sequential pattern recognition. IEEE Signal Process Mag 25(5):14–36. doi:10.1109/msp.2008.926652
Article Google Scholar
Hinton G, Deng L, Yu D, Dahl GE, Mohamed A-R, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 82:82–97. doi:10.1109/msp.2012.2205597
Article Google Scholar
Holmes J, Holmes W (2001) Speech synthesis and recognition. CRC Press, Boca Raton
MATH Google Scholar
Huang X, Acero A, Hon H-W (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice Hall, Upper Saddle River, NJ
Google Scholar
Huang X, Deng L (2010) An overview of modern speech recognition. In: Indurkhya N, Damerau FJ (eds) Handbook of natural language processing. CRC Press, Boca Raton, pp 339–366. http://research.microsoft.com/pubs/118769/Book-Chap-HuangDeng2010.pdf. Accessed 20 Jan 2016
Jelinek F (1998) Statistical methods for speech recognition. MIT Press, Massachusetts
Google Scholar
Jurafsky D, Martin JH (2009) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd edn. Prentice Hall, Upper Saddle River, NJ
Google Scholar
Levinson SE (2005) Mathematical models for speech technology. Wiley, Chichester, UK
Book Google Scholar
Lewis JR (2011) Practical speech user interface design. CRC Press, Boca Raton. doi:10.1201/b10461
Google Scholar
Pieraccini R (2012) The voice in the machine: building computers that understand speech. MIT Press, Cambridge, MA
Google Scholar
Rabiner L, Juang B-H (1998) Fundamentals of speech recognition. Prentice Hall, Upper Saddle River
Google Scholar
Renals S, Hain T (2010) Speech recognition. In: Clark A, Fox C, Lappin S (eds) The handbook of computational linguistics and natural language processing. Wiley-Blackwell, Chichester, UK, pp 299–322. doi:10.1002/9781444324044.ch12
Google Scholar
Seide F, Li G, Yu D (2011) Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of the 12th annual conference of the international speech communication association (INTERSPEECH 2011). Florence, Italy, 27–31 Aug 2011, pp 437–440
Google Scholar
Suendermann D, Höge H, Black A (2010) Challenges in speech synthesis. In: Chen F, Jokinen K (eds) Speech technology: theory and applications. Springer, New York, pp 19–32. doi:10.1007/978-0-387-73819-2_2
Google Scholar
Taylor P (2000) Analysis and synthesis using the tilt model. J Acoust Soc Am 107(3):1697–1714. doi:10.1121/1.428453
Article Google Scholar
Taylor P (2009) Text-to-speech synthesis. Cambridge University Press, Cambridge. doi:10.1017/cbo9780511816338
Viterbi AJ (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE T Inform Theory 13(2):260–269. doi:10.1109/TIT.1967.1054010
Article MATH Google Scholar
Yu D, Deng L (2015) Automatic speech recognition: a deep learning approach. Springer, New York. doi:10.1007/978-1-4471-5779-3
Google Scholar

Web Pages

Comparison of Android TTS engines http://www.geoffsimons.com/2012/06/7-best-android-text-to-speech-engines.html
Computer Speech and Language http://www.journals.elsevier.com/computer-speech-and-language/
EURASIP journal on Audio, Speech, and Music Processing http://www.asmp.eurasipjournals.com/
History of text-to-speech systems www.cs.indiana.edu/rhythmsp/ASA/Contents.html
IEEE/ACM Transactions on Audio, Speech, and Language Processing http://www.signalprocessingsociety.org/publications/periodicals/taslp/
International Journal of Speech Technology http://link.springer.com/journal/10772
Resources for TTS http://technav.ieee.org/tag/2739/speech-synthesis
Speech Communication http://www.journals.elsevier.com/speech-communication/
SSML: a language for the specification of synthetic speech- http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/

Download references

Author information

Authors and Affiliations

School of Computing and Mathematics, Ulster University, Northern Ireland, UK
Michael McTear
ETSI Informática y Telecomunicación, University of Granada, Granada, Spain
Zoraida Callejas
Department of Computer Science, Universidad Carlos III de Madrid, Madrid, Spain
David Griol

Authors

Michael McTear
View author publications
You can also search for this author in PubMed Google Scholar
Zoraida Callejas
View author publications
You can also search for this author in PubMed Google Scholar
David Griol
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael McTear .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

McTear, M., Callejas, Z., Griol, D. (2016). Speech Input and Output. In: The Conversational Interface. Springer, Cham. https://doi.org/10.1007/978-3-319-32967-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-32967-3_5
Published: 20 May 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32965-9
Online ISBN: 978-3-319-32967-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics