Skip to main content

Speech Input and Output

  • Chapter
  • First Online:
The Conversational Interface

Abstract

When a user speaks to a conversational interface, the system has to be able to recognize what was said. The automatic speech recognition (ASR) component processes the acoustic signal that represents the spoken utterance and outputs a sequence of word hypotheses, thus transforming the speech into text. The other side of the coin is text-to-speech synthesis (TTS), in which written text is transformed into speech. There has been extensive research in both these areas, and striking improvements have been made over the past decade. In this chapter, we provide an overview of the processes of ASR and TTS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://cmusphinx.sourceforge.net/wiki/tutorialconcepts. Accessed February 20, 2016.

  2. 2.

    http://arxiv.org/abs/cs/0504020v2. Accessed February 20, 2016.

  3. 3.

    http://www.eunison.eu/. Accessed February 20, 2016.

  4. 4.

    https://www.youtube.com/watch?v=t4YzfGD0f6s&feature=youtu.be. Accessed February 20, 2016.

  5. 5.

    http://www.w3.org/TR/speech-synthesis/. Accessed February 20, 2016.

  6. 6.

    https://developer.amazon.com/public/solutions/alexa/alexa-skills-kit/docs/speech-synthesis-markup-language-ssml-reference. Accessed February 20, 2016.

  7. 7.

    http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-345-automatic-speech-recognition-spring-2003/index.htm. Accessed February 20, 2016.

  8. 8.

    http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-345-automatic-speech-recognition-spring-2003/assignments/. Accessed February 20, 2016.

  9. 9.

    http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html. Accessed February 20, 2016.

  10. 10.

    https://www.ivona.com/. Accessed February 20, 2016.

  11. 11.

    http://www.naturalreaders.com/index.html. Accessed February 20, 2016.

  12. 12.

    http://www.cepstral.com/en/demos. Accessed February 20, 2016.

  13. 13.

    http://speech.diotek.com/en/. Accessed February 20, 2016.

  14. 14.

    http://www.cstr.ed.ac.uk/projects/festival/onlinedemo.html. Accessed February 20, 2016.

  15. 15.

    http://mary.dfki.de/. Accessed February 20, 2016.

References

  • Aaron A, Eide E, Pitrelli JF (2005) Conversational computers. Sci Am June: 64–69. doi:10.1038/scientificamerican0605-64

    Google Scholar 

  • Baker J, Deng L, Glass J, Khudanpur S, Lee C-H, Morgan N, O’Shaughnessy D (2009a) Developments and directions in speech recognition and understanding, Part 1. Sig Process Mag IEEE 26(3):75–80. doi:10.1109/msp.2009.932166

    Article  Google Scholar 

  • Baker J, Deng L, Khudanpur S, Lee C-H, Glass J, Morgan N, O’Shaughnessy D (2009b) Updated MINDS report on speech recognition and understanding, Part 2 signal processing magazine. IEEE 26(4):78–85. doi:10.1109/msp.2009.932707

    Google Scholar 

  • Beckman ME, Hirschberg J, Shattuck-Hufnagel S (2005) The original ToBI system and the evolution of the ToBI framework. In: Jun S-A (ed) Prosodic typology—the phonology of intonation and phrasing, Chapter 2. Oxford University Press, Oxford, pp 9–54. doi:10.1093/acprof:oso/9780199249633.003.0002

    Google Scholar 

  • Black A. (2000) Speech synthesis in Festival: a practical course on making computers talk. http://festvox.org/festtut/notes/festtut_toc.html. Accessed 20 Jan 2016

  • Deng L, Li X (2013) Machine learning paradigms for speech recognition: an overview. IEEE T Speech Audi P 21 (5) May 2013:1061–1089. doi:10.1109/tasl.2013.2244083

    Google Scholar 

  • Deng L, Yu D (2013) Deep learning: methods and applications. Found Trends Signal Process 7(3–4):197–386. doi:10.1561/2000000039

    MathSciNet  MATH  Google Scholar 

  • Dutoit T (2001) An introduction to text-to-speech synthesis. Springer, New York. doi:10.1007/978-94-011-5730-8

    Google Scholar 

  • Esposito A, Faundez-Zanuy M, Cordasco G, Drugman T, Solé-Casals J, Morabito FC (eds) (2016) Recent advances in nonlinear speech processing. Springer, New York

    Google Scholar 

  • Forney GD Jr (2005) The Viterbi algorithm: a personal history. http://arxiv.org/abs/cs/0504020v2. Accessed 20 February 2016

  • Furui S (2010) History and development of speech recognition. In: Chen F, Jokinen K (eds) Speech technology: theory and applications. Springer, New York:1–18. doi:10.1007/978-0-387-73819-2_1

    Google Scholar 

  • He X, Deng L, Chou W (2008) Discriminative learning in sequential pattern recognition. IEEE Signal Process Mag 25(5):14–36. doi:10.1109/msp.2008.926652

    Article  Google Scholar 

  • Hinton G, Deng L, Yu D, Dahl GE, Mohamed A-R, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 82:82–97. doi:10.1109/msp.2012.2205597

    Article  Google Scholar 

  • Holmes J, Holmes W (2001) Speech synthesis and recognition. CRC Press, Boca Raton

    MATH  Google Scholar 

  • Huang X, Acero A, Hon H-W (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice Hall, Upper Saddle River, NJ

    Google Scholar 

  • Huang X, Deng L (2010) An overview of modern speech recognition. In: Indurkhya N, Damerau FJ (eds) Handbook of natural language processing. CRC Press, Boca Raton, pp 339–366. http://research.microsoft.com/pubs/118769/Book-Chap-HuangDeng2010.pdf. Accessed 20 Jan 2016

  • Jelinek F (1998) Statistical methods for speech recognition. MIT Press, Massachusetts

    Google Scholar 

  • Jurafsky D, Martin JH (2009) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd edn. Prentice Hall, Upper Saddle River, NJ

    Google Scholar 

  • Levinson SE (2005) Mathematical models for speech technology. Wiley, Chichester, UK

    Book  Google Scholar 

  • Lewis JR (2011) Practical speech user interface design. CRC Press, Boca Raton. doi:10.1201/b10461

    Google Scholar 

  • Pieraccini R (2012) The voice in the machine: building computers that understand speech. MIT Press, Cambridge, MA

    Google Scholar 

  • Rabiner L, Juang B-H (1998) Fundamentals of speech recognition. Prentice Hall, Upper Saddle River

    Google Scholar 

  • Renals S, Hain T (2010) Speech recognition. In: Clark A, Fox C, Lappin S (eds) The handbook of computational linguistics and natural language processing. Wiley-Blackwell, Chichester, UK, pp 299–322. doi:10.1002/9781444324044.ch12

    Google Scholar 

  • Seide F, Li G, Yu D (2011) Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of the 12th annual conference of the international speech communication association (INTERSPEECH 2011). Florence, Italy, 27–31 Aug 2011, pp 437–440

    Google Scholar 

  • Suendermann D, Höge H, Black A (2010) Challenges in speech synthesis. In: Chen F, Jokinen K (eds) Speech technology: theory and applications. Springer, New York, pp 19–32. doi:10.1007/978-0-387-73819-2_2

    Google Scholar 

  • Taylor P (2000) Analysis and synthesis using the tilt model. J Acoust Soc Am 107(3):1697–1714. doi:10.1121/1.428453

    Article  Google Scholar 

  • Taylor P (2009) Text-to-speech synthesis. Cambridge University Press, Cambridge. doi:10.1017/cbo9780511816338

  • Viterbi AJ (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE T Inform Theory 13(2):260–269. doi:10.1109/TIT.1967.1054010

    Article  MATH  Google Scholar 

  • Yu D, Deng L (2015) Automatic speech recognition: a deep learning approach. Springer, New York. doi:10.1007/978-1-4471-5779-3

    Google Scholar 

Web Pages

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael McTear .

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

McTear, M., Callejas, Z., Griol, D. (2016). Speech Input and Output. In: The Conversational Interface. Springer, Cham. https://doi.org/10.1007/978-3-319-32967-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-32967-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-32965-9

  • Online ISBN: 978-3-319-32967-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics