skip to main content
article

Toward adaptive conversational interfaces: Modeling speech convergence with animated personas

Published:01 September 2004Publication History
Skip Abstract Section

Abstract

The design of robust interfaces that process conversational speech is a challenging research direction largely because users' spoken language is so variable. This research explored a new dimension of speaker stylistic variation by examining whether users' speech converges systematically with the text-to-speech (TTS) heard from a software partner. To pursue this question, a study was conducted in which twenty-four 7 to 10-year-old children conversed with animated partners that embodied different TTS voices. An analysis of children's amplitude, durational features, and dialogue response latencies confirmed that they spontaneously adapt several basic acoustic-prosodic features of their speech 10--50%, with the largest adaptations involving utterance pause structure and amplitude. Children's speech adaptations were relatively rapid, bidirectional, and dynamically readaptable when introduced to new partners, and generalized across different types of users and TTS voices. Adaptations also occurred consistently, with 70--95% of children converging with their partner's TTS, although individual differences in magnitude of adaptation were evident. In the design of future conversational systems, users' spontaneous convergence could be exploited to guide their speech within system processing bounds, thereby enhancing robustness. Adaptive system processing could yield further significant performance gains. The long-term goal of this research is the development of predictive models of human-computer communication to guide the design of new conversational interfaces.

References

  1. Andersen, E. S. 1990. Speaking with Style: The Sociolinguistic Skills of Children. Routledge, Kagan Paul: London, England.Google ScholarGoogle Scholar
  2. Andre, E., Muller, J., and Rist, T. 1996. The PPP persona: A multipurpose animated presentation agent. Advanced Visual Interfaces, ACM Press, 245--247. Google ScholarGoogle Scholar
  3. Bickmore, T. 2003. Relational agents: Effecting change through human-computer relationships, MIT Ph.D. Thesis, February.Google ScholarGoogle Scholar
  4. Bickmore, T. and Cassell, J. 2004. Social dialogue with embodied conversational agents. In Natural, Intelligent and Effective Interaction with Multimodal Dialogue Systems, Van Kuppevelt, L. Dybkjaer and N. Bernsen, Eds. Kluwer Academic: New York, NY.Google ScholarGoogle Scholar
  5. Boughman, J. M. 1997. Greater spear-nosed bats give group-distinctive calls. Behavioral Ecology and Sociobiology 40, 61--70.Google ScholarGoogle Scholar
  6. Burgoon, J., Stern, L., and Dillman, L. 1995. Interpersonal Adaptation: Dyadic Interaction Patterns. Cambridge Univ. Press, Cambridge, UK.Google ScholarGoogle Scholar
  7. Cassell, J. Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjalmsson, H., and Yan, H. 1999. Embodiment in conversational interfaces: Rhea, Proceedings of CHI'99, ACM Press: Pittsburgh, Pa., 520--527. Google ScholarGoogle Scholar
  8. Cassell, J. and Thorisson, K. R. 1999. The power of a nod and a glance: Envelope vs. emotional feedback in animated conversational agents. App. Artif. Intell. J. 13, 4--5, 519--538.Google ScholarGoogle Scholar
  9. Cassell, J., Sullivan, J., Prevost, S., and Churchill, E., Eds. 2000. Embodied Conversational Agents. MIT Press, Cambridge, MA. Google ScholarGoogle Scholar
  10. Coulston, R., Oviatt, S. L., and Darves, C. 2002. Amplitude convergence in children's conversational speech with animated personas. In Proceedings of the International Conference on Spoken Language Processing (ICSLP'2002), J. Hansen and B. Pellom, Eds. Casual Prod. Ltd.: Denver, CO, Sept. 2002, vol. 4, 2689--2692.Google ScholarGoogle Scholar
  11. Coulston, R. and Darves, C. 2001. Duration scoring procedures, Oregon Health and Science University, unpublished manuscript, November.Google ScholarGoogle Scholar
  12. Cowlishaw, G. 1992. Song function in gibbons. Behavior 121, 1--2, 131--153.Google ScholarGoogle Scholar
  13. Darves, C. and Oviatt, S. L. 2004. Talking to digital fish: Designing effective conversational interfaces for educational software. In Evaluating Conversational Agents, Z. Ruttkay and C. Pelachaud, Eds. Kluwer Academic Publisher, Dordrecht, The Netherlands.Google ScholarGoogle Scholar
  14. Dehn, D. M. and Van Mulken, S. 2000. The impact of animated interface agents: A review of empirical research. Int. J. Hum. Comput. Studies 52, 1--22. Google ScholarGoogle Scholar
  15. Elowson, A. M. and Snowdon, C. T. 1994. Pygmy marmosets, Cebuella pygmaea, modify vocal structure in response to changed social environment. Animal Behavior 47, 1267--1277.Google ScholarGoogle Scholar
  16. Giles, H., Mulac, A., Bradac, J., and Johnson, P. 1987. Speech accommodation theory: The first decade and beyond. Communication Yearbook 10, M. L. Mcglaughlin, Ed. Sage Publ., London, UK, 13--48.Google ScholarGoogle Scholar
  17. Gong, L., Nass, C., Simard, C., and Takhteyev, Y. 2001. When non-human is better than semi-human: Consistency in speech interfaces. Usability Evaluation and Interface Design: Cognitive Engineering, Intelligent Agents and Virtual Reality, Vol. 1. M. Smith, G. Salvendy, D. Harris and R. Koubek, Eds. Lawrence Erlbaum Assoc., Mahwah N.J., 390--394.Google ScholarGoogle Scholar
  18. Haimoff, E. H. 1984. Acoustic and organizational features of gibbon songs. In The Lesser Apes, H. Preuschoft et al., Eds. Edinburgh University Press, Edinburgh, Scotland, 333--353.Google ScholarGoogle Scholar
  19. Janik, V. M. and Slater, P. 1997. Vocal learning in mammals. Advances in the Study of Behavior 26, 59--99.Google ScholarGoogle Scholar
  20. Junqua, J. C. 1993. The Lombard reflex and its role on human listeners and automatic speech recognizers. J. Acoust. Soc. Am. 93, 1, 510--524.Google ScholarGoogle Scholar
  21. Karat, C. M., Vergo, J., and Nahamoo, D. 2003. Conversational interface technologies. In The Human--Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, J. Jacko and A. Sears, Eds. Lawrence Erlbaum Assoc., Mahwah, NJ, 169--186. Google ScholarGoogle Scholar
  22. Lai, J. and Yankelovich, N. 2003. Conversational speech interfaces. In The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, J. Jacko and A. Sears, Eds. Lawrence Erlbaum Assoc., Mahwah, NJ, 698--713. Google ScholarGoogle Scholar
  23. Leiser, R. G. 1989. Improving natural language and speech interfaces by the use of metalinguistic phenomena. Appl. Ergonomics 20, 3, 168--73.Google ScholarGoogle Scholar
  24. Ladefoged, P. 1993. A course in phonetics. Harcourt Brace Jovanovich, Ft. Worth, TX.Google ScholarGoogle Scholar
  25. Maples, E. G., Haraway, M. M., and Hutto, C. W. 1989. Development of coordinated singing in a newly formed siamang pair (Hylobates syndactylus). Zoo Biology 8, 367--378.Google ScholarGoogle Scholar
  26. Massaro, D., Cohen, M., Beskow, J., and Cole, R. 2000. Developing and evaluating conversational agents. In Embodied Conversational Agents, J. Cassell, J. Sullivan, S. Prevost, and E. Churchill, Eds. MIT Press, Cambridge, UK, 287--318. Google ScholarGoogle Scholar
  27. Mirghafori, N., Fosler, E., and Morgan, N. 1996. Towards robustness to fast speech in ASR. In Proceedings of ICASSP-96, 1, 335--338. Google ScholarGoogle Scholar
  28. Mitani, J. C. and Brandt, K. L. 1994. Social factors influence the acoustic variability in the long-distance calls of male chimpanzees. Ethology 96, 233--252.Google ScholarGoogle Scholar
  29. Moreno, R., Mayer, R., Spires, H., and Lester, J. 2001. The case for social agency in computer-based teaching: Do students learn more deeply when they interact with animated pedagogical agents? Cognition and Instruc. 19, 2, 177--213.Google ScholarGoogle Scholar
  30. Nass, C. and lee, K. 2000. Does computer-generated speech manifest personality? An experimental test of similarity-attraction. In Proceedings of the Conference on Human Factors in Computing System. ACM Press, New York, NY, 329--336. Google ScholarGoogle Scholar
  31. Nass, C. and Lee, K. 2001. Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction. J. Exper. Psych. Appl. 7, 3, 171--181.Google ScholarGoogle Scholar
  32. Nass, C. Isbister, K. and Lee, E. 2000. Truth is beauty: Researching embodied conversational agents. In Embodied Conversational Agents, J. Cassell, J. Sullivan, S. Prevost and E. Churchill, Eds. MIT Press, Cambridge, MA, 374--402. Google ScholarGoogle Scholar
  33. Nass, C., Steuer, J., and Tauber, E. 1994. Computers are social actors. In Proceedings of the Conference on Human Factors in Computing Systems. ACM Press, Boston, MA, 72--78. Google ScholarGoogle Scholar
  34. Oviatt, S. L. 2003. Multimodal interfaces. In The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, J. Jacko and A. Sears, Eds. Lawrence Erlbaum Assoc., Mahwah, NJ, 286--304. Google ScholarGoogle Scholar
  35. Oviatt, S. L. 1996. User-centered design of spoken language and multimodal interfaces. IEEE Multimedia, winter 3 (4), 26--35. Reprinted in Readings on Intelligent User Interfaces, M. Maybury and W. Wahlster, Eds. Morgan-Kaufmann. Google ScholarGoogle Scholar
  36. Oviatt, S. L. and Adams, B. 2000. Designing and evaluating conversational interfaces with animated characters. In Embodied Conversational Agents, J. Cassell, J. Sullivan, S. Prevost, and E. Churchill, Eds. MIT Press, Cambridge, MA, 319--343. Google ScholarGoogle Scholar
  37. Oviatt, S., Levow, G., Moreton, E., and Maceachern, M. 1998. Modeling global and focal hyperarticulation during human-computer error resolution. J. Acoust. Soc. Amer. 104, 5, 3080--3098.Google ScholarGoogle Scholar
  38. Pisoni, D. 1997. Perception of synthetic speech. In Progress in Speech Synthesis, J. Van Santen, R. Sproat, J. Olive, and J. Hirschberg, Eds. Springer-Verlag, New York, NY, 541--556.Google ScholarGoogle Scholar
  39. Pols, L. and Jekosch, U. 1997. A structured way of looking at the performance of text-to-speech systems. In Progress in Speech Synthesis, J. Van Santen, R. Sproat, J. Olive, and J. Hirschberg, Eds. Springer-Verlag, New York, NY, 519--527.Google ScholarGoogle Scholar
  40. Potamionos, A., Narayanan, S., and lee, S. 1997. Automatic speech recognition for children. In European Conference on Speech Communication and Technology 5, 2371--2374.Google ScholarGoogle Scholar
  41. Praat speech signal analysis software (URL:www.praat.org).Google ScholarGoogle Scholar
  42. Rickel, J. and Johnson, W. L. 1998. Animated agents for procedural training in virtual reality: Perception, cognition and motor control. Appl. Artif. Intell. 13, 4--5, 343--382.Google ScholarGoogle Scholar
  43. Rickenberg, R. and Reeves, B. 2000. The effects of animated characters on anxiety, task performance, and evaluations of user interfaces. Proceedings of CHI 2000, ACM Press: The Hague, Amsterdam, 49--56. Google ScholarGoogle Scholar
  44. Scherer, K. R. 1979. Personality markers in speech. In Social Markers in Speech, K. Scherer and Giles, Eds. Cambridge Univ. Press, Cambridge, UK, 147--209.Google ScholarGoogle Scholar
  45. Smith, B. L., Brown, B. L., Strong, W. J., and Rencher, A. C. 1995. Effects of speech rate on personality perception. Language and Speech 18, 145--152.Google ScholarGoogle Scholar
  46. Snowdon, C. T. and Elowson, M. A. 1999. Pygmy marmosets modify call structure when paired. Ethology 105, 893--908.Google ScholarGoogle Scholar
  47. Street, R., Street, N., and Vankleeck, A. 1983. Speech convergence among talkative and reticent three-year-olds. Language Sciences 5, 79--86.Google ScholarGoogle Scholar
  48. Tusing, K. J. and Dillard, J. P. 2000. The sounds of dominance: Vocal precursors of perceived dominance during interpersonal influence. Hum. Comm. Resear. 26, 148--171.Google ScholarGoogle Scholar
  49. Ward, N. and Nakagawa, S. 2002. Automatic user-adaptive speaking rate selection for information delivery. In Proceedings of the International Conference on Spoken Language Processing (ICSLP'2002), J. Hansen and B. Pellom, Eds. Casual Prod. Ltd.: Denver, CO, Sept. 2002, vol. 1, 549--552.Google ScholarGoogle Scholar
  50. Weiss, D. J., Garibaldi, B. T. and Hauser, M. D. 2001. The production and perception of long calls by cotton-top tamarins (Saguinus Oedipus): Acoustic analyses and playback experiments. J. Comp. Psych. 115, 3, 258--271.Google ScholarGoogle Scholar
  51. Welkowitz, J., Cariffe, G., and Feldstein, S. 1976. Conversational congruence as a criterion of socialization in children. Child Develop. 47, 269--272.Google ScholarGoogle Scholar
  52. Welkowitz, J., Feldstein, S., Finklestein, M., and Aylesworth, L. 1972. Changes in vocal intensity as a function of interspeaker influence. Perceptual and Motor Skills 35, 715--718.Google ScholarGoogle Scholar
  53. Wilpon, J. and Jacobsen, C. 1996. A study of speech recognition for children and the elderly. In Proceedings of the International Conference on Acoustics, Speech & Signal Processing, IEEE Press, Atlanta, GA, 349--352. Google ScholarGoogle Scholar
  54. Yeni-Komshian, G., Kavanaugh, J., and Ferguson, C., Eds. 1980. Child Phonology, Volume 1: Production. Academic Press, New York, NY.Google ScholarGoogle Scholar
  55. Zoltan-Ford, E. 1991. How to get people to say and type what computers can understand. Int. J. Man-Mach. Studies 34, 527--547. Google ScholarGoogle Scholar

Index Terms

  1. Toward adaptive conversational interfaces: Modeling speech convergence with animated personas

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader