Abstract
The design of robust interfaces that process conversational speech is a challenging research direction largely because users' spoken language is so variable. This research explored a new dimension of speaker stylistic variation by examining whether users' speech converges systematically with the text-to-speech (TTS) heard from a software partner. To pursue this question, a study was conducted in which twenty-four 7 to 10-year-old children conversed with animated partners that embodied different TTS voices. An analysis of children's amplitude, durational features, and dialogue response latencies confirmed that they spontaneously adapt several basic acoustic-prosodic features of their speech 10--50%, with the largest adaptations involving utterance pause structure and amplitude. Children's speech adaptations were relatively rapid, bidirectional, and dynamically readaptable when introduced to new partners, and generalized across different types of users and TTS voices. Adaptations also occurred consistently, with 70--95% of children converging with their partner's TTS, although individual differences in magnitude of adaptation were evident. In the design of future conversational systems, users' spontaneous convergence could be exploited to guide their speech within system processing bounds, thereby enhancing robustness. Adaptive system processing could yield further significant performance gains. The long-term goal of this research is the development of predictive models of human-computer communication to guide the design of new conversational interfaces.
- Andersen, E. S. 1990. Speaking with Style: The Sociolinguistic Skills of Children. Routledge, Kagan Paul: London, England.Google Scholar
- Andre, E., Muller, J., and Rist, T. 1996. The PPP persona: A multipurpose animated presentation agent. Advanced Visual Interfaces, ACM Press, 245--247. Google Scholar
- Bickmore, T. 2003. Relational agents: Effecting change through human-computer relationships, MIT Ph.D. Thesis, February.Google Scholar
- Bickmore, T. and Cassell, J. 2004. Social dialogue with embodied conversational agents. In Natural, Intelligent and Effective Interaction with Multimodal Dialogue Systems, Van Kuppevelt, L. Dybkjaer and N. Bernsen, Eds. Kluwer Academic: New York, NY.Google Scholar
- Boughman, J. M. 1997. Greater spear-nosed bats give group-distinctive calls. Behavioral Ecology and Sociobiology 40, 61--70.Google Scholar
- Burgoon, J., Stern, L., and Dillman, L. 1995. Interpersonal Adaptation: Dyadic Interaction Patterns. Cambridge Univ. Press, Cambridge, UK.Google Scholar
- Cassell, J. Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjalmsson, H., and Yan, H. 1999. Embodiment in conversational interfaces: Rhea, Proceedings of CHI'99, ACM Press: Pittsburgh, Pa., 520--527. Google Scholar
- Cassell, J. and Thorisson, K. R. 1999. The power of a nod and a glance: Envelope vs. emotional feedback in animated conversational agents. App. Artif. Intell. J. 13, 4--5, 519--538.Google Scholar
- Cassell, J., Sullivan, J., Prevost, S., and Churchill, E., Eds. 2000. Embodied Conversational Agents. MIT Press, Cambridge, MA. Google Scholar
- Coulston, R., Oviatt, S. L., and Darves, C. 2002. Amplitude convergence in children's conversational speech with animated personas. In Proceedings of the International Conference on Spoken Language Processing (ICSLP'2002), J. Hansen and B. Pellom, Eds. Casual Prod. Ltd.: Denver, CO, Sept. 2002, vol. 4, 2689--2692.Google Scholar
- Coulston, R. and Darves, C. 2001. Duration scoring procedures, Oregon Health and Science University, unpublished manuscript, November.Google Scholar
- Cowlishaw, G. 1992. Song function in gibbons. Behavior 121, 1--2, 131--153.Google Scholar
- Darves, C. and Oviatt, S. L. 2004. Talking to digital fish: Designing effective conversational interfaces for educational software. In Evaluating Conversational Agents, Z. Ruttkay and C. Pelachaud, Eds. Kluwer Academic Publisher, Dordrecht, The Netherlands.Google Scholar
- Dehn, D. M. and Van Mulken, S. 2000. The impact of animated interface agents: A review of empirical research. Int. J. Hum. Comput. Studies 52, 1--22. Google Scholar
- Elowson, A. M. and Snowdon, C. T. 1994. Pygmy marmosets, Cebuella pygmaea, modify vocal structure in response to changed social environment. Animal Behavior 47, 1267--1277.Google Scholar
- Giles, H., Mulac, A., Bradac, J., and Johnson, P. 1987. Speech accommodation theory: The first decade and beyond. Communication Yearbook 10, M. L. Mcglaughlin, Ed. Sage Publ., London, UK, 13--48.Google Scholar
- Gong, L., Nass, C., Simard, C., and Takhteyev, Y. 2001. When non-human is better than semi-human: Consistency in speech interfaces. Usability Evaluation and Interface Design: Cognitive Engineering, Intelligent Agents and Virtual Reality, Vol. 1. M. Smith, G. Salvendy, D. Harris and R. Koubek, Eds. Lawrence Erlbaum Assoc., Mahwah N.J., 390--394.Google Scholar
- Haimoff, E. H. 1984. Acoustic and organizational features of gibbon songs. In The Lesser Apes, H. Preuschoft et al., Eds. Edinburgh University Press, Edinburgh, Scotland, 333--353.Google Scholar
- Janik, V. M. and Slater, P. 1997. Vocal learning in mammals. Advances in the Study of Behavior 26, 59--99.Google Scholar
- Junqua, J. C. 1993. The Lombard reflex and its role on human listeners and automatic speech recognizers. J. Acoust. Soc. Am. 93, 1, 510--524.Google Scholar
- Karat, C. M., Vergo, J., and Nahamoo, D. 2003. Conversational interface technologies. In The Human--Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, J. Jacko and A. Sears, Eds. Lawrence Erlbaum Assoc., Mahwah, NJ, 169--186. Google Scholar
- Lai, J. and Yankelovich, N. 2003. Conversational speech interfaces. In The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, J. Jacko and A. Sears, Eds. Lawrence Erlbaum Assoc., Mahwah, NJ, 698--713. Google Scholar
- Leiser, R. G. 1989. Improving natural language and speech interfaces by the use of metalinguistic phenomena. Appl. Ergonomics 20, 3, 168--73.Google Scholar
- Ladefoged, P. 1993. A course in phonetics. Harcourt Brace Jovanovich, Ft. Worth, TX.Google Scholar
- Maples, E. G., Haraway, M. M., and Hutto, C. W. 1989. Development of coordinated singing in a newly formed siamang pair (Hylobates syndactylus). Zoo Biology 8, 367--378.Google Scholar
- Massaro, D., Cohen, M., Beskow, J., and Cole, R. 2000. Developing and evaluating conversational agents. In Embodied Conversational Agents, J. Cassell, J. Sullivan, S. Prevost, and E. Churchill, Eds. MIT Press, Cambridge, UK, 287--318. Google Scholar
- Mirghafori, N., Fosler, E., and Morgan, N. 1996. Towards robustness to fast speech in ASR. In Proceedings of ICASSP-96, 1, 335--338. Google Scholar
- Mitani, J. C. and Brandt, K. L. 1994. Social factors influence the acoustic variability in the long-distance calls of male chimpanzees. Ethology 96, 233--252.Google Scholar
- Moreno, R., Mayer, R., Spires, H., and Lester, J. 2001. The case for social agency in computer-based teaching: Do students learn more deeply when they interact with animated pedagogical agents? Cognition and Instruc. 19, 2, 177--213.Google Scholar
- Nass, C. and lee, K. 2000. Does computer-generated speech manifest personality? An experimental test of similarity-attraction. In Proceedings of the Conference on Human Factors in Computing System. ACM Press, New York, NY, 329--336. Google Scholar
- Nass, C. and Lee, K. 2001. Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction. J. Exper. Psych. Appl. 7, 3, 171--181.Google Scholar
- Nass, C. Isbister, K. and Lee, E. 2000. Truth is beauty: Researching embodied conversational agents. In Embodied Conversational Agents, J. Cassell, J. Sullivan, S. Prevost and E. Churchill, Eds. MIT Press, Cambridge, MA, 374--402. Google Scholar
- Nass, C., Steuer, J., and Tauber, E. 1994. Computers are social actors. In Proceedings of the Conference on Human Factors in Computing Systems. ACM Press, Boston, MA, 72--78. Google Scholar
- Oviatt, S. L. 2003. Multimodal interfaces. In The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, J. Jacko and A. Sears, Eds. Lawrence Erlbaum Assoc., Mahwah, NJ, 286--304. Google Scholar
- Oviatt, S. L. 1996. User-centered design of spoken language and multimodal interfaces. IEEE Multimedia, winter 3 (4), 26--35. Reprinted in Readings on Intelligent User Interfaces, M. Maybury and W. Wahlster, Eds. Morgan-Kaufmann. Google Scholar
- Oviatt, S. L. and Adams, B. 2000. Designing and evaluating conversational interfaces with animated characters. In Embodied Conversational Agents, J. Cassell, J. Sullivan, S. Prevost, and E. Churchill, Eds. MIT Press, Cambridge, MA, 319--343. Google Scholar
- Oviatt, S., Levow, G., Moreton, E., and Maceachern, M. 1998. Modeling global and focal hyperarticulation during human-computer error resolution. J. Acoust. Soc. Amer. 104, 5, 3080--3098.Google Scholar
- Pisoni, D. 1997. Perception of synthetic speech. In Progress in Speech Synthesis, J. Van Santen, R. Sproat, J. Olive, and J. Hirschberg, Eds. Springer-Verlag, New York, NY, 541--556.Google Scholar
- Pols, L. and Jekosch, U. 1997. A structured way of looking at the performance of text-to-speech systems. In Progress in Speech Synthesis, J. Van Santen, R. Sproat, J. Olive, and J. Hirschberg, Eds. Springer-Verlag, New York, NY, 519--527.Google Scholar
- Potamionos, A., Narayanan, S., and lee, S. 1997. Automatic speech recognition for children. In European Conference on Speech Communication and Technology 5, 2371--2374.Google Scholar
- Praat speech signal analysis software (URL:www.praat.org).Google Scholar
- Rickel, J. and Johnson, W. L. 1998. Animated agents for procedural training in virtual reality: Perception, cognition and motor control. Appl. Artif. Intell. 13, 4--5, 343--382.Google Scholar
- Rickenberg, R. and Reeves, B. 2000. The effects of animated characters on anxiety, task performance, and evaluations of user interfaces. Proceedings of CHI 2000, ACM Press: The Hague, Amsterdam, 49--56. Google Scholar
- Scherer, K. R. 1979. Personality markers in speech. In Social Markers in Speech, K. Scherer and Giles, Eds. Cambridge Univ. Press, Cambridge, UK, 147--209.Google Scholar
- Smith, B. L., Brown, B. L., Strong, W. J., and Rencher, A. C. 1995. Effects of speech rate on personality perception. Language and Speech 18, 145--152.Google Scholar
- Snowdon, C. T. and Elowson, M. A. 1999. Pygmy marmosets modify call structure when paired. Ethology 105, 893--908.Google Scholar
- Street, R., Street, N., and Vankleeck, A. 1983. Speech convergence among talkative and reticent three-year-olds. Language Sciences 5, 79--86.Google Scholar
- Tusing, K. J. and Dillard, J. P. 2000. The sounds of dominance: Vocal precursors of perceived dominance during interpersonal influence. Hum. Comm. Resear. 26, 148--171.Google Scholar
- Ward, N. and Nakagawa, S. 2002. Automatic user-adaptive speaking rate selection for information delivery. In Proceedings of the International Conference on Spoken Language Processing (ICSLP'2002), J. Hansen and B. Pellom, Eds. Casual Prod. Ltd.: Denver, CO, Sept. 2002, vol. 1, 549--552.Google Scholar
- Weiss, D. J., Garibaldi, B. T. and Hauser, M. D. 2001. The production and perception of long calls by cotton-top tamarins (Saguinus Oedipus): Acoustic analyses and playback experiments. J. Comp. Psych. 115, 3, 258--271.Google Scholar
- Welkowitz, J., Cariffe, G., and Feldstein, S. 1976. Conversational congruence as a criterion of socialization in children. Child Develop. 47, 269--272.Google Scholar
- Welkowitz, J., Feldstein, S., Finklestein, M., and Aylesworth, L. 1972. Changes in vocal intensity as a function of interspeaker influence. Perceptual and Motor Skills 35, 715--718.Google Scholar
- Wilpon, J. and Jacobsen, C. 1996. A study of speech recognition for children and the elderly. In Proceedings of the International Conference on Acoustics, Speech & Signal Processing, IEEE Press, Atlanta, GA, 349--352. Google Scholar
- Yeni-Komshian, G., Kavanaugh, J., and Ferguson, C., Eds. 1980. Child Phonology, Volume 1: Production. Academic Press, New York, NY.Google Scholar
- Zoltan-Ford, E. 1991. How to get people to say and type what computers can understand. Int. J. Man-Mach. Studies 34, 527--547. Google Scholar
Index Terms
- Toward adaptive conversational interfaces: Modeling speech convergence with animated personas
Recommendations
Talking to digital fish
From brows to trustConversational interfaces that incorporate animated characters potentially are well suited for educational software, since they can engage children as active learners and support question asking skills. In the present research, a simulation study was ...
The Gemination Effect on Consonant and Vowel Duration in Standard Arabic Speech
SNPD '10: Proceedings of the 2010 11th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed ComputingIn this paper, we expose the results of an experimental study of acoustic properties of geminated consonants in Arabic language. We aim to determinate the temporal relationship between doubled consonant and the length of the vowel preceding them in a ...
Prosody modification for speech recognition in emotionally mismatched conditions
A degradation in the performance of automatic speech recognition systems (ASR) is observed in mismatched training and testing conditions. One of the reasons for this degradation is due to the presence of emotions in the speech. The main objective of ...
Comments