article

Toward adaptive conversational interfaces: Modeling speech convergence with animated personas

Authors:
Sharon Oviatt

Oregon Health and Science University, Beaverton, OR

Oregon Health and Science University, Beaverton, OR
View Profile

,
Courtney Darves

Oregon Health and Science University, Beaverton, OR

Oregon Health and Science University, Beaverton, OR
View Profile

,
Rachel Coulston

Oregon Health and Science University, Beaverton, OR

Oregon Health and Science University, Beaverton, OR
View Profile

Authors Info & Claims

ACM Transactions on Computer-Human Interaction Volume 11 Issue 3pp 300–328https://doi.org/10.1145/1017494.1017498

Published:01 September 2004Publication History

ACM Transactions on Computer-Human Interaction

Abstract

The design of robust interfaces that process conversational speech is a challenging research direction largely because users' spoken language is so variable. This research explored a new dimension of speaker stylistic variation by examining whether users' speech converges systematically with the text-to-speech (TTS) heard from a software partner. To pursue this question, a study was conducted in which twenty-four 7 to 10-year-old children conversed with animated partners that embodied different TTS voices. An analysis of children's amplitude, durational features, and dialogue response latencies confirmed that they spontaneously adapt several basic acoustic-prosodic features of their speech 10--50%, with the largest adaptations involving utterance pause structure and amplitude. Children's speech adaptations were relatively rapid, bidirectional, and dynamically readaptable when introduced to new partners, and generalized across different types of users and TTS voices. Adaptations also occurred consistently, with 70--95% of children converging with their partner's TTS, although individual differences in magnitude of adaptation were evident. In the design of future conversational systems, users' spontaneous convergence could be exploited to guide their speech within system processing bounds, thereby enhancing robustness. Adaptive system processing could yield further significant performance gains. The long-term goal of this research is the development of predictive models of human-computer communication to guide the design of new conversational interfaces.

References

Andersen, E. S. 1990. Speaking with Style: The Sociolinguistic Skills of Children. Routledge, Kagan Paul: London, England.Google Scholar
Andre, E., Muller, J., and Rist, T. 1996. The PPP persona: A multipurpose animated presentation agent. Advanced Visual Interfaces, ACM Press, 245--247. Google Scholar
Bickmore, T. 2003. Relational agents: Effecting change through human-computer relationships, MIT Ph.D. Thesis, February.Google Scholar
Bickmore, T. and Cassell, J. 2004. Social dialogue with embodied conversational agents. In Natural, Intelligent and Effective Interaction with Multimodal Dialogue Systems, Van Kuppevelt, L. Dybkjaer and N. Bernsen, Eds. Kluwer Academic: New York, NY.Google Scholar
Boughman, J. M. 1997. Greater spear-nosed bats give group-distinctive calls. Behavioral Ecology and Sociobiology 40, 61--70.Google Scholar
Burgoon, J., Stern, L., and Dillman, L. 1995. Interpersonal Adaptation: Dyadic Interaction Patterns. Cambridge Univ. Press, Cambridge, UK.Google Scholar
Cassell, J. Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjalmsson, H., and Yan, H. 1999. Embodiment in conversational interfaces: Rhea, Proceedings of CHI'99, ACM Press: Pittsburgh, Pa., 520--527. Google Scholar
Cassell, J. and Thorisson, K. R. 1999. The power of a nod and a glance: Envelope vs. emotional feedback in animated conversational agents. App. Artif. Intell. J. 13, 4--5, 519--538.Google Scholar
Cassell, J., Sullivan, J., Prevost, S., and Churchill, E., Eds. 2000. Embodied Conversational Agents. MIT Press, Cambridge, MA. Google Scholar
Coulston, R., Oviatt, S. L., and Darves, C. 2002. Amplitude convergence in children's conversational speech with animated personas. In Proceedings of the International Conference on Spoken Language Processing (ICSLP'2002), J. Hansen and B. Pellom, Eds. Casual Prod. Ltd.: Denver, CO, Sept. 2002, vol. 4, 2689--2692.Google Scholar
Coulston, R. and Darves, C. 2001. Duration scoring procedures, Oregon Health and Science University, unpublished manuscript, November.Google Scholar
Cowlishaw, G. 1992. Song function in gibbons. Behavior 121, 1--2, 131--153.Google Scholar
Darves, C. and Oviatt, S. L. 2004. Talking to digital fish: Designing effective conversational interfaces for educational software. In Evaluating Conversational Agents, Z. Ruttkay and C. Pelachaud, Eds. Kluwer Academic Publisher, Dordrecht, The Netherlands.Google Scholar
Dehn, D. M. and Van Mulken, S. 2000. The impact of animated interface agents: A review of empirical research. Int. J. Hum. Comput. Studies 52, 1--22. Google Scholar
Elowson, A. M. and Snowdon, C. T. 1994. Pygmy marmosets, Cebuella pygmaea, modify vocal structure in response to changed social environment. Animal Behavior 47, 1267--1277.Google Scholar
Giles, H., Mulac, A., Bradac, J., and Johnson, P. 1987. Speech accommodation theory: The first decade and beyond. Communication Yearbook 10, M. L. Mcglaughlin, Ed. Sage Publ., London, UK, 13--48.Google Scholar
Gong, L., Nass, C., Simard, C., and Takhteyev, Y. 2001. When non-human is better than semi-human: Consistency in speech interfaces. Usability Evaluation and Interface Design: Cognitive Engineering, Intelligent Agents and Virtual Reality, Vol. 1. M. Smith, G. Salvendy, D. Harris and R. Koubek, Eds. Lawrence Erlbaum Assoc., Mahwah N.J., 390--394.Google Scholar
Haimoff, E. H. 1984. Acoustic and organizational features of gibbon songs. In The Lesser Apes, H. Preuschoft et al., Eds. Edinburgh University Press, Edinburgh, Scotland, 333--353.Google Scholar
Janik, V. M. and Slater, P. 1997. Vocal learning in mammals. Advances in the Study of Behavior 26, 59--99.Google Scholar
Junqua, J. C. 1993. The Lombard reflex and its role on human listeners and automatic speech recognizers. J. Acoust. Soc. Am. 93, 1, 510--524.Google Scholar
Karat, C. M., Vergo, J., and Nahamoo, D. 2003. Conversational interface technologies. In The Human--Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, J. Jacko and A. Sears, Eds. Lawrence Erlbaum Assoc., Mahwah, NJ, 169--186. Google Scholar
Lai, J. and Yankelovich, N. 2003. Conversational speech interfaces. In The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, J. Jacko and A. Sears, Eds. Lawrence Erlbaum Assoc., Mahwah, NJ, 698--713. Google Scholar
Leiser, R. G. 1989. Improving natural language and speech interfaces by the use of metalinguistic phenomena. Appl. Ergonomics 20, 3, 168--73.Google Scholar
Ladefoged, P. 1993. A course in phonetics. Harcourt Brace Jovanovich, Ft. Worth, TX.Google Scholar
Maples, E. G., Haraway, M. M., and Hutto, C. W. 1989. Development of coordinated singing in a newly formed siamang pair (Hylobates syndactylus). Zoo Biology 8, 367--378.Google Scholar
Massaro, D., Cohen, M., Beskow, J., and Cole, R. 2000. Developing and evaluating conversational agents. In Embodied Conversational Agents, J. Cassell, J. Sullivan, S. Prevost, and E. Churchill, Eds. MIT Press, Cambridge, UK, 287--318. Google Scholar
Mirghafori, N., Fosler, E., and Morgan, N. 1996. Towards robustness to fast speech in ASR. In Proceedings of ICASSP-96, 1, 335--338. Google Scholar
Mitani, J. C. and Brandt, K. L. 1994. Social factors influence the acoustic variability in the long-distance calls of male chimpanzees. Ethology 96, 233--252.Google Scholar
Moreno, R., Mayer, R., Spires, H., and Lester, J. 2001. The case for social agency in computer-based teaching: Do students learn more deeply when they interact with animated pedagogical agents? Cognition and Instruc. 19, 2, 177--213.Google Scholar
Nass, C. and lee, K. 2000. Does computer-generated speech manifest personality? An experimental test of similarity-attraction. In Proceedings of the Conference on Human Factors in Computing System. ACM Press, New York, NY, 329--336. Google Scholar
Nass, C. and Lee, K. 2001. Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction. J. Exper. Psych. Appl. 7, 3, 171--181.Google Scholar
Nass, C. Isbister, K. and Lee, E. 2000. Truth is beauty: Researching embodied conversational agents. In Embodied Conversational Agents, J. Cassell, J. Sullivan, S. Prevost and E. Churchill, Eds. MIT Press, Cambridge, MA, 374--402. Google Scholar
Nass, C., Steuer, J., and Tauber, E. 1994. Computers are social actors. In Proceedings of the Conference on Human Factors in Computing Systems. ACM Press, Boston, MA, 72--78. Google Scholar
Oviatt, S. L. 2003. Multimodal interfaces. In The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, J. Jacko and A. Sears, Eds. Lawrence Erlbaum Assoc., Mahwah, NJ, 286--304. Google Scholar
Oviatt, S. L. 1996. User-centered design of spoken language and multimodal interfaces. IEEE Multimedia, winter 3 (4), 26--35. Reprinted in Readings on Intelligent User Interfaces, M. Maybury and W. Wahlster, Eds. Morgan-Kaufmann. Google Scholar
Oviatt, S. L. and Adams, B. 2000. Designing and evaluating conversational interfaces with animated characters. In Embodied Conversational Agents, J. Cassell, J. Sullivan, S. Prevost, and E. Churchill, Eds. MIT Press, Cambridge, MA, 319--343. Google Scholar
Oviatt, S., Levow, G., Moreton, E., and Maceachern, M. 1998. Modeling global and focal hyperarticulation during human-computer error resolution. J. Acoust. Soc. Amer. 104, 5, 3080--3098.Google Scholar
Pisoni, D. 1997. Perception of synthetic speech. In Progress in Speech Synthesis, J. Van Santen, R. Sproat, J. Olive, and J. Hirschberg, Eds. Springer-Verlag, New York, NY, 541--556.Google Scholar
Pols, L. and Jekosch, U. 1997. A structured way of looking at the performance of text-to-speech systems. In Progress in Speech Synthesis, J. Van Santen, R. Sproat, J. Olive, and J. Hirschberg, Eds. Springer-Verlag, New York, NY, 519--527.Google Scholar
Potamionos, A., Narayanan, S., and lee, S. 1997. Automatic speech recognition for children. In European Conference on Speech Communication and Technology 5, 2371--2374.Google Scholar
Praat speech signal analysis software (URL:www.praat.org).Google Scholar
Rickel, J. and Johnson, W. L. 1998. Animated agents for procedural training in virtual reality: Perception, cognition and motor control. Appl. Artif. Intell. 13, 4--5, 343--382.Google Scholar
Rickenberg, R. and Reeves, B. 2000. The effects of animated characters on anxiety, task performance, and evaluations of user interfaces. Proceedings of CHI 2000, ACM Press: The Hague, Amsterdam, 49--56. Google Scholar
Scherer, K. R. 1979. Personality markers in speech. In Social Markers in Speech, K. Scherer and Giles, Eds. Cambridge Univ. Press, Cambridge, UK, 147--209.Google Scholar
Smith, B. L., Brown, B. L., Strong, W. J., and Rencher, A. C. 1995. Effects of speech rate on personality perception. Language and Speech 18, 145--152.Google Scholar
Snowdon, C. T. and Elowson, M. A. 1999. Pygmy marmosets modify call structure when paired. Ethology 105, 893--908.Google Scholar
Street, R., Street, N., and Vankleeck, A. 1983. Speech convergence among talkative and reticent three-year-olds. Language Sciences 5, 79--86.Google Scholar
Tusing, K. J. and Dillard, J. P. 2000. The sounds of dominance: Vocal precursors of perceived dominance during interpersonal influence. Hum. Comm. Resear. 26, 148--171.Google Scholar
Ward, N. and Nakagawa, S. 2002. Automatic user-adaptive speaking rate selection for information delivery. In Proceedings of the International Conference on Spoken Language Processing (ICSLP'2002), J. Hansen and B. Pellom, Eds. Casual Prod. Ltd.: Denver, CO, Sept. 2002, vol. 1, 549--552.Google Scholar
Weiss, D. J., Garibaldi, B. T. and Hauser, M. D. 2001. The production and perception of long calls by cotton-top tamarins (Saguinus Oedipus): Acoustic analyses and playback experiments. J. Comp. Psych. 115, 3, 258--271.Google Scholar
Welkowitz, J., Cariffe, G., and Feldstein, S. 1976. Conversational congruence as a criterion of socialization in children. Child Develop. 47, 269--272.Google Scholar
Welkowitz, J., Feldstein, S., Finklestein, M., and Aylesworth, L. 1972. Changes in vocal intensity as a function of interspeaker influence. Perceptual and Motor Skills 35, 715--718.Google Scholar
Wilpon, J. and Jacobsen, C. 1996. A study of speech recognition for children and the elderly. In Proceedings of the International Conference on Acoustics, Speech & Signal Processing, IEEE Press, Atlanta, GA, 349--352. Google Scholar
Yeni-Komshian, G., Kavanaugh, J., and Ferguson, C., Eds. 1980. Child Phonology, Volume 1: Production. Academic Press, New York, NY.Google Scholar
Zoltan-Ford, E. 1991. How to get people to say and type what computers can understand. Int. J. Man-Mach. Studies 34, 527--547. Google Scholar

Index Terms

Toward adaptive conversational interfaces: Modeling speech convergence with animated personas
1. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

Talking to digital fish
From brows to trust

Conversational interfaces that incorporate animated characters potentially are well suited for educational software, since they can engage children as active learners and support question asking skills. In the present research, a simulation study was ...
Read More
The Gemination Effect on Consonant and Vowel Duration in Standard Arabic Speech
SNPD '10: Proceedings of the 2010 11th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing

In this paper, we expose the results of an experimental study of acoustic properties of geminated consonants in Arabic language. We aim to determinate the temporal relationship between doubled consonant and the length of the vowel preceding them in a ...
Read More
Prosody modification for speech recognition in emotionally mismatched conditions

A degradation in the performance of automatic speech recognition systems (ASR) is observed in mismatched training and testing conditions. One of the reasons for this degradation is due to the presence of emotions in the speech. The main objective of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Computer-Human Interaction Volume 11, Issue 3
September 2004
92 pages
ISSN:1073-0516
EISSN:1557-7325
DOI:10.1145/1017494
Issue’s Table of Contents

Copyright © 2004 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2004
Published in tochi Volume 11, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Adaptive interfaces
amplitude
animated characters
children's educational software
communication accommodation theory
conversational interfaces
dialogue response latency
duration
human-computer adaptation
individual differences
mobile interfaces
social metaphors
speech recognition
text-to-speech
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 81
  Total Citations
  View Citations
- 2,421
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Toward adaptive conversational interfaces: Modeling speech convergence with animated personas

ACM Transactions on Computer-Human Interaction

Abstract

References

Cited By

Index Terms

Recommendations

Talking to digital fish

The Gemination Effect on Consonant and Vowel Duration in Standard Arabic Speech

Prosody modification for speech recognition in emotionally mismatched conditions

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Toward adaptive conversational interfaces: Modeling speech convergence with animated personas

ACM Transactions on Computer-Human Interaction

Abstract

References

Cited By

Index Terms

Recommendations

Talking to digital fish

The Gemination Effect on Consonant and Vowel Duration in Standard Arabic Speech

Prosody modification for speech recognition in emotionally mismatched conditions

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media