No Cover Image

Book chapter 297 views 192 downloads

Building and Designing Expressive Speech Synthesis

Matthew P. Aylett, Leigh Clark Orcid Logo, Benjamin R. Cowan, Ilaria Torre

The Handbook on Socially Interactive Agents, Volume: 1, Pages: 173 - 212

Swansea University Author: Leigh Clark Orcid Logo

DOI (Published version): 10.1145/3477322.3477329

Abstract

We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and t...

Full description

Published in: The Handbook on Socially Interactive Agents
ISBN: 978-1-4503-8720-0
Published: New York, NY, USA ACM 2021
URI: https://cronfa.swan.ac.uk/Record/cronfa56508
Tags: Add Tag
No Tags, Be the first to tag this record!
first_indexed 2021-03-24T11:03:21Z
last_indexed 2021-11-17T04:22:49Z
id cronfa56508
recordtype SURis
fullrecord <?xml version="1.0"?><rfc1807><datestamp>2021-11-16T10:28:09.5580748</datestamp><bib-version>v2</bib-version><id>56508</id><entry>2021-03-24</entry><title>Building and Designing Expressive Speech Synthesis</title><swanseaauthors><author><sid>004ef41b90854a57a498549a462f13a0</sid><ORCID>0000-0002-9237-1057</ORCID><firstname>Leigh</firstname><surname>Clark</surname><name>Leigh Clark</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2021-03-24</date><deptcode>SCS</deptcode><abstract>We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future &#x201C;spoken language will provide a natural conversational interface between human beings and so-called intelligent systems.&#x201D; [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out &#x201C;voice interfaces have become notorious for fostering frustration and failure&#x201D; [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the user&#x2019;s successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech.</abstract><type>Book chapter</type><journal>The Handbook on Socially Interactive Agents</journal><volume>1</volume><journalNumber/><paginationStart>173</paginationStart><paginationEnd>212</paginationEnd><publisher>ACM</publisher><placeOfPublication>New York, NY, USA</placeOfPublication><isbnPrint/><isbnElectronic>978-1-4503-8720-0</isbnElectronic><issnPrint/><issnElectronic/><keywords/><publishedDay>10</publishedDay><publishedMonth>9</publishedMonth><publishedYear>2021</publishedYear><publishedDate>2021-09-10</publishedDate><doi>10.1145/3477322.3477329</doi><url/><notes/><college>COLLEGE NANME</college><department>Computer Science</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>SCS</DepartmentCode><institution>Swansea University</institution><apcterm/><lastEdited>2021-11-16T10:28:09.5580748</lastEdited><Created>2021-03-24T10:58:43.4400013</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>Matthew P.</firstname><surname>Aylett</surname><order>1</order></author><author><firstname>Leigh</firstname><surname>Clark</surname><orcid>0000-0002-9237-1057</orcid><order>2</order></author><author><firstname>Benjamin R.</firstname><surname>Cowan</surname><order>3</order></author><author><firstname>Ilaria</firstname><surname>Torre</surname><order>4</order></author></authors><documents><document><filename>56508__19542__ec149756d8144f80bf8ab835cafbc54e.pdf</filename><originalFilename>Socially_Interactive_Agents___Book_Chapter-17.pdf</originalFilename><uploaded>2021-03-24T11:02:04.7223413</uploaded><type>Output</type><contentLength>3914290</contentLength><contentType>application/pdf</contentType><version>Accepted Manuscript</version><cronfaStatus>true</cronfaStatus><copyrightCorrect>true</copyrightCorrect><language>eng</language></document></documents><OutputDurs/></rfc1807>
spelling 2021-11-16T10:28:09.5580748 v2 56508 2021-03-24 Building and Designing Expressive Speech Synthesis 004ef41b90854a57a498549a462f13a0 0000-0002-9237-1057 Leigh Clark Leigh Clark true false 2021-03-24 SCS We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future “spoken language will provide a natural conversational interface between human beings and so-called intelligent systems.” [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out “voice interfaces have become notorious for fostering frustration and failure” [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the user’s successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech. Book chapter The Handbook on Socially Interactive Agents 1 173 212 ACM New York, NY, USA 978-1-4503-8720-0 10 9 2021 2021-09-10 10.1145/3477322.3477329 COLLEGE NANME Computer Science COLLEGE CODE SCS Swansea University 2021-11-16T10:28:09.5580748 2021-03-24T10:58:43.4400013 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science Matthew P. Aylett 1 Leigh Clark 0000-0002-9237-1057 2 Benjamin R. Cowan 3 Ilaria Torre 4 56508__19542__ec149756d8144f80bf8ab835cafbc54e.pdf Socially_Interactive_Agents___Book_Chapter-17.pdf 2021-03-24T11:02:04.7223413 Output 3914290 application/pdf Accepted Manuscript true true eng
title Building and Designing Expressive Speech Synthesis
spellingShingle Building and Designing Expressive Speech Synthesis
Leigh Clark
title_short Building and Designing Expressive Speech Synthesis
title_full Building and Designing Expressive Speech Synthesis
title_fullStr Building and Designing Expressive Speech Synthesis
title_full_unstemmed Building and Designing Expressive Speech Synthesis
title_sort Building and Designing Expressive Speech Synthesis
author_id_str_mv 004ef41b90854a57a498549a462f13a0
author_id_fullname_str_mv 004ef41b90854a57a498549a462f13a0_***_Leigh Clark
author Leigh Clark
author2 Matthew P. Aylett
Leigh Clark
Benjamin R. Cowan
Ilaria Torre
format Book chapter
container_title The Handbook on Socially Interactive Agents
container_volume 1
container_start_page 173
publishDate 2021
institution Swansea University
isbn 978-1-4503-8720-0
doi_str_mv 10.1145/3477322.3477329
publisher ACM
college_str Faculty of Science and Engineering
hierarchytype
hierarchy_top_id facultyofscienceandengineering
hierarchy_top_title Faculty of Science and Engineering
hierarchy_parent_id facultyofscienceandengineering
hierarchy_parent_title Faculty of Science and Engineering
department_str School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science
document_store_str 1
active_str 0
description We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future “spoken language will provide a natural conversational interface between human beings and so-called intelligent systems.” [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out “voice interfaces have become notorious for fostering frustration and failure” [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the user’s successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech.
published_date 2021-09-10T04:04:23Z
_version_ 1756962344811888640
score 10.92735