Building and Designing Expressive Speech Synthesis

Aylett, Matthew P.; Clark, Leigh; Cowan, Benjamin R.; Torre, Ilaria

doi:10.1145/3477322.3477329

Book chapter 599 views 395 downloads

Building and Designing Expressive Speech Synthesis

Matthew P. Aylett, Leigh Clark

, Benjamin R. Cowan, Ilaria Torre

The Handbook on Socially Interactive Agents, Volume: 1, Pages: 173 - 212

Swansea University Author: Leigh Clark

PDF | Accepted Manuscript
Download (3.73MB)

DOI (Published version): 10.1145/3477322.3477329

Abstract

We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and t...

Full description

Published in:	The Handbook on Socially Interactive Agents
ISBN:	978-1-4503-8720-0
Published:	New York, NY, USA ACM 2021
URI:	https://cronfa.swan.ac.uk/Record/cronfa56508
Tags:	Add Tag No Tags, Be the first to tag this record!

first_indexed	2021-03-24T11:03:21Z
last_indexed	2021-11-17T04:22:49Z
id	cronfa56508
recordtype	SURis
fullrecord	<?xml version="1.0"?><rfc1807><datestamp>2021-11-16T10:28:09.5580748</datestamp><bib-version>v2</bib-version><id>56508</id><entry>2021-03-24</entry><title>Building and Designing Expressive Speech Synthesis</title><swanseaauthors><author><sid>004ef41b90854a57a498549a462f13a0</sid><ORCID>0000-0002-9237-1057</ORCID><firstname>Leigh</firstname><surname>Clark</surname><name>Leigh Clark</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2021-03-24</date><deptcode>SCS</deptcode><abstract>We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future “spoken language will provide a natural conversational interface between human beings and so-called intelligent systems.” [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out “voice interfaces have become notorious for fostering frustration and failure” [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the user’s successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech.</abstract><type>Book chapter</type><journal>The Handbook on Socially Interactive Agents</journal><volume>1</volume><journalNumber/><paginationStart>173</paginationStart><paginationEnd>212</paginationEnd><publisher>ACM</publisher><placeOfPublication>New York, NY, USA</placeOfPublication><isbnPrint/><isbnElectronic>978-1-4503-8720-0</isbnElectronic><issnPrint/><issnElectronic/><keywords/><publishedDay>10</publishedDay><publishedMonth>9</publishedMonth><publishedYear>2021</publishedYear><publishedDate>2021-09-10</publishedDate><doi>10.1145/3477322.3477329</doi><url/><notes/><college>COLLEGE NANME</college><department>Computer Science</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>SCS</DepartmentCode><institution>Swansea University</institution><apcterm/><lastEdited>2021-11-16T10:28:09.5580748</lastEdited><Created>2021-03-24T10:58:43.4400013</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>Matthew P.</firstname><surname>Aylett</surname><order>1</order></author><author><firstname>Leigh</firstname><surname>Clark</surname><orcid>0000-0002-9237-1057</orcid><order>2</order></author><author><firstname>Benjamin R.</firstname><surname>Cowan</surname><order>3</order></author><author><firstname>Ilaria</firstname><surname>Torre</surname><order>4</order></author></authors><documents><document><filename>56508__19542__ec149756d8144f80bf8ab835cafbc54e.pdf</filename><originalFilename>Socially_Interactive_Agents___Book_Chapter-17.pdf</originalFilename><uploaded>2021-03-24T11:02:04.7223413</uploaded><type>Output</type><contentLength>3914290</contentLength><contentType>application/pdf</contentType><version>Accepted Manuscript</version><cronfaStatus>true</cronfaStatus><copyrightCorrect>true</copyrightCorrect><language>eng</language></document></documents><OutputDurs/></rfc1807>
spelling	2021-11-16T10:28:09.5580748 v2 56508 2021-03-24 Building and Designing Expressive Speech Synthesis 004ef41b90854a57a498549a462f13a0 0000-0002-9237-1057 Leigh Clark Leigh Clark true false 2021-03-24 SCS We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future “spoken language will provide a natural conversational interface between human beings and so-called intelligent systems.” [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out “voice interfaces have become notorious for fostering frustration and failure” [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the user’s successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech. Book chapter The Handbook on Socially Interactive Agents 1 173 212 ACM New York, NY, USA 978-1-4503-8720-0 10 9 2021 2021-09-10 10.1145/3477322.3477329 COLLEGE NANME Computer Science COLLEGE CODE SCS Swansea University 2021-11-16T10:28:09.5580748 2021-03-24T10:58:43.4400013 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science Matthew P. Aylett 1 Leigh Clark 0000-0002-9237-1057 2 Benjamin R. Cowan 3 Ilaria Torre 4 56508__19542__ec149756d8144f80bf8ab835cafbc54e.pdf Socially_Interactive_Agents___Book_Chapter-17.pdf 2021-03-24T11:02:04.7223413 Output 3914290 application/pdf Accepted Manuscript true true eng
title	Building and Designing Expressive Speech Synthesis
spellingShingle	Building and Designing Expressive Speech Synthesis Leigh Clark
title_short	Building and Designing Expressive Speech Synthesis
title_full	Building and Designing Expressive Speech Synthesis
title_fullStr	Building and Designing Expressive Speech Synthesis
title_full_unstemmed	Building and Designing Expressive Speech Synthesis
title_sort	Building and Designing Expressive Speech Synthesis
author_id_str_mv	004ef41b90854a57a498549a462f13a0
author_id_fullname_str_mv	004ef41b90854a57a498549a462f13a0_***_Leigh Clark
author	Leigh Clark
author2	Matthew P. Aylett Leigh Clark Benjamin R. Cowan Ilaria Torre
format	Book chapter
container_title	The Handbook on Socially Interactive Agents
container_volume	1
container_start_page	173
publishDate	2021
institution	Swansea University
isbn	978-1-4503-8720-0
doi_str_mv	10.1145/3477322.3477329
publisher	ACM
college_str	Faculty of Science and Engineering
hierarchytype
hierarchy_top_id	facultyofscienceandengineering
hierarchy_top_title	Faculty of Science and Engineering
hierarchy_parent_id	facultyofscienceandengineering
hierarchy_parent_title	Faculty of Science and Engineering
department_str	School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science
document_store_str	1
active_str	0
description	We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support and mediate our social relationships with 1) each other, 2) with digital information, and, increasingly, 3) with AI-based algorithms and processes. Socially Interactive Agents (SIAs) are at the fore- front of research and innovation in this area. There is an assumption that in the future “spoken language will provide a natural conversational interface between human beings and so-called intelligent systems.” [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out “voice interfaces have become notorious for fostering frustration and failure” [Nass and Brave 2005, p.6]. It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably, because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterising full systems which have made use of expressive speech. Furthermore when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion and an intonational model. Dimensions like accent and personality (cross speaker parameters) as well as vocal style, emotion and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the user’s successful interaction with a system. Thus expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human-computer interaction (HCI), to name a few. It is not our aim to synthesise these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasising future challenges in expressive speech research and development. Yet, before these are expanded upon we must first try and define what we actually mean by expressive speech.
published_date	2021-09-10T04:11:31Z
_version_	1763753791811747840
score	11.012678

Building and Designing Expressive Speech Synthesis

Similar Items