Date: Tue, 20 Apr 2004 20:05:40 +0200 From: Marianne Jessen Subject: Expression in Speech: Analysis and synthesis
AUTHOR: Tatham, Mark; Morton, Katherine TITLE: Expression in Speech SUBTITLE: Analysis and synthesis PUBLISHER: Oxford University Press YEAR: 2004
Marianne Jessen, Dept of Logopedics,Fachhochschule Fresenius, Idstein. Michael Jessen, Forensic Speech and Audio Dept, Bundeskriminalamt, Wiesbaden.
''Expression in Speech'' focuses on the issue of how current speech synthesis systems (e.g. within text-to-speech applications or dialogue systems) can be improved by adding or enhancing acoustic correlates of expression. ''Expression'' is seen as a ''manner of speaking, a way of externalizing feelings, attitudes, and moods - conveying information about our emotional state'' (p. 39); Tatham and Morton (TM) also use the term ''tone of voice'' synonymously with expression (p.65). TM are not interested in any quick, short sighted solutions to the issue of expression in speech synthesis. Instead, before turning to more concrete implementation design proposals in the latter part of their book, TM go through great efforts to capture the issue of expression in speech more generally, including its foundations in the biology and psychology of emotions and the linguistic pragmatics of emotive expression in speech. They point out explicitly that the phonetics of expression in speech is not just a set of salient acoustic correlates of strong basic emotions overlaid on entire utterances, and that it should not be synthesized in this manner. Instead, what happens in natural speech is that often very subtle and blended emotions are conveyed for only small sections of speech, that there is a complicated interaction between acoustic and linguistic (choice of lexical items etc.) cues to emotions, and that the speaker is not just a passive victim to the biology of emotion and its reflection in speech but that expression in speech can be modified and adjusted on a cognitive and sometimes conscious level. This cognitive mediation includes the fact that the speaker can perceive or infer the reaction of the listener to the expressive content of his/her speech within the context of the conversation and is able to make adjustments. TM propose that a speech synthesis system should be able to model all of these aspects. As for the incorporation of listener reactions, TM claim that an automatic speech recognition module can increase the capabilities of the speech synthesis module. In general, TM emphasize that speech synthesis should not end with a model of the speaker and her/his expression capabilities but should ultimately be listener-oriented. This not only would be an appropriate way of capturing the goal-oriented nature of speech production on a scientific level but it would also be of commercial interest - after all, it is the customer who will be the listener of the synthetic speech.
TM in the final part of their book propose a speech production model (see Fig. 16.1, p. 365) in which on a ''static plane'' the phonology/phonetics of a language and their interface is captured as the set of grammatical/ linguistic-phonetic rules and constraints of speech with ''neutral expression'' (p. 302). In addition to this static plane there is a ''dynamic prosody/phonology tier'', responsible for planning utterances and a ''dynamic phonetic tier'', responsible for rendering utterances. The rendering module receives input from a ''dynamic cognitive phonetics agent'', which supervises and modifies the rendering process based on contextual and environmental information. Apparently, while the static components cover what is addressed in most of current phonology and phonetics, the dynamic components focus on psycholinguistic and linguistic-pragmatic factors. This model implies a plea by TM for a broad-sighted view of phonetics, in which psycholinguistic and pragmatic factors are taken into account, so that a topic like expression in speech does not assume a marginal role in phonetics. TM mention that their theory of ''Cognitive Phonetics'' (e.g. p. 360) is a proposal into that direction. TM make proposals as to how their speech production model and their account of expression in speech can be implemented as part of a speech synthesis architecture. Within this agenda they present a number of XML declarations in which they lay out a prosodic hierarchy. A node is on top of this hierarchy, which proceeds further down with prosodic categories such a , , and (p. 370). Aside from the practical aspects of this hierarchy (capturing that expressions usually have a longer temporal domain, i.e. change less rapidly than units of linguistic prosody) TM also claim that in the planning of an utterance the speaker first formulates the ''prosodic wrapper'' and subsequently the segmental content, contrary to the more traditional notion that the segmental make up of an utterance is planned first and then provided with linguistic and expressive prosody (pp. 384-386).
Since ''Expression in Speech'' is a lot about imagining how speech synthesis can be improved in the future, let us for illustration purposes (and fun) beam aboard the Enterprise 1701-D and listen to the type of (Sci-Fi-projected) speech synthesis found there. (To Tatham and Morton: this is not to ridicule your book but to cherish its value; to all who don't like or know Paramout Picture's Star Trek:The Next Generation: please skip to the next paragraph.) First there is the voice of the ship's computer, everybody can talk to from the bridge, the elevator and all over the ship. The computer speaks in a voice that is essentially expressionless. Actually the voice is not fully without expression: it speaks in an overall friendly manner, which is an illustration of TM's point that ''all speech is expression-based'' (title of Chapter 14). But this friendly kind of voice by the computer is always the same, no matter how inappropriate for the context and how annoying for the listener. In TM's terms, the node has an attribute such as ''low-emotion friendly'' as a permanent setting for every utterance. This kind of inflexible way of including expression in speech synthesis is what TM's argue against. What will probably meet their expectations, however, is the voice of the unique android Lieutenant Commander Data. Data is not able to experience emotions but in his speech and nonverbal behavior is able to express a certain degree of emotion. He usually cannot express strong and basic emotions; at least he is not very good at it, although when demanded in situations like a theater play his expressive abilities into that direction improve (cf. TM's XML declaration of emotive aspects in Hamlet's speech, p. 304f.).
According to TM, what is both more difficult and more required of a speech synthesis system is the ability to express subtle and blended rather than extreme and basic emotions. What an interactive system needs in their words is ''less intense expressiveness to increase its naturalness and credibility'' (p. 90) - a feature certainly met in Data's speech. Data also meets TM's proposal that a speech synthesis system should be able to perceive or infer listener reactions and to relate those reactions to the verbal or vocal expressive content of its speech with the ability to adjust it. In his regular interactions with the other crew members Data can for example perceive physical or verbal/vocal signs of distress in reaction to his behavior and can ask if he in any way offended the person he talked to. Another point: is the goal of expressive speech synthesis to model just the expression or also the physical and perhaps psychological aspects that come with an emotive reaction, as a stage prior to or interacting with the expression (TM pp. 277-280 for discussion)? More philosophically: can or should machines ever cross the body-mind barrier and even be able to EXPERIENCE emotions? That certainly went wrong with Data's android brother Lore, who turned into a raving lunatic over his abilities to experience emotions - but who knows. By the way, being an android, Data is also the perfect embodiment of an articulatory synthesizer, which many in the field of speech synthesis think will ultimately be the best way of doing synthesis.
''Expression in Speech'' in some ways has more the character of a monograph for the advanced reader than of a basic textbook or handbook because it presupposes that - or is of maximal value if - the reader is familiar with or willing to familiarize her/himself elsewhere with the principles of speech synthesis, with the literature on emotion in speech, and with background subjects such as phonology or psycholinguistics. For example, although different speech synthesis techniques such as formant synthesis, unit- selection synthesis, or diphone synthesis are all mentioned, discussed and in part illustrated, the reader still has to turn to other sources when wanting to know how e.g. formant synthesis works (the distinction between source and filter parameters, the cascade and the parallel branch, etc.). And although the most important correlates of emotion in speech that have been reported in the literature are summarized in the form of tables (pp. 55, 115), TM essentially do not provide a literature overview on this topic (by mentioning the original sources such as Williams and Stevens 1972 and many others) but cite a few secondary sources, one of which a probably not very accessible Ph.D. thesis, to which the interested reader can turn for further literature. [We want to mention at this point that there has also been some interesting work on emotion in speech in Germany including Tischer (1993; with extensive literature review up to that date), Klasmeyer and Sendlmeier (2000), Burkhardt (2001; with special reference to emotion in speech synthesis), and Kienast (2002).]
The importance of phonology and prosody are mentioned throughout the book, but except for a few remarks on the Firthian prosodic framework, metrical phonology and articulatory phonology (pp. 21f.), their theory of Cognitive Phonetics (p. 209, 334 etc.), or on the limitations of Pierrehumbert's intonation model and the ToBI system for speech synthesis (p. 118), it is not really clear what the model of phonology it is that TM have in mind as background for their work on expression in speech (e.g. in their production model mentioned above) or whether they think a combination of models is best for the practical goals at hand. In our opinion, for example, it would be too harsh a judgment to question the usefulness of autosegmental phonology for the purpose of speech synthesis, if this is what TM have in mind (see Clements and Hertz 1996 for the autosegmental ''Delta'' model of speech synthesis and its phonological motivation). The unfamiliar reader would need a few phonology textbooks and perhaps an introduction to the history of linguistics explaining the differences between British and American linguistic traditions (e.g. Anderson 1985) to get a perspective. Regarding psycholinguistics, it would have been useful had TM explained how their speech production model is similar to or differs from at least the one of Levelt (1989). On the positive side, TM mention quite a bit of literature on the biology and psychology of emotions. For that purpose they also provide a bibliography (p. 411f.) following their list of references.
''Expression in Speech'' is written in a clear and explicit style, avoiding as much technical language as possible. It also focuses in on some topics and explains them in quite some detail (e.g. what the syllable-internal constituents are and how hierarchical syllable structure can be expressed in XML; p. 372-374). These aspects make the book again more textbook- than monograph-like, and it has the positive consequence that it will be understood by many interested persons outside the specialized emotion-in- speech-synthesis community, which corresponds to the announcement in the text on the book cover that the book will be of interest for researchers in linguistics, speech science, pathology, technology and behavioral or cognitive science. In some instances, however, clarity and explicit style turns into redundancy. The book contains 16 chapters not all of which dealing with separate topics. TM have the habit of bringing up a topic and explaining some aspects of it, then bringing it up again in a different chapter with a certain shift in detail or perspective. Some readers will enjoy this way of arranging the book - and it can be a way of ultimately grasping the subject matter better than with a more redundancy-free style - but other readers, who cannot invest the same amount of time or may wish to concentrate on some aspects while leaving others, might find it difficult to extract the information they need without missing something important that occurs elsewhere in the book (TM provide a subject and author index however).
We have two technical comments on speech synthesis. First, to our knowledge, the HLsyn system by the Sensimetrics company is based on the revised and expanded parameter set described in Klatt and Klatt (1990) and not the 1980 model of the Klatt formant synthesizer (p. 239). Second, it is essentially correct that formant frequencies and amplitudes (including correlates of articulatory precision) as well as voice quality parameters cannot be modified with signal processing methods in concatenative synthesis (see table on p. 237). However, there has been research and development into that direction, and it is probably increasing strongly in the future, enhanced in part by the motivation to enable synthesizers to speak with different individual voices (see e.g. Quatieri and McAulay 1986, d'Alessandro and Doval 1998, Kain and Macon 1998, Stylianou 2001). [Thanks to Karlheinz Stöber for discussion on that subject and for giving us information on literature.]
The few critical comments we made here are essentially about issues of style and the selection and organization of background information. They leave untouched our central impression of the book: that it is extremely useful as a guide to anyone working on the interface between emotion in speech and speech synthesis. Tatham and Morton offer a far- sighted perspective to this topic and make explicit many issues the developer of synthesis systems might not think about at all. In this sense the book is also a very good example of how the linguist and phonetician can make valuable contributions to speech technology, and that in the end the best results will be obtained if speech technologists and linguists/phoneticians work together.
Anderson, S. R. (1985) Phonology in the twentieth century: theories of rules and theories of representations, Chicago: The University of Chicago Press.
Burkhardt, F. (2001) Simulation emotionaler Sprechweise mit Sprachsynthesesystemen, Aachen: Shaker Verlag.
Clements, G. N. and Hertz, S. R. (1996) An integrated approach to phonology and phonetics. In Durand, J. and Laks, B. (eds.) Current trends in phonology: models and methods, pp. 143-173, University of Salford, European Studies Research Institute.
d'Alessandro C. and Doval, B. (1998) Experiments in voice quality modification of natural speech signals: the spectral approach. In: The Third ESCA/COCOSDA Workshop on Speech Synthesis (on CD).
Kain, A. and Macon, M. (1998) Personalizing a speech synthesizer by voice adaptation. In: The Third ESCA/COCOSDA Workshop on Speech Synthesis (on CD).
Kienast, M. (2002) Phonetische Veränderungen in emotionaler Sprechweise, Aachen: Shaker Verlag.
Klasmeyer, G. and Sendlmeier, W. F. (2000) Voice and emotional states. In R.D. Kent and M. J. Ball (eds.) Voice quality measurement, pp. 339- 357, San Diego: Singular Publishing Group.
Klatt, D. H. and Klatt, L. C. (1990) Analysis, synthesis, and perception of voice quality variations among females and male talkers, Journal of the Acoustical Society of America 87, pp. 820-857.
Levelt, W. J. M. (1989) Speaking: from intention to articulation. Cambridge, MA: The MIT Press.
Quatieri T. F. and McAulay, R. J. (1986) Speech transformations based on sinusoidal representation, IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-34, pp. 1449-1464.
Stylianou, Y. (2001) Applying the harmonic plus noise model in concatenative speech synthesis, IEEE Transactions on Speech and Audio Processing, 9, 1, pp. 21-29.
Tischer, B. (1993) Die vokale Kommunikation von Gefühlen. Weinheim: Beltz.
Williams C. and Stevens, K. N. (1972) Emotions and speech: some acoustical correlates, Journal of the Acoustical Society of America 52, pp.1238-1250.
ABOUT THE REVIEWER:
Marianne Jessen is a lecturer at the Department of Logopedics, Europa- Fachhochschule Fresenius in Idstein, Germany - the first academically- based program in Logopedics in Germany - where she is responsible for the section on voice. Her interests include speech under stress, voice quality, and dysphagia. Michael Jessen works at the Forensic Speech and Audio Department of the Bundeskriminalamt (Federal Criminal Police Office) in Wiesbaden, Germany. His interests include voicing and voice quality, laboratory phonology, and speaker identification.