Publishing Partner: Cambridge University Press CUP Extra Wiley-Blackwell Publisher Login
amazon logo
More Info


New from Oxford University Press!

ad

Words in Time and Place: Exploring Language Through the Historical Thesaurus of the Oxford English Dictionary

By David Crystal

Offers a unique view of the English language and its development, and includes witty commentary and anecdotes along the way.


New from Cambridge University Press!

ad

Thesaurus of English Words and Phrases

By Peter Mark Roget

This book "supplies a vocabulary of English words and idiomatic phrases 'arranged … according to the ideas which they express'. The thesaurus, continually expanded and updated, has always remained in print, but this reissued first edition shows the impressive breadth of Roget's own knowledge and interests."


New from Brill!

ad

The Brill Dictionary of Ancient Greek

By Franco Montanari

Coming soon: The Brill Dictionary of Ancient Greek by Franco Montanari is the most comprehensive dictionary for Ancient Greek to English for the 21st Century. Order your copy now!


Email this page
E-mail this page

Review of  Wired for Speech


Reviewer: Richard W Sproat
Book Title: Wired for Speech
Book Author: Clifford Nass Scott Brave
Publisher: MIT Press
Linguistic Field(s): Computational Linguistics
Sociolinguistics
Book Announcement: 17.65

Buy
Discuss this Review
Help on Posting
Review:
Date: Sun, 1 Jan 2006 23:14:52 -0500
From: Richard Sproat <rws@xoba.com>
Subject: Wired for Speech

AUTHOR: Nass, Clifford; Brave, Scott
TITLE: Wired for Speech
SUBTITLE: How Voice Activates and Advances the Human-Computer
Relationship
PUBLISHER: MIT Press
YEAR: 2005

Richard Sproat, Departments of Linguistics and ECE, University of
Illinois at Urbana-Champaign

OVERVIEW

The topic of this book is voice user interfaces, an example of which is
the automated system that one interacts with when one calls United
Airlines and wishes to check on the arrival or departure time for a
flight. Other examples include systems where one speaks to a
graphical avatar (a ''talking head'') that serves as an automated
information kiosk; or the National Oceanic and Atmospheric
Administration's Weather Radio, which presents the weather using a
text-to-speech (TTS) synthesizer. In short, a voice user interface is
any automated system that allows a user to access information,
possibly with automatic speech recognition (ASR) for voice input, and
with either prerecorded prompts or TTS technology to produce output.

The book is not about the technology underlying voice user interfaces.
Rather, it is about how humans react to them and interact with them in
controlled experiments, and how this information should guide the
design of the ''persona'' that the interface presents to the world.

The book is divided into fourteen main chapters, and a two-page
summary chapter. After a brief stage-setting first chapter, the authors
turn in chapters 2-3 to their first topic, namely the gender of voices.
How do people react to synthetic male versus female voices? How
does gender stereotyping affect people's perception of the quality or
believability of a voice user interface system? Can one get around
user prejudices by having a ''gender neutral'' voice?

Chapters 4-5 turn to the issue of voice ''personality''. People infer
many aspects of other people's personality by the way they talk, and
the kinds of words they use. And, as it turns out, people
impute ''personalities'' to voice user interfaces. As with gender,
preconceived notions about personalities have a strong effect on the
user's perception of a voice user interface.

Chapter 6 deals with the issue of regional or foreign accents and
perceived ethnicity. Once again people's prejudices about accent and
race carry over to machines, even though the notion that a machine
has a geographical or ethnic background is obviously absurd.

Chapters 7-8 discuss emotion and how that should be expressed, or
not expressed, in voice user interfaces. One of the clear suggestions
of this section of the book is that, where possible, it is important for a
voice user interface to match its emotion to the (expected) emotional
state of the user.

Chapter 9 asks when and how a voice user interface should use
multiple voices. A couple of conclusions are drawn: first, if multiple
voices are used, they should be matched to the tasks being
performed. For example, the authors suggest using an officious
sounding voice to guide users through a complex menu system, and a
warm friendly sounding voice to reassure users that they are being
guided to the right place. Second, despite the common notion
of ''voice fonts'' (e.g. Raman 2004), users do not treat different voices
the same way as they treat different textual fonts since a change of
voice has social implications that a change of font does not.

Chapter 10 deals with the question of whether voice interfaces should
say ''I'', and thereby make perceived claims to being human. From the
authors' experiments, it seems that systems that use synthetic (TTS)
voices should not say ''I''.

Chapter 11 deals with recorded speech versus TTS, and real faces
versus synthetic faces, and concludes that people react better to a
system that has either a synthetic face speaking with a synthetic
voice, or a real face speaking with a real voice; users do not like it
when the conditions are crossed.

Chapter 12 argues that it is generally bad to mix obviously recorded
speech with obviously synthetic speech: for example, it would be a
bad design choice to have a system produce a canned phrase
like ''Good Morning Ms.'' using a prerecorded voice, and then finish
the utterance with an obviously synthetic voice saying the name of the
user. This, at least, is one section of the book that will seem obvious
to anyone who has worked on the technology of speech synthesis: we
have known for a long time that it is not a good idea to mix high quality
prerecorded speech with poor quality synthesis. The chapter also
contains a discussion of humor, though it is not obvious how this
relates to the main topic of the chapter.

The final two chapters, 13-14, shift the ground from voice (and video)
output to voice input. Chapter 13 deals with the issue of how
comfortably people will interact when they know, or are constantly
reminded that they are being recorded. A set of experiments with
various kinds of attached microphones versus unobtrusive array
microphones, showed that users who used the less obtrusive
microphones were more creative in their responses and more willing
to disclose sensitive information. Finally, chapter 14 discusses what
systems should do when they misrecognize a user: what are the
relative costs and benefits of the system accepting blame (''I'm sorry, I
did not understand you'') versus implicating blame on the user (''You
are speaking too quickly, please slow down.'').

Three main themes run through this book.

The first is simply this: we are ''wired'' for speech. Even though users
know they are dealing with an automated system, if the system takes
speech as input, or produces speech as output, users cannot help but
treat the system as if it were another human, and will apply the same
beliefs and prejudices to the automated system as they would to a
human that had the same behavior. That is, if a person feels more
comfortable with a male human explaining how to operate a complex
piece of equipment than with a female, then that prejudice will carry
over to an automated assistant that has a female voice. This first
point is consistent with previous work from Nass's lab: in general,
people seem to treat computers as if they were people, even though
they know full well that they are not (Nass, Steuer & Tauber, 1994).

The second theme is that there is no way around the first theme: for
instance, we cannot solve the problem of speakers' inherent gender
biases by making a system with a voice of ambiguous gender. Users
will just think that the system is weird and will react to it worse than if
the system clearly indicates that it is ''female'' or ''male''.

Finally, user perceptions of voice user interfaces have direct
implications for users' views of whatever service or product the system
is trying to sell. Just as a skilled salesman can make a product seem
more desirable than it might otherwise seem, so a well-designed voice
interface can make claims of a product's value seem more believable.

DETAILED CRITIQUE

To place the current research in some historical perspective it is worth
noting that Nass's research on ''Computers as Social Actors'' was the
inspiration for Microsoft ''Bob'' which, after its demise, led eventually
to ''Clippy'', the Microsoft Office automated assistant. Neither of these
products have been well received and there has been much
discussion of why (e.g., Schwartz, 2003), a topic that would take us
beyond the scope of this review. The authors are evidently very
proud of their long experience at providing user-interface design
advice to corporations: the preface to the book is highly self-laudatory,
and contains a fairly long list of consulting contracts that Nass's lab
has had with various companies over the years, including such varied
companies as BMW, Charles Schwab, General Magic, Macromedia,
NTT, Philips and US West.

In the overview above, reference was made to experiments conducted
by the authors to validate their claims about design issues for voice
user interfaces, and it is worth summarizing one of those experiments
just to give a flavor of the kind of research the authors performed. For
example, in assessing the importance of gender stereotyping, the
authors conducted an experiment, where participants were directed to
an online auction site which offered a set of stereotypically male and
stereotypically female merchandise, with descriptions from eBay.
Descriptions were read to the listeners either with a female voice or a
male voice generated with the Festival TTS system (Taylor, Black &
Caley, 1998). Subjects were then asked to rate how credible the
description they heard was. The results of this experiment (reported
on pages 25-27) were that speakers rated the product descriptions as
more credible if the gender of the voice matched the ''gender'' of the
product.

While the focus of the book is on the use of technology, rather than
the technology itself, one cannot forget that any use of a technology
presumes some understanding of the technology that is being used.
From the technological perspective there are a couple of points of
interest about this book. First, I personally found it noteworthy that
the majority of the discussion focusses on synthesis rather than
recognition. This focus is the opposite of the focus in the speech
technology community, where synthesis has long taken a back seat to
recognition, and where synthesis has traditionally been regarded as
much easier than recognition. But the focus of the current book on
synthesis is, after all, natural: although speech recognition is an
important part of many voice user interfaces, it is the voice with which
the system speaks that gives it its ''personality'' and its apparent
human-like qualities.

Second, it is unfortunately the case that the authors do not always
seem to understand the technology that they are evaluating. On
several occasions they imply that changing voices, changing the
emotions of voices, and changing the gender of voices is a
straightforward process. This is misleading at a number of levels.
First, consider emotion. While Nass and Brave are correct that many
of the acoustic correlates of emotion are known, and while it is true
that rendering emotion in synthetic speech has been a research topic
since Cahn's work (Cahn, 1989), it is still not possible to produce
convincing renditions of all emotions. Second, while it might seem as if
it should be easy in general to change the voice or the gender used
by the system, in practice there are limitations. To understand this, it is
necessary to briefly remind the reader of the various methods used to
produce speech output in TTS systems. The oldest approach,
exemplified by the Klatt synthesizer (Klatt 1980) and its commercial
offspring DECTalk, is a fully parametric system where all parameters
of the voice, including pitch, formant values, spectral tilt, and many
others, are controllable. In such a system it is indeed in principle easy
to produce new voices --- but at a cost: the quality of the resulting
speech sounds distinctly mechanical, largely because we do not have
good models of how to control the parameters over time. Such
limitations in our understanding have been sidestepped in much of the
recent work on ''unit selection'' based methods. These methods,
pioneered in work on the CHATR system (Hunt & Black, 1996), and
exemplified in commercial systems such as AT&T's ''Natural Voices'',
depends upon a huge database of speech from one speaker. During
synthesis, a set of units as closely as possible matching the intended
utterance is selected on the fly from the database. The resulting
speech can sound very good in the best case --- and downright silly in
the worst. But one of the practical discoveries of this work is that the
less one fiddles with the speech, the better and more natural the
resulting synthesis sounds. This means that modification of speech
such as changing the pitch is to be eschewed. The result: if you want
a different voice, you have to record a different speaker, and analyze
their speech. If you want a different emotion, you have to record your
speaker performing speech with that emotion. This is certainly a lot
easier to do than it used to be, but at a minimum one is looking at
recording an hour's worth of a speech. This clearly involves more than
turning a few knobs, which is all that Nass and Brave seem to imply is
needed.

Turning away from technological issues, there are problems with the
design of the book itself. As the authors state at the outset, they use
endnotes extensively for background information that can be freely
skipped by the casual reader. For example, the data and statistical
analyses of all the experiments are presented in endnotes, not in the
body of the text, which merely summarizes the results. Also,
bibliographic references are all given in the endnotes. This design
choice has both a good and a bad aspect. It surely helps the non-
specialist reader, who will not necessarily be inclined to look at the
authors' data in detail, but will be satisfied with the authors' synopsis
of the results. But it is annoying for someone who has a technical
background in the field, since following up any single point
necessitates thumbing to the back of the book. The lack of a standard
bibliography is also an extremely bothersome feature of the book.

Negatives aside, this is a book worth reading by anyone interested in
speech technology. Those of us who have worked on developing the
technology underlying voice user interfaces have traditionally not
thought much about the actual design of the end product. Nass and
Brave have clearly thought about these issues more than anyone
else.

Still, while it is useful to understand what features of speech work best
for which applications, we should not lose sight of the fact that the
underlying technology is itself immature, and that just building a
system that can communicate effectively with inexperienced users is
still a challenge. In the ''Restaurant at the End of the Universe'', the
second book in Douglas Adams' ''Hitchhiker'' series, Ford Prefect (the
Betelgeusian companion of the hero, Arthur Dent) berates the
Golgafrincham colonizers of prehistoric Earth for not having made
much progress on the invention of the wheel. A marketing consultant
fires back at Ford and asks him, if he is so smart, what color it should
be.

REFERENCES

Cahn, J. 1989. ''Generating Expression in Synthesized Speech.''
Master's thesis. Massachusetts Institute of Technology.

Hunt, A. and Black, A. 1996. ''Unit selection in a concatenative speech
synthesis system using a large speech database.'' Proceedings of
ICASSP 96, vol 1, pp 373-376, Atlanta, Georgia.

Klatt, D. 1980. ''Software for a cascade/parallel formant synthesizer'',
Journal of the Acoustical Society of America, 67.3, 971-995.

Nass, C., Steuer, J. S., & Tauber, E. (1994). ''Computers are social
actors.'' Proceeding of the CHI Conference, 72-77. Boston, MA

Raman, TV. 2004. ''Emacsspeak -- The Complete Audio Desktop''.
http://emacspeak.sourceforge.net/

Schwartz, L. 2003. ''Why people hate the paperclip: Labels,
appearance, behavior and social responses to user interface agents.''
Master's Thesis, Stanford University.

Taylor, P., Black, A. and Caley, R. 1998 ''The architecture of the
Festival Speech Synthesis System'' 3rd ESCA Workshop on Speech
Synthesis, pp. 147-151, Jenolan Caves, Australia.
 
ABOUT THE REVIEWER:
ABOUT THE REVIEWER


Richard Sproat is professor in the departments of Linguistics and
Electrical & Computer Engineering at the University of Illinois at
Urbana-Champaign. His interests include multilingual text processing
and speech technology. Prior to coming to the University of Illinois,
Sproat worked in industrial research at AT&T Bell Laboratories, with
his primary area of research being text-to-speech synthesis. Sproat
was one of the main architects of the Bell Labs multilingual text-to-
speech synthesizer. He was also involved in the design of the SABLE
text-to-speech markup language, a precursor to the W3C's SSML.


Versions:
Format: Hardback
ISBN: 0262140926
ISBN-13: N/A
Pages: 296
Prices: U.S. $ 32.50