Agara is an autonomous virtual voice agent and can carry out end to end voice conversations with customers over the phone. Given the multitude of text-based chatbots that we interact with daily and the attention they have received, I often come across the question:   

Doesn’t building such a voicebot amount to using a chatbot, which additionally
a) works on the text transcript of the customer’s speech, obtained using an Automatic Speech Recognition (ASR) engine,
b) converts its own text response to voice using a Text-to-speech (TTS) engine?”

However, saying that a voicebot tantamounts to a chatbot – with an ASR on the way in, and a TTS on the way out – is essentially saying the way you text is also the way you speak. A rather basic thought experiment in this regard would be telling.

Let’s imagine you ‘converted’ one of your old text chats with a friend into a voice call. Let’s say you read out your part of the text, and your voice assistant – an Alexa or a Google Assistant – reads out your friend’s parts. Alternatively, imagine having the same conversation with your friend over a phone call. Would both these scenarios sound the same? Most likely not.

The dissimilarity in the basic nature of conversation is a fundamental aspect, but this does not scratch the surface of the technical challenges of processing real-time audio. Some of these challenges include latency and transcription errors, and the fact that even state-of-art artificial speech synthesis systems can sound robotic.

On the topic of voice assistants, it must be prefaced that Agara is not a command-and-control voice assistant like Alexa and Siri. Agara carries out complete and goal-oriented conversations with customers. Throughout this article, we use the term ‘voicebot’ to refer to an AI-powered voice agent of this nature, that aims to have as human-like conversations as possible.

This article is an attempt at highlighting some of the challenges and considerations involved in designing and building such a voicebot. Since chatbots are rather ubiquitous these days, this article also tries to contrast them with voicebots and, in the process, shed light on the oft-asked “Isn’t Voicebot = ASR + Chatbot + TTS?”

The continuum of speech

Text chat is, by its very nature, designed to be turn-based. For the most part, we send a text and wait for a reply, and then we text again. We only see the other person’s text in units of more complete text messages. We formulate a response, type it out, and also make multiple edits along the way, during which time, the person at the other end will only see the ubiquitous ‘Typing…’. What they receive finally is merely the product of our ‘turn’ of a text conversation. 

On the other hand, in a spoken conversation, we frequently interrupt each other and provide verbal cues of acknowledgment, (dis)agreement, empathy, etc. All this happens while the other person is still speaking their ‘turn’ of the conversation. Spoken conversation between two people must be thought of as two separate continuous streams of audio. Each stream is a continuum of spoken words, non-word verbal cues (such as oh’s, ah’s, and hmm’s) as well as background noise of any form. 

Interestingly, many customer support desks enable their live chat agents (advisors) to see what customers type, as they type it. They can see each letter as the customer types it, including all their messy deletions and to-be-edited autocorrects. This, of course, is to allow them to pre-empt queries and fetch information quickly.  There are also concept apps like the Beam Messenger that allow people on either side to see what the other person is texting in real-time. However, they haven’t gotten much traction lately. One set of user reviews seem to indicate that people would still like to retain the luxury of articulating better and editing when it comes to text-based chats. Another set of reviews alludes to the fact that real-time texting of this fashion isn’t quite ‘natural’ as compared to, say, a voice or a video call.

What do humans do?

Since we are in the business of building voicebots that talk to customers over the phone, looking at this question from a customer support standpoint might be a good start, without losing generality. It might be insightful to compare how similar situations are handled in text and voice conversations. 

Let’s say you were a customer support advisor, chatting with a customer over text chat. You see a text from them and start typing a response; when you notice them ‘Typing…’ again, as you are typing out your response. You stop typing (possibly even delete what you started typing), let the customer finish, and then respond, now factoring in their new text. 

In contrast, let’s say you are talking to a customer over the phone. The following possibilities (and more) can take place:

  • While you are saying something, the customer might interject to correct you, or already pre-empt the rest of your statement. 
  • If they are in a hurry, they would like you to get straight to the point. You might choose to stop speaking mid-way in such cases and let them ‘take over’ the ‘turn.’ The conversation might then proceed along a new path. 
  • You might hear noise from their end while you’re speaking, and choose to repeat yourself just in case the customer missed hearing what you said.  
  • They might be using ‘fillers’ like hmm’s, ah’s, and okay’s to let you know they’re acknowledging what you’re saying without meaning to interrupt your speech. In such cases, you listen to them but continue speaking.

Of course, there are times when you, as the advisor, might need to interrupt the customer’s speech. Maybe they are reading out a wrong piece of information, and you’d like to save them time by letting them know before they read out its entirety. Or maybe you’d like to interrupt a verbally abusive customer and politely let them know that profanity will not be accepted.

In our analysis of customer support calls, we found that 22% of the turns of agents’ conversation had overlapping customer-side speech. Some of these were interrupts, and others were filler words that weren’t intended to interrupt.


A voicebot that strives to be human-like should account for all these artifacts or possibilities that come with the real-time nature of voice-based conversations. 

Let’s look at an example here to highlight some of these considerations. 

Bot: May I have your reference number, John?
Customer: Yes, its ACB432
Bot: I heard that as ACP4… (Bot hears the customer’s interrupt here)
Customer: No, that’s a B.

In this case, the ASR mistranscribed ‘ACB432’ to ‘ACP432’. The response to this should be along the lines of “Sorry, is your reference number ACB432 ?”. This involves the bot understanding that:

  1. Firstly, the customer is attempting to correct one of the letters, and hence this is a case where the bot should stop speaking and change course
  2. ‘B’ and ‘P’ sound similar and are often confused with each other, and hence the customer is likely trying to correct the ‘B’ to a ‘P.’
  3. The customer interrupted it around the time when it said ‘P,’ even though there is no guarantee on how much time had passed since ‘P’ when the interrupt happened – there is also a latency added in sending the customer’s audio over the network, as well as in transcription using ASR.

While all of these are intuitive to a human, they need to be explicitly taught to the AI models powering this conversation. Further, there are considerations of UX, which are described later. If the bot stops abruptly when it hears an interrupt, the user experience is unnatural. For instance, halting our speech mid-word is something that we humans rarely do, even upon hearing overlapping talk.

The gaffes of ASR

ASR systems (also referred to as speech-to-text systems) are used to transcribe speech to text. Publicly available ASR systems (such as those on Google Cloud Platform – GCP, and Amazon Web Services – AWS) are notoriously bad at transcribing speech that 1) is noisy, 2) is of an accent for which the ASR has not seen many examples of during training, and 3) contains words or phrases that are not often used in general language.



In the context of an ASR that is meant to transcribe the customer’s speech, pretty much any other sound in the audio is considered noise. A few frequent sources of noise include other people speaking in the background, audio degradation during transmission through phone lines (the infamous crackling and muffling of phone audio), traffic sounds, public transport announcement systems, and even a fan or a loud air conditioner. These types of noise pose a transcription challenge and cause the transcripts to have missing, incorrectly transcribed or partially transcribed portions of customer speech.


Publicly available ASRs perform well on ‘mainstream’ accents while struggling with ‘accented speech’. To give perspective, GCP’s ASR has only 16 accents of English at the time of writing this article. These accents are listed under the countries they are mainly spoken in, but this ignores the sub-dialects and English accents within those countries. For instance, the United States alone has at least 20 distinguishable accents. Further, as a recent work of research points out, there is a huge disparity even across mainstream English accents, with the Word Error Rates (WERs) being “23% or higher on datasets in Australian and Indian accents, as opposed to a WER of 13.2% on US accents.” WER is roughly the percentage of mistranscribed words.

Domain-specific words

Words that are not spoken often as part of mainstream or general vocabulary are out-of-vocabulary words for the ASR. Since spoken samples of these words are not available in large numbers in standard ASR datasets, public ASRs perform poorly while transcribing them. These domain-specific ‘entities’ include arbitrary sequences of letters and digits (such as a reference number), all types of proper nouns (names, places, organizations, brands), and most sequences of words and numbers that make up information such as email addresses and physical addresses. Accurately extracting and processing these entities is an important task.

Magnitude of errors

The Word Error Rate of an ASR, in general, is a good indicator of how ‘off’ its transcripts are. However, most publicly reported WER numbers are not indicative of the true magnitude of errors made by off-the-shelf ASRs when used in domain-specific contexts such as in customer support calls. For instance, this report from last year pegs GCP’s WER at 4.9%, indicating that over 95% of the words are transcribed correctly.

However, even the best real-time GCP ASR gave us a WER of over 33% on phone call recordings of customer support calls – it could only transcribe two-thirds of the words correctly. To put this in perspective, imagine talking to a person whose every third word, you misheard.

More significantly, the words that are disproportionately highly mistranscribed are, in fact, the words that are most crucial to goal-oriented conversations. These are the ‘entities’ or domain-specific words mentioned before. If you call a banking voicebot to block your credit card, the call is a no-go if it can’t get your credit card number and bank account information transcribed correctly. If it is unable to get your name or date of birth correctly, it cannot verify that you are indeed the legitimate holder of the card. The ASR is most likely to get exactly these pieces of information wrong simply because it has not been trained to transcribe many hours of audio containing entities of these sorts conveyed over a phone line. As Figure 1 shows, GCP’s real-time ASR mistranscribed email addresses around 90% of the time, mailing addresses 36% of the time, and zip codes 29% of the time in these calls.

Figure 1

Conversational Design accounting for ASR errors

The possibility that information might be mistranscribed is unique to voicebots. While it is possible that one could make spelling errors while interacting with a text-based chatbot, these errors are far fewer and more deterministic than ASR mistranscriptions. Most notably, in the case of text-based chatbots, the text of the information that the bot receives is the same as that typed in by the customer. ASR errors mean that this is not true in the case of voicebots – what is said by the customer might not be what is read by the bot.

This makes designing conversations tricky. If the bot were to verify everything the ASR transcribed by repeating it back to the customer, that would make for a long and onerous conversation. However, it is crucial that it gets certain entities right, for security reasons and otherwise. These trade-offs made during conversation design form the core of the user experience offered by the bot.


Not only are ASR errors a challenge, but it turns out merely knowing whether the customer has finished speaking itself is not a trivial task. While Voice Activity Detection (VAD) has come a long way since its early days, state-of-art VAD models (simply VADs, hereon) still find it hard to confidently say whether a customer is still speaking. VADs are particularly sensitive to background noise and are easily confused by other people talking in the background. 

I highlight just two fundamental issues with VAD – false positives and false negatives – in the interest of brevity. False positives occur if the VAD believes that the customer is still speaking when, in reality, they are not. This can cause the bot to wait for longer than it should, leading to awkward pauses. On the other hand, false negatives occur if the VAD falsely concludes the customer is not speaking, even before they stop talking. Consequently, the bot might start talking over the customer, responding to an incomplete customer utterance.


While seamless user experience (UX) is a key consideration throughout this article, this section focuses on three illustrative but not limiting aspects – latency, speech characteristics and voice persona.


A salient aspect by which voice conversation differs from chat is its ‘blocking’ nature. Being on a phone call demands more dedicated attention from the customer because of its real-time nature, distinguishing it from texting, where texts can be read or re-visited intermittently. This places the responsibility of keeping the time it takes to respond to a minimum, on a voicebot. A slight delay in a chatbot’s response might be pardonable to customers who are free to perform other tasks while also texting. By contrast, a voicebot that is slow on the uptake can be quite annoying to talk to. Further, if a voicebot treats customers to a bout of silence while its cogs of comprehension turn slowly, chances are they will assume that it hasn’t understood them, and will end up repeating themselves. These considerations notwithstanding, speaking with a droid that’s making you wait around for a few seconds after everything you say, is just unnatural.


Let’s for a moment assume a very rudimentary voicebot that was indeed an ASR + chatbot + TTS. There are now two extra modules added to a chatbot, both of which involve real-time processing of audio, but the voicebot must also have much lower latency than is acceptable with a standalone chatbot. Sending audio over the network, converting it to text, converting text back to speech, and sending it back over the network together can be quite time-consuming. Figure 2  and  Figure 3 show respectively the latency incurred by GCP in converting speech to text (ASR) and text to speech (TTS) as a function of the number of words in the text.

A customer speaks about 16 words per turn, and an agent speaks about 23 words per turn, both on an average. Even assuming our rudimentary voicebot instantaneously generated text responses from ASR transcripts, it would still have to a) wait for the ASR to finish transcribing the customer’s turn, and b) wait while the TTS converted all of its own text response to speech.

This combined latency alone is close to 5 seconds (factoring in network latency in moving audio in receiving and sending audio). Five seconds is a substantially long time if you consider that a good part of this time is just silence while you wait for the bot to reply.

This, of course, does not include the extra time taken by the bot to understand what the customer has said and to generate a text response in a real-world scenario.

Figure 2

Figure 3

Owing to these high latencies, a rudimentary chaining of an ASR, a chatbot, and a TTS, cannot make for a voicebot that offers smooth UX.

Speech Characteristics

As a medium of communication, speech is richer than text in that it carries ‘prosody’. Prosody is used to describe the set of features such as intonation, tone, emphasis, and acoustic features such as pitch and loudness. Together, these ‘prosodic features’ convey context about a speaker that goes beyond just the words of their speech. Humans rely on prosody when communicating with each other to establish a shared context comprising elements such as emotion, urgency, assuredness, and empathy. Landmark studies by the psychologist Albert Mehrabian have led to the now-famous ‘7-38-55 rule’: the total impact of a message is a) about 7 percent verbal (words only), b) 38 percent vocal (including tone of voice, inflection, and other sounds) and c) 55 percent nonverbal (body language).

Hence, because humans innately rely on voice prosody as a significant part of their communication, it is natural for customers to expect a voicebot to understand their emotions and context, through their voice. The urgency of an issue, the exasperation of a bot possibly not understanding them, the need to be transferred to a human agent immediately are all cases in point. These would be expressed explicitly in words while interacting with a text-chat bot, making it easier for text-based models to pick up on such cues.

Figure 4


Detecting emotion and other cues from speech is a nascent field of research. Recent work performed emotion detection on a clean dataset of dialogues from Friends. This is a high-quality audio dataset (unlike phone calls), and the task was to extract a few high-level emotions (such as happiness, sadness, anger, etc.) Even in this experimental setting, the reported emotion detection accuracies (F1-scores, for the more technically inclined) were only in the range of 7% to 55% depending on the emotion detected. Figure 4 shows these numbers. Research has just scratched the surface of using prosodic features to extract meaningful context from speech.

Voice Persona

Depending on their context, customers calling into a phone line might need to be reassured, empathized with, and even placated sometimes. A seasoned customer support agent recognizes this and adapts their voice persona to the customer’s demographic and calling context. Similarly, the voice element brings a bot closer to life, affording it the possibility of a more distinct and adaptive persona than a text chatbot. However, this also comes with the flipside. A robotic-sounding voicebot that fails to capitalize on the possibilities offered by voice lends a markedly more monotonous overtone to the conversation than a text chatbot does, for instance. 

These are a few challenges and considerations while building virtual voice AI agents. We continue to address each of them, enabling Agara to push the boundaries of conversational voice AI. Read our blog post on how Agara achieves high accuracy in speech recognition to know about how we’re exceeding the performance of state-of-art ASR models. Look out for more such under-the-hood details as well as insights in our future blog posts!

Agara’s ability to handle nuanced conversations and provide highly personalized behavior makes it one of the most advanced Real-time Voice AI products anywhere. Learn how Real-time Voice AI can help you deliver the best experience for your customers.
Click here to Schedule a demo. In case of any queries, feel free to reach out to us at [email protected]