‘Sorry, I am not allowed to upgrade the wife,’ responded the voice bot. It had misinterpreted the actual command, which was to upgrade the Wi-Fi.

Human language is highly ambiguous, and we speak with several accents, varied pronunciation, and diction. Voice bots can notoriously misinterpret what we say if the speech recognition system is inaccurate. Any customer-facing business cannot afford such misinterpretations if they are using voice bots to deliver customer service. It could either result in the customer merely laughing at the bot or worse, ending all relationships with the company, forever.

The current state of the art
The job of a voice bot is clear: Ignore the bells and whistles, take a speech signal, ingest, analyze, interpret emotions, and respond to it in natural language, in real-time. Voice bots must be able to integrate a voice request, convert it into text, and deliver an accurate vocal response in return—all in a matter of seconds.

Automatic Speech Recognition (ASR) is the technology that enables human beings to speak with a computer interface to resemble a regular conversation. ASR converts speech to text transcripts. Natural Language Understanding (NLU) techniques applied over these transcripts create the responses. 

Speech recognition accuracy in voice bots is not yet robust and is influenced by several factors – poor articulation, a high degree of acoustic variability caused by accents, noise, interruptions, sloppy pronunciation, hesitation, repetition, and much more. Hence, the accuracy of voice bots is about 65% to 70% on real-world data sets. The error-rate further dips for real-time transcription as opposed to batch transcription which isn’t time-sensitive. The case today is that users largely ‘pardon’ or ignore the accuracy rate for general-purpose use cases like closed captions for YouTube videos. 

Although ASR has matured to transform customer service and commercial applications, a high error rate remains one of the critical impediments to the full acceptance of speech technology, particularly in the enterprise landscape, where the conversations are mission-critical. The key to the accuracy of a voice bot is interpreting and responding to what users say appropriately. Building a valuable voice bot is more an outcome of good design and implementation combined with the right technology.

Enter Agara
Agara is an autonomous virtual voice agent powered by Real-time Voice AI. Be it for troubleshooting, marketing, transactions, or credit and collections, it is designed to have intelligent conversations with your customers without any human agents’ assistance. 

Agara is built by thinking like the users. It thinks through dialog flows and predicts what the user could say next based on a set of semantic possibilities, not a choice of words, and triggers a response. Users seldom hear it say, ‘I am sorry, I’m not sure what you just said.’

Agara is built using robust natural language understanding systems that learn to accommodate for errors based on context. It asks relevant questions to clarify confusion or uncertainty in a conversation, keeping in mind that the UX should be similar to how a good agent would handle things if they have misheard.

Best of both worlds
Agara uses publicly available ASRs combined with its proprietary ASR technology. 

Agara uses publicly available ASRs like Google and Amazon to transcribe generic speech. This gives it the ability to transcribe most words spoken by the user and get the base transcript of the conversation.

While these provide reasonably accurate results and are an excellent way to get started in a project, they will not get the best accuracy than a solution that is optimized for a specific use-case. This is because the machine learning models used by the cloud providers have been trained on generic data rather than domain-specific language.

Entity-specific speech recognition
In parallel, Agara’s proprietary entity-specific Deep Learning-based ASR models are designed to accurately recognize domain-specific information, phrases, or intents. In a mission-critical setup, getting the right facts being provided by the customer is critical. For instance, in an insurance call, getting the policy number and the nature of the problem is central to a great experience.

The entities that qualify for this treatment include:

  • Names
  • Cities, states and countries
  • Numbers (policy number, order number, account number, ticket number)
  • Alphanumeric strings
  • IDs: Passport, drivers license
  • Dates & days
  • Reason for calling
  • And more…

These independent, entity-specific ASR models are hyper-personalized to particular use-cases and industries for maximum impact. For instance, Agara can accurately capture a string of numbers and characters (such as a PNR number or record locator for airlines), which is a hard task for generic ASRs.

These ASR models have been developed in-house and are built to deliver the highest accuracy for specific data entities. They are trained on voice inputs mimicking how customers typically speak these entities (for instance, Christmas eve instead of 24 December). 

Agara controls which of these models are invoked at a given point in time based on what the conversation is about. For instance, when asked the reason for filing a new insurance claim, the customer will most likely provide reasons (‘I had an accident yesterday’ or ‘I already have a claim initiated. I want to add more details’) or ask the bot to speak to an agent (‘I have provided the details. I want to get an update from an agent’). With this context, Agara can invoke the appropriate ASR models to get the best read on the output.

Data Collection & Training
Agara’s tech is built on millions of customer conversations – both proprietary to our clients and generated in-house by Agara.

  • Agara works with independent contractors from around the world to generate data specific to its requirements
  • Contractors are employed from specific parts of the world to generate authentic accent data
  • Contractors are asked to record conversations in their natural surroundings ensuring that natural, real-world noises are included in the data sets

Models are trained on a regular basis looking at areas where improvements are needed. They are also additionally trained to manage the specific needs of clients.

A strong source of data and improvements comes from recorded customer conversations provided to us by our clients. These conversations are the closest data to what the voice bot will encounter and are used in additional training of the models. It is important to note that no client data is ever shared with anyone in any form for any reason.

Delivering Results
All these details matter only if the results are made available in a fast, usable manner. To run the multiple ASR models as well as the subsequent NLU models, Agara does a couple of things:

One, entity-specific ASRs operate directly on the speech input and do not require language models to create their output. The result is much faster processing than what public ASRs like Google and Amazon do.

Two, Agara processes its models on GPU machines for fast responses. A high-performance Nvidia Tesla v100 GPU cluster runs ASR models as well as NLU models. The average processing times for Agara ASR engines (external and entity-specific combined) are ~350ms (1/3rd of a second) which make for a seamless, natural flow in the conversation.

Voice-first and voice-only
McKinsey reports that human customer service interactions will come down by 30% in the next two years. This fundamentally means that bots have the potential to own up to 70% of the communications in the future.

Transcript accuracy is one of the most significant objectives for Agara. Together with the relentless drive to solving unique voice-related problems, these innovations enable Agara to exceed the performance of generic ASR models, deliver near-human level accuracy in call automation, and push the boundaries of AI.

Listen to this conversation between a customer and Agara.

Remember, the part from Agara is generated automatically in real-time!


Agara’s ability to handle nuanced conversations and provide highly personalized behavior makes it one of the most advanced Real-time Voice AI products anywhere. Learn how Real-time Voice AI can help you deliver the best experience for your customers. Click here to Schedule a demo. In case of any queries, feel free to reach out to us at [email protected]