How does Agara manage to achieve high accuracy in speech recognition?

Agara uses publicly available ASRs combined with its proprietary ASR technology. 

Agara uses publicly available ASRs like Google and Amazon to transcribe generic speech. This gives it the ability to transcribe most words spoken by the user and get the base transcript of the conversation.

While these provide reasonably accurate results and are an excellent way to get started in a project, they will not get the best accuracy than a solution that is optimized for a specific use-case. This is because the machine learning models used by the cloud providers have been trained on generic data rather than domain-specific language.

Entity-specific speech recognition

In parallel, Agara’s proprietary entity-specific Deep Learning-based ASR models are designed to accurately recognize domain-specific information, phrases, or intents. In a mission-critical setup, getting the right facts being provided by the customer is critical. For instance, in an insurance call, getting the policy number and the nature of the problem is central to a great experience.

The entities that qualify for this treatment include:

  • Names
  • Cities, states and countries
  • Numbers (policy number, order number, account number, ticket number)
  • Alphanumeric strings
  • IDs: Passport, drivers license
  • Dates & days
  • Reason for calling
  • And more…

These independent, entity-specific ASR models are hyper-personalized to particular use-cases and industries for maximum impact. For instance, Agara can accurately capture a string of numbers and characters (such as a PNR number or record locator for airlines), which is a hard task for generic ASRs.

These ASR models have been developed in-house and are built to deliver the highest accuracy for specific data entities. They are trained on voice inputs mimicking how customers typically speak these entities (for instance, Christmas eve instead of 24 December). 

Agara controls which of these models are invoked at a given point in time based on what the conversation is about. For instance, when asked the reason for filing a new insurance claim, the customer will most likely provide reasons (‘I had an accident yesterday’ or ‘I already have a claim initiated. I want to add more details’) or ask the bot to speak to an agent (‘I have provided the details. I want to get an update from an agent’). With this context, Agara can invoke the appropriate ASR models to get the best read on the output.

Data Collection & Training

Agara’s tech is built on millions of customer conversations – both proprietary to our clients and generated in-house by Agara.

  • Agara works with independent contractors from around the world to generate data specific to its requirements
  • Contractors are employed from specific parts of the world to generate authentic accent data
  • Contractors are asked to record conversations in their natural surroundings ensuring that natural, real-world noises are included in the data sets

Models are trained on a regular basis looking at areas where improvements are needed. They are also additionally trained to manage the specific needs of clients.

A strong source of data and improvements comes from recorded customer conversations provided to us by our clients. These conversations are the closest data to what the voice bot will encounter and are used in additional training of the models. It is important to note that no client data is ever shared with anyone in any form for any reason.

Delivering Results

All these details matter only if the results are made available in a fast, usable manner. To run the multiple ASR models as well as the subsequent NLU models, Agara does a couple of things:

One, entity-specific ASRs operate directly on the speech input and do not require language models to create their output. The result is much faster processing than what public ASRs like Google and Amazon do.

Two, Agara processes its models on GPU machines for fast responses. A high-performance Nvidia Tesla v100 GPU cluster runs ASR models as well as NLU models. The average processing times for Agara ASR engines (external and entity-specific combined) are ~350ms (1/3rd of a second) which make for a seamless, natural flow in the conversation.