Does Agara use any of the public AI services from Amazon or google? How does this work with Agara’s proprietary technology?

Agara uses publicly available ASRs combined with its proprietary ASR technology. 

Agara uses publicly available ASRs like Google and Amazon to transcribe generic speech. This gives it the ability to transcribe most words spoken by the user and get the base transcript of the conversation.

While these provide reasonably accurate results and are an excellent way to get started in a project, they will not get the best accuracy than a solution that is optimized for a specific use-case. This is because the machine learning models used by the cloud providers have been trained on generic data rather than domain-specific language. The TTS used right now is only GCP’s. We are still developing our in-house TTS and entity-specific speech recognition.

In parallel, Agara’s proprietary entity-specific Deep Learning-based ASR models are designed to accurately recognize domain-specific information, phrases, or intents. In a mission-critical setup, getting the right facts being provided by the customer is critical. For instance, in an insurance call, getting the policy number and the nature of the problem is central to a great experience.

The entities that qualify for this treatment include:

  • Names
  • Cities, states and countries
  • Numbers (policy number, order number, account number, ticket number)
  • Alphanumeric strings
  • IDs: Passport, drivers license
  • Dates & days
  • Reason for calling
  • And more…

These independent, entity-specific ASR models are hyper-personalized to particular use-cases and industries for maximum impact. For instance, Agara can accurately capture a string of numbers and characters (such as a PNR number or record locator for airlines), which is a hard task for generic ASRs.

These ASR models have been developed in-house and are built to deliver the highest accuracy for specific data entities. They are trained on voice inputs mimicking how customers typically speak these entities (for instance, Christmas eve instead of 24 December). 

Agara controls which of these models are invoked at a given point in time based on what the conversation is about. For instance, when asked the reason for filing a new insurance claim, the customer will most likely provide reasons (‘I had an accident yesterday’ or ‘I already have a claim initiated. I want to add more details’) or ask the bot to speak to an agent (‘I have provided the details. I want to get an update from an agent’). With this context, Agara can invoke the appropriate ASR models to get the best read on the output.

How we use GCP and in-house ASRs together (Additional explanation)

We use accent and language-specific ASRs from GCP to get the transcript of every turn of the customer’s conversation – they’re always ‘on’. — (A)

We also have accent and language-specific ASRs that we build in-house, that are used to get transcripts of certain turns of conversations. We know which turns of conversation to turn them ‘on’  or ‘off’ because we have a decent expectation of what the customer is about to say. For instance, if the customer was asked about the reason for his calling us, he would most likely provide a reason from the set of reasons that most customers provide for that industry and domain (‘i want to cancel my order’, ‘i want to book a new order’, etc.). – (B)

Further, we also have in-house built SLU systems that provide intents and entities from the speech of the customer. Again, we know when to turn them ‘on’ or ‘off’ because we know what to expect from the customer. For instance, if we have an SLU to specifically capture strings of numbers and characters (such as a PNR number), which is a hard task for generic ASRs, it will provide the PNR directly from the speech of the customer, without transcribing the rest of the utterance – (C)

Given that the goal of the system is to get structured entities and intents from the customer’s speech, an NLU system takes in A, B (when available), and C (when available), and applies proprietary methods to ‘combine/reconcile’ these multiple transcripts and SLU-provided intents and entities, to arrive at the final set of intents and entities

This ‘combining/reconciling’ logic is different for different entities and intents, and is either algorithmically written or can also be learned by a machine-learning model.